『データ解析のための統計モデリング入門』3章と6章覚え書き

Pythonでやってみた系のあれ。結局pandasとかstatsmodelsとか初めて触れるんだからRで書いてある通りに実行したほうがいいのでは？とかQiitaに書けとかQiitaにすでに記事いくつかあるじゃんとかは気にしない気にしない。

3章

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline

d = pd.read_csv("data3a.csv")

Rでd&xみたいなことがやりたいときはこうすればいいぽい

d.x

0      8.31
1      9.44
2      9.50
3      9.07
4     10.16
5      8.32
6     10.61
7     10.06
8      9.93
9     10.43
10    10.36
11    10.15
12    10.92
13     8.85
14     9.42
15    11.11
16     8.02
17    11.93
18     8.55
19     7.19
20     9.83
21    10.79
22     8.89
23    10.09
24    11.63
25    10.21
26     9.45
27    10.44
28     9.44
29    10.48
      ...  
70    10.54
71    11.30
72    12.40
73    10.18
74     9.53
75    10.24
76    11.76
77     9.52
78    10.40
79     9.96
80    10.30
81    11.54
82     9.42
83    11.28
84     9.73
85    10.78
86    10.21
87    10.51
88    10.73
89     8.85
90    11.20
91     9.86
92    11.54
93    10.03
94    11.88
95     9.15
96     8.52
97    10.24
98    10.86
99     9.97
Name: x, dtype: float64

Rのsummary()関数の代わりにdescribe()メソッドがある。ただsummryと違ってfの列をカウントしてくれない。

d.describe()

	y	x
count	100.000000	100.000000
mean	7.830000	10.089100
std	2.624881	1.008049
min	2.000000	7.190000
25%	6.000000	9.427500
50%	8.000000	10.155000
75%	10.000000	10.685000
max	15.000000	12.400000

d.f.describe()

count     100
unique      2
top         T
freq       50
Name: f, dtype: object

plt.scatter(d.x[d.f == "C"], d.y[d.f == "C"] ,c = "red", label = "C")
plt.scatter(d.x[d.f == "T"], d.y[d.f == "T"] ,c = "blue", label = "T")
plt.show()

f:id:spellm:20160607121834p:plain

でGLM。3.4.1と同じく$ \lambda_i = \exp(\beta_1 + \beta_2 x_i)$としてやる。本と同じ結果が得られる(当然だけど)。この結果ってNewton-Raphson法とかで求めているのかしらん。

fit = smf.glm(formula = 'y ~ x', data = d ,family = sm.families.Poisson())
result = fit.fit()

result.summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	98
Model Family:	Poisson	Df Model:	1
Link Function:	log	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-235.39
Date:	Mon, 06 Jun 2016	Deviance:	84.993
Time:	21:03:35	Pearson chi2:	83.8
No. Iterations:	7

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	1.2917	0.364	3.552	0.000	0.579 2.005
x	0.0757	0.036	2.125	0.034	0.006 0.145

図示する。

plt.scatter(d.x, d.y)
xx = np.linspace(d.x.min(),d.x.max(),100)
plt.plot(xx,np.exp(1.29 + 0.0757*xx))
plt.show()

f:id:spellm:20160607121906p:plain

説明変数が因子型の場合に移る前に多項式だとどう書けばいいんだろうって思った。 http://stackoverflow.com/questions/31978948/python-stats-models-quadratic-term-in-regression でいいっぽい？

ということで必要性はないけど$\lambda_i = \exp(\beta_1 + \beta_2 x_i + \beta_3 x_i²)$で回帰してみる。

fit2 = smf.glm(formula = 'y ~ x + np.power(x,2)', data = d ,family = sm.families.Poisson())
fit2.fit().summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	97
Model Family:	Poisson	Df Model:	2
Link Function:	log	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-234.40
Date:	Mon, 06 Jun 2016	Deviance:	83.015
Time:	21:03:36	Pearson chi2:	81.6
No. Iterations:	7

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-2.5873	2.844	-0.910	0.363	-8.162 2.987
x	0.8449	0.560	1.510	0.131	-0.252 1.942
np.power(x, 2)	-0.0378	0.027	-1.378	0.168	-0.092 0.016

plt.scatter(d.x, d.y)
xx = np.linspace(d.x.min(),d.x.max(),100)
plt.plot(xx,np.exp(-2.59 + 0.845*xx - 0.0378*xx**2))
plt.show()

f:id:spellm:20160607122145p:plain

説明変数が因子型の場合もRと似た感じでできるっぽい。因子型のみは略

fit_all = smf.glm(formula = 'y ~ x + f', data = d ,family = sm.families.Poisson())
fit_all.fit().summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	100
Model:	GLM	Df Residuals:	97
Model Family:	Poisson	Df Model:	2
Link Function:	log	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-235.29
Date:	Mon, 06 Jun 2016	Deviance:	84.808
Time:	21:03:37	Pearson chi2:	83.8
No. Iterations:	7

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	1.2631	0.370	3.417	0.001	0.539 1.988
f[T.T]	-0.0320	0.074	-0.430	0.667	-0.178 0.114
x	0.0801	0.037	2.162	0.031	0.007 0.153

6章

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline

d = pd.read_csv("data4a.csv")

d.describe()

	N	y	x
count	100.0	100.000000	100.000000
mean	8.0	5.080000	9.967200
std	0.0	2.743882	1.088954
min	8.0	0.000000	7.660000
25%	8.0	3.000000	9.337500
50%	8.0	6.000000	9.965000
75%	8.0	8.000000	10.770000
max	8.0	8.000000	12.440000

d.f.describe()

count     100
unique      2
top         T
freq       50
Name: f, dtype: object

plt.scatter(d.x[d.f == "C"], d.y[d.f == "C"], c = "red",label = "C")
plt.scatter(d.x[d.f == "T"], d.y[d.f == "T"], c = "blue", label = "T")
plt.legend(loc = "upper left")

f:id:spellm:20160607122309p:plain

グラフにラベルがないとあれなんですがmatplotlibでラベルに日本語使うのは少し面倒くさかったことを思い出したので略。横軸が植物の体サイズ$x_i$、縦軸が生存種子数$y_i$、$f$が施肥処理の有無(ありがT)。

今回は二項分布Bi(100,$q$)で$q = \frac{1}{1+\exp(-z_i)}$、線形予測子を$z_i = \beta_1 + \beta_2 x_i + \beta_3 f_i$としてロジスティック回帰を行う。二項分布＋ロジットリンク関数の組み合わせだとオッズの対数が線形予測子そのものなので解釈しやすいっぽい。

d_failure = pd.DataFrame({'z':d.N - d.y})

d = pd.concat([d,d_failure],axis =1)

http://tomoshige-n.hatenablog.com/entry/2014/11/07/202201 によるとcbind(y,N-y)~説明変数を"成功回数+失敗回数~説明変数"に置き換えればいいっぽい。

fit = smf.glm(formula = 'y + z ~ x + f', data = d ,family = sm.families.Binomial()).fit()

fit.summary()

Generalized Linear Model Regression Results
Dep. Variable:	['y', 'z']	No. Observations:	100
Model:	GLM	Df Residuals:	97
Model Family:	Binomial	Df Model:	2
Link Function:	logit	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-133.11
Date:	Tue, 07 Jun 2016	Deviance:	123.03
Time:	11:40:28	Pearson chi2:	109.
No. Iterations:	8

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-19.5361	1.414	-13.818	0.000	-22.307 -16.765
f[T.T]	2.0215	0.231	8.740	0.000	1.568 2.475
x	1.9524	0.139	14.059	0.000	1.680 2.225

xx = np.linspace(d.x.min(),d.x.max(),100)

plt.scatter(d.x[d.f == "C"], d.y[d.f == "C"], c = "red", label = "C")
plt.scatter(d.x[d.f == "T"], d.y[d.f == "T"], c = "blue", label = "T")
y_c = fit.predict({'x': xx, 'f': ['C']*len(xx)})*d.y.max()
y_t = fit.predict({'x': xx, 'f': ['T']*len(xx)})*d.y.max()
plt.plot(xx,y_c, label = "C")
plt.plot(xx,y_t, label = "T" )
plt.legend(loc = "upper left")

f:id:spellm:20160607122331p:plain

交互作用項のある場合: $ \mathrm{logit}(q_i) = \beta_1 + \beta_2 x_i + \beta_3 f_i + \beta_4 x_i f_i $として回帰。本というかRと同じようにx *fと書けばよいっぽい

fit_cross = smf.glm(formula = 'y + z ~ x*f', data = d ,family = sm.families.Binomial()).fit()

fit_cross.summary()

Generalized Linear Model Regression Results
Dep. Variable:	['y', 'z']	No. Observations:	100
Model:	GLM	Df Residuals:	96
Model Family:	Binomial	Df Model:	3
Link Function:	logit	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-132.81
Date:	Tue, 07 Jun 2016	Deviance:	122.43
Time:	11:40:31	Pearson chi2:	109.
No. Iterations:	8

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-18.5233	1.886	-9.821	0.000	-22.220 -14.827
f[T.T]	-0.0638	2.704	-0.024	0.981	-5.363 5.235
x	1.8525	0.186	9.983	0.000	1.489 2.216
x:f[T.T]	0.2163	0.280	0.772	0.440	-0.333 0.765

次はガンマ分布のGLM。配布されているのはRdataファイルなので適当にcsvをwriteしておく。縦軸が花の重量で横軸が葉の重量の架空データらしい。

d_g = pd.read_csv("data_gamma.csv")

plt.scatter(d_g.x,d_g.y)

f:id:spellm:20160607122352p:plain

今度は平均花重量$\mu_i$が(何らかの理由があって)葉重量$x_i$を用いて$\mu = Ax_i^b$と書けると仮定してGLMを行う。

$\log \mu_i = a + b\log x_i$と、$\mu_i$が対数リンク関数と線形予測子$ a + b\log x_i$で与えられることになる。

fit_g = smf.glm("y ~ np.log(x)",data = d_g ,family = sm.families.Gamma(link=sm.families.links.log)).fit()

fit_g.summary()

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	50
Model:	GLM	Df Residuals:	48
Model Family:	Gamma	Df Model:	1
Link Function:	log	Scale:	0.325084605974
Method:	IRLS	Log-Likelihood:	58.471
Date:	Tue, 07 Jun 2016	Deviance:	17.251
Time:	11:40:36	Pearson chi2:	15.6
No. Iterations:	12

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-1.0403	0.119	-8.759	0.000	-1.273 -0.808
np.log(x)	0.6832	0.068	9.992	0.000	0.549 0.817

xxx = np.linspace(d_g.x.min(),d_g.x.max(),100)
yyy = fit_g.predict({'x': xxx,})*d_g.y.max()
plt.plot(xxx,yyy)
plt.scatter(d_g.x,d_g.y)
plt.ylim([0,0.7])

f:id:spellm:20160607122414p:plain

glmによる平均の予測曲線と合わせた図。本みたいに予測区間も描きたいところだけど面倒くさかったのでパス。http://qiita.com/makora9143/items/7354c61bd3f1df33e791

・雑感

頻繁に書かないからただのIPythonだと対話できるからってそこまで幸せか？みたいなことを思っていたんですがJupyterはすごく便利なんですね。

AICとかGLM自体の話とか最尤法とかは個人的なメモだから既知ということにしてすっ飛ばした。

あとはpandasはともかくstatsmodelsだけ使えても何だかなーだしGLMの最尤推定くらい速くなくていいから自分でやったほうがいい気がした

99ln(99/e)

人生のログ(にしたい)。本のメモや感想を中心に。

『データ解析のための統計モデリング入門』3章と6章覚え書き