第13章 Python建模库介绍

本章会回顾⼀些pandas的特点,这有利于pandas数据规整和模型拟合和评分。然后简短介绍两个流⾏的建模⼯具,statsmodels和scikit-learn。

13.1 pandas与模型代码的接⼝

模型开发的通常⼯作流是使⽤pandas进⾏数据加载和清洗,然后切换到建模库进⾏建模。

pandas与其它分析库通常是靠NumPy的数组联系起来的。
将DataFrame转换为NumPy数组,可以使⽤.values属性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import pandas as pd


data = pd.DataFrame({
'x0': [1, 2, 3, 4, 5],
'x1': [0.01, -0.01, 0.25, -4.1, 0.],
'y': [-1.5, 0., 3.6, 1.3, -2.]})

print(data)
# x0 x1 y
# 0 1 0.01 -1.5
# 1 2 -0.01 0.0
# 2 3 0.25 3.6
# 3 4 -4.10 1.3
# 4 5 0.00 -2.0

print(data.columns)
# Index(['x0', 'x1', 'y'], dtype='object')


print(data.values)
# [[ 1. 0.01 -1.5 ]
# [ 2. -0.01 0. ]
# [ 3. 0.25 3.6 ]
# [ 4. -4.1 1.3 ]
# [ 5. 0. -2. ]]

# 使用.values转为NumPy数组
print(type(data.values))
# <class 'numpy.ndarray'>


model_cols = ['x0', 'x1']
print(data.loc[:, model_cols].values)
# [[ 1. 0.01]
# [ 2. -0.01]
# [ 3. 0.25]
# [ 4. -4.1 ]
# [ 5. 0. ]]

print(type(data.loc[:, model_cols].values))
# <class 'numpy.ndarray'>


# 使用DataFrame函数将NumPy数组转为DataFrame
df2 = pd.DataFrame(data.values, columns=['one', 'two', 'three'])
print(df2)
# one two three
# 0 1.0 0.01 -1.5
# 1 2.0 -0.01 0.0
# 2 3.0 0.25 3.6
# 3 4.0 -4.10 1.3
# 4 5.0 0.00 -2.0

虚变量增加与删除:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
import pandas as pd


data = pd.DataFrame({
'x0': [1, 2, 3, 4, 5],
'x1': [0.01, -0.01, 0.25, -4.1, 0.],
'y': [-1.5, 0., 3.6, 1.3, -2.]})


# category列为非数值列
data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'],categories=['a', 'b'])
print(data)
# x0 x1 y category
# 0 1 0.01 -1.5 a
# 1 2 -0.01 0.0 b
# 2 3 0.25 3.6 a
# 3 4 -4.10 1.3 a
# 4 5 0.00 -2.0 b


# 根据category列创建虚变量
dummies = pd.get_dummies(data.category, prefix='category')
# 删除category列,然后添加虚变量
data_with_dummies = data.drop('category', axis=1).join(dummies)
print(data_with_dummies)
# x0 x1 y category_a category_b
# 0 1 0.01 -1.5 1 0
# 1 2 -0.01 0.0 0 1
# 2 3 0.25 3.6 1 0
# 3 4 -4.10 1.3 1 0
# 4 5 0.00 -2.0 0 1

13.2 ⽤Patsy创建模型描述

Patsy是Python的⼀个库,使⽤简短的字符串“公式语法”描述统计模型(尤其是线性模型)

Patsy适合描述statsmodels的线性模型,Patsy的公式是⼀个特殊的字符串语法

1
y ~ x0 + x1

y ~ x0 + x1被称为公式字符串.
a+b不是将a与b相加的意思,⽽是为模型创建的设计矩阵

patsy.dmatrices函数接收⼀个公式字符串和⼀个数据集(可以是DataFrame或数组的字典),为线性模型创建设计矩阵:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import pandas as pd
import patsy


data = pd.DataFrame({
'x0': [1, 2, 3, 4, 5],
'x1': [0.01, -0.01, 0.25, -4.1, 0.],
'y': [-1.5, 0., 3.6, 1.3, -2.]})
print(data)
# x0 x1 y
# 0 1 0.01 -1.5
# 1 2 -0.01 0.0
# 2 3 0.25 3.6
# 3 4 -4.10 1.3
# 4 5 0.00 -2.0

# 为模型创建的设计矩阵
y, X = patsy.dmatrices('y ~ x0 + x1', data)

print(y)
# DesignMatrix with shape (5, 1)
# y
# -1.5
# 0.0
# 3.6
# 1.3
# -2.0
# Terms:
# 'y' (column 0)


# 有个Intercept列
print(X)
# DesignMatrix with shape (5, 3)
# Intercept x0 x1
# 1 1 0.01
# 1 2 -0.01
# 1 3 0.25
# 1 4 -4.10
# 1 5 0.00
# Terms:
# 'Intercept' (column 0)
# 'x0' (column 1)
# 'x1' (column 2)


print(type(y))
# <class 'patsy.design_info.DesignMatrix'>

print(type(X))
# <class 'patsy.design_info.DesignMatrix'>

这些Patsy的DesignMatrix实例实际上是NumPy的ndarray,带有附加元数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

# 转为ndarray对象
print(np.asarray(y))
# [[-1.5]
# [ 0. ]
# [ 3.6]
# [ 1.3]
# [-2. ]]

print(np.asarray(X))
# [[ 1. 1. 0.01]
# [ 1. 2. -0.01]
# [ 1. 3. 0.25]
# [ 1. 4. -4.1 ]
# [ 1. 5. 0. ]]


print(type(np.asarray(y)))
# <class 'numpy.ndarray'>

print(type(np.asarray(X)))
# <class 'numpy.ndarray'>

就像刚刚说的,X会出现一个Intercept列:
这是线性模型(⽐如普通最⼩⼆乘法)的惯例⽤法。添加 +0 到模型可以不显示intercept:

1
2
3
4
5
6
7
8
9
10
11
print(patsy.dmatrices('y ~ x0 + x1 + 0', data)[1])
# DesignMatrix with shape (5, 2)
# x0 x1
# 1 0.01
# 2 -0.01
# 3 0.25
# 4 -4.10
# 5 0.00
# Terms:
# 'x0' (column 0)
# 'x1' (column 1)

Patsy对象可以直接传递到算法(⽐如numpy.linalg.lstsq)中,它执⾏普通最⼩⼆乘回归.
模型的元数据保留在design_info属性中,因此你可以重新附加列名到拟合系数,以获得⼀个Series

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Patsy对象直接传递到算法
coef, resid, _, _ = np.linalg.lstsq(X, y)

print(coef)
# [[ 0.31290976]
# [-0.07910564]
# [-0.26546384]]


# 模型的元数据保留在design_info属性中,因此你可以重新附加列名到拟合系数,以获得⼀个Series
coef = pd.Series(coef.squeeze(), index=X.design_info.column_names)
print(coef)
# Intercept 0.312910
# x0 -0.079106
# x1 -0.265464
# dtype: float64

⽤Patsy公式进⾏数据转换

可以将Python代码与patsy公式结合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import pandas as pd
import patsy


data = pd.DataFrame({
'x0': [1, 2, 3, 4, 5],
'x1': [0.01, -0.01, 0.25, -4.1, 0.],
'y': [-1.5, 0., 3.6, 1.3, -2.]})
print(data)
# x0 x1 y
# 0 1 0.01 -1.5
# 1 2 -0.01 0.0
# 2 3 0.25 3.6
# 3 4 -4.10 1.3
# 4 5 0.00 -2.0


# 将Python代码与patsy公式结合
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)
print(X)
# DesignMatrix with shape (5, 3)
# Intercept x0 np.log(np.abs(x1) + 1)
# 1 1 0.00995
# 1 2 0.00995
# 1 3 0.22314
# 1 4 1.62924
# 1 5 0.00000
# Terms:
# 'Intercept' (column 0)
# 'x0' (column 1)
# 'np.log(np.abs(x1) + 1)' (column 2)


# 使用Patsy的内置的函数
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)
print(X)
# DesignMatrix with shape (5, 3)
# Intercept standardize(x0) center(x1)
# 1 -1.41421 0.78
# 1 -0.70711 0.76
# 1 0.00000 1.02
# 1 0.70711 -3.33
# 1 1.41421 0.77
# Terms:
# 'Intercept' (column 0)
# 'standardize(x0)' (column 1)
# 'center(x1)' (column 2)

patsy.build_design_matrices函数可以应⽤于转换新数据,使⽤原始样本数据集的保存信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
new_data = pd.DataFrame({
'x0': [6, 7, 8, 9],
'x1': [3.1, -0.5, 0, 2.3],
'y': [1, 2, 3, 4]})
new_X = patsy.build_design_matrices([X.design_info], new_data)
print(new_X)
# [DesignMatrix with shape (4, 3)
# Intercept standardize(x0) center(x1)
# 1 2.12132 3.87
# 1 2.82843 0.27
# 1 3.53553 0.77
# 1 4.24264 3.07
# Terms:
# 'Intercept' (column 0), 'standardize(x0)' (column 1), 'center(x1)' (column 2)]

因为Patsy中的加号不是加法的意义,当你按照名称将数据集的列相加时,你必须⽤特殊I函数将它们封装起来:

1
2
3
4
5
6
7
8
9
10
11
12
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)
print(X)
# DesignMatrix with shape (5, 2)
# Intercept I(x0 + x1)
# 1 1.01
# 1 1.99
# 1 3.25
# 1 -0.10
# 1 5.00
# Terms:
# 'Intercept' (column 0)
# 'I(x0 + x1)' (column 1)

分类数据和Patsy

当你在Patsy公式中使⽤⾮数值数据,它们会默认转换为虚变
量。如果有截距,会去掉⼀个,避免共线性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
import patsy


data = pd.DataFrame({
'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
'key2': [0, 1, 0, 1, 0, 1, 0, 0],
'v1': [1, 2, 3, 4, 5, 6, 7, 8],
'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})
y, X = patsy.dmatrices('v2 ~ key1', data)
print(X)
# DesignMatrix with shape (8, 2)
# Intercept key1[T.b]
# 1 0
# 1 0
# 1 1
# 1 1
# 1 0
# 1 1
# 1 0
# 1 1
# Terms:
# 'Intercept' (column 0)
# 'key1' (column 1)

如果从模型中忽略截距,每个分类值的列都会包括在设计矩阵的模型中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
import numpy as np
import patsy


data = pd.DataFrame({
'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
'key2': [0, 1, 0, 1, 0, 1, 0, 0],
'v1': [1, 2, 3, 4, 5, 6, 7, 8],
'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})
y, X = patsy.dmatrices('v2 ~ key1 + 0', data)
print(X)
# DesignMatrix with shape (8, 2)
# key1[a] key1[b]
# 1 0
# 1 0
# 0 1
# 0 1
# 1 0
# 0 1
# 1 0
# 0 1
# Terms:
# 'key1' (columns 0:2)



# 使⽤C函数,数值列可以截取为分类量:
y, X = patsy.dmatrices('v2 ~ C(key2)', data)
print(X)
# DesignMatrix with shape (8, 2)
# Intercept C(key2)[T.1]
# 1 0
# 1 1
# 1 0
# 1 1
# 1 0
# 1 1
# 1 0
# 1 0
# Terms:
# 'Intercept' (column 0)
# 'C(key2)' (column 1)

当你在模型中使⽤多个分类名,事情就会变复杂,因为会包括key1:key2形式的相交部分,它可以⽤在⽅差(ANOVA)模型分析中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import pandas as pd
import numpy as np
import patsy


data = pd.DataFrame({
'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
'key2': [0, 1, 0, 1, 0, 1, 0, 0],
'v1': [1, 2, 3, 4, 5, 6, 7, 8],
'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})


data['key2'] = data['key2'].map({0: 'zero', 1: 'one'})
print(data)
# key1 key2 v1 v2
# 0 a zero 1 -1.0
# 1 a one 2 0.0
# 2 b zero 3 2.5
# 3 b one 4 -0.5
# 4 a zero 5 4.0
# 5 b one 6 -1.2
# 6 a zero 7 0.2
# 7 b zero 8 -1.7


y, X = patsy.dmatrices('v2 ~ key1 + key2', data)
print(X)
# DesignMatrix with shape (8, 3)
# Intercept key1[T.b] key2[T.zero]
# 1 0 1
# 1 0 0
# 1 1 1
# 1 1 0
# 1 0 1
# 1 1 0
# 1 0 1
# 1 1 1
# Terms:
# 'Intercept' (column 0)
# 'key1' (column 1)
# 'key2' (column 2)


y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)
print(X)
# DesignMatrix with shape (8, 4)
# Intercept key1[T.b] key2[T.zero]
# key1[T.b]:key2[T.zero]
# 1 0 1 0
# 1 0 0 0
# 1 1 1 1
# 1 1 0 0
# 1 0 1 0
# 1 1 0 0
# 1 0 1 0
# 1 1 1 1
# Terms:
# 'Intercept' (column 0)
# 'key1' (column 1)
# 'key2' (column 2)
# 'key1:key2' (column 3)

13.3 statsmodels介绍

statsmodels是Python进⾏拟合多种统计模型、进⾏统计试验和数据探索可视化的库。Statsmodels包含许多经典的统计⽅法,但没有⻉叶斯⽅法和机器学习模型。

statsmodels包含的模型有:

  • 线性模型,⼴义线性模型和健壮线性模型
  • 线性混合效应模型
  • ⽅差(ANOVA)⽅法分析
  • 时间序列过程和状态空间模型
  • ⼴义矩估计

估计线性模型

statsmodels的线性模型有两种不同的接⼝:基于数组,和基于公式。它们可以通过API模块引⼊:

1
2
import statsmodels.api as sm
import statsmodels.formula.api as smf

使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf


def dnorm(mean, variance, size=1):
if isinstance(size, int):
size = size,
return mean + np.sqrt(variance) * np.random.randn(*size)


np.random.seed(12345)

N = 100
X = np.c_[dnorm(0, 0.4, size=N),
dnorm(0, 0.6, size=N),
dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]

y = np.dot(X, beta) + eps


print(X[:5])
# [[-0.12946849 -1.21275292 0.50422488]
# [ 0.30291036 -0.43574176 -0.25417986]
# [-0.32852189 -0.02530153 0.13835097]
# [-0.35147471 -0.71960511 -0.25821463]
# [ 1.2432688 -0.37379916 -0.52262905]]

print(y[:5])
# [ 0.42786349 -0.67348041 -0.09087764 -0.48949442 -0.12894109]



# sm.add_constant函数可以添加⼀个截距的列到现存的矩阵:
X_model = sm.add_constant(X)
print(X_model[:5])
# [[ 1. -0.12946849 -1.21275292 0.50422488]
# [ 1. 0.30291036 -0.43574176 -0.25417986]
# [ 1. -0.32852189 -0.02530153 0.13835097]
# [ 1. -0.35147471 -0.71960511 -0.25821463]
# [ 1. 1.2432688 -0.37379916 -0.52262905]]


# sm.OLS类可以拟合⼀个普通最⼩⼆乘回归:
model = sm.OLS(y, X)

# fit⽅法返回了⼀个回归结果对象,它包含估计的模型参数和其它内容:
results = model.fit()
print(results.params)
# [0.17826108 0.22303962 0.50095093]


# 对结果使⽤summary⽅法可以打印模型的详细诊断结果:
print(results.summary())
# OLS Regression Results
# ==============================================================================
# Dep. Variable: y R-squared: 0.430
# Model: OLS Adj. R-squared: 0.413
# Method: Least Squares F-statistic: 24.42
# Date: Tue, 14 May 2019 Prob (F-statistic): 7.44e-12
# Time: 17:12:00 Log-Likelihood: -34.305
# No. Observations: 100 AIC: 74.61
# Df Residuals: 97 BIC: 82.42
# Df Model: 3
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# x1 0.1783 0.053 3.364 0.001 0.073 0.283
# x2 0.2230 0.046 4.818 0.000 0.131 0.315
# x3 0.5010 0.080 6.237 0.000 0.342 0.660
# ==============================================================================
# Omnibus: 4.662 Durbin-Watson: 2.201
# Prob(Omnibus): 0.097 Jarque-Bera (JB): 4.098
# Skew: 0.481 Prob(JB): 0.129
# Kurtosis: 3.243 Cond. No. 1.74
# ==============================================================================
# Warnings:
# [1] Standard Errors assume that the covariance matrix of the errors is correctly
# specified.


data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])
data['y'] = y
print(data[:5])
# col0 col1 col2 y
# 0 -0.129468 -1.212753 0.504225 0.427863
# 1 0.302910 -0.435742 -0.254180 -0.673480
# 2 -0.328522 -0.025302 0.138351 -0.090878
# 3 -0.351475 -0.719605 -0.258215 -0.489494
# 4 1.243269 -0.373799 -0.522629 -0.128941


# 使⽤statsmodels的公式API和Patsy的公式字符串:
results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()
print(results.params)
# Intercept 0.033559
# col0 0.176149
# col1 0.224826
# col2 0.514808
# dtype: float64

print(results.tvalues)
# Intercept 0.952188
# col0 3.319754
# col1 4.850730
# col2 6.303971
# dtype: float64


# 给出⼀个样本外数据,可以根据估计的模型参数计算预测值:
print(results.predict(data[:5]))
# 0 -0.002327
# 1 -0.141904
# 2 0.041226
# 3 -0.323070
# 4 -0.100535
# dtype: float64

估计时间序列过程

statsmodels的另⼀模型类是进⾏时间序列分析,包括⾃回归过程、卡尔曼滤波和其它态空间模型,和多元⾃回归模型。

⽤⾃回归结构和噪声来模拟⼀些时间序列数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import random

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf


def dnorm(mean, variance, size=1):
if isinstance(size, int):
size = size,
return mean + np.sqrt(variance) * np.random.randn(*size)


np.random.seed(12345)
init_x = 4
values = [init_x, init_x]
N = 1000

b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)
for i in range(N):
new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
values.append(new_x)

MAXLAGS = 5
model = sm.tsa.AR(values)
results = model.fit(MAXLAGS)


# 结果中的估计参数⾸先是截距,其次是前两个参数的估计值
print(results.params)
# [-0.0008394 0.7990018 -0.41539155 -0.00639167 -0.00286703 0.01663687]

13.4 scikit-learn介绍

scikit-learn是⼀个⼴泛使⽤、⽤途多样的Python机器学习库。
它包含多种标准监督和⾮监督机器学习⽅法和模型选择和评估、数据转换、数据加载和模型持久化⼯具。
这些模型可以⽤于分类、聚合、预测和其它任务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import pandas as pd


train = pd.read_csv('datasets/titanic/train.csv')
test = pd.read_csv('datasets/titanic/test.csv')
print(train[:4])
# PassengerId Survived Pclass ... Fare Cabin Embarked
# 0 1 0 3 ... 7.2500 NaN S
# 1 2 1 1 ... 71.2833 C85 C
# 2 3 1 3 ... 7.9250 NaN S
# 3 4 1 1 ... 53.1000 C123 S

# [4 rows x 12 columns]


# statsmodels和scikit-learn不能接收缺失数据,因此要查看列是否包含缺失值:
print(train.isnull().sum())
# PassengerId 0
# Survived 0
# Pclass 0
# Name 0
# Sex 0
# Age 177
# SibSp 0
# Parch 0
# Ticket 0
# Fare 0
# Cabin 687
# Embarked 2
# dtype: int64

print(test.isnull().sum())
# PassengerId 0
# Pclass 0
# Name 0
# Sex 0
# Age 86
# SibSp 0
# Parch 0
# Ticket 0
# Fare 1
# Cabin 327
# Embarked 0
# dtype: int64


# 现在想用年龄作为预测值,但是它包含缺失值。
# ⽤训练数据集的中位数补全两个表的空值:
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)


# 现在需要指定模型。增加了⼀个列IsFemale,作为“Sex”列的编码
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)


# 确定模型变量,并创建NumPy数组:
predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values
print(X_train[:5])
# [[ 3. 0. 22.]
# [ 1. 1. 38.]
# [ 3. 1. 26.]
# [ 1. 1. 35.]
# [ 3. 0. 35.]]

print(y_train[:5])
# [0 1 1 1 0]


from sklearn.linear_model import LogisticRegression
# scikitlearn的LogisticRegression模型,创建⼀个模型实例:
model = LogisticRegression()


# 与statsmodels类似,我们可以⽤模型的fit⽅法,将它拟合到训练数据:
print(model.fit(X_train, y_train))
# LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
# intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
# penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
# verbose=0, warm_start=False)


# ⽤model.predict,对测试数据进⾏预测:
y_predict = model.predict(X_test)
print(y_predict[:10])
# [0 0 0 0 1 0 1 0 1 0]


# logisticregressioncv类可以⽤⼀个参数指定⽹格搜索对模型的正则化参数C的粒度:
from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(10)
print(model_cv.fit(X_train, y_train))
# LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
# fit_intercept=True, intercept_scaling=1.0, max_iter=100,
# multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
# refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)


# 要⼿动进⾏交叉验证,可以使⽤cross_val_score帮助函数
# 它可以处理数据分割。

# 交叉验证我们的带有四个不重叠训练数据的模型
from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model, X_train, y_train, cv=4)
print(scores)
# [0.77232143 0.80269058 0.77027027 0.78828829]