第13章 Python建模库介绍
13.1 pandas与模型代码的接口
import pandas as pd
import numpy as np
data = pd.DataFrame({
'x0': [1, 2, 3, 4, 5],
'x1': [0.01, -0.01, 0.25, -4.1, 0.],
'y': [-1.5, 0., 3.6, 1.3, -2.]})
print(data)
x0 x1 y
0 1 0.01 -1.5
1 2 -0.01 0.0
2 3 0.25 3.6
3 4 -4.10 1.3
4 5 0.00 -2.0
data.columns
Index(['x0', 'x1', 'y'], dtype='object')
data.values
array([[ 1. , 0.01, -1.5 ],
[ 2. , -0.01, 0. ],
[ 3. , 0.25, 3.6 ],
[ 4. , -4.1 , 1.3 ],
[ 5. , 0. , -2. ]])
df2 = pd.DataFrame(data.values, columns=['one', 'two', 'three'])
print(df2)
one two three
0 1.0 0.01 -1.5
1 2.0 -0.01 0.0
2 3.0 0.25 3.6
3 4.0 -4.10 1.3
4 5.0 0.00 -2.0
data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'],
categories=['a', 'b'])
print(data)
x0 x1 y category
0 1 0.01 -1.5 a
1 2 -0.01 0.0 b
2 3 0.25 3.6 a
3 4 -4.10 1.3 a
4 5 0.00 -2.0 b
13.2 用Patsy创建模型
patsy描述statsmodel的线性模型。
patsy的公式是特殊的字符串语法:y ~ x0 + x1。
添加 +0 到模型可以不显示intercep:y ~ x0 + x1 + 0。
data = pd.DataFrame({
'x0': [1, 2, 3, 4, 5],
'x1': [0.01, -0.01, 0.25, -4.1, 0.],
'y': [-1.5, 0., 3.6, 1.3, -2.]})
import patsy
y, X = patsy.dmatrices('y ~ x0 + x1', data)
patsy.dmatrices('y ~ x0 + x1 + 0', data)[1]
DesignMatrix with shape (5, 2)
x0 x1
1 0.01
2 -0.01
3 0.25
4 -4.10
5 0.00
Terms:
'x0' (column 0)
'x1' (column 1)
用Patsy公式进行数据转换
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)
print(X)
[[1. 1. 0.00995033]
[1. 2. 0.00995033]
[1. 3. 0.22314355]
[1. 4. 1.62924054]
[1. 5. 0. ]]
patsy.build_design_matrices函数可以使用原始样本数据集的保存信息,来转换新数据。
因为Patsy中的加号不是加法的意义,当你按照名称将数据集的列相加时,你必须用特殊I函数将它们封装起来.
new_data = pd.DataFrame({
'x0': [6, 7, 8, 9],
'x1': [3.1, -0.5, 0, 2.3],
'y': [1, 2, 3, 4]})
new_X = patsy.build_design_matrices([X.design_info], new_data)
print(new_X)
[DesignMatrix with shape (4, 3)
Intercept x0 np.log(np.abs(x1) + 1)
1 6 1.41099
1 7 0.40547
1 8 0.00000
1 9 1.19392
Terms:
'Intercept' (column 0)
'x0' (column 1)
'np.log(np.abs(x1) + 1)' (column 2)]
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)
print(X)
[[ 1. 1.01]
[ 1. 1.99]
[ 1. 3.25]
[ 1. -0.1 ]
[ 1. 5. ]]
分类数据和Patsy
当你在Patsy公式中使用非数值数据,它们会默认转换为虚变量。如果有截距,会去掉⼀个,避免共线性。
data = pd.DataFrame({
'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
'key2': [0, 1, 0, 1, 0, 1, 0, 0],
'v1': [1, 2, 3, 4, 5, 6, 7, 8],
'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})
y, X = patsy.dmatrices('v2 ~ key1', data)
print(X)
[[1. 0.]
[1. 0.]
[1. 1.]
[1. 1.]
[1. 0.]
[1. 1.]
[1. 0.]
[1. 1.]]
y, X = patsy.dmatrices('v2 ~ key1 + 0', data)
print(X)
[[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[1. 0.]
[0. 1.]]
#使用多个分类名,
data['key2'] = data['key2'].map({0: 'zero', 1: 'one'})
y, x = patsy.dmatrices('v2 ~ key1 + key2', data)
print(x)
[[1. 0. 1.]
[1. 0. 0.]
[1. 1. 1.]
[1. 1. 0.]
[1. 0. 1.]
[1. 1. 0.]
[1. 0. 1.]
[1. 1. 1.]]
y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)
print(X)
[[1. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 1. 1. 1.]
[1. 1. 0. 0.]
[1. 0. 1. 0.]
[1. 1. 0. 0.]
[1. 0. 1. 0.]
[1. 1. 1. 1.]]
13.3 statsmodels介绍
估计线性模型
statsmodels的线性模型有两种不同的接口:基于数组和基于公式。
import statsmodels.api as sm
import statsmodels.formula.api as smf
def dnorm(mean, variance, size=1):
if isinstance(size, int):
size = size,
return mean + np.sqrt(variance) * np.random.randn(*size)
np.random.seed(12345)
N = 100
X = np.c_[dnorm(0, 0.4, size=N),
dnorm(0, 0.6, size=N),
dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]
y = np.dot(X, beta) + eps
print(X[:5])
print('\n')
print(y[:5])
[[-0.12946849 -1.21275292 0.50422488]
[ 0.30291036 -0.43574176 -0.25417986]
[-0.32852189 -0.02530153 0.13835097]
[-0.35147471 -0.71960511 -0.25821463]
[ 1.2432688 -0.37379916 -0.52262905]]
[ 0.42786349 -0.67348041 -0.09087764 -0.48949442 -0.12894109]
X_model = sm.add_constant(X)#sm.add_constant可以添加一个截距的列。
X_model[:5]
array([[ 1. , -0.12946849, -1.21275292, 0.50422488],
[ 1. , 0.30291036, -0.43574176, -0.25417986],
[ 1. , -0.32852189, -0.02530153, 0.13835097],
[ 1. , -0.35147471, -0.71960511, -0.25821463],
[ 1. , 1.2432688 , -0.37379916, -0.52262905]])
#使用sm。OLS可以拟合最小二乘回归.
model = sm.OLS(y, X)
#model为拟合后的模型。使用fit方法接口返回回归对象.
results = model.fit()
results.params
array([0.17826108, 0.22303962, 0.50095093])
13.4 scikit-learn介绍
值得好好学些的机器学习库。
train = pd.read_csv('datasets/titanic/train.csv')
test = pd.read_csv('datasets/titanic/test.csv')
#statsmodels和scikit-learn通常不能接收缺失数据。需要检查数据中是否有缺失值。
train.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
impute_value = train['Age'].median()#计算训练数据集的中位数
train['Age'] = train['Age'].fillna(impute_value)#利用中位数进行填充
test['Age'] = test['Age'].fillna(impute_value)
train['IsFemale'] = (train['Sex'] == 'female').astype(int)#增加sFeamle列,表示sex、
test['IsFemale'] = (test['Sex'] == 'female').astype(int)
predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values
#确定训练数据集以及测试值
from sklearn.linear_model import LogisticRegression
#导入LogisticRegression模型,
model = LogisticRegression()
model.fit(X_train, y_train)#拟合数据
y_predict = model.predict(X_test)#数据集预测
y_predict[:10]
E:\anaconda\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)
from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(10)
model_cv.fit(X_train, y_train)
E:\anaconda\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
warnings.warn(CV_WARNING, FutureWarning)
LogisticRegressionCV(Cs=10, class_weight=None, cv='warn', dual=False,
fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
max_iter=100, multi_class='warn', n_jobs=None,
penalty='l2', random_state=None, refit=True, scoring=None,
solver='lbfgs', tol=0.0001, verbose=0)
说明:
放上参考链接,这个系列都是复现的这个链接中的内容。
放上原链接: https://www.jianshu.com/p/04d180d90a3f
作者在链接中放上了书籍,以及相关资源。因为平时杂七杂八的也学了一些,所以这次可能是对书中的部分内容的复现。也可能有我自己想到的内容,内容暂时都还不定。在此感谢原简书作者SeanCheney的分享。
网友评论