python note

作者: thuxuxs | 来源:发表于2018-07-25 21:48 被阅读0次

小作品： Python 锤子便签
Python进阶（装饰器）
Python Notes (1) - Syntax and St
Python Notes (4) - Lists and Dic
Python扫雷游戏
Python 解决：NameError: name 'relo
python note
python with note
Python Note
Python note

准备

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data=pd.DataFrame(np.random.random([200,3]),columns=['a','b','c'])
data['d']=[np.random.randint(3) for i in range(200)]

热图

help(sns.heatmap)
data.corr()
sns.heatmap(data.corr(),annot=True,linewidths=.5,fmt='.2f')

[x] annot: 是否显示相关系数
[x] fmt: 显示相关系数的格式，例如'.2f'，两位有效数字
[x] linewidths: 热图中方块的间隔

DataFrame画图

data.a.plot(kind='line',linestyle=':',marker='*')
data.a.plot.hist(bins=20)

data.plot.scatter(x='a',y='b')
data.plot(kind='hist',bins=20,subplots=True,cumulative=True)

[x] kind: 绘图类型
- line: 线形
- hist: 柱状图，bins表示柱状图的个数
- bar: 条形图
- scatter: 散点图
[x] subplots: 是否分开画图

统计种类数

data.d.value_counts()
sns.countplot(data.d)

[x] pd.Series.value_counts
[x] **

逻辑运算

a=[1,0,0,1,1]
b=[0,1,0,0,1]

np.logical_and(a,b)
np.logical_not(a)
np.logical_or(a,b)
np.logical_xor(a,b)
np.info(np.logical_and)

partial

from functools import partial
f=partial(lambda x,y,z:(x+z)*y,1,2)
f(3)

[x] partial(func,*args):将函数的前几个参数固定住，而不修改原函数

数组的截断

a=range(9)
np.clip(a,3,7)

分离数据集和训练集

from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y=train_test_split(data.iloc[:,:-2],data.d,test_size=0.3,random_state=1)
train_x.head()
train_y.head()

[x] test_size: 训练集大小
[x] random_state: 设置随机种子

一些误差的定义

比较预测值x和真实值y之间的误差

x=[1,2,3,4,5,6]
y=[2,1,5,3,2,4]

均方误差(mean square error) MSE
$\text{MSE}=\frac{1}{n}\sum_{i=1}^n(x_i-y_i)^2$

mse=1.0/len(x)*sum([(x[i]-y[i])**2 for i in range(len(x))])
print(mse)

均方根误差(root mean square error) RMSE
$\text{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(x_i-y_i)^2}$

rmse=np.sqrt(1.0/len(x)*sum([(x[i]-y[i])**2 for i in range(len(x))]))
print(rmse)

平均绝对误差(mean absolute error) MAE
$\text{MAE}=\frac{1}{n}\sum_{i=1}^n\lvert x_i-y_i\rvert$

mae=1.0/len(x)*sum([abs(x[i]-y[i]) for i in range(len(x))])
print(mae)

均方根误差对数误差(root mean squared logarithmic error) RMSLE
$\text{RMSLE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(\log(x_i+1)-\log(y_i+1))^2}$

rmsle=np.sqrt(1.0/len(x)*sum([(np.log1p(x[i])-np.log1p(y[i]))**2 for i in range(len(x))]))
print(rmsle)

[x] RMSLE会更多的惩罚欠拟合
[x] np.log1p计算加一后的对数，其逆运算是np.expm1
[x] 采用此误差函数时，可以先对原始数据做np.log1p，再使用RMSE

ROC

在二分类问题中常使用ROC和AUC来作为误差函数。对于二分类问题，预测结果和真是结果可能有如下的一些关系

		预测
		1	0
实际	1	True Positive(TP)	False Negative(FN)
	0	False Positive(FP)	True Negative(TN)

(True/False)(Positive/Negative)=(预测的正确性)(预测的值)

[x] 真正率(TPR): 识别出的正实例占所有正实例的比例，TPR=TP/(TP+FN)
[x] 假正率(FPR)：错误识别的正实例（实际为负实例）占所有负实例的比例，FPR=FP/(FP+TN)
由于预测值是连续的，我们选择不同的阈值来作为判断预测值是正实例还是负实例的依据，在此预测值下算出真正率和假正率，并假正率为x坐标，真正率为y坐标画出来的曲线即为ROC曲线，ROC曲线下的面积几位AUC。
当AUC的面积越接近与1时，说明预测的越准，如果为0.5则说明预测的值跟实际不相关。

from sklearn.metrics import roc_curve,auc
import numpy as np
import matplotlib.pyplot as plt
x=np.array([np.random.choice([0,1]) for i in range(1000)])
y=x+(np.random.random(1000)-0.5)*2
fpr,tpr,threahold=roc_curve(x,y)
print('auc:{}'.format(auc(fpr,tpr)))
plt.plot(fpr,tpr)

DataFrame删除列

data.drop('a',axis=1).head()

生成字典的新方法

tmp={'a':1,'b':2}
dict(tmp,a=3,c=4)
dict(a=3,d=4)
dict(dict(a=3,d=4),a=1)

pandas时间

基本结构Timestamp

pd.to_datetime(pd.datetime(2018,6,13,23,54))
now=pd.to_datetime('20180613235430')
now
now.weekday_name
now.dayofyear

生成时间序列

time=pd.date_range('6/12/2018',periods=200,freq='5D')
time=pd.Series(time)
data['time']=time

[x] periods: 得到的总的个数
[x] freq: 每隔多少天、月、年等
提取时间

time.dt.dayofweek
time.dt.month

pandas的index

help(pd.DataFrame.reset_index(drop=True)
t=pd.DataFrame([[1,2],[3,4]],['a','b'],columns=['c','d'])
t.reset_index()
t.reset_index(drop=True)

[x] drop=True:直接将index删掉并设置为0开始的序列

pandas提取不同类型的列

data.dtypes.to_dict()
data.select_dtypes(exclude=['int64'])
data.select_dtypes(include=['datetime64','float64'])

[x] pd.to_dict: 直接将序列转化为字典，同样可以利用to_csv来保存数据
[x] select_dtypes: 选取特定的类型的列

pandas未知类型的参数化

help(pd.factorize)
pd.factorize(['a','c',np.nan,1,120,np.max,np.mean,np.max])

数据处理技巧

[x] 将训练集和测试集放到同一个大表中进行处理，使得对这两个数据集的处理方法相同
[x] 对于object对象可以利用pd.factorize来进行数字化，可理解为编号
[x] 对于跟时间有关的数据集，可添加以下的特征到表格中
- 这一天为是周内的第几天，也即星期几
- 月份
- 本周是这年的第几周
- 以上这些特征在全部数据集中出现的次数，利用value_counts().to_dict()和map()来实现，示例
- 这一天距月末和月初的最短距离
- 这一天小于7号，设为0，大于24号设为2，其余设为1
[x] scipy.stats.skew:添加分布的倾斜度(偏差)
[x] scipy.stats.kurtosis:添加分布的峰度

from scipy.stats import skew,kurtosis
skew([1,3,2,4,5])
skew([0,1,2,2,3,3,4])
kurtosis([0,1,2,2,3,3,4])

[x] numpy.percentile添加分布的分位数

np.percentile([1,2,3,3,4],50)
np.percentile([1,2,3,3,4],75)
np.percentile([1,2,3,3,4],25)
np.percentile([1,2,3,3,4],90)

XGBoost

参数定义

[x] eta [default=0.3]: shrinkage参数，用于更新叶子节点权重时，乘以该系数，避免步长过大。参数值越大，越可能无法收敛。把学习率 eta 设置的小一些，小学习率可以使得后面的学习更加仔细。
[x] min_child_weight [default=1]: 这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言，假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易过拟合。
[x] max_depth [default=6]: 每颗树的最大深度，树高越深，越容易过拟合。
[x] max_leaf_nodes: 最大叶结点数，与max_depth作用有点重合。
[x] gamma [default=0]: 后剪枝时，用于控制是否后剪枝的参数。
[x] max_delta_step [default=0]: 这个参数在更新步骤中起作用，如果取0表示没有约束，如果取正值则使得更新步骤更加保守。可以防止做太大的更新步子，使更新更加平缓。
[x] subsample [default=1]: 样本随机采样，较低的值使得算法更加保守，防止过拟合，但是太小的值也会造成欠拟合。
[x] colsample_bytree [default=1]: 列采样，对每棵树的生成用的特征进行列采样.一般设置为： 0.5-1
[x] lambda [default=1]: 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
[x] alpha [default=0]: 控制模型复杂程度的权重值的 L1 正则项参数，参数值越大，模型越不容易过拟合。
[x] scale_pos_weight [default=1]: 如果取值大于0的话，在类别样本不平衡的情况下有助于快速收敛。

sklearn

[x] sklearn.externals.joblib: 用来保存数据
[x] sklearn.feature_selection.VarianceThreshold: 通过数据的方差阈值来挑选特征，bool矩阵取反~a

numpy

[x] np.unique: 返回数组中唯一的那些数，并返回多对应的位置

np.unique(['a','b','b','c','a'],return_index=True)

[x] np.isin(x,y):返回x中的元素是否在y中的bool矩阵

np.isin([[1,2],[4,3]],[2,3])

[x] np.where(condition,[x,y]):返回满足条件的位置的横纵坐标，如果x和y给出来了，则返回用x和y来填充满足条件的bool矩阵，满足条件的用x填充，不满足条件的用y填充

np.where(np.array([[1,3],[4,2]]))
np.where(np.array([[1,3],[4,2]])>2)
np.where(np.array([[1,3],[4,2]])>2,'big','small')

小作品： Python 锤子便签
Python-SmartisanNotes Python API Wrapper for http://note....
Python进阶（装饰器）
note 1：Python内置的@语法就是为了简化装饰器调用。下面两图效果一样。 note 2：python的de...
Python Notes (1) - Syntax and St
The note introduces basic Python syntax and strings.Pytho...
Python Notes (4) - Lists and Dic
The note covers two python data types - lists and diction...
Python扫雷游戏
Python扫雷游戏 #coding: utf-8__note__ = """* 扫雷小游戏* 需要python3...
Python 解决：NameError: name 'relo
对于python2.x，如下参考：对于< Python3.3，如下参考： Note: *Python 3 与 P...
python note
准备相关系数热图 [x] annot: 是否显示相关系数 [x] fmt: 显示相关系数的格式，例如'.2f'...
python with note
With语句是什么？有一些任务，可能事先需要设置，事后做清理工作。对于这种场景，Python的with语句提供了...
Python Note
Python It's personal python note 引言有两种方式构建软件设计：一种是把软件做的很...
Python note

python note

相关系数

热图

DataFrame画图

统计种类数

逻辑运算

partial

数组的截断

分离数据集和训练集

一些误差的定义

ROC

DataFrame删除列

生成字典的新方法

pandas时间

pandas的index

pandas提取不同类型的列

pandas未知类型的参数化

数据处理技巧

XGBoost

sklearn

numpy

相关文章

小作品： Python 锤子便签

Python进阶（装饰器）

Python Notes (1) - Syntax and St

Python Notes (4) - Lists and Dic

Python扫雷游戏

Python 解决：NameError: name 'relo

python note

python with note

Python Note

Python note

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

我爱编程