干货 | 教你一文掌握数据预处理

作者: 6bb0c5e1b2e1 | 来源:发表于2020-03-05 13:59 被阅读0次

干货 | 教你一文掌握数据预处理
手把手教你从数据预处理开始体验图数据库
人工智能领域：想学习大数据要掌握些什么知识？
算法笔记（13）数据预处理及Python代码实现
kaggle竞赛：Jigsaw Unintended Bias
1分钟了解数据分析挖掘体系
傻了吧？我只写了3行Python代码，就能让数据预处理速度高2-
机器学习笔记
一文教你看懂大数据的技术生态圈 Hadoop,hive,spar
金融投资高分书单 | 金融从业者的精选推荐

关注公众号“Python数据分析实战与AI干货”，二维码在文末~

满满干货等你来～

数据分析一定少不了数据预处理，预处理的好坏决定了后续的模型效果，今天我们就来看看预处理有哪些方法呢？

记录实战过程中在数据预处理环节用到的方法~

主要从以下几个方面介绍：

常用方法

Numpy部分

Pandas部分

Sklearn 部分

处理文本数据

一、常用方法

1、生成随机数序列

randIndex = random.sample(range(trainSize, len(trainData_copy)), 5*trainSize)

2、计算某个值出现的次数

titleSet = set(titleData)
for i in titleSet:
    count = titleData.count(i)

用文本出现的次数替换非空的地方。词袋模型 Word Count

titleData = allData['title']
titleSet = set(list(titleData))
title_counts = titleData.value_counts()
for i in titleSet:
    if isNaN(i):
        continue
    count = title_counts[i]
    titleData.replace(i, count, axis=0, inplace=True)
title = pd.DataFrame(titleData)
allData['title'] = title

3、判断值是否为NaN

def isNaN(num):
    return num != num

4、 Matplotlib在jupyter中显示图像

%matplotlib inline

5、处理日期

birth = trainData['birth_date']
birthDate = pd.to_datetime(birth)
end = pd.datetime(2020, 3, 5)
# 计算天数
birthDay = end - birthDate
birthDay.astype('timedelta64[D]')
# timedelta64 转到 int64
trainData['birth_date'] = birthDay.dt.days

6、计算多列数的平均值等

trainData['operate_able'] = trainData.iloc[ : , 20:53].mean(axis=1)
trainData['local_able'] = trainData.iloc[ : , 53:64].mean(axis=1)

7、数据分列（对列进行one-hot）

train_test = pd.get_dummies(train_test,columns=["Embarked"])
train_test = pd.get_dummies(train_test,columns = ['SibSp','Parch','SibSp_Parch'])

8、正则提取指定内容

df['Name].str.extract()是提取函数,配合正则一起使用

train_test['Name1'] = train_test['Name'].str.extract('.+,(.+)').str.extract( '^(.+?)\.').str.strip()

9、根据数据是否缺失进行处理

train_test.loc[train_test["Age"].isnull() ,"age_nan"] = 1
train_test.loc[train_test["Age"].notnull() ,"age_nan"] = 0

10、按区间分割-数据离散化

返回x所属区间的索引值，半开区间

#将年龄划分五个阶段10以下,10-18,18-30,30-50,50以上
train_test['Age'] = pd.cut(train_test['Age'], bins=[0,10,18,30,50,100],labels=[1,2,3,4,5])

二、Numpy部分

1、where索引列表

delLocal = np.array(np.where(np.array(trainData['acc_now_delinq']) == 1))

2、permutation(x) 随机生成一个排列或返回一个range

如果x是一个多维数组，则只会沿着它的第一个索引进行混洗。

import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

3、numpy.argmax() 返回沿轴的最大值的`索引`

返回沿轴的最大值的索引。

np.argmax(some_digit_scores)

a : array_like; 输入数组

axis : int, optional; 默认情况下，索引是放在平面数组中，否则沿着指定的轴。

out : array, optional; 如果提供，结果将被插入到这个数组中。它应该是适当的形状和dtype。

4、numpy.dot(a, b, out=None) 计算两个数组的点积

>>> np.dot(3, 4)

5、numpy.random.randn() 从标准正太分布返回样本

>>> np.random.seed(42) # 可设置随机数种子
>>> theta = np.random.randn(2,1)
array([[ 4.21509616],
       [ 2.77011339]])

参数

d0, d1, …, dn : int, optional；返回的数组维度，应该都是正值。如果没有给出，将返回一个Python float值。

6、numpy.linspace() 在指定区间返回间隔均匀的样本[start, stop]

X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])
save_fig("quadratic_predictions_plot")
plt.show()

start : scalar；序列的起始值

stop : scalar；序列的结束值

num : int, optional；要生成的样本数量，默认为50个。

endpoint : bool, optional；若为True则包括结束值，否则不包括结束值，即[start, stop)区间。默认为True。

dtype : dtype, optional；输出数组的类型，若未给出则从输入数据推断类型。

三、Pandas部分

1、Jupyter notebook中设置最大显示行列数

pd.set_option('display.max_columns', 64)
pd.set_option('display.max_rows', 1000000)

2、读入数据

homePath = 'game'
trainPath = os.path.join(homePath, 'train.csv')
testPath = os.path.join(homePath, 'test.csv')
trainData = pd.read_csv(trainPath)
testData = pd.read_csv(testPath)

3、数据简单预览

~head()
获取前五行数据，供快速参考。

~info()
获取总行数、每个属性的类型、非空值的数量。

~value_counts()
获取每个值出现的次数

~hist()
直方图的形式展示数值型数据

~describe()
简要显示数据的数字特征；例如：总数、平均值、标准差、最大值最小值、25%/50%/75%值

4、拷贝数据

mthsMajorTest = fullData.copy()

5、数据相关性

计算相关性矩阵

corrMatrix = trainData.corr()
corrMatrix['acc_now_delinq'].sort_values(ascending=False) # 降序排列

6、删除某列

trainData.drop('acc_now_delinq', axis=1, inplace=True)

# 此方法并不会从内存中释放内存
del fullData['member_id']

7、列表类型转换

termData = list(map(int, termData))

8、替换数据

gradeData.replace(['A','B','C','D','E','F','G'], [7,6,5,4,3,2,1],inplace=True)

9、数据集合并

allData = trainData.append(testData)

allData = pd.concat([trainData, testData], axis=0, ignore_index=True)

10、分割

termData = termData.str.split(' ', n=2, expand=True)[1]

11、~where() 相当于三目运算符( ? : )

通过判断自身的值来修改自身对应的值，相当于三目运算符( ? : )

housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

cond 如果为True则保持原始值，若为False则使用第二个参数other替换值。

other 替换的目标值

inplace 是否在数据上执行操作

12、np.ceil(x, y) 限制元素范围

x 输入的数据

y float型，每个元素的上限

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)     # 每个元素都除1.5

13、~loc[] 纯粹基于标签位置的索引器

strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

14、~dropna() 返回略去丢失数据部分后的剩余数据

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

15、~fillna() 用指定的方法填充

# 用中位数填充
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

16、重置索引

allData = subTrain.reset_index()

四、Sklearn 部分

1、数据标准化

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(mthsMajorTrain)
mthsMajorTrain_d = ss.transform(mthsMajorTrain)
mthsMajorTest_d = ss.transform(mthsMajorTest)

2、预测缺失值

from sklearn import linear_model
lin = linear_model.BayesianRidge()
lin.fit(mthsMajorTrain_d, mthsMajorTrainLabel)
trainData.loc[(trainData['mths_since_last_major_derog'].isnull()), 'mths_since_last_major_derog'] = lin.predict(mthsMajorTest_d)

3、Lightgbm提供的特征重要性

import lightgbm as lgb

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'auc'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

lgb_train = lgb.Dataset(totTrain[:400000], totLabel[:400000])
lgb_eval = lgb.Dataset(totTrain[400000:], totLabel[400000:])
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                early_stopping_rounds=5)
lgb.plot_importance(gbm, figsize=(10,10))

对于缺失值，一般手动挑选几个重要的特征，然后进行预测

upFeatures = ['revol_util', 'revol_bal', 'annual_inc']  # 通过上一步挑选出的特征
totTrain = totTrain[upFeatures]
totTest = trainData.loc[(trainData['total_rev_hi_lim'].isnull())][upFeatures]
totTest['annual_inc'].fillna(-9999, inplace=True)

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(totTrain)
train_d = ss.transform(totTrain)
test_d = ss.transform(totTest)

from sklearn import linear_model
lin = linear_model.BayesianRidge()
lin.fit(train_d, totLabel)
trainData.loc[(trainData['total_rev_hi_lim'].isnull()), 'total_rev_hi_lim'] = lin.predict(test_d)

4、用中位数填充

trainData['total_acc'].fillna(trainData['total_acc'].median(), inplace=True)

5、用均值填充

trainData['total_acc'].fillna(trainData['total_acc'].mean(), inplace=True)

6、Imputer() 处理丢失值

各属性必须是数值

from sklearn.preprocessing import Imputer
# 指定用何值替换丢失的值，此处为中位数
imputer = Imputer(strategy="median")

# 使实例适应数据
imputer.fit(housing_num)

# 结果在statistics_ 变量中
imputer.statistics_

# 替换
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))

# 预览
housing_tr.loc[sample_incomplete_rows.index.values]

五、处理文本数据

1、pandas.factorize() 将输入值编码为枚举类型或分类变量

housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 输出
# 17606     <1H OCEAN
# 18632     <1H OCEAN
# 14650    NEAR OCEAN
# 3230         INLAND
# 3555      <1H OCEAN
# 19480        INLAND
# 8879      <1H OCEAN
# 13685        INLAND
# 4937      <1H OCEAN
# 4861      <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# 输出
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

2、参数

values : ndarray (1-d)；序列

sort : boolean, default False；根据值排序

na_sentinel : int, default -1；给未找到赋的值

size_hint : hint to the hashtable sizer

3、返回值

labels : the indexer to the original array

uniques : ndarray (1-d) or Index；当传递的值是Index或Series时，返回独特的索引。

4、OneHotEncoder 编码整数特征为one-hot向量

返回值为稀疏矩阵

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

注意fit_transform()期望一个二维数组，所以这里将数据reshape了。

5、处理文本特征示例

housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 17606     <1H OCEAN
# 18632     <1H OCEAN
# 14650    NEAR OCEAN
# 3230         INLAND
# 3555      <1H OCEAN
# 19480        INLAND
# 8879      <1H OCEAN
# 13685        INLAND
# 4937      <1H OCEAN
# 4861      <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

housing_categories
# Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
print(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot
# [[0]
#  [0]
#  [1]
#  ..., 
#  [2]
#  [0]
#  [3]]
# <16512x5 sparse matrix of type '<class 'numpy.float64'>'
#     with 16512 stored elements in Compressed Sparse Row format>

6、LabelEncoder 标签编码

LabelEncoder`是一个可以用来将标签规范化的工具类，它可以将标签的编码值范围限定在[0,n_classes-1]。简单来说就是对不连续的数字或者文本进行编号。

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

当然，它也可以用于非数值型标签的编码转换成数值标签（只要它们是可哈希并且可比较的）:

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

7、LabelBinarizer 标签二值化

LabelBinarizer 是一个用来从多类别列表创建标签矩阵的工具类:

>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
       [0, 0, 0, 1]])

对于多类别是实例，可以使用:class:MultiLabelBinarizer:

>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
       [0, 0, 1]])
>>> lb.classes_
array([1, 2, 3])

看完可以收藏，方便以后灵活运用哦~

干货 | 教你一文掌握数据预处理
关注公众号“Python数据分析实战与AI干货”，二维码在文末~ 满满干货等你来～数据分析一定少不了数据预处理，...
手把手教你从数据预处理开始体验图数据库
本文首发于 Nebula 公众号：手把手教你从数据预处理开始体验图数据库[https://mp.weixin.qq...
人工智能领域：想学习大数据要掌握些什么知识？
需要掌握的知识：大数据技术体系太庞杂了，基础技术覆盖数据采集、数据预处理、分布式存储、 NOSQL数据库、...
算法笔记（13）数据预处理及Python代码实现
常用数据预处理工具:使用StandardScaler进行数据预处理、使用MinMaxScaler进行数据预处理、使...
kaggle竞赛：Jigsaw Unintended Bias
1 数据预处理上面的句子用来预处理数据。
1分钟了解数据分析挖掘体系
总体上来讲，数据分析挖掘体系可分为数据预处理、分析挖掘、数据探索、数据展现和分析工具。数据预处理数据预处理包含...
傻了吧？我只写了3行Python代码，就能让数据预处理速度高2-
在 Python 中，我们可以找到原生的并行化运算指令。本文可以教你仅使用 3 行代码，大大加快数据预处理的速度。...
机器学习笔记
精品笔记 ML AI 斯坦福机器学习笔记 GTD 数据预处理数据预处理预处理终版.
一文教你看懂大数据的技术生态圈 Hadoop,hive,spar
一文教你看懂大数据的技术生态圈 Hadoop,hive,spark 原文地址：http://data.qq.com...
金融投资高分书单 | 金融从业者的精选推荐
这是一篇纯干货文章，这个书单不只是教你如何赚钱，而是教你如何系统建立投资理念。一千个人有一千种投资方法，掌握投资逻...