使用Python，pandas，seaborn和scikit-L

作者: 14e61d025165 | 来源:发表于2019-03-22 16:10 被阅读1次

使用Python，pandas，seaborn和scikit-L
使用Python，pandas，seaborn和scikit-L
笔记|数据分析之pandas基础----matplotlib基础
Python基础学习14
Python关于数据分析各种学习文章链接
统计学习方法 | C1概论
python画热力图
python (Seaborn) 实现heatmap
seaborn的countplot
seaborn的介绍

在这篇文章中，我将使用流行的Python软件包在Titanic机器学习数据集（https://www.kaggle.com/francksylla/titanic-machine-learning-from-disaster）上执行EDA ：pandas，matplotlib，seaborn和scikit-learn。

大纲：

什么是数据
分类分析
定量分析
聚类
基于树的估算器的特征重要性
仪表板技术

1.什么是数据

首先是一些理论。“数据”一词于1946年首次用于表示“可传输和存储的计算机信息”。在最高层次上，数据可以大致分为两大类:结构化和非结构化。结构化数据是预先定义的数据模型，通常驻留在具有固定模式的关系数据库或数据仓库中。常见的示例包括事务信息、客户信息和日期等。非结构化数据没有预先定义的数据模型，并且存在于NoSQL数据库和数据湖中。示例包括图像、视频文件和音频文件。

在这篇文章中，我们将专注于结构化数据，我将提出一种系统的方法来快速显示您数据中的潜在统计数据。在结构化数据的框架下，我们可以进一步将它们分类为分类和定量。对于分类数据，算术规则不适用。在分类中，我们有定类数据和定序数据，而在定量中，我们有区间和比率。重要的是，我们需要花一些时间来清楚地定义和理解每个术语之间的细微差别，因为这将影响我们以后的分析和预处理技术。

image

4种不同类型的数据

定类数据（Nominal data）

“nominal”这个名字来自拉丁语nomen，意思是名字。定类数据是通过一个简单的命名系统进行区分的对象。需要注意的一件重要事情是，定类数据可能也有分配给它们的数字。这可能看起来是有序，但它们不是。编号仅用于捕获和引用。一些例子包括:

一个国家集。
运动员编号

定序数据（Ordinal data）

定序数据是顺序重要的项。更正式地说，它们在序号上的相对位置为我们提供了意义。默认情况下，定序数据的顺序是通过给它们分配数字来定义的。但是，字母或其他顺序符号也可使用。一些例子包括:

比赛的比赛排名（第1名，第2名，第3名）
组织中的工资等级（Associate，AVP，VP，SVP）。

定距数据（Interval data）

与定序数据类似，定距数据沿着每个对象的位置彼此等距的标度来测量。这个独特的属性允许算术应用于它们。一个例子是

以华氏度为单位的温度，其中78度和79度之间的差值与45度和46度之间的差值相同。

定比数据（Ratio data）

与Interval数据一样，Ratio数据的差异也很有意义。定比数据具有附加功能，使得对象的比率也有意义，即它们具有真正的零点。零表示缺少某种财产。因此，当我们说某些东西是零重量时，我们的意思是那个东西没有质量。一些例子包括：

一个人在体重秤上的重量

定距与定比

定距和定比之间的区别就是一个没有真正的零点，而另一个有。这个例子很好地说明了这一点:当我们说某物是华氏0度时，它并不意味着它没有热量。这种独特的特性使得“华氏80度是华氏40度的两倍”等比率的说法不成立。

在我们深入研究其他部分之前，我想对一些概念进行形式化，以便您在思考过程中明确我们为什么要执行下面所示的操作。

首先我要说的是，快速显示数据摘要的最佳方法是通过2D图。尽管我们生活在3D空间世界中，但发现难以感知第三维度，例如深度，3D绘图在2D屏幕上的投影。因此，在随后的章节中，您会看到我们只使用定类数据的条形图和定量数据的箱形图，因为它们分别简洁地表达了数据分布。我们只关注单变量分析和双变量分析与目标变量。

我们主要使用seaborn和pandas来实现这一目标。众所周知，统计数据是任何数据科学家工具包的重要组成部分，而seaborn可以快速方便地使用matplotlib来精确地显示数据的统计数据。matplotlib功能强大，但有时会变得复杂。Seaborn提供了matplotlib的高级抽象，使我们能够轻松地绘制有吸引力的统计图。为了充分利用seaborn，我们还需要pandas，因为seaborn最适合使用pandas的DataFrames。

2.定类分析

我们可以开始使用pd.read_csv()读取数据。通过在数据框架上执行.head()，我们可以快速查看数据的前5行。其他有用的方法是 .desribe()， .info()：

image

后者会显示：

image

我们现在看到，

定类数据：

PassengerId,
Survived,
Pclass,
Name,
Sex,
Ticket,
Cabin,
Embarked

而定量数据：

Age,
SibSp,
Parch,
Fare

现在，凭借这些知识以及我们在第1部分中学到的知识，让我们编写一个自定义辅助函数，可以用来处理大多数类别的定类数据，并快速总结它们。我们将借助panda方法和seaborn .countplot()方法来完成这些工作。调用辅助函数categorical_summarized，Python实现如下所示。

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">def categorical_summarized(dataframe, x=None, y=None, hue=None, palette='Set1', verbose=True):
'''
Helper function that gives a quick summary of a given column of categorical data
Arguments
=========
dataframe: pandas dataframe
x: str. horizontal axis to plot the labels of categorical data, y would be the count
y: str. vertical axis to plot the labels of categorical data, x would be the count
hue: str. if you want to compare it another variable (usually the target variable)
palette: array-like. Colour of the plot
Returns
=======
Quick Stats of the data and also the count plot
'''
if x == None:
column_interested = y
else:
column_interested = x
series = dataframe[column_interested]
print(series.describe())
print('mode: ', series.mode())
if verbose:
print('='*80)
print(series.value_counts())
sns.countplot(x=x, y=y, hue=hue, data=dataframe, palette=palette)
plt.show()
</pre>

image

categorical_summary的作用是它接受一个data frame，一些输入参数和输出如下:

数值数据的计数，平均值，std，最小值，最大值和四分位数，或非数值数据的顶级类别的计数，唯一，顶级类和非数值数据的顶级类的频率。
感兴趣列的类频率，如果verbose设置为True
感兴趣的列的每个类的计数的条形图

我们来谈谈输入参数。x和y采用str类型，它对应于我们想要研究的感兴趣的列。将列的名称设置为x将创建一个条形图，其中x轴显示不同的类及其在y轴上的计数。将感兴趣的列的名称设置为y将翻转先前图的轴，其中不同的类将在y轴上，x轴显示该计数。通过将色相设置为目标变量(在本例中Survived )，该函数将显示目标变量w.r.t.对感兴趣列的依赖关系。显示categorical_summary用法的一些示例代码如下:

单变量分析

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);"># Target Variable: Survival
c_palette = ['tab:blue', 'tab:orange']
categorical_summarized(train_df, y = 'Survived', palette=c_palette)
</pre>

image

会给出以下内容：

image

双变量分析

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);"># Feature Variable: Gender
categorical_summarized(train_df, y = 'Sex', hue='Survived', palette=c_palette)
</pre>

image

会给出以下内容：

image

在性别变量上输出categorical_summarized，其中hue设置为Survived

3.定量分析

现在，从技术上讲，我们可以使用条形图进行定量数据处理，但它通常会相当混乱(您可以尝试在Age列上使用categorical_summary)。一个更整洁的方法是使用一个箱型图，它会根据一个5个数字的摘要来显示分布，最小，Q1，中位数，Q3，和最大值。

下一个调用的辅助函数quantitative_summarized 定义如下：

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">def quantitative_summarized(dataframe, x=None, y=None, hue=None, palette='Set1', ax=None, verbose=True, swarm=False):
'''
Helper function that gives a quick summary of quantattive data
Arguments
=========
dataframe: pandas dataframe
x: str. horizontal axis to plot the labels of categorical data (usually the target variable)
y: str. vertical axis to plot the quantitative data
hue: str. if you want to compare it another categorical variable (usually the target variable if x is another variable)
palette: array-like. Colour of the plot
swarm: if swarm is set to True, a swarm plot would be overlayed
Returns
=======
Quick Stats of the data and also the box plot of the distribution
'''
series = dataframe[y]
print(series.describe())
print('mode: ', series.mode())
if verbose:
print('='*80)
print(series.value_counts())
sns.boxplot(x=x, y=y, hue=hue, data=dataframe, palette=palette, ax=ax)
if swarm:
sns.swarmplot(x=x, y=y, hue=hue, data=dataframe,
palette=palette, ax=ax)
plt.show()
</pre>

image

类似于categorical_summarized ，quantitative_summarized输入data frame和一些输入参数来输出潜在统计数据，以及一个box plot和swarm plot(如果swarm被设置为true)。

quantitative_summary可以接受一个定量变量和两个定类变量，其中定量变量必须分配给y，其他两个定类变量可以分别分配给x和hue。下面是一些示例代码，展示了它的用法:

单变量分析

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);"># univariate analysis
quantitative_summarized(dataframe= train_df, y = 'Age', palette=c_palette, verbose=False, swarm=True)
</pre>

image

会给出以下内容：

image

双变量分析

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);"># bivariate analysis with target variable
quantitative_summarized(dataframe= train_df, y = 'Age', x = 'Survived', palette=c_palette, verbose=False, swarm=True)
</pre>

image

会给出以下内容：

image

多变量分析

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);"># multivariate analysis with Embarked variable and Pclass variable
quantitative_summarized(dataframe= train_df, y = 'Age', x = 'Embarked', hue = 'Pclass', palette=c_palette3, verbose=False, swarm=False)
</pre>

image

会给出以下内容：

image

在Age变量上输出quant_summarized，其中x设置为Survived，hue设置为Pclass

4.聚类

k-Means聚类

k-means聚类属于划分聚类。在划分群集中，我们必须指定我们想要的聚类数k。这可以通过选择下图的“below”点来完成。

image

由于K Means计算特征之间的距离以确定以下观察是否属于某个质心，我们必须通过编码定类变量并填充缺失值来预处理我们的数据。一个简单的预处理函数如下所示。

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">def simple_preprocessing(dataframe, train=True):
le = LabelEncoder()
X = dataframe.drop(['PassengerId', 'Cabin', 'Name', 'Ticket'], axis=1)
X['Age'] = X['Age'].fillna(value=X['Age'].mode()[0])
X['Embarked'] = le.fit_transform(X['Embarked'].fillna(value=X['Embarked'].mode()[0]))
X['Sex'] = np.where(X['Sex'] == 'male', 1, 0)

if train:
X = X.drop(['Survived'], axis=1)
y = np.where(dataframe['Survived'] == 1, 'Alive', 'Dead')
y = pd.get_dummies(y, columns=['Survived'])
return X, y
else:
return X
</pre>

image

现在我们已经处理了数据，我们必须执行特征缩放，以便可以比较特征之间的距离。这可以通过sklearn.preprocessing库轻松完成。。运行k-means算法后，我们设置k = 2，我们可以绘制变量，Python如下所示。

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">import matplotlib as mpl
fig = plt.figure(figsize = (8,10))
mpl.rcParams['image.cmap'] = 'jet'
labels = pipeline.predict(sample_train)
x_label = 'Survived'
y_label = 'Age'
plt.scatter(sample_train[x_label], sample_train[y_label], c = labels, alpha = 0.3)
plt.xlabel(x_label)
plt.xticks(sample_train[x_label])
plt.ylabel(y_label)
plt.show()
</pre>

image

凝聚层次聚类

对于本小节，我将介绍另一种通过聚类执行EDA的快速方法。凝聚聚类使用自下而上的方法，其中个体观察基于它们的距离迭代地连接在一起。我们将使用该scipy.cluster.hierarchy包来执行链接并使用树形图显示我们的结果。两个聚类之间的距离通过最近邻方法计算。

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
sample_train,sample_val, gt_train, gt_val = train_test_split(train_df,
train_df['Survived'],
test_size=0.05,
random_state=99)
sample_val_processed = simple_preprocessing(sample_val, train = False)
sample_val_processed = scaler.fit_transform(sample_val_processed)
mergings = linkage(sample_val_processed, method='complete')
fig = plt.figure(figsize = (16,10))
dendrogram(mergings,
labels=np.array(sample_val['Name']),
leaf_rotation=90,
leaf_font_size=10)
plt.show()
</pre>

image

5.基于树的估算器的特征重要性

另一种执行EDA的快速方法是通过基于树的估算器。决策树在最终输出预测的叶节点之前学习如何“最佳”地将机器学习数据集拆分成较小的子集。拆分通常由诸如基尼或信息增益熵的杂质标准定义。由于这是关于EDA而不是决策树的帖子，我不会详细解释它们背后的数学，但我将向您展示如何使用它们更好地理解您的特征。

基于杂质标准，可以通过greedily picking有助于获得最多信息增益的特征来构建树。为了说明这一点，我将使用该scikit-learn库。

构建随机森林分类器

我们首先构建一个随机森林分类器。默认情况下，杂质标准设置为Gini。使用以下Python代码，我们可以看到我们的Titanic机器学习数据集的相应特征重要性。

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators = 500, max_depth=12)
rf_clf.fit(X_train, y_train)
rf_y_pred = rf_clf.predict(X_val)
pd.Series(rf_clf.feature_importances_, index = X_train.columns).nlargest(12).plot(kind = 'barh',
figsize = (10, 10),
title = 'Feature importance from RandomForest').invert_yaxis();
</pre>

image

XGBoost

另一种创建决策树集合的方法是通过XGBoost，它是梯度提升框架系列的一部分。使用以下Python代码，我们可以看到哪个相应的特征对我们的XGBoost很重要。同样，默认情况下，杂质标准设置为Gini。

<pre style="-webkit-tap-highlight-color: transparent;box-sizing: border-box;font-family: Consolas, Menlo, Courier, monospace;font-size: 16px;white-space: pre-wrap;line-height: 1.5;color: rgb(153, 153, 153);margin-top: 1em;margin-bottom: 1em;padding: 12px 10px;border-width: 1px;border-style: solid;border-color: rgb(232, 232, 232);font-variant-numeric: normal;font-variant-east-asian: normal;text-align: start;widows: 1;background: rgb(244, 245, 246);">from xgboost import XGBClassifier
xgb_clf = XGBClassifier(max_depth=12, learning_rate=1e-4,n_estimators=500)
xgb_clf.fit(X_train, np.argmax(np.array(y_train), axis = 1))
xgb_y_pred = xgb_clf.predict(X_val)
pd.Series(xgb_clf.feature_importances_, index = X_train.columns).nlargest(12).plot(kind = 'barh',
figsize = (10, 10),
title = 'Feature importance from XGBoost').invert_yaxis();
</pre>