Python数据分析(八):农粮组织数据集探索性分析(EDA)

作者: 风之舟 | 来源:发表于2020-01-04 22:33 被阅读0次

这里我们用FAO(Food and Agriculture Organization)组织提供的数据集，练习一下如何利用python进行探索性数据分析。

探索性数据分析(Exploratory Data Analysis，简称EDA)是指对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索，通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方法，特别是当我们面对大数据时代到来的时候，各种杂乱的“脏数据”，往往不知所措，不知道从哪里开始了解目前拿到手上的数据时候，探索性数据分析就非常有效。

1、导包

我们先导入需要用到的包

%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import os,sys
import warnings 
sns.set_context("poster",font_scale=1.3)

接下来，加载数据集

#导入数据
data = pd.read_csv('../数据集/农粮数据/aquastat.csv.gzip',compression='gzip')#compression直接使用磁盘上的压缩文件
data.head()

看一下数据量，

data.shape
#输出结果
(143280, 7)

看一下数据的信息，

data.info()
输出结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143280 entries, 0 to 143279
Data columns (total 7 columns):
country          143280 non-null object
region           143280 non-null object
variable         143280 non-null object
variable_full    143280 non-null object
time_period      143280 non-null object
year_measured    96411 non-null float64
value            96411 non-null float64
dtypes: float64(2), object(5)
memory usage: 7.7+ MB

2、数据切片分析

我们先来看一下variable,variable_full这两列的信息，

#数据切片分析
#看一下variable,variable_full这两列的信息，去掉重复项
data[['variable','variable_full']].drop_duplicates()#drop_duplicate方法是对DataFrame格式的数据，去除特定列下面的重复行。返回DataFrame格式的数据。
#total_area 国土面积(1000公顷)
#arable_land  可耕作面积
#permanent_crop_area 多年生作物面积
#cultivated_area  耕地面积
#percent_cultivated 耕地面积占比
#total_pop 总人口
#rural_pop 农村人口
#urban_pop 城市人口
#gdp 国内生产总值
#gdp_per_capita 人均国内生产总值
#agg_to_gdp 农业，增加国内生产总值
#human_dev_index  人类发展指数
#gender_inequal_index 性别不平等指数
#percent_undernourished 营养不良患病率
#avg_annual_rain_depth 长期平均年降水量
#national_rainfall_index  全国降雨指数

看一下统计了多少国家，

data.country.nunique()
输出结果
199

看一下有多少个时间周期，

data.time_period.nunique()#时间周期
输出结果
12

看一下时间周期有哪些，

time_period = data.time_period.unique()
time_period

我们看一下某一列某个指标的缺失值的个数，比如variable是total_area时缺失值的个数,

data[data.variable=='total_area'].value.isnull().sum()#缺失值个数
输出结果
220

我们通过几个维度来进行数据的分析：

横截面：一个时期内所有国家
时间序列：一个国家随着时间的推移
面板数据：所有国家随着时间的推移(作为数据给出)
地理空间：所有地理上相互联系的国家

我们按照上面的处理继续，现在我们想统计一下对于一个时间周期来说，不同国家在这个周期内的变化情况，

def time_slice(df,time_period):
    df = df[df.time_period == time_period]
    df = df.pivot(index = 'country',columns = 'variable',values='value')
    # 根据列对数据表进行重塑
    # index是重塑的新表的索引名称是什么
    # 第二个columns是重塑的新表的列名称是什么
    # values就是生成新列的值应该是多少
    df.columns.name=time_period
    return df

time_slice(data,time_period[0])

我们也可以按照国家分类，查看某个国家在不同时期的变化，

#按照国家分类，国家在不同时间周期内的变化
countries = data.country.unique()
def country_slice(df,country):
    df = df[df.country == country]
    df = df.pivot(index = 'variable',columns='time_period',values='value')
    df.index.name=country
    return df

country_slice(data,countries[40]).head(10)

我们还可以根据属性，查看不同国家在不同周期内的变化情况，

#对某一个属性或者变量进行分析
def variable_slice(df,variable):
    df = df[df.variable == variable]
    df = df.pivot(index = 'country',columns='time_period',values='value')
    df.index.name='country'
    return df

variable_slice(data,'total_pop').head(10)

我们还可以给定国家和指标，查看这个国家在这个指标上的变化情况，

#给定国家和指标，查看这个国家在这个指标上的变化情况
def time_series(df,country,variable):
    series = df[(df.country == country) & (df.variable == variable)]
    series = series.dropna()[['year_measured','value']]
    
    series.year_measured = series.year_measured.astype(int)#可用于转化dateframe某一列的数据类型
    series.set_index('year_measured',inplace=True)
    series.columns = [variable]
    return series
time_series(data,'Belarus','total_pop')

我们还有region(区域)没有查看，我们来看一下：

data.region.unique()

通过上图可以看出，区域太多，不便于观察，我们可以将一些区域进行合并。减少区域数量有助于模型评估，可以创建一个字典来查找新的，更简单的区域(亚洲，北美洲，南美洲，大洋洲)

simple_regions = {
    'World | Asia':'Asia',
       'Americas | Central America and Caribbean | Central America':'North America',
       'Americas | Central America and Caribbean | Greater Antilles':'North America',
       'Americas | Central America and Caribbean | Lesser Antilles and Bahamas':'North America',
       'Americas | Northern America | Northern America':'North America',
       'Americas | Northern America | Mexico':'North America',
       'Americas | Southern America | Guyana':'South America',
       'Americas | Southern America | Andean':'South America',
       'Americas | Southern America | Brazil':'South America',
       'Americas | Southern America | Southern America':'South America',
       'World | Africa':'Africa',
       'World | Europe':'Europe', 
       'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x:simple_regions[x])

print(data.region.unique())
#输出结果
['Asia' 'North America' 'South America' 'Africa' 'Europe' 'Oceania']

我们来看一下数据变化，

data

这样看起来，region数据变得很简洁，我们提取某一个区域来看一下，

#提取单个区域
def subregion(data,region):
    data = data[data.region == region]
    return data

subregion(data,'Asia').head(10)

3、单变量分析

紧接着上面的数据处理，我们重新导入一下包，这次有一些新包，

%matplotlib inline

#plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster",font_scale=1.3)
import folium  #画世界地图

import os,sys
import numpy as np
import pandas as pd

import pivottablejs
import missingno as msno#缺失值可视化
import pandas_profiling#可以以网页的形式展现给你数据总体概况

msno.matrix(recent,labels=True)#缺失值可视化

我们看一下水资源的情况，

#水资源总量
msno.matrix(variable_slice(data,'exploitable_total'),inline=False,sort='descending')
plt.xlabel('Time period')
plt.ylabel('Country')
plt.title('Missing total exploitable water resources data across countries and time  period \n \n \n')

通过上图可以看出只有一小部分国家报告了可利用的水资源总量，这些国家中只有极少数国家拥有最近一段时间的数据，我们将删除变量，因为这么少的数据点会导致很多问题。

data.loc[~data.variable.str.contains('exploitable'),:]

接下来我们看一下全国降雨指数，

# 全国降水指数（NRI）(毫米/每年)
msno.matrix(variable_slice(data,'national_rainfall_index'),inline=False,sort='descending')
plt.xlabel('Time period')
plt.ylabel('Country')
plt.title('Missing national rainfall index data across countries and time periods \n \n \n')

全国降雨在2002年以后不再报到，所以我们也删除这个数据，

data = data.loc[~(data.variable=='national_rainfall_index')]

我们单独拿出一个洲来进行分析，举例南美洲，我们来看一下数据的完整性，

#看美洲
# 先把数据筛选出来
north_america = subregion(data,'North America')

msno.matrix(msno.nullity_sort(time_slice(north_america,'2013-2017'),sort='descending').T,inline=False)#数据的完整性

通过上图发现，南美洲各国缺失得的数据并不是很多，前几个国家是不缺数据的，我们抽查一下巴哈马(图中倒数第二)，看一下它缺少了哪些数据，

msno.nullity_filter(country_slice(data,'Bahamas').T,filter='bottom',p=0.1)

接下来我们探索一下区域之间的关系，看一下缺失值的出现，是不是跟国家之间有关系，我们画一下世界地图，选择区域，然后在这个区域选择某一个指标，看一下它的缺失值的分布情况，颜色越深，缺失值越严重，我们以农业对GDPagg_to_gdp,为指标看一下分布情况，

#颜色越深，缺失值越严重
geo = r'../../数据集/农粮数据/world.json'
null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48,-102],zoom_start=2)
map.choropleth(geo_data=geo,
              data=null_data,
              columns=['country','agg_to_gdp'],
              key_on='feature.properties.name',reset=True,
              fill_color='GnBu',fill_opacity=1,line_opacity=0.2,
              legend_name='Missing agricultural contribution to GDP data 2013-2017')

map

我们也可以指定不同的指标，

def plot_null_map(df,time_period,variable,legend_name=None):
    ts=time_slice(df,time_period).reset_index().copy()#不指明，从0开始
    ts[variable]=ts[variable].notnull()*1
    map = folium.Map(location=[48,-102],zoom_start=2)
    map.choropleth(geo_data=geo,
        data=ts,
        columns=['country',variable],
        key_on='feature.properties.name',reset=True,
        fill_color='GnBu',fill_opacity=1,line_opacity=0.2,
        legend_name=legend_name if legend_name else variable)
    return map

plot_null_map(data,'2013-2017','number_undernourished','Number undernourished id missing')

接下来我们用热力图展示一下，不同指标随着时间的变化情况，颜色越深说明在这个指标上收集到的国家数据越少，反之则越多。

fig,ax = plt.subplots(figsize=(16,16))
sns.heatmap(data.groupby(['time_period','variable']).value.count().unstack().T,ax=ax)
plt.xticks(rotation=45)
plt.xlabel('Time period')
plt.ylabel('Variable')
plt.title('Number of countries with data reported for each variable over time')

通过上图我们知道，刚开始，统计这些指标的国家比较少，随着时间的推移统计这个指标的国家越来越多，可以通过颜色由深到浅的变化看出来，从过上图分析可以知道不同指标国家的重视程度，有些指标也正好相反，国家越来越不重视了。

接下来，我们使用pandas_profiling来对单变量以及多变量之间的关系进行统计一下，

pandas_profiling.ProfileReport(time_slice(data,'2013-2017'))

4、峰度与偏度

这里我们要计算的是，比如

均值，中位数，模式，四分位
标准差、方差、范围、间距范围
偏度、峰度
首先我们先来看一下数据的描述，

recent[['total_pop','urban_pop','rural_pop']].describe().astype(int) #总人口，城镇人口，农村人口

通过上图我们就可以大体得出我们想要的结果，但是通过看出图表我们发现，rural_pop的最小值是负数，人口统计不应该出现负值，我们继续来看一下，

recent.sort_values('rural_pop')[['total_pop','urban_pop','rural_pop']].head()

我们按照rural_pop从小到大进行排序，发现的确有几个国家的农村人口是负数，

我们再具体看一下这几个国家哪一年的农村人口是负的，我们以Qatar这个国家为例，

time_series(data,'Qatar','total_pop').join(time_series(data,'Qatar','urban_pop')).join(time_series(data,'Qatar','rural_pop'))

人口数目是不可能小于0，所以这说明数据有问题，存在脏数据，如果做分析预测时，要注意将这些脏数据处理一下。

接下来我们看一下偏度，我们规定，

均值<中位数，定为左偏；
均值>中位数，定为有偏；

通过这张图可以看出来，比如total_pop列mean(36890)>50%(7595)，所以是右偏。
我们看一下如何计算偏度，

# 计算偏度
import scipy
recent[['total_pop','urban_pop','rural_pop']].apply(scipy.stats.skew)

正态分布的偏度应为零，负偏度表示左偏，正偏表示右偏。

偏度计算完后，我们计算一下峰度，峰度也是一个正态分布，峰度不能为负，只能是正数，越大说明越陡峭，

recent[['total_pop','urban_pop','rural_pop']].apply(scipy.stats.kurtosis)

接下来我们看一下，如果数据分布非常不均匀该怎么办呢，

fig,ax=plt.subplots(figsize=(12,6))
ax.hist(recent.total_pop.values,bins=50)
ax.set_xlabel('Total population')
ax.set_ylabel('Number of countries')
ax.set_title('Distribution of population of countries 2013-2017')

上图是2013-2017年国家总人数的分布，通过上图我们发现，人口量少于200000(不考虑单位)的国家非常多，人口大于1200000的国家非常少，如果我们需要建模的话，这种数据我们是不能要的。这个时候我们应该怎么办呢？

通常，遇到这种情况，使用log变换将其变为正常。对数变换是数据变换的一种常用方式，数据变换的目的在于使数据的呈现方式接近我们所希望的前提假设，从而更好的进行统计推断。

接下来，我们用log转换一下，并看一下它的偏度和峰值，

#偏度
recent[['total_pop']].apply(np.log).apply(scipy.stats.skew)

可以看出偏度下降了很多，减少了倾斜。

#峰度
recent[['total_pop']].apply(np.log).apply(scipy.stats.kurtosis)

可以发现峰度也下降了，接下来我们看一下经过log转换后的数据分布，

def plot_hist(df,variable,bins=20,xlabel=None,by=None,ylabel=None,title=None,logx=False,ax=None):
    if not ax:
        fig,ax=plt.subplots(figsize=(12,8))
    if logx:
        if df[variable].min()<=0:
            df[variable] = df[variable] -df[variable].min()+1
            print('Warning:data<=0 exists,data transformed by %0.2g before plotting' % (-df[variable].min()))
        bins = np.logspace(np.log10(df[variable].min()),np.log10(df[variable].max()),bins)
        ax.set_xscale("log")
    ax.hist(df[variable].dropna().values,bins=bins)
    
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
    if title:
        ax.set_title(title)
    
    return ax

plot_hist(recent,'total_pop',bins=25,logx=True,xlabel='Log of total population',ylabel='Number of countries',
         title='Distribution of total population of countries 2013-2017')

虽然数据还有一些偏度，但是明显好了很多，呈现的分布也比较标准。

5、数据的分析维度

首先我们先来看一下美国的人口总数随时间的变化，

plt.figure(figsize=(10,10))
plt.plot(time_series(data,'United States of America','total_pop'))
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('United States population over time')

接下来，我们查看北美洲每个国家人口总数随着时间的变化，

plt.figure(figsize=(15, 10))
with sns.color_palette(sns.diverging_palette(220,280,s=85,l=25,n=23)):
    north_america = time_slice(subregion(data,'North America'),'1958-1962').sort_values('total_pop').index.tolist()
    for country in north_america:
        plt.plot(time_series(data,country,'total_pop'),label=country)
        plt.xlabel('Year')
        plt.ylabel('Population')
        plt.title('North American populations over time')

    plt.legend(loc=2,prop={'size':10})

这个时候我们发现，一些国家由于人口数量本身就少，所以整个图像显示的不明显，我们可以改变一下参照指标，那我们通过什么标准化？我们可以选择一个国家的最小、平均、中位数、最大值...或任何其他位置。那我们选择最小值，这样我们就能看到每个国家的起始人口上的增长。

plt.figure(figsize=(15, 10))
with sns.color_palette(sns.diverging_palette(220,280,s=85,l=25,n=23)):
    for country in north_america:
        ts=time_series(data,country,'total_pop')
        ts['norm_pop']=ts.total_pop/ts.total_pop.min()*100
        plt.plot(ts['norm_pop'],label=country)
        plt.xlabel('Year')
        plt.ylabel('Percent increase in population')
        plt.title('Percent increase in population from 1960 in North American countries')

    plt.legend(loc=2,prop={'size':10})

人口增长倍率图

我们也可以用热度图来展示，用颜色的深浅来比较大小关系，

north_america_pop = variable_slice(subregion(data,'North America'),'total_pop')
north_america_norm_pop = north_america_pop.div(north_america_pop.min(axis=1),axis=0)*100
north_america_norm_pop = north_america_norm_pop.loc[north_america]

fig,ax = plt.subplots(figsize=(16,12))
sns.heatmap(north_america_norm_pop,ax=ax,cmap=sns.light_palette((214,90,60),input="husl",as_cmap=True))
plt.xticks(rotation=45)
plt.xlabel('Time period')
plt.ylabel('Country,ordered by population in 1960(<-greatest to least->)')
plt.title('Percent increase in population from 1960')

接下来我们分析一下水资源的分布情况，

plot_hist(recent,'total_renewable',bins=50,
         xlabel='Total renewable water resources (10^9 m^3/yr$)',
         ylabel='Number of countries',
         title='Distribution of total renewable water resources,2013-2017')

我们可以进行一下log转换，

plot_hist(recent,'total_renewable',bins=50,
         xlabel='Total renewable water resources (10^9 m^3/yr$)',
         ylabel='Number of countries',logx=True,
         title='Distribution of total renewable water resources,2013-2017')

我们用热度图画一下，

north_america_renew = variable_slice(subregion(data,'North America'),'total_renewable')

fig,ax = plt.subplots(figsize=(16,12))
sns.heatmap(north_america_renew,ax=ax,cmap=sns.light_palette((214,90,60),input="husl",as_cmap=True))
plt.xticks(rotation=45)
plt.xlabel('Time period')
plt.ylabel('Country,ordered by Total renewable water resources in 1960(<-greatest to least->)')
plt.title('Total renewable water resources increase in population from 1960')

6、变量关系可视化展示

连续值可以画成散点图，方便观看，
我们来看一下随着季节变化，人均GDP的变化情况，

data=data.loc[~data.variable.str.contains('exploitable')]
data=data.loc[~(data.variable=='national_rainfall_index')]

plt.figure(figsize=(8,8))
plt.scatter(recent.seasonal_variability,recent.gdp_per_capita)
plt.xlabel('Seasonal variability')
plt.ylabel('GDP per capita ($USD/person)')

不仅可以用散点图，我们还可以Seaborn中的函数JointGrid画出来

def plot_scatter(df,x,y,bins=20,xlabel=None,ylabel=None,title=None,ax=None,logx=False,logy=False):
    if not ax:
        fig,ax=plt.subplots(figsize=(12,8))
    colors = mpl.rcParams['axes.prop_cycle'].by_key()['color']
    if by:
        groups = df.groupby(by)
        for j,(name,group) in enumerate(groups):
            ax.scatter(group[x],group[y],color=colors[j],label=name)
        ax.legend()
    else:
        ax.scatter(df[x],df[y],color=colors[0])
    if logx:
        ax.set_xscale('log')
    if logy:
        ax.set_yscale('log')
    ax.set_xlabel(xlabel if xlabel else x)
    ax.set_ylabel(ylabel if ylabel else y)
    if title:
        ax.set_title(title)
    
    return ax

svr = [recent.seasonal_variability.min(),recent.seasonal_variability.max()]
gdpr = [recent.gdp_per_capita.min(),recent.gdp_per_capita.max()]
gdpbins = np.logspace(*np.log10(gdpr),25)

g = sns.JointGrid(x="seasonal_variability",y="gdp_per_capita",data=recent,ylim=gdpr)
g.ax_marg_x.hist(recent.seasonal_variability,range=svr)
g.ax_marg_y.hist(recent.gdp_per_capita,range=gdpr,bins=gdpbins,orientation="horizontal")
g.plot_joint(plt.hexbin,gridsize=25)
ax = g.ax_joint

g.fig.set_figheight(8)
g.fig.set_figwidth(9)

相关程度：
相关度量两个变量之间的线性关系的强度，我们可以用相关性来识别变量。

现在我们单独拿出来一个指标分析是什么因素与人均GDP的变化有关系，正相关就是积极影响，负相关就是消极影响。

recent_corr = recent.corr().loc['gdp_per_capita'].drop(['gdp','gdp_per_capita'])

def conditional_bar(series,bar_colors=None,color_labels=None,figsize=(13,24),
                   xlabel=None,by=None,ylabel=None,title=None):
    fig,ax = plt.subplots(figsize=figsize)
    if not bar_colors:
        bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0]
    plt.barh(range(len(series)),series.values,color=bar_colors)
    plt.xlabel('' if not xlabel else xlabel)
    plt.ylabel('' if not ylabel else ylabel)
    plt.yticks(range(len(series)),series.index.tolist())
    plt.title('' if not title else title)
    plt.ylim([-1,len(series)])
    if color_labels:
        for col,lab in color_labels.items():
            plt.plot([],linestyle='',marker='s',c=col,label=lab)
        lines,labels=ax.get_legend_handles_labels()
        ax.legend(lines[-len(color_labels.keys()):],labels[-len(color_labels.keys()):],loc='upper right')
    plt.close()
    return fig

bar_colors = ['#0055A7' if x else '#2C3E4F' for x in list(recent_corr.values<0)]
color_labels = {'#0055A7':'Negative correlation','#2C3E4F':'Positive correlation'}

conditional_bar(recent_corr.apply(np.abs),bar_colors,color_labels,
               title='Magnitude of correlation with GDP per capita,2013-2017',
               xlabel='|Correlation|')

当我们在画图的时候也可以考虑一下利用bined设置一下区间，比如说连续值我们可以分成几个区间进行分析，这里我们以人均GDP的数量来进行分析，我们可以将人均GDP的数据映射到不同的区间，比如人均GDP比较低，比较落后的国家，以及人均GDP比较高，比较发达的国家，这个也是我们经常需要的操作，

plot_hist(recent,'gdp_per_capita',xlabel='GDP per capita($)',
         ylabel='Number of countries',
         title='Distribution of GDP per capita,2013-2017')

做一下log变换，这里是25个bin

plot_hist(recent,'gdp_per_capita',xlabel='GDP per capita($)',logx=True,
         ylabel='Number of countries',bins=25,
         title='Distribution of GDP per capita,2013-2017')

我们指定一下分割的标准，

capita_bins = ['Very low','Low','Medium','High','Very high']
recent['gdp_bin'] = pd.qcut(recent.gdp_per_capita,5,capita_bins)
bin_ranges = pd.qcut(recent.gdp_per_capita,5).unique()

def plot_hist_div(df,variable,bins=None,xlabel=None,by=None,ylabel=None,title=None,logx=False,ax=None):
    if not ax:
        fig,ax=plt.subplots(figsize=(12,8))
    if logx:
        bins = np.logspace(np.log10(df[variable].min()),np.log10(df[variable].max()),bins)#返回数以对数刻度均匀分布。
        ax.set_xscale("log")
    if by:
        if type(df[by].unique()) == pd.Categorical:
            cats = df[by].unique().categories.tolist()
        else:
            cats = df[by].unique().tolist()
        for cat in cats:
            to_plot = df[df[by] == cat][variable].dropna()
            ax.hist(to_plot,bins=bins)
    else:
        ax.hist(df[variable].dropna().values,bins=bins)
    
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
    if title:
        ax.set_title(title)
    
    return ax

plot_hist_div(recent,'gdp_per_capita',xlabel='GDP per capita($)',logx=True,
         ylabel='Number of countries',bins=25,by='gdp_bin',
         title='Distribution of log GDP per capita,2013-2017')

我们还可以看一下人均GDP较低，落后国家的内部数据，下面我们看一下内部数据分布情况，用boxplot进行画图，

recent[['gdp_bin','total_pop_access_drinking']].boxplot(by='gdp_bin',figsize=(10,10))
plt.title('Distribution of percent of total population with access to drinking water across gdp per capita categories')
plt.xlabel('GDP per capita quintile')
plt.ylabel('Total population of country')

对于这部分的分布，我们还可以统计看一下其他指标，如下图所示，我们还可以看一下洪水的统计信息，

def mult_boxplots(df,variable,category,
                 xlabel=None,ylabel=None,
                 title=None,ylim=None):
    df[[variable,category]].boxplot(by=category,figsize=(10,10))
    
    if xlabel:
        plt.xlabel(xlabel)
    if ylabel:
        plt.ylabel(ylabel)
    if title:
        plt.title(title)
    if ylim:
        plt.ylim(ylim)

mult_boxplots(recent,'flood_occurence','gdp_bin',
             xlabel='GDP per capita quintile')

我们现在总结一下上面用的函数，为了以后方便使用，我们将用到的函数在统计一下，

#某一个时间区域内，各个国家与各个指标之间的关系
def time_slice(df,time_period):
    df = df[df.time_period == time_period]
    df = df.pivot(index = 'country',columns = 'variable',values='value')
    # 根据列对数据表进行重塑
    # index是重塑的新表的索引名称是什么
    # 第二个columns是重塑的新表的列名称是什么
    # values就是生成新列的值应该是多少
    df.columns.name=time_period
    return df

#指定国家，查看各个指标随时间的变化
countries = data.country.unique()
def country_slice(df,country):
    df = df[df.country == country]
    df = df.pivot(index = 'variable',columns='time_period',values='value')
    df.index.name=country
    return df


#指定指标，查看各个国家随着时间的变化
def variable_slice(df,variable):
    df = df[df.variable == variable]
    df = df.pivot(index = 'country',columns='time_period',values='value')
    df.index.name='country'
    return df


#给定国家和指标，查看这个国家在这个指标上的变化情况
def time_series(df,country,variable):
    series = df[(df.country == country) & (df.variable == variable)]
    series = series.dropna()[['year_measured','value']]
    
    series.year_measured = series.year_measured.astype(int)#可用于转化dateframe某一列的数据类型
    series.set_index('year_measured',inplace=True)
    series.columns = [variable]
    return series

#将国家按照各大洲分类
simple_regions = {
    'World | Asia':'Asia',
       'Americas | Central America and Caribbean | Central America':'North America',
       'Americas | Central America and Caribbean | Greater Antilles':'North America',
       'Americas | Central America and Caribbean | Lesser Antilles and Bahamas':'North America',
       'Americas | Northern America | Northern America':'North America',
       'Americas | Northern America | Mexico':'North America',
       'Americas | Southern America | Guyana':'South America',
       'Americas | Southern America | Andean':'South America',
       'Americas | Southern America | Brazil':'South America',
       'Americas | Southern America | Southern America':'South America',
       'World | Africa':'Africa',
       'World | Europe':'Europe', 
       'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x:simple_regions[x])


#提取单个区域
def subregion(data,region):
    data = data[data.region == region]
    return data

#查看各个国家总人口的分布情况(数据经过log转换)
def plot_hist(df,variable,bins=20,xlabel=None,by=None,ylabel=None,title=None,logx=False,ax=None):
    if not ax:
        fig,ax=plt.subplots(figsize=(12,8))
    if logx:
        if df[variable].min()<=0:
            df[variable] = df[variable] -df[variable].min()+1
            print('Warning:data<=0 exists,data transformed by %0.2g before plotting' % (-df[variable].min()))
        bins = np.logspace(np.log10(df[variable].min()),np.log10(df[variable].max()),bins)#返回数以对数刻度均匀分布。
        ax.set_xscale("log")
    ax.hist(df[variable].dropna().values,bins=bins)
    
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
    if title:
        ax.set_title(title)
    
    return ax


#提取某个洲的各个国家，查看总人口随时间的变化
plt.figure(figsize=(15, 10))
with sns.color_palette(sns.diverging_palette(220,280,s=85,l=25,n=23)):
    north_america = time_slice(subregion(data,'North America'),'1958-1962').sort_values('total_pop').index.tolist()
    for country in north_america:
        plt.plot(time_series(data,country,'total_pop'),label=country)
        plt.xlabel('Year')
        plt.ylabel('Population')
        plt.title('North American populations over time')

    plt.legend(loc=2,prop={'size':10})


#指定人口基数，查看各个国家在这个基数上人口变化的倍率
plt.figure(figsize=(15, 10))
with sns.color_palette(sns.diverging_palette(220,280,s=85,l=25,n=23)):
    for country in north_america:
        ts=time_series(data,country,'total_pop')
        ts['norm_pop']=ts.total_pop/ts.total_pop.min()*100
        plt.plot(ts['norm_pop'],label=country)
        plt.xlabel('Year')
        plt.ylabel('Percent increase in population')
        plt.title('Percent increase in population from 1960 in North American countries')

    plt.legend(loc=2,prop={'size':10})


#相关性的条形图
def conditional_bar(series,bar_colors=None,color_labels=None,figsize=(13,24),
                   xlabel=None,by=None,ylabel=None,title=None):
    fig,ax = plt.subplots(figsize=figsize)
    if not bar_colors:
        bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0]
    plt.barh(range(len(series)),series.values,color=bar_colors)
    plt.xlabel('' if not xlabel else xlabel)
    plt.ylabel('' if not ylabel else ylabel)
    plt.yticks(range(len(series)),series.index.tolist())
    plt.title('' if not title else title)
    plt.ylim([-1,len(series)])
    if color_labels:
        for col,lab in color_labels.items():
            plt.plot([],linestyle='',marker='s',c=col,label=lab)
        lines,labels=ax.get_legend_handles_labels()
        ax.legend(lines[-len(color_labels.keys()):],labels[-len(color_labels.keys()):],loc='upper right')
    plt.close()
    return fig


#将数据分开不同的区域进行相关性分析
def plot_hist_div(df,variable,bins=None,xlabel=None,by=None,ylabel=None,title=None,logx=False,ax=None):
    if not ax:
        fig,ax=plt.subplots(figsize=(12,8))
    if logx:
        bins = np.logspace(np.log10(df[variable].min()),np.log10(df[variable].max()),bins)#返回数以对数刻度均匀分布。
        ax.set_xscale("log")
    if by:
        if type(df[by].unique()) == pd.Categorical:
            cats = df[by].unique().categories.tolist()
        else:
            cats = df[by].unique().tolist()
        for cat in cats:
            to_plot = df[df[by] == cat][variable].dropna()
            ax.hist(to_plot,bins=bins)
    else:
        ax.hist(df[variable].dropna().values,bins=bins)
    
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
    if title:
        ax.set_title(title)
    
    return ax


#指标的不同类别进行boxplot分析
def mult_boxplots(df,variable,category,
                 xlabel=None,ylabel=None,
                 title=None,ylim=None):
    df[[variable,category]].boxplot(by=category,figsize=(10,10))
    
    if xlabel:
        plt.xlabel(xlabel)
    if ylabel:
        plt.ylabel(ylabel)
    if title:
        plt.title(title)
    if ylim:
        plt.ylim(ylim)

关于农粮组织数据的探索性分析到这里就结束了，希望大家能动手做一下。

Python数据分析(八):农粮组织数据集探索性分析(EDA)

1、导包

2、数据切片分析

3、单变量分析

4、峰度与偏度

5、数据的分析维度

6、变量关系可视化展示

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

数据分析