美文网首页
Python数据分析与机器学习51-EDA之粮农组织数据

Python数据分析与机器学习51-EDA之粮农组织数据

作者: 只是甲 | 来源:发表于2022-08-09 17:46 被阅读0次

一. 数据源介绍

http://www.fao.org/nr/water/aquastat/data/query/index.html

粮农组织的三个主要目标是:

  1. 消除饥饿、粮食不安全和营养不良
  2. 消除贫困促进经济社会进步
  3. 自然资源的可持续管理和利用,包括土地、水、空气、气候和遗传资源,以造福今世后代。

为支持这些目标,《宪法》第1条要求粮农组织“收集、分析、解释和传播与营养、粮食和农业有关的信息”。因此,水温自动调节器开始,其目的是通过收集有助于联合国粮农组织的目标,与水资源相关的信息传播分析,用水和农业用水管理,对国家重点在非洲,亚洲,美国,拉丁美洲,加勒比海。

联合国粮农组织提供数据,元数据,报告国家概况,河流域概况,分析区域,图,表空间,数据,指导方针,和其他的在线工具:

  1. 水资源:内部、跨界、总
  2. 水的用途:按部门,按来源,废水
  3. 灌溉:地点、面积、类型、技术、作物
  4. 水坝:位置,高度,容量,表面积
  5. 与水有关的机构、政策和立法

数据概述:

image.png
#total_area 国土面积(1000公顷)
#arable_land 可耕作面积
#permanent_crop_area 多年生作物面积
#cultivated_area 耕地面积
#percent_cultivated 耕地面积占比
#total_pop 总人口
#rural_pop 农村人口
#urban_pop 城市人口
#gdp 国内生产总值
#gdp_per_capita 人均国内生产总值
#agg_to_gdp 农业,增加国内生产总值
#human_dev_index 人类发展指数
#gender_inequal_index 性别不平等指数
#percent_undernourished 营养不良患病率
#avg_annual_rain_depth 长期平均年降水量
#national_rainfall_index 全国降雨指数 

二. 提出问题

问题:
水的供应和用水是否与人均国内生产总值有关?

我们的计划:
Crisp-DMExploratory数据分析由以下的主要任务组成,我们在这里线性的呈现这些任务,因为每个任务如果没有之前的任务就没有意义了。然而,在现实中,你会不断地从一步跳到另一步。您可能希望先对变量的一个子集执行所有步骤。或者,通常情况下,一个观察会引出一个您想要调查的问题,在回到穷尽EDA的主要路径之前,您将进行分支和探索以回答这个问题。

  1. 形成假设/发展调查主题来探索
  2. 争论的数据
  3. 评估数据质量
  4. 配置文件数据
  5. 研究数据集中的每个单独变量
  6. 评估每个变量与目标之间的关系
  7. 评估变量之间的相互作用
  8. 跨多个维度探索数据

在整个分析过程中,你需要:

  1. 为进一步的探索列出一系列假设和问题。
  2. 记录在未来分析中要注意的事情。
  3. 向同事展示中间结果,以获得新的观点、反馈和领域知识。不要在泡沫中做EDA !获得反馈,特别是从那些从问题中解脱出来的人和/或具有相关领域知识的人那里。
  4. 把视觉效果和结果放在一起。EDA依赖于你的自然模式识别能力,所以通过将可视化和结果放置在接近的地方,可以最大化你将发现的东西。

三. 初步的分析

3.1 初步数据查看

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
print(data.head())

# 去除重复项
data_v = data[['variable','variable_full']].drop_duplicates()
print('\n' + "查看variable相关内容:")
print(data_v.head())

# 查看国家数
countries = data.country.unique()
print('\n' + "查看country:")
print(countries)

# 查看不同的时期数
time_periods = data.time_period.unique()
print('\n' + "查看不同时期:")
print(time_periods)

# 查看缺失值
data_null1 = data[data.variable=='total_area'].value.isnull().sum()
print('\n' + "查看total_are缺失值:")
print(data_null1)

测试记录:

       country        region    variable  ... time_period year_measured    value
0  Afghanistan  World | Asia  total_area  ...   1958-1962        1962.0  65286.0
1  Afghanistan  World | Asia  total_area  ...   1963-1967        1967.0  65286.0
2  Afghanistan  World | Asia  total_area  ...   1968-1972        1972.0  65286.0
3  Afghanistan  World | Asia  total_area  ...   1973-1977        1977.0  65286.0
4  Afghanistan  World | Asia  total_area  ...   1978-1982        1982.0  65286.0

[5 rows x 7 columns]

查看variable相关内容:
                 variable                                      variable_full
0              total_area                Total area of the country (1000 ha)
576           arable_land                         Arable land area (1000 ha)
1152  permanent_crop_area                     Permanent crops area (1000 ha)
1728      cultivated_area  Cultivated area (arable land + permanent crops...
2304   percent_cultivated             % of total country area cultivated (%)

查看country:
['Afghanistan' 'Armenia' 'Azerbaijan' 'Bahrain' 'Bangladesh' 'Bhutan'
 'Brunei Darussalam' 'Cambodia' 'China'
 "Democratic People's Republic of Korea" 'Georgia' 'India' 'Indonesia'
 'Iran (Islamic Republic of)' 'Iraq' 'Israel' 'Japan' 'Jordan'
 'Kazakhstan' 'Kuwait' 'Kyrgyzstan' "Lao People's Democratic Republic"
 'Lebanon' 'Malaysia' 'Maldives' 'Mongolia' 'Myanmar' 'Nepal'
 'Occupied Palestinian Territory' 'Oman' 'Pakistan' 'Papua New Guinea'
 'Philippines' 'Qatar' 'Republic of Korea' 'Saudi Arabia' 'Singapore'
 'Sri Lanka' 'Syrian Arab Republic' 'Tajikistan' 'Thailand' 'Timor-Leste'
 'Turkey' 'Turkmenistan' 'United Arab Emirates' 'Uzbekistan' 'Viet Nam'
 'Yemen' 'Belize' 'Costa Rica' 'El Salvador' 'Guatemala' 'Honduras'
 'Nicaragua' 'Panama' 'Cuba' 'Dominican Republic' 'Haiti' 'Jamaica'
 'Antigua and Barbuda' 'Bahamas' 'Barbados' 'Dominica' 'Grenada'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
 'Trinidad and Tobago' 'Canada' 'United States of America' 'Mexico'
 'Guyana' 'Suriname' 'Bolivia (Plurinational State of)' 'Colombia'
 'Ecuador' 'Peru' 'Venezuela (Bolivarian Republic of)' 'Brazil'
 'Argentina' 'Chile' 'Paraguay' 'Uruguay' 'Algeria' 'Angola' 'Benin'
 'Botswana' 'Burkina Faso' 'Burundi' 'Cabo Verde' 'Cameroon'
 'Central African Republic' 'Chad' 'Comoros' 'Congo' "Côte d'Ivoire"
 'Democratic Republic of the Congo' 'Djibouti' 'Egypt' 'Equatorial Guinea'
 'Eritrea' 'Ethiopia' 'Gabon' 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau'
 'Kenya' 'Lesotho' 'Liberia' 'Libya' 'Madagascar' 'Malawi' 'Mali'
 'Mauritania' 'Mauritius' 'Morocco' 'Mozambique' 'Namibia' 'Niger'
 'Nigeria' 'Rwanda' 'Sao Tome and Principe' 'Senegal' 'Seychelles'
 'Sierra Leone' 'Somalia' 'South Africa' 'South Sudan' 'Sudan' 'Swaziland'
 'Togo' 'Tunisia' 'Uganda' 'United Republic of Tanzania' 'Zambia'
 'Zimbabwe' 'Albania' 'Andorra' 'Austria' 'Belarus' 'Belgium'
 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Cyprus' 'Czechia'
 'Denmark' 'Estonia' 'Faroe Islands' 'Finland' 'France' 'Germany' 'Greece'
 'Holy See' 'Hungary' 'Iceland' 'Ireland' 'Italy' 'Latvia' 'Liechtenstein'
 'Lithuania' 'Luxembourg' 'Malta' 'Monaco' 'Montenegro' 'Netherlands'
 'Norway' 'Poland' 'Portugal' 'Republic of Moldova' 'Romania'
 'Russian Federation' 'San Marino' 'Serbia' 'Slovakia' 'Slovenia' 'Spain'
 'Sweden' 'Switzerland' 'The former Yugoslav Republic of Macedonia'
 'Ukraine' 'United Kingdom' 'Australia' 'Cook Islands' 'Fiji' 'Kiribati'
 'Marshall Islands' 'Micronesia (Federated States of)' 'Nauru'
 'New Zealand' 'Niue' 'Palau' 'Samoa' 'Solomon Islands' 'Tokelau' 'Tonga'
 'Tuvalu' 'Vanuatu']

查看不同时期:
['1958-1962' '1963-1967' '1968-1972' '1973-1977' '1978-1982' '1983-1987'
 '1988-1992' '1993-1997' '1998-2002' '2003-2007' '2008-2012' '2013-2017']

查看total_are缺失值:
220

3.2 初步的时间序列分析

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

print('\n' + "一个时期内所有国家:")
print(time_slice(data, time_periods[0]).head())

# 时间序列:一个国家随着时间的推移
def country_slice(df, country):
    # Only take data for country of interest
    df = df[df.country == country]

    # Pivot table
    df = df.pivot(index='variable', columns='time_period', values='value')

    df.index.name = country
    return df

print('\n' + "时间序列:一个国家随着时间的推移")
print(country_slice(data, countries[40]).head())

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

print('\n' + "面板数据:所有国家随着时间的推移(作为数据给出)")
print(variable_slice(data, 'total_pop').head())

# 时间序列 for  country and  variable
def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

print('\n' + "时间序列 for  country and  variable:")
print(time_series(data, 'Belarus', 'total_pop'))


# 地理空间:所有地理上相互联系的国家
# 减少区域数量有助于模式评估
# 创建一个字典来查找新的、更简单的区域(亚洲、北美洲、南美洲、非洲、欧洲、大洋洲)
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

print('\n' + "不同的区域:")
print(data.region.unique())

测试记录:

一个时期内所有国家:
1958-1962    accounted_flow  ...  water_total_external_renewable
country                      ...                                
Afghanistan           19.00  ...                           18.18
Albania                3.30  ...                            3.30
Algeria                0.39  ...                            0.42
Andorra                 NaN  ...                             NaN
Angola                 0.40  ...                            0.40

[5 rows x 60 columns]

时间序列:一个国家随着时间的推移
time_period                   1958-1962  1963-1967  ...  2008-2012  2013-2017
Thailand                                            ...                      
accounted_flow                    214.1     214.10  ...     214.10      214.1
accounted_flow_border_rivers      214.1     214.10  ...     214.10      214.1
agg_to_gdp                         34.0      29.24  ...      11.57       10.5
arable_land                     10600.0   11600.00  ...   16560.00    16810.0
avg_annual_rain_depth            1622.0    1622.00  ...    1622.00     1622.0

[5 rows x 12 columns]

面板数据:所有国家随着时间的推移(作为数据给出)
time_period  1958-1962  1963-1967  1968-1972  ...  2003-2007  2008-2012  2013-2017
country                                       ...                                 
Afghanistan    9344.00   10369.00   11717.00  ...   25878.00   29727.00   32527.00
Albania        1738.00    1999.00    2254.00  ...    3011.00    2881.00    2897.00
Algeria       11690.00   13354.00   15377.00  ...   34262.00   37439.00   39667.00
Andorra          15.38      20.75      26.89  ...      84.88      79.32      70.47
Angola         5466.00    5963.00    6588.00  ...   19184.00   22686.00   25022.00

[5 rows x 12 columns]

时间序列 for  country and  variable:
               total_pop
year_measured           
1992             10235.0
1997             10091.0
2002              9826.0
2007              9556.0
2012              9491.0
2015              9496.0

不同的区域:
['Asia' 'North America' 'South America' 'Africa' 'Europe' 'Oceania']

四. 数据质量评估和分析

在试图了解数据中哪些信息之前,请确保您理解了数据代表什么和丢失了什么。

我们需要做的事情:

  1. 分类:计数,区分计数,评估唯一值
  2. 数值:计数,最小,最大
  3. 抽查你熟悉的随机样品
  4. 切片和切块

主要的问题:

  1. 那里没有什么数据?
  2. 那里的数据对吗?
  3. 数据是按照你想象的方式生成的吗?

有用的python库:

  1. missingno
  2. pivottablejs
  3. pandas_profiling

例子积压:

  1. 评估缺失数据在所有数据字段中的普遍性,评估其丢失是随机的还是系统的,并在缺少数据时确定模式
  2. 标识包含给定字段丢失数据的默认值。
  3. 确定质量评估抽样策略和初始EDA
  4. datetime数据类型,保证格式的一致性和粒度的数据,并执行对数据的所有日期的检查.
  5. 在多个字段捕获相同或相似信息的情况下,了解它们之间的关系并评估最有效的字段使用。
  6. 评估每个字段数据类型
  7. 对于离散值类型,确保数据格式一致。
  8. 对于离散值类型,评估不同值和唯一百分比的数目,并对答案的类型进行正确检查。
  9. 对于连续数据类型,评估描述性统计,并对值进行检查。
  10. 了解时间戳和评估使用的分析之间的关系
  11. 按设备类型、操作系统、软件版本对数据进行切片,保证跨切片数据的一致性
  12. 对于设备或应用程序数据,确定版本发布日期,并评估这些日期前后格式或值的任何更改数据。

4.1 丢失数据

  1. 缺少数据有系统的原因吗?
  2. 有没有总是同时缺失的领域?
  3. 有什么信息丢失了吗?

4.1.1 一个时期内所有国家的 variables 情况

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])


recent = time_slice(data, '2013-2017')
msno.matrix(recent, labels=True)

plt.show()

测试记录:

image.png

初步结论:
讨论:这提供了什么额外的信息或它建议了什么额外的问题?

深入挖掘:exploitable variables
大多数国家都没有“Exploitable”变量。

思考问题:这种情况是否发生在每个时间段?

4.1.2 水资源总量

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

#Total exploitable water resources 水资源总量
msno.matrix(variable_slice(data, 'exploitable_total'), sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing total exploitable water resources data across countries and time periods \n \n \n \n');

plt.show()

测试记录:

image.png

初步结论:
只有一小部分国家报告了可利用的水资源总量,这些国家中只有极少数国家拥有最近一段时间的数据。

我们将删除该变量,因为这么少的数据点会导致很多问题。

4.1.3 全国降水指数(NRI)(毫米/年)

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
#data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
#data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

#national_rainfall_index 全国降水指数(NRI)(毫米/年)
msno.matrix(variable_slice(data, 'national_rainfall_index'),
            sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing national rainfall index data across countries and time periods \n \n \n \n');

plt.show()

测试记录:

image.png

4.1.4 按照区域来看

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 过滤北美的数据
north_america = subregion(data, 'North America')

#指数完整性
msno.matrix(msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending').T)

plt.show()

测试记录:

image.png

结论:
问:数据缺失最严重的国家有什么规律吗?

问:丢失数据的潜在原因是什么?我们可以检查什么?

抽查巴哈马缺少哪些数据以获得更多的了解

msno.nullity_filter(country_slice(data, 'Bahamas').T, filter='bottom', p=0.1)

4.1.5 区域的单一variable

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

geo = r'E:/file/world.json'

recent = time_slice(data, '2013-2017')

null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(geo_data=geo,
               data=null_data,
               columns=['country', 'agg_to_gdp'],
               key_on='feature.properties.name', reset=True,
               fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
               legend_name='Missing agricultural contribution to GDP data 2013-2017')

map.save('map.html')

测试记录:

image.png

4.1.6 随着时间的推移

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])


fig, ax = plt.subplots(figsize=(16, 16));
sns.heatmap(data.groupby(['time_period','variable']).value.count().unstack().T , ax=ax);
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Variable');
plt.title('Number of countries with data reported for each variable over time');

plt.show()

测试记录:

image.png

4.2 探索人口

Location: 均值,中位数,模式,四分位
Spread: 标准差、方差、范围、间距范围
Shape: 偏度、峰度

4.2.1 数据的位置和传播

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

recent = time_slice(data, '2013-2017')


def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

df1 = time_series(data, 'Qatar', 'total_pop').join(time_series(data, 'Qatar', 'urban_pop')).join(time_series(data, 'Qatar', 'rural_pop'))
print(df1)

测试记录:

               total_pop  urban_pop  rural_pop
year_measured                                 
1962               56.19      48.39       7.80
1967               86.16      75.48      10.68
1972              130.40     115.60      14.80
1977              182.40     162.40      20.00
1982              277.20     248.60      28.60
1987              423.30     385.40      37.90
1992              489.70     459.10      30.60
1997              528.20     506.50      21.70
2002              634.40     608.90      25.50
2007             1179.00    1130.00      49.00
2012             2016.00    2029.00     -13.00
2015             2235.00    2333.00     -98.00

4.2.2 数据形状

  1. 数据分布是倾斜的吗?
  2. 有异常值吗?它们可行吗?
  3. 有不连续的吗?

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

recent = time_slice(data, '2013-2017')


def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

df1 = recent[['total_pop', 'urban_pop', 'rural_pop']].describe().astype(int)
print(df1)

df2 = recent[['total_pop', 'urban_pop', 'rural_pop']].apply(scipy.stats.skew)
print(df2)

测试记录:

2013-2017  total_pop  urban_pop  rural_pop
count            199        199        199
mean           36890      19849      17040
std           140720      69681      77461
min                0          0        -98
25%             1368        822        500
50%             7595       3967       2404
75%            25088      11656      10677
max          1407306     805387     891112

初步结论:
是的,看起来人口是倾斜的。让我们尝试计算偏度和峰度和绘制直方图显示。

正态分布的偏度应为零。负偏度表示偏左,正偏表示右偏。

峰度也是一个正态分布和零只能是积极的。我们肯定有一些异常值!

4.2.3 可靠的直方图

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

recent = time_slice(data, '2013-2017')


def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series

fig, ax = plt.subplots(figsize=(12, 8))
ax.hist(recent.total_pop.values, bins=50);
ax.set_xlabel('Total population');
ax.set_ylabel('Number of countries');
ax.set_title('Distribution of population of countries 2013-2017');

plt.show()

测试记录:

image.png

4.2.4 对数变换

对数变换是数据变换的一种常用方式,数据变换的目的在于使数据的呈现方式接近我们所希望的前提假设,从而更好的进行统计推断。

左边是正常数据,可以看到随着时间推进,电力生产也变得方差越来越大,即越来越不稳定。 这种情况下常有的分析假设经常就不会满足(误差服从独立同分布的正态分布,时间序列要求平稳)。

理论上,我们将这类问题抽象成这种模型,即分布的标准差与其均值线性相关。

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

recent = time_slice(data, '2013-2017')


def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series


def plot_hist(df, variable, bins=20, xlabel=None, by=None,
              ylabel=None, title=None, logx=False, ax=None):
    if not ax:
        fig, ax = plt.subplots(figsize=(12, 8))
    if logx:
        if df[variable].min() <= 0:
            df[variable] = df[variable] - df[variable].min() + 1
            print('Warning: data <=0 exists, data transformed by %0.2g before plotting' % (- df[variable].min() + 1))

        bins = np.logspace(np.log10(df[variable].min()),
                           np.log10(df[variable].max()), bins)
        ax.set_xscale("log")

    ax.hist(df[variable].dropna().values, bins=bins);

    if xlabel:
        ax.set_xlabel(xlabel);
    if ylabel:
        ax.set_ylabel(ylabel);
    if title:
        ax.set_title(title);

    return ax

plot_hist(recent, 'total_pop', bins=25, logx=True, 
          xlabel='Log of total population', ylabel='Number of countries',
          title='Distribution of total population of countries 2013-2017');

plt.show()

测试记录:

image.png

4.3 随着时间的推移

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

recent = time_slice(data, '2013-2017')


def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series


recent['population_density'] = recent.total_pop.divide(recent.total_area)



with sns.color_palette(sns.diverging_palette(220, 280, s=85, l=25, n=23)):
    north_america = time_slice(subregion(data, 'North America'), '1958-1962').sort_values('total_pop').index.tolist()
    for country in north_america:
        ts = time_series(data, country, 'total_pop')
        ts['norm_pop'] = ts.total_pop/ts.total_pop.min()*100
        plt.plot(ts['norm_pop'], label=country);
        plt.xlabel('Year');
        plt.ylabel('Percent increase in population');
        plt.title('Percent increase in population from 1960 in North American countries');
    plt.legend(loc=2,prop={'size':10});


plt.show()

测试记录:

image.png

4.4 探索可再生水资源总量

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df

# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
    # Only data for that variable
    df = df[df.variable == variable]

    # Get variable for each country over the time periods
    df = df.pivot(index='country', columns='time_period', values='value')
    return df

# 提取单个区域的函数
def subregion(data, region):
    return data[data.region==region]

# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

# 获取国家
countries = data.country.unique()

# 获取数据源里面的时间标签
time_periods = data.time_period.unique()

# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}

data.region = data.region.apply(lambda x: simple_regions[x])

# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]

# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]

recent = time_slice(data, '2013-2017')


def time_series(df, country, variable):
    # Only take data for country/variable combo
    series = df[(df.country == country) & (df.variable == variable)]

    # Drop years with no data
    series = series.dropna()[['year_measured', 'value']]

    # Change years to int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    return series


recent['population_density'] = recent.total_pop.divide(recent.total_area)


def plot_hist(df, variable, bins=20, xlabel=None, by=None,
              ylabel=None, title=None, logx=False, ax=None):
    if not ax:
        fig, ax = plt.subplots(figsize=(12, 8))
    if logx:
        if df[variable].min() <= 0:
            df[variable] = df[variable] - df[variable].min() + 1
            print('Warning: data <=0 exists, data transformed by %0.2g before plotting' % (- df[variable].min() + 1))

        bins = np.logspace(np.log10(df[variable].min()),
                           np.log10(df[variable].max()), bins)
        ax.set_xscale("log")

    ax.hist(df[variable].dropna().values, bins=bins);

    if xlabel:
        ax.set_xlabel(xlabel);
    if ylabel:
        ax.set_ylabel(ylabel);
    if title:
        ax.set_title(title);

    return ax

plot_hist(recent, 'total_renewable', bins=50,
          xlabel='Total renewable water resources ($10^9 m^3/yr$)',
          ylabel='Number of countries',
          title='Distribution of total renewable water resources, 2013-2017');


plt.show()

测试记录:

image.png

4.5 评估每个变量与目标之间的关系

4.5.1 目标:人均GDP

代码:

from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df



# simplify regions
# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])

# remove exploitable fields and national rainfall index
data = data.loc[~data.variable.str.contains('exploitable'),:]
data = data.loc[~(data.variable=='national_rainfall_index')]

# Subset for cross-sectional analysis
recent = time_slice(data, '2013-2017')

plt.scatter(recent.seasonal_variability, recent.gdp_per_capita)
plt.xlabel('Seasonal variability');
plt.ylabel('GDP per capita ($USD/person)');

plt.show()

测试记录:

image.png

4.5.2 各特征值与GDP之间的关系

代码:

from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy

# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling

# system packages
import os, sys
sys.path.append('../../scripts/')

data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')

# 横截面:一个时期内所有国家
def time_slice(df, time_period):
    # Only take data for time period of interest
    df = df[df.time_period == time_period]

    # Pivot table
    df = df.pivot(index='country', columns='variable', values='value')
    df.columns.name = time_period

    return df



# simplify regions
# 减少区域数量有助于模式评估
simple_regions ={
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America',
    'World | Africa':'Africa',
    'World | Europe':'Europe',
    'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])

# remove exploitable fields and national rainfall index
data = data.loc[~data.variable.str.contains('exploitable'),:]
data = data.loc[~(data.variable=='national_rainfall_index')]

# Subset for cross-sectional analysis
recent = time_slice(data, '2013-2017')

recent_corr = recent.corr().loc['gdp_per_capita'].drop(['gdp','gdp_per_capita'])

def conditional_bar(series, bar_colors=None, color_labels=None, figsize=(13,24),
                   xlabel=None, by=None, ylabel=None, title=None):
    fig, ax  = plt.subplots(figsize=figsize)
    if not bar_colors:
        bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0]
    plt.barh(range(len(series)),series.values, color=bar_colors)
    plt.xlabel('' if not xlabel else xlabel);
    plt.ylabel('' if not ylabel else ylabel)
    plt.yticks(range(len(series)), series.index.tolist())
    plt.title('' if not title else title);
    plt.ylim([-1,len(series)]);
    if color_labels:
        for col, lab in color_labels.items():
            plt.plot([], linestyle='',marker='s',c=col, label= lab);
        lines, labels = ax.get_legend_handles_labels();
        ax.legend(lines[-len(color_labels.keys()):], labels[-len(color_labels.keys()):], loc='upper right');
    #plt.close()
    return fig

bar_colors = ['#0055A7' if x else '#2C3E4F' for x in list(recent_corr.values < 0)]
color_labels = {'#0055A7':'Negative correlation', '#2C3E4F':'Positive correlation'}

conditional_bar(recent_corr.apply(np.abs), bar_colors, color_labels,
               title='Magnitude of correlation with GDP per capita, 2013-2017',
               xlabel='|Correlation|')

plt.show()

测试记录:

image.png

参考:

  1. https://study.163.com/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1

相关文章

网友评论

      本文标题:Python数据分析与机器学习51-EDA之粮农组织数据

      本文链接:https://www.haomeiwen.com/subject/abvkwrtx.html