一. 数据源介绍
http://www.fao.org/nr/water/aquastat/data/query/index.html
粮农组织的三个主要目标是:
- 消除饥饿、粮食不安全和营养不良
- 消除贫困促进经济社会进步
- 自然资源的可持续管理和利用,包括土地、水、空气、气候和遗传资源,以造福今世后代。
为支持这些目标,《宪法》第1条要求粮农组织“收集、分析、解释和传播与营养、粮食和农业有关的信息”。因此,水温自动调节器开始,其目的是通过收集有助于联合国粮农组织的目标,与水资源相关的信息传播分析,用水和农业用水管理,对国家重点在非洲,亚洲,美国,拉丁美洲,加勒比海。
联合国粮农组织提供数据,元数据,报告国家概况,河流域概况,分析区域,图,表空间,数据,指导方针,和其他的在线工具:
- 水资源:内部、跨界、总
- 水的用途:按部门,按来源,废水
- 灌溉:地点、面积、类型、技术、作物
- 水坝:位置,高度,容量,表面积
- 与水有关的机构、政策和立法
数据概述:
#total_area 国土面积(1000公顷)
#arable_land 可耕作面积
#permanent_crop_area 多年生作物面积
#cultivated_area 耕地面积
#percent_cultivated 耕地面积占比
#total_pop 总人口
#rural_pop 农村人口
#urban_pop 城市人口
#gdp 国内生产总值
#gdp_per_capita 人均国内生产总值
#agg_to_gdp 农业,增加国内生产总值
#human_dev_index 人类发展指数
#gender_inequal_index 性别不平等指数
#percent_undernourished 营养不良患病率
#avg_annual_rain_depth 长期平均年降水量
#national_rainfall_index 全国降雨指数
二. 提出问题
问题:
水的供应和用水是否与人均国内生产总值有关?
我们的计划:
Crisp-DMExploratory数据分析由以下的主要任务组成,我们在这里线性的呈现这些任务,因为每个任务如果没有之前的任务就没有意义了。然而,在现实中,你会不断地从一步跳到另一步。您可能希望先对变量的一个子集执行所有步骤。或者,通常情况下,一个观察会引出一个您想要调查的问题,在回到穷尽EDA的主要路径之前,您将进行分支和探索以回答这个问题。
- 形成假设/发展调查主题来探索
- 争论的数据
- 评估数据质量
- 配置文件数据
- 研究数据集中的每个单独变量
- 评估每个变量与目标之间的关系
- 评估变量之间的相互作用
- 跨多个维度探索数据
在整个分析过程中,你需要:
- 为进一步的探索列出一系列假设和问题。
- 记录在未来分析中要注意的事情。
- 向同事展示中间结果,以获得新的观点、反馈和领域知识。不要在泡沫中做EDA !获得反馈,特别是从那些从问题中解脱出来的人和/或具有相关领域知识的人那里。
- 把视觉效果和结果放在一起。EDA依赖于你的自然模式识别能力,所以通过将可视化和结果放置在接近的地方,可以最大化你将发现的东西。
三. 初步的分析
3.1 初步数据查看
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
print(data.head())
# 去除重复项
data_v = data[['variable','variable_full']].drop_duplicates()
print('\n' + "查看variable相关内容:")
print(data_v.head())
# 查看国家数
countries = data.country.unique()
print('\n' + "查看country:")
print(countries)
# 查看不同的时期数
time_periods = data.time_period.unique()
print('\n' + "查看不同时期:")
print(time_periods)
# 查看缺失值
data_null1 = data[data.variable=='total_area'].value.isnull().sum()
print('\n' + "查看total_are缺失值:")
print(data_null1)
测试记录:
country region variable ... time_period year_measured value
0 Afghanistan World | Asia total_area ... 1958-1962 1962.0 65286.0
1 Afghanistan World | Asia total_area ... 1963-1967 1967.0 65286.0
2 Afghanistan World | Asia total_area ... 1968-1972 1972.0 65286.0
3 Afghanistan World | Asia total_area ... 1973-1977 1977.0 65286.0
4 Afghanistan World | Asia total_area ... 1978-1982 1982.0 65286.0
[5 rows x 7 columns]
查看variable相关内容:
variable variable_full
0 total_area Total area of the country (1000 ha)
576 arable_land Arable land area (1000 ha)
1152 permanent_crop_area Permanent crops area (1000 ha)
1728 cultivated_area Cultivated area (arable land + permanent crops...
2304 percent_cultivated % of total country area cultivated (%)
查看country:
['Afghanistan' 'Armenia' 'Azerbaijan' 'Bahrain' 'Bangladesh' 'Bhutan'
'Brunei Darussalam' 'Cambodia' 'China'
"Democratic People's Republic of Korea" 'Georgia' 'India' 'Indonesia'
'Iran (Islamic Republic of)' 'Iraq' 'Israel' 'Japan' 'Jordan'
'Kazakhstan' 'Kuwait' 'Kyrgyzstan' "Lao People's Democratic Republic"
'Lebanon' 'Malaysia' 'Maldives' 'Mongolia' 'Myanmar' 'Nepal'
'Occupied Palestinian Territory' 'Oman' 'Pakistan' 'Papua New Guinea'
'Philippines' 'Qatar' 'Republic of Korea' 'Saudi Arabia' 'Singapore'
'Sri Lanka' 'Syrian Arab Republic' 'Tajikistan' 'Thailand' 'Timor-Leste'
'Turkey' 'Turkmenistan' 'United Arab Emirates' 'Uzbekistan' 'Viet Nam'
'Yemen' 'Belize' 'Costa Rica' 'El Salvador' 'Guatemala' 'Honduras'
'Nicaragua' 'Panama' 'Cuba' 'Dominican Republic' 'Haiti' 'Jamaica'
'Antigua and Barbuda' 'Bahamas' 'Barbados' 'Dominica' 'Grenada'
'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
'Trinidad and Tobago' 'Canada' 'United States of America' 'Mexico'
'Guyana' 'Suriname' 'Bolivia (Plurinational State of)' 'Colombia'
'Ecuador' 'Peru' 'Venezuela (Bolivarian Republic of)' 'Brazil'
'Argentina' 'Chile' 'Paraguay' 'Uruguay' 'Algeria' 'Angola' 'Benin'
'Botswana' 'Burkina Faso' 'Burundi' 'Cabo Verde' 'Cameroon'
'Central African Republic' 'Chad' 'Comoros' 'Congo' "Côte d'Ivoire"
'Democratic Republic of the Congo' 'Djibouti' 'Egypt' 'Equatorial Guinea'
'Eritrea' 'Ethiopia' 'Gabon' 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau'
'Kenya' 'Lesotho' 'Liberia' 'Libya' 'Madagascar' 'Malawi' 'Mali'
'Mauritania' 'Mauritius' 'Morocco' 'Mozambique' 'Namibia' 'Niger'
'Nigeria' 'Rwanda' 'Sao Tome and Principe' 'Senegal' 'Seychelles'
'Sierra Leone' 'Somalia' 'South Africa' 'South Sudan' 'Sudan' 'Swaziland'
'Togo' 'Tunisia' 'Uganda' 'United Republic of Tanzania' 'Zambia'
'Zimbabwe' 'Albania' 'Andorra' 'Austria' 'Belarus' 'Belgium'
'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Cyprus' 'Czechia'
'Denmark' 'Estonia' 'Faroe Islands' 'Finland' 'France' 'Germany' 'Greece'
'Holy See' 'Hungary' 'Iceland' 'Ireland' 'Italy' 'Latvia' 'Liechtenstein'
'Lithuania' 'Luxembourg' 'Malta' 'Monaco' 'Montenegro' 'Netherlands'
'Norway' 'Poland' 'Portugal' 'Republic of Moldova' 'Romania'
'Russian Federation' 'San Marino' 'Serbia' 'Slovakia' 'Slovenia' 'Spain'
'Sweden' 'Switzerland' 'The former Yugoslav Republic of Macedonia'
'Ukraine' 'United Kingdom' 'Australia' 'Cook Islands' 'Fiji' 'Kiribati'
'Marshall Islands' 'Micronesia (Federated States of)' 'Nauru'
'New Zealand' 'Niue' 'Palau' 'Samoa' 'Solomon Islands' 'Tokelau' 'Tonga'
'Tuvalu' 'Vanuatu']
查看不同时期:
['1958-1962' '1963-1967' '1968-1972' '1973-1977' '1978-1982' '1983-1987'
'1988-1992' '1993-1997' '1998-2002' '2003-2007' '2008-2012' '2013-2017']
查看total_are缺失值:
220
3.2 初步的时间序列分析
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
print('\n' + "一个时期内所有国家:")
print(time_slice(data, time_periods[0]).head())
# 时间序列:一个国家随着时间的推移
def country_slice(df, country):
# Only take data for country of interest
df = df[df.country == country]
# Pivot table
df = df.pivot(index='variable', columns='time_period', values='value')
df.index.name = country
return df
print('\n' + "时间序列:一个国家随着时间的推移")
print(country_slice(data, countries[40]).head())
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
print('\n' + "面板数据:所有国家随着时间的推移(作为数据给出)")
print(variable_slice(data, 'total_pop').head())
# 时间序列 for country and variable
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
print('\n' + "时间序列 for country and variable:")
print(time_series(data, 'Belarus', 'total_pop'))
# 地理空间:所有地理上相互联系的国家
# 减少区域数量有助于模式评估
# 创建一个字典来查找新的、更简单的区域(亚洲、北美洲、南美洲、非洲、欧洲、大洋洲)
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
print('\n' + "不同的区域:")
print(data.region.unique())
测试记录:
一个时期内所有国家:
1958-1962 accounted_flow ... water_total_external_renewable
country ...
Afghanistan 19.00 ... 18.18
Albania 3.30 ... 3.30
Algeria 0.39 ... 0.42
Andorra NaN ... NaN
Angola 0.40 ... 0.40
[5 rows x 60 columns]
时间序列:一个国家随着时间的推移
time_period 1958-1962 1963-1967 ... 2008-2012 2013-2017
Thailand ...
accounted_flow 214.1 214.10 ... 214.10 214.1
accounted_flow_border_rivers 214.1 214.10 ... 214.10 214.1
agg_to_gdp 34.0 29.24 ... 11.57 10.5
arable_land 10600.0 11600.00 ... 16560.00 16810.0
avg_annual_rain_depth 1622.0 1622.00 ... 1622.00 1622.0
[5 rows x 12 columns]
面板数据:所有国家随着时间的推移(作为数据给出)
time_period 1958-1962 1963-1967 1968-1972 ... 2003-2007 2008-2012 2013-2017
country ...
Afghanistan 9344.00 10369.00 11717.00 ... 25878.00 29727.00 32527.00
Albania 1738.00 1999.00 2254.00 ... 3011.00 2881.00 2897.00
Algeria 11690.00 13354.00 15377.00 ... 34262.00 37439.00 39667.00
Andorra 15.38 20.75 26.89 ... 84.88 79.32 70.47
Angola 5466.00 5963.00 6588.00 ... 19184.00 22686.00 25022.00
[5 rows x 12 columns]
时间序列 for country and variable:
total_pop
year_measured
1992 10235.0
1997 10091.0
2002 9826.0
2007 9556.0
2012 9491.0
2015 9496.0
不同的区域:
['Asia' 'North America' 'South America' 'Africa' 'Europe' 'Oceania']
四. 数据质量评估和分析
在试图了解数据中哪些信息之前,请确保您理解了数据代表什么和丢失了什么。
我们需要做的事情:
- 分类:计数,区分计数,评估唯一值
- 数值:计数,最小,最大
- 抽查你熟悉的随机样品
- 切片和切块
主要的问题:
- 那里没有什么数据?
- 那里的数据对吗?
- 数据是按照你想象的方式生成的吗?
有用的python库:
- missingno
- pivottablejs
- pandas_profiling
例子积压:
- 评估缺失数据在所有数据字段中的普遍性,评估其丢失是随机的还是系统的,并在缺少数据时确定模式
- 标识包含给定字段丢失数据的默认值。
- 确定质量评估抽样策略和初始EDA
- datetime数据类型,保证格式的一致性和粒度的数据,并执行对数据的所有日期的检查.
- 在多个字段捕获相同或相似信息的情况下,了解它们之间的关系并评估最有效的字段使用。
- 评估每个字段数据类型
- 对于离散值类型,确保数据格式一致。
- 对于离散值类型,评估不同值和唯一百分比的数目,并对答案的类型进行正确检查。
- 对于连续数据类型,评估描述性统计,并对值进行检查。
- 了解时间戳和评估使用的分析之间的关系
- 按设备类型、操作系统、软件版本对数据进行切片,保证跨切片数据的一致性
- 对于设备或应用程序数据,确定版本发布日期,并评估这些日期前后格式或值的任何更改数据。
4.1 丢失数据
- 缺少数据有系统的原因吗?
- 有没有总是同时缺失的领域?
- 有什么信息丢失了吗?
4.1.1 一个时期内所有国家的 variables 情况
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
recent = time_slice(data, '2013-2017')
msno.matrix(recent, labels=True)
plt.show()
测试记录:
初步结论:
讨论:这提供了什么额外的信息或它建议了什么额外的问题?
深入挖掘:exploitable variables
大多数国家都没有“Exploitable”变量。
思考问题:这种情况是否发生在每个时间段?
4.1.2 水资源总量
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
#Total exploitable water resources 水资源总量
msno.matrix(variable_slice(data, 'exploitable_total'), sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing total exploitable water resources data across countries and time periods \n \n \n \n');
plt.show()
测试记录:
初步结论:
只有一小部分国家报告了可利用的水资源总量,这些国家中只有极少数国家拥有最近一段时间的数据。
我们将删除该变量,因为这么少的数据点会导致很多问题。
4.1.3 全国降水指数(NRI)(毫米/年)
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
#data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
#data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
#national_rainfall_index 全国降水指数(NRI)(毫米/年)
msno.matrix(variable_slice(data, 'national_rainfall_index'),
sort='descending');
plt.xlabel('Time period');
plt.ylabel('Country');
plt.title('Missing national rainfall index data across countries and time periods \n \n \n \n');
plt.show()
测试记录:
4.1.4 按照区域来看
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 过滤北美的数据
north_america = subregion(data, 'North America')
#指数完整性
msno.matrix(msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending').T)
plt.show()
测试记录:
结论:
问:数据缺失最严重的国家有什么规律吗?
问:丢失数据的潜在原因是什么?我们可以检查什么?
抽查巴哈马缺少哪些数据以获得更多的了解
msno.nullity_filter(country_slice(data, 'Bahamas').T, filter='bottom', p=0.1)
4.1.5 区域的单一variable
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
geo = r'E:/file/world.json'
recent = time_slice(data, '2013-2017')
null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(geo_data=geo,
data=null_data,
columns=['country', 'agg_to_gdp'],
key_on='feature.properties.name', reset=True,
fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
legend_name='Missing agricultural contribution to GDP data 2013-2017')
map.save('map.html')
测试记录:
4.1.6 随着时间的推移
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
fig, ax = plt.subplots(figsize=(16, 16));
sns.heatmap(data.groupby(['time_period','variable']).value.count().unstack().T , ax=ax);
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Variable');
plt.title('Number of countries with data reported for each variable over time');
plt.show()
测试记录:
4.2 探索人口
Location:
均值,中位数,模式,四分位
Spread:
标准差、方差、范围、间距范围
Shape:
偏度、峰度
4.2.1 数据的位置和传播
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
recent = time_slice(data, '2013-2017')
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
df1 = time_series(data, 'Qatar', 'total_pop').join(time_series(data, 'Qatar', 'urban_pop')).join(time_series(data, 'Qatar', 'rural_pop'))
print(df1)
测试记录:
total_pop urban_pop rural_pop
year_measured
1962 56.19 48.39 7.80
1967 86.16 75.48 10.68
1972 130.40 115.60 14.80
1977 182.40 162.40 20.00
1982 277.20 248.60 28.60
1987 423.30 385.40 37.90
1992 489.70 459.10 30.60
1997 528.20 506.50 21.70
2002 634.40 608.90 25.50
2007 1179.00 1130.00 49.00
2012 2016.00 2029.00 -13.00
2015 2235.00 2333.00 -98.00
4.2.2 数据形状
- 数据分布是倾斜的吗?
- 有异常值吗?它们可行吗?
- 有不连续的吗?
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
recent = time_slice(data, '2013-2017')
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
df1 = recent[['total_pop', 'urban_pop', 'rural_pop']].describe().astype(int)
print(df1)
df2 = recent[['total_pop', 'urban_pop', 'rural_pop']].apply(scipy.stats.skew)
print(df2)
测试记录:
2013-2017 total_pop urban_pop rural_pop
count 199 199 199
mean 36890 19849 17040
std 140720 69681 77461
min 0 0 -98
25% 1368 822 500
50% 7595 3967 2404
75% 25088 11656 10677
max 1407306 805387 891112
初步结论:
是的,看起来人口是倾斜的。让我们尝试计算偏度和峰度和绘制直方图显示。
正态分布的偏度应为零。负偏度表示偏左,正偏表示右偏。
峰度也是一个正态分布和零只能是积极的。我们肯定有一些异常值!
4.2.3 可靠的直方图
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
recent = time_slice(data, '2013-2017')
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
fig, ax = plt.subplots(figsize=(12, 8))
ax.hist(recent.total_pop.values, bins=50);
ax.set_xlabel('Total population');
ax.set_ylabel('Number of countries');
ax.set_title('Distribution of population of countries 2013-2017');
plt.show()
测试记录:
4.2.4 对数变换
对数变换是数据变换的一种常用方式,数据变换的目的在于使数据的呈现方式接近我们所希望的前提假设,从而更好的进行统计推断。
左边是正常数据,可以看到随着时间推进,电力生产也变得方差越来越大,即越来越不稳定。 这种情况下常有的分析假设经常就不会满足(误差服从独立同分布的正态分布,时间序列要求平稳)。
理论上,我们将这类问题抽象成这种模型,即分布的标准差与其均值线性相关。
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
recent = time_slice(data, '2013-2017')
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
def plot_hist(df, variable, bins=20, xlabel=None, by=None,
ylabel=None, title=None, logx=False, ax=None):
if not ax:
fig, ax = plt.subplots(figsize=(12, 8))
if logx:
if df[variable].min() <= 0:
df[variable] = df[variable] - df[variable].min() + 1
print('Warning: data <=0 exists, data transformed by %0.2g before plotting' % (- df[variable].min() + 1))
bins = np.logspace(np.log10(df[variable].min()),
np.log10(df[variable].max()), bins)
ax.set_xscale("log")
ax.hist(df[variable].dropna().values, bins=bins);
if xlabel:
ax.set_xlabel(xlabel);
if ylabel:
ax.set_ylabel(ylabel);
if title:
ax.set_title(title);
return ax
plot_hist(recent, 'total_pop', bins=25, logx=True,
xlabel='Log of total population', ylabel='Number of countries',
title='Distribution of total population of countries 2013-2017');
plt.show()
测试记录:
4.3 随着时间的推移
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
recent = time_slice(data, '2013-2017')
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
recent['population_density'] = recent.total_pop.divide(recent.total_area)
with sns.color_palette(sns.diverging_palette(220, 280, s=85, l=25, n=23)):
north_america = time_slice(subregion(data, 'North America'), '1958-1962').sort_values('total_pop').index.tolist()
for country in north_america:
ts = time_series(data, country, 'total_pop')
ts['norm_pop'] = ts.total_pop/ts.total_pop.min()*100
plt.plot(ts['norm_pop'], label=country);
plt.xlabel('Year');
plt.ylabel('Percent increase in population');
plt.title('Percent increase in population from 1960 in North American countries');
plt.legend(loc=2,prop={'size':10});
plt.show()
测试记录:
4.4 探索可再生水资源总量
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# 面板数据:所有国家随着时间的推移(作为数据给出)
def variable_slice(df, variable):
# Only data for that variable
df = df[df.variable == variable]
# Get variable for each country over the time periods
df = df.pivot(index='country', columns='time_period', values='value')
return df
# 提取单个区域的函数
def subregion(data, region):
return data[data.region==region]
# 读取数据源
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
# 获取国家
countries = data.country.unique()
# 获取数据源里面的时间标签
time_periods = data.time_period.unique()
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# 将包含 exploitable的行 去掉
data = data.loc[~data.variable.str.contains('exploitable'),:]
# 将包含 national_rainfall_index的行 去掉
data = data.loc[~(data.variable=='national_rainfall_index')]
recent = time_slice(data, '2013-2017')
def time_series(df, country, variable):
# Only take data for country/variable combo
series = df[(df.country == country) & (df.variable == variable)]
# Drop years with no data
series = series.dropna()[['year_measured', 'value']]
# Change years to int and set as index
series.year_measured = series.year_measured.astype(int)
series.set_index('year_measured', inplace=True)
series.columns = [variable]
return series
recent['population_density'] = recent.total_pop.divide(recent.total_area)
def plot_hist(df, variable, bins=20, xlabel=None, by=None,
ylabel=None, title=None, logx=False, ax=None):
if not ax:
fig, ax = plt.subplots(figsize=(12, 8))
if logx:
if df[variable].min() <= 0:
df[variable] = df[variable] - df[variable].min() + 1
print('Warning: data <=0 exists, data transformed by %0.2g before plotting' % (- df[variable].min() + 1))
bins = np.logspace(np.log10(df[variable].min()),
np.log10(df[variable].max()), bins)
ax.set_xscale("log")
ax.hist(df[variable].dropna().values, bins=bins);
if xlabel:
ax.set_xlabel(xlabel);
if ylabel:
ax.set_ylabel(ylabel);
if title:
ax.set_title(title);
return ax
plot_hist(recent, 'total_renewable', bins=50,
xlabel='Total renewable water resources ($10^9 m^3/yr$)',
ylabel='Number of countries',
title='Distribution of total renewable water resources, 2013-2017');
plt.show()
测试记录:
4.5 评估每个变量与目标之间的关系
4.5.1 目标:人均GDP
代码:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# simplify regions
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# remove exploitable fields and national rainfall index
data = data.loc[~data.variable.str.contains('exploitable'),:]
data = data.loc[~(data.variable=='national_rainfall_index')]
# Subset for cross-sectional analysis
recent = time_slice(data, '2013-2017')
plt.scatter(recent.seasonal_variability, recent.gdp_per_capita)
plt.xlabel('Seasonal variability');
plt.ylabel('GDP per capita ($USD/person)');
plt.show()
测试记录:
4.5.2 各特征值与GDP之间的关系
代码:
from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import folium
import scipy
# eda tools
import pivottablejs
import missingno as msno
import pandas_profiling
# system packages
import os, sys
sys.path.append('../../scripts/')
data = pd.read_csv('E:/file/aquastat.csv.gzip', compression='gzip')
# 横截面:一个时期内所有国家
def time_slice(df, time_period):
# Only take data for time period of interest
df = df[df.time_period == time_period]
# Pivot table
df = df.pivot(index='country', columns='variable', values='value')
df.columns.name = time_period
return df
# simplify regions
# 减少区域数量有助于模式评估
simple_regions ={
'World | Asia':'Asia',
'Americas | Central America and Caribbean | Central America': 'North America',
'Americas | Central America and Caribbean | Greater Antilles': 'North America',
'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
'Americas | Northern America | Northern America': 'North America',
'Americas | Northern America | Mexico': 'North America',
'Americas | Southern America | Guyana':'South America',
'Americas | Southern America | Andean':'South America',
'Americas | Southern America | Brazil':'South America',
'Americas | Southern America | Southern America':'South America',
'World | Africa':'Africa',
'World | Europe':'Europe',
'World | Oceania':'Oceania'
}
data.region = data.region.apply(lambda x: simple_regions[x])
# remove exploitable fields and national rainfall index
data = data.loc[~data.variable.str.contains('exploitable'),:]
data = data.loc[~(data.variable=='national_rainfall_index')]
# Subset for cross-sectional analysis
recent = time_slice(data, '2013-2017')
recent_corr = recent.corr().loc['gdp_per_capita'].drop(['gdp','gdp_per_capita'])
def conditional_bar(series, bar_colors=None, color_labels=None, figsize=(13,24),
xlabel=None, by=None, ylabel=None, title=None):
fig, ax = plt.subplots(figsize=figsize)
if not bar_colors:
bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0]
plt.barh(range(len(series)),series.values, color=bar_colors)
plt.xlabel('' if not xlabel else xlabel);
plt.ylabel('' if not ylabel else ylabel)
plt.yticks(range(len(series)), series.index.tolist())
plt.title('' if not title else title);
plt.ylim([-1,len(series)]);
if color_labels:
for col, lab in color_labels.items():
plt.plot([], linestyle='',marker='s',c=col, label= lab);
lines, labels = ax.get_legend_handles_labels();
ax.legend(lines[-len(color_labels.keys()):], labels[-len(color_labels.keys()):], loc='upper right');
#plt.close()
return fig
bar_colors = ['#0055A7' if x else '#2C3E4F' for x in list(recent_corr.values < 0)]
color_labels = {'#0055A7':'Negative correlation', '#2C3E4F':'Positive correlation'}
conditional_bar(recent_corr.apply(np.abs), bar_colors, color_labels,
title='Magnitude of correlation with GDP per capita, 2013-2017',
xlabel='|Correlation|')
plt.show()
测试记录:
网友评论