Learn专题五——Pandas

作者: Python解决方案 | 来源:发表于2019-06-03 19:53 被阅读0次

Learn专题五——Pandas
pandas 2
python数据分析推荐安装包
机器学习
2018-10-14-python认识
独热编码
学习资料
Learn_for_Pandas
Pandas 技巧（一）
【机器学习工程师】学习路径

第一课创建和读写工作薄

1.导入Pandas包

import pandas as pd

2.创建Series结构的数据(注意大小写)

s1=pd.Series([1,2,3,4],index=["one","two","three","four"]，name="s1")

3.创建DataFrame结构数据

通过字典创建

方法一：直接在创建DataFrame时设置index即可

dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict,index=[0])
print(data)

方法二：通过from_dict函数将value为标称变量的字典转换为DataFrame对象

dict = {'a':1,'b':2,'c':3}
pd.DataFrame.from_dict(dict,orient='index').T
print(data)

方法三：输入字典时不要让Value为标称属性，把Value转换为list对象再传入即可

dict = {'a':[1],'b':[2],'c':[3]}
data = pd.DataFrame(dict)
print(data)

方法四：直接将key和value取出来，都转换成list对象

dict = {'a':1,'b':2,'c':3}
pd.DataFrame(list(dict.items()))
print(data)

错误方法：直接传入标称属性为value的字典需要写入index，也就是说，需要在创建DataFrame对象时设定index。

dict = {'a':1,'b':2,'c':3}
data = pd.DataFrame(dict)

直接创建

方法一:自动产生序号为1,2……的index

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

方法二:自定义index

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

              Bob             Sue
Product A   I liked it. Pretty good.
Product B   It was awful.   Bland.

由上面的例子可以看出创建DataFrame时，字典的键对应DataFrame的列名，行名要么是自动生成，要么自定义，同时掌握index的用法。

4.读写常见文件

CSV(comma-seperated values)

#read
wine=pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#write
wine_reviews.head().to_csv("wine_reviews.csv")

Excel(单个sheet)

#read
wic = pd.read_excel("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xlsx", 
                    sheet_name='Total Women')
#write
wic.to_excel('wic.xlsx', sheet_name='Total Women')

Excel(多个sheet)

#read
wic = pd.ExcelFile("../input/publicassistance/xls_files_all/WICAgencies2013ytd.xls", 
                    sheet_name='Total Women')
df=pd.read_excel(wic,"sheet1")

SQL文件(Structured Query Language ,仅支持SQLite数据库)

#read
import sqlite3
conn = sqlite3.connect("../input/188-million-us-wildfires/FPA_FOD_20170508.sqlite")

fires = pd.read_sql_query("SELECT * FROM fires", conn)

#write
conn = sqlite3.connect("fires.sqlite")
fires.head(10).to_sql("fires", conn)

注意：创建DataFrame时加索引用index=[0]，而读取文件的时候则是index_col=0。此外，python to_csv和read_csv索引加入内容的问题可以参考：DemonHunter211

更详细的内容参考：pandas-docs

第二课索引、选择与分配

1.索引：Series和DataFrame的索引方式有很多种；这里以DataFrame为例作简略介绍，Series与之相似。

直接通过索引和标签，类似于列表索引
通过各种内置方法：.iloc、.loc、.ix、.iat、.index、.column等，关于.iloc、.loc、.ix、.iat等的区别见：roamer314

例如由一个名为reviews的数据表如下：

reviews
操作如下：
Select the description column from reviews and assign the result to the variable desc.

desc = reviews["description"]

desc = reviews.description

Select the first value from the description column of reviews, assigning it to variable first_description.

first_description = reviews["description"][0]

first_row=reviews.index[0]

Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions.

first_descriptions = reviews["description"][:10]

reviews.loc[:9, "description"]

review.head(10)

Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.

sample_reviews = reviews.iloc[[1,2,3,5,8]]

indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]

Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100.

dict=["country","province","region1","region2"]
df = pd.DataFrame(dict,index=[0,1,10,100],columns=[0])

cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]

Create a variable df containing the country and variety columns of the first 100 records.

df = reviews.loc[:99][["country","variety"]]
df=reviews.iloc[:100][["country","variety"]]
df=reviews[["country","variety"]][:100]
df = reviews.loc[:99, ["country","variety"]]
df = reviews.iloc[:100, [0,11]]

2.选择：通常利用关系表达式来进行判断选择，例如：
Create a DataFrame italian_wines containing reviews of wines made in Italy.

italian_wines = reviews[reviews.country=="Italy"]

Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

top_oceania_wines = reviews.loc[(reviews.country in (["Australia","New Zealand"])) 
&  (reviews.points>=95)]

第三课总结的方法与映射

1.总结
pandas内置了很多总结的方法:.sum()、.cumsum()(用于计算各行累加值)、.min()、.max()、.describe()、.mean()、.median()、.idxmin()(用于计算最小的索引值)、.idxmax()等。
其他的方法包括：.unique()、.count()、.isna()、isnull()、.value_counts()等等。

例子：
Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

bargain_wine = reviews.loc[((reviews.points)/(reviews.price)).idxmax(),"title"]
# print(type(((reviews.points)/(reviews.price))))#Series类型

2.map(映射)

例子1：
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.
一种实现方法：

import time
time_start=time.time()

ind1=[]
for i in range(len(reviews["description"])):
    if "tropical" in reviews["description"][i]:
        ind1.append(i)

ind2=[]
for i in range(len(reviews["description"])):
    if "fruity" in reviews["description"][i]:
        ind2.append(i)

descriptor_counts = pd.Series([len(ind1), len(ind2)], index=['tropical', 'fruity'])
time_end=time.time()
print('totally cost',time_end-time_start)

耗时：3.5854461193084717
另一种实现方法：

import time
time_start=time.time()

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

time_end=time.time()
print('totally cost',time_end-time_start)

耗时：0.15914034843444824
可见内置的map函数对于提高程序执行速度有很大提升作用，多用map少用循环。

例子2：
Create a series star_ratings with the number of stars corresponding to each review in the dataset.
一种实现方法：

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

耗时：5.851803779602051
另一种实现方法：

s1=reviews.country.map(lambda y: y=="Canada")|reviews.points.map(lambda x: x>=95)
s2=reviews.points.map(lambda z: (z>=85)) & reviews.country.map(lambda l: l!="Canada")
s3=reviews.points.map(lambda z: z<85) & reviews.country.map(lambda l: l!="Canada")
star_ratings=(s1.replace(True,3)+s2.replace(True,2)+s3.replace(True,2)).astype(int)

耗时：0.20263314247131348
记住：Series用map,DataFrame用apply，apply要指定axis.

第四课分组和排序

1.分组
pandas中进行分组的主要方法是groupby，它会是一个非常有用的函数，常常和其他方法组合在一起使用，来看几个例子：

taster_twitter_handle数量

reviews_written = reviews.groupby('taster_twitter_handle').size()

得分统计

reviews.groupby('points').points.count()

以得分情况为分组依据，得出各得分值对应价格的最小值

reviews.groupby('points').price.min()

以价格作为分组依据，得出各不同价格对应得分最大值，并升序排列

best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()

以酒品分类作为分组依据，对各分类对应价格的最大值和最小值进行输出，生成DataFrame,

price_extremes = reviews.groupby('variety').price.agg([min, max])

以国家和酒的分类作为分组依据，得出每种分类对应的酒之数量，至少有三种实现方法，第一种生成的是DataFrame，第二、三种是Series

country_variety_counts = reviews.groupby(['country', 'variety']).description.agg([len]).sort_values(by='len', ascending=False)

scountry_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)

country_variety_counts = reviews.groupby(["country","variety"]).variety.count().sort_values(ascending=False)

结合apply函数，以酿酒厂作为分类依据，分析每个酿酒厂的第一种酒名

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

以多个特征:country和province作为分类依据,然后在各自的分组中按照得分的降序排列。

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.argmax()])

结合agg方法，以country作为分类依据，对价格进行输出：不同价格数，最低价、最高价

reviews.groupby(['country']).price.agg([len, min, max])

多个索引，可以使用reset_index()进行索引复原

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

Grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not the data.

2.排序
pandas的中常用的排序是sort_values以及sort_index,可以一次对多个特征进行排序。

sorted_varieties = price_extremes.sort_values(["min","max"],ascending=False)

countries_reviewed.sort_index()

第五课数据类型和缺失数据

1.数据类型

dtype:int、float、object(没有str)

注意用dtype和dtypes的区别

reviews.price.dtype
reviews.price.dtypes

reviews.dtypes

数据类型的转换：astype

reviews.points.astype('float64')

使用TAB键结合通配符可以查询个多的内置方法。

reviews.*type*?

reviews.astype
reviews.dtypes
reviews.ftypes
reviews.get_dtype_counts
reviews.get_ftype_counts
reviews.select_dtypes

2.缺失数据

缺失数据的查询与统计

reviews.isna
reviews.notna

reviews.price.isnull()
reviews.price.isnull()

reviews[reviews.country.isnull()]

reviews.price.isnull().sum()

缺失数据的填充与丢弃

reviews["region_1"].fillna("Unknown")

reviews.dropna("region_1")

NaN值无法用replace方法进行替换.

第六课重命名和合并

1.重命名

改变特征名

renamed = reviews.rename(columns={"region_1":"region","region_2":"locale"})
renamed = reviews.rename(columns=dict(region_1='region', region_2='locale'))

改变索引名，可以用rename,也可以用rename_axis

reindexed = reviews.rename(columns=dict(index="wines"))

reindexed = reviews.rename_axis('wines', axis='rows')

改变索引值

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

2.合并（连接）

pandas提供了三种主要的数据结合的方式，依据复杂由低到高的顺序依次是：concat、join、merge，另外还有一个combine_first实例方法。

concat

当数据具有相同的列属性时，进行数据堆叠，建议使用concat。

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

joint

方便实现按索引合并，还可以用于合并多个带有相同或者相似索引的DataFrame对象，而不管它们之间有没有重叠的列。

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

这里lsuffix、rsuffix用来区别两个数据集。

merge
可根据一个或者多个键将不同的DF中的行连接起来，其功能比较强大，具体参pandas.DataFrame.merge
combine_first
可以将重复数据编结在一起，用一个对象中的值填充另一个对象中的缺失值。

最后贴上一张图：

Pandas基础使用方法表

参考1：小白白白又白cdllp
参考2：DemonHunter211
参考3：roamer314
参考4：pandas-docs
参考5：pandas.DataFrame.merge
参考6：《利用Python进行数据分析》Wes McKinney 著唐学韬等译

Learn专题五——Pandas
第一课创建和读写工作薄 1.导入Pandas包 2.创建Series结构的数据(注意大小写) 3.创建DataF...
pandas 2
pandas 2 Learn to handle missing data using pandas and a ...
python数据分析推荐安装包
numpy pandas[http://pandas.pydata.org/] scikit-learn scip...
机器学习
随机森林参考文章 scikit-learn随机森林调参小结用scikit-learn和pandas学习线性回归...
2018-10-14-python认识
第一、工具库：numpy、pandas、matplotlib 、scikit-learn、tensorflow 第...
独热编码
关于独热编码的实现一. scikit-learn库二. scikit-learn库三. 使用pandas的ge...
学习资料
AI Cheatsheets：各种 cheat sheet，包括 Pandas，Scikit-learn等htt...
Learn_for_Pandas
Learn_for_Pandas 1.Series Series类似于一维向量组，只不过包含了value和inde...
Pandas 技巧（一）
导语在众多的科学计算库中，我认为Pandas对数据科学运算最有用。Pandas，加上Scikit-learn几乎...
【机器学习工程师】学习路径
机器学习路径将带你学会使用 Numpy，Scipy，Pandas，Scikit-learn，Matplotlib ...