Python数据分析_Pandas03_数据整理

作者: ChZ_CC | 来源:发表于2017-02-03 16:23 被阅读236次

Python数据分析_Pandas03_数据整理
python的pandas函数学习
销售数据分析 (python)
【分析篇】：Pandas像sql操作python 进行数据分析
《Python数据分析与挖掘实战》读书笔记-数据探索
解决Python输出图不显示中文/数据分组统计
2019-09-14 分析lianjia数据（四）——Power
2019-10-03 分析lianjia数据（五）——生成词云图
Adventure项目总结
用 Python 做数据分析（4）

主要内容：

增删行列
修改数值：apply lambda
描述统计
合并、补齐数据
极端值处理

起始数据框：

In [76]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [77]: df
Out[77]:
                   A         B         C         D
2013-01-01 -0.411674  0.273549  0.629843  1.881497
2013-01-02  1.240512  0.970725  0.033099  1.553420
2013-01-03 -0.544326  0.545738 -1.325810  0.130738
2013-01-04  1.044803 -0.117151  0.874583  2.278227
2013-01-05 -2.194728 -2.536257  0.478644  0.057728
2013-01-06 -1.092031  1.249952  1.598761 -0.153423

In [98]: df2 = df.copy()
    ...: df2['E'] = ['one', 'one','two','three','four','three']
    ...: df2
    ...:
Out[98]:
                   A         B         C         D      E
2013-01-01 -0.411674  0.273549  0.629843  1.881497    one
2013-01-02  1.240512  0.970725  0.033099  1.553420    one
2013-01-03 -0.544326  0.545738 -1.325810  0.130738    two
2013-01-04  1.044803 -0.117151  0.874583  2.278227  three
2013-01-05 -2.194728 -2.536257  0.478644  0.057728   four
2013-01-06 -1.092031  1.249952  1.598761 -0.153423  three

增删行列

增加新列column

In [110]: df2["F"] = np.arange(3,9)   # 如果赋的是一个值比如3，那这一列都是3。

In [111]: df2
Out[111]:
                   A         B         C         D      E  F
2013-01-01 -0.411674  0.273549  0.629843  1.881497    ONE  3
2013-01-02  1.240512  0.970725  0.033099  1.553420    ONE  4
2013-01-03 -0.544326  0.545738 -1.325810  0.130738    TWO  5
2013-01-04  1.044803 -0.117151  0.874583  2.278227  THREE  6
2013-01-05 -2.194728 -2.536257  0.478644  0.057728   FOUR  7
2013-01-06 -1.092031  1.249952  1.598761 -0.153423  THREE  8

改变列名

In [112]: df2.columns = list("qwerty")

In [113]: df2.columns
Out[113]: Index(['q', 'w', 'e', 'r', 't', 'y'], dtype='object')

设置index

pd.DataFrame.set_index(keys,drop=False, inplace=False) 
# keys : column label or list of column labels / arrays
# 默认drop=True，设置为index的列从数据库删除。

修改数值

用lambda和apply更改数据值

In [101]: df2.loc[:,"E"].unique()
Out[101]: array(['one', 'two', 'three', 'four'], dtype=object)

In [105]: df2.loc[:,"E"]=df2.E.apply(lambda x: x.upper())

In [106]: df2.loc[:,"E"].unique()
Out[106]: array(['ONE', 'TWO', 'THREE', 'FOUR'], dtype=object)

函数应用 function application

三种情况：

整张表：pipe()
行列应用：apply()
元素级应用： applymap()

这一部分在【function application部分】：http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization。

apply()的使用

In [34]: df.apply(np.mean)
Out[34]:
A   -0.274648
B   -0.260124
C    0.152004
D   -0.659868
dtype: float64

In [35]: df.apply(lambda x: x.max() - x.min())  # 求了个全距
Out[35]:
A    3.203982
B    4.084655
C    1.984507
D    3.190727
dtype: float64

合并重叠数据

#1
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
           index=['f', 'e', 'd', 'c', 'b', 'a'])
b = Series(np.arange(len(a), dtype=np.float64),
           index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan


np.where(pd.isnull(a), b, a)

#2  补齐数据
b[:-2].combine_first(a[2:])

#3 
df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
                 'b': [np.nan, 2., np.nan, 6.],
                 'c': range(2, 18, 4)})
df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
                 'b': [np.nan, 3., 4., 6., 8.]})
df1.combine_first(df2)  
# 把df1中缺失的数据NaN，补充为df2中对应的数据。

缺失值处理

df1.dropna(how='any')       # 删除空数据，按行。
df1.fillna(value=5)         # 填充空数据，9999。
pd.isnull(df1)              # 获得df1数据库的nan布尔面具。

DataFrame.fillna(value=None,    # scalar, dict, Series, or DataFrame. 【不能是list】
                                # dict Series等没有的值就没法填充。
                method=None,    # {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
                                # backfill/bfill，用后面的第一个值，ffill用前面的最有一个值。
                axis=None,      # {0 or ‘index’, 1 or ‘columns’}
                inplace=False,  # 布尔型，True或False。默认False。
                limit=None, 
                downcast=None, 
                **kwargs)

limit : int, default None
    If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
downcast : dict, default is None
    a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns:    
    filled : DataFrame

1.96个标准差以外的
分组后，组内1.96个标准差以外的
多重分组后。。。

#---创建了两个数据表，然后合在一起---
In [116]: States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL']
     ...: data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
     ...: idx = pd.date_range('1/1/2012', periods=10, freq='MS')
     ...: df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
     ...: df1['State'] = States
     ...:

In [118]: data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
     ...: idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
     ...: df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
     ...: df2['State'] = States
     ...:

In [119]: df = pd.concat([df1,df2])

In [120]: df
Out[120]:
            Revenue State
2012-01-01      1.0    NY
2012-02-01      2.0    NY
2012-03-01      3.0    NY
2012-04-01      4.0    NY
2012-05-01      5.0    FL
2012-06-01      6.0    FL
2012-07-01      7.0    GA
2012-08-01      8.0    GA
2012-09-01      9.0    FL
2012-10-01     10.0    FL
2013-01-01     10.0    NY
2013-02-01     10.0    NY
2013-03-01      9.0    NY
2013-04-01      9.0    NY
2013-05-01      8.0    FL
2013-06-01      8.0    FL
2013-07-01      7.0    GA
2013-08-01      7.0    GA
2013-09-01      6.0    FL
2013-10-01      6.0    FL

方法1：超总体均值1.96个std

In [121]: newdf = df.copy()
     ...:
     ...: newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
     ...: newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
     ...: newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
     ...: newdf
     ...:
Out[121]:
            Revenue State  x-Mean  1.96*std Outlier
2012-01-01      1.0    NY    5.75  5.200273    True
2012-02-01      2.0    NY    4.75  5.200273   False
2012-03-01      3.0    NY    3.75  5.200273   False
2012-04-01      4.0    NY    2.75  5.200273   False
2012-05-01      5.0    FL    1.75  5.200273   False
2012-06-01      6.0    FL    0.75  5.200273   False
2012-07-01      7.0    GA    0.25  5.200273   False
2012-08-01      8.0    GA    1.25  5.200273   False
2012-09-01      9.0    FL    2.25  5.200273   False
2012-10-01     10.0    FL    3.25  5.200273   False
2013-01-01     10.0    NY    3.25  5.200273   False
2013-02-01     10.0    NY    3.25  5.200273   False
2013-03-01      9.0    NY    2.25  5.200273   False
2013-04-01      9.0    NY    2.25  5.200273   False
2013-05-01      8.0    FL    1.25  5.200273   False
2013-06-01      8.0    FL    1.25  5.200273   False
2013-07-01      7.0    GA    0.25  5.200273   False
2013-08-01      7.0    GA    0.25  5.200273   False
2013-09-01      6.0    FL    0.75  5.200273   False
2013-10-01      6.0    FL    0.75  5.200273   False

方法2：分组后超小组1.96个std

In [122]: newdf = df.copy()
     ...:
     ...: State = newdf.groupby('State')   # 分类汇总的方法。groupby.transform
     ...:
     ...: newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
     ...: newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) )
     ...: newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() )
     ...: newdf
     ...:
Out[122]:
            Revenue State Outlier  x-Mean  1.96*std
2012-01-01      1.0    NY   False    5.00  7.554813
2012-02-01      2.0    NY   False    4.00  7.554813
2012-03-01      3.0    NY   False    3.00  7.554813
2012-04-01      4.0    NY   False    2.00  7.554813
2012-05-01      5.0    FL   False    2.25  3.434996
2012-06-01      6.0    FL   False    1.25  3.434996
2012-07-01      7.0    GA   False    0.25  0.980000
2012-08-01      8.0    GA   False    0.75  0.980000
2012-09-01      9.0    FL   False    1.75  3.434996
2012-10-01     10.0    FL   False    2.75  3.434996
2013-01-01     10.0    NY   False    4.00  7.554813
2013-02-01     10.0    NY   False    4.00  7.554813
2013-03-01      9.0    NY   False    3.00  7.554813
2013-04-01      9.0    NY   False    3.00  7.554813
2013-05-01      8.0    FL   False    0.75  3.434996
2013-06-01      8.0    FL   False    0.75  3.434996
2013-07-01      7.0    GA   False    0.25  0.980000
2013-08-01      7.0    GA   False    0.25  0.980000
2013-09-01      6.0    FL   False    1.25  3.434996
2013-10-01      6.0    FL   False    1.25  3.434996

#---函数式的写法---
newdf = df.copy()

State = newdf.groupby('State')

def s(group):
    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
    group['1.96*std'] = 1.96*group['Revenue'].std()  
    group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
    return group

Newdf2 = State.apply(s)
Newdf2

方法3：多重分组之后，1.96

In [123]: newdf = df.copy()
     ...:
     ...: StateMonth = newdf.groupby(['State', lambda x: x.month])
     ...:
     ...: newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
     ...: newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) )
     ...: newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() )
     ...: newdf
     ...:
Out[123]:
            Revenue State Outlier  x-Mean   1.96*std
2012-01-01      1.0    NY   False     4.5  12.473364
2012-02-01      2.0    NY   False     4.0  11.087434
2012-03-01      3.0    NY   False     3.0   8.315576
2012-04-01      4.0    NY   False     2.5   6.929646
2012-05-01      5.0    FL   False     1.5   4.157788
2012-06-01      6.0    FL   False     1.0   2.771859
2012-07-01      7.0    GA   False     0.0   0.000000
2012-08-01      8.0    GA   False     0.5   1.385929
2012-09-01      9.0    FL   False     1.5   4.157788
2012-10-01     10.0    FL   False     2.0   5.543717
2013-01-01     10.0    NY   False     4.5  12.473364
2013-02-01     10.0    NY   False     4.0  11.087434
2013-03-01      9.0    NY   False     3.0   8.315576
2013-04-01      9.0    NY   False     2.5   6.929646
2013-05-01      8.0    FL   False     1.5   4.157788
2013-06-01      8.0    FL   False     1.0   2.771859
2013-07-01      7.0    GA   False     0.0   0.000000
2013-08-01      7.0    GA   False     0.5   1.385929
2013-09-01      6.0    FL   False     1.5   4.157788
2013-10-01      6.0    FL   False     2.0   5.543717

In [124]: StateMonth
Out[124]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000017F52F4DAC8>

方法4：四分位数。对于非高斯分布/正态分布的数据。

# make a copy of original df
In [126]: newdf = df.copy()
     ...:
     ...: State = newdf.groupby('State')
     ...:
     ...: newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
     ...: newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
     ...: newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper'])
     ...: newdf
     ...:
Out[126]:
            Revenue State  Lower   Upper Outlier
2012-01-01      1.0    NY -7.000  19.000   False
2012-02-01      2.0    NY -7.000  19.000   False
2012-03-01      3.0    NY -7.000  19.000   False
2012-04-01      4.0    NY -7.000  19.000   False
2012-05-01      5.0    FL  2.625  11.625   False
2012-06-01      6.0    FL  2.625  11.625   False
2012-07-01      7.0    GA  6.625   7.625   False
2012-08-01      8.0    GA  6.625   7.625    True
2012-09-01      9.0    FL  2.625  11.625   False
2012-10-01     10.0    FL  2.625  11.625   False
2013-01-01     10.0    NY -7.000  19.000   False
2013-02-01     10.0    NY -7.000  19.000   False
2013-03-01      9.0    NY -7.000  19.000   False
2013-04-01      9.0    NY -7.000  19.000   False
2013-05-01      8.0    FL  2.625  11.625   False
2013-06-01      8.0    FL  2.625  11.625   False
2013-07-01      7.0    GA  6.625   7.625   False
2013-08-01      7.0    GA  6.625   7.625   False
2013-09-01      6.0    FL  2.625  11.625   False
2013-10-01      6.0    FL  2.625  11.625   False

Python数据分析_Pandas03_数据整理
主要内容：增删行列修改数值：apply lambda 描述统计合并、补齐数据极端值处理起始数据框：增删...
python的pandas函数学习
下面是对教程的一个整理： Python数据分析之Pandas——基础篇 1. Python数据分析实战 | 初识...
销售数据分析 (python)
分析思路： 1.了解数据（1）对数据源的数据进行整理 (2) 数据导入到python （3）查看数据 - 数据有...
【分析篇】：Pandas像sql操作python 进行数据分析
前言：Python数据分析的基础在于对数据的提取，清洗，汇总，并思考整理发现数据规律的过程。以下模块是个人在数据分...
《Python数据分析与挖掘实战》读书笔记-数据探索
《Python数据分析与挖掘实战》，第三章数据探索前面的内容偏理论，简单整理下，我感觉有点儿用的吧数据质量分析...
解决Python输出图不显示中文/数据分组统计
Python分析建模，日常问题整理（一） 2018.07.23~2018.07.09 当数据类型是categori...
2019-09-14 分析lianjia数据（四）——Power
分析lianjia房源数据（一）——Python数据清洗分析lianjia房源数据（二）——SPSS数据清洗分...
2019-10-03 分析lianjia数据（五）——生成词云图
分析lianjia房源数据（一）——Python数据清洗分析lianjia房源数据（二）——SPSS数据清洗分...
Adventure项目总结
分析成果的链接本文目录：一项目背景介绍二数据整理和罗列分析指标三使用Python聚合数据并实现自动更新四 ...
用 Python 做数据分析（4）
推荐阅读：人人都是数据分析师推荐阅读：用 Python 做数据分析（1）推荐阅读：用 Python 做数据分析...