美文网首页
Python数据分析_Pandas05_统计量/statisti

Python数据分析_Pandas05_统计量/statisti

作者: ChZ_CC | 来源:发表于2017-02-06 16:26 被阅读621次

    主要内容:

    • Pandas中的常用统计量
    • 频数
    • 数据分割
    • 增量百分比
    • Covariance协方差
    • Correlation相关

    Pandas中的常用统计量

    Method          Description
    -----------------------------------------------------------
    count()         Number of non-null observations
    sum()           Sum of values
    mean()          Mean of values
    median()        Arithmetic median of values
    min()           Minimum
    max()           Maximum
    std()           Bessel-corrected sample standard deviation
    var()           Unbiased variance
    skew()          Sample skewness (3rd moment)
    kurt()          Sample kurtosis (4th moment)
    quantile()      Sample quantile (value at %)
    apply()         Generic apply
    cov()           Unbiased covariance (binary)
    corr()          Correlation (binary)
    

    使用方法很简单啊,dataframe和Series都可以用。默认计算的是列的统计量。比如:

    df.mean()       #按列计算均值
    df.mean(1)      #按行计算均值
    df.T            #也可以用转置函数操控行列。
    

    另外可以用describe()计算常用的几个统计量。

    df.describe()   #按列计算count, mean, std, min, 25% 50%  75%, max
    

    太简单,不举例了。

    频数

    对于离散分布的数据,频数是一个很重要的统计量。

    In [13]: data = np.random.randint(0, 7, size=50)
    
    In [14]: data
    Out[14]:
    array([0, 1, 6, 2, 4, 2, 4, 2, 5, 2, 0, 2, 0, 2, 2, 3, 4, 6, 6, 1, 6, 3, 6,
           5, 0, 0, 6, 1, 5, 6, 2, 6, 0, 3, 3, 6, 4, 6, 0, 0, 0, 4, 0, 6, 5, 6,
           5, 6, 3, 1])
    
    In [15]: s = pd.Series(data)
    
    In [16]: s.value_counts()
    Out[16]:
    6    13
    0    10
    2     8
    5     5
    4     5
    3     5
    1     4
    dtype: int64
    
    # 还可以用以下方式计算频率。
    # numpy array也可以用此方法计算。
    In [17]: pd.value_counts(data)
    In [18]: pd.value_counts(s)
    # 结果跟s.value_counts()一样,不再贴出来了。
    

    可以用众数mode的方法选取出现最多的值。

    In [20]: s.mode()
    Out[20]:
    0    6
    dtype: int32
    

    数据分割

    .cut(arr, n) 分成n部分,qcut(arr, [])百分等级。看例子。

    In [25]: arr = np.random.randn(20)
    
    In [26]: factor = pd.cut(arr, 4) #四等分
    
    In [27]: factor #结果标出来每个数据属于哪个范围
    Out[27]:
    [(-0.356, 0.695], (-0.356, 0.695], (-0.356, 0.695], (-0.356, 0.695], (-0.356, 0.695], ..., (-1.412, -0.356], (-0.356, 0.695], (-1.412, -0.356], (-1.412, -0.356], (0.695, 1.747]]
    Length: 20
    Categories (4, object): [(-1.412, -0.356] < (-0.356, 0.695] < (0.695, 1.747] < (1.747, 2.799]]
    
    In [28]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])
    
    In [29]: factor
    Out[29]:
    [(-1, 0], (0, 1], (0, 1], (0, 1], (0, 1], ..., (-1, 0], (0, 1], (-1, 0], (-5, -1], (0, 1]]
    Length: 20
    Categories (4, object): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
    
    # 计算每个部分的个数
    In [30]: pd.value_counts(factor)
    Out[30]:
    (0, 1]      7
    (-1, 0]     7
    (1, 5]      3
    (-5, -1]    3
    dtype: int64
    
    In [31]: factor2 = pd.qcut(arr, [0, .25, .5, .75, 1])
    
    In [32]: factor2
    Out[32]:
    [(-0.392, 0.0332], (0.0332, 0.663], (0.0332, 0.663], (0.0332, 0.663], (0.0332, 0.663], ..., [-1.408, -0.392], (0.0332, 0.663], [-1.408, -0.392], [-1.408, -0.392], (0.663, 2.799]]
    Length: 20
    Categories (4, object): [[-1.408, -0.392] < (-0.392, 0.0332] < (0.0332, 0.663] < (0.663, 2.799]]
    

    增量百分比

    增量百分比(percentage increase):(新值-原来的值)/原来的值

    In [19]: ser = pd.Series(np.random.randn(8))
    
    In [20]: ser
    Out[20]:
    0   -0.299098
    1   -0.083835
    2   -0.899787
    3    1.331207
    4   -0.931809
    5    0.363284
    6   -1.324239
    7   -0.187943
    dtype: float64
    
    In [21]: ser.pct_change()
    Out[21]:
    0         NaN
    1   -0.719708
    2    9.732820
    3   -2.479470
    4   -1.699973
    5   -1.389869
    6   -4.645190
    7   -0.858075
    dtype: float64
    
    In [22]: ser.pct_change(periods=3)
    Out[22]:
    0          NaN
    1          NaN
    2          NaN
    3    -5.450733
    4    10.114794
    5    -1.403744
    6    -1.994765
    7    -0.798303
    dtype: float64
    

    Covariance协方差

    In [5]: s1 = pd.Series(np.random.randn(1000))
    
    In [6]: s2 = pd.Series(np.random.randn(1000))
    
    In [7]: s1.cov(s2)
    
    In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
    
    In [9]: frame.cov()
    

    Correlation相关

    可用的方法:pearson(默认)、kendall、spearman

    In [3]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
    
    In [4]: frame.ix[::2] = np.nan
    
    In [5]: frame.head()
    Out[5]:
              a         b         c         d         e
    0       NaN       NaN       NaN       NaN       NaN
    1  1.323677  0.000110 -1.333564  0.099595  0.151210
    2       NaN       NaN       NaN       NaN       NaN
    3 -0.038058 -0.559302  1.011685 -0.796040  0.123981
    4       NaN       NaN       NaN       NaN       NaN
    
    #单列之间的相关,也就是series之间的相关。
    In [6]: frame['a'].corr(frame['b'])
    Out[6]: 0.043408577177721466
    
    #指定计算相关的方法,结果会有不同。
    In [7]: frame['a'].corr(frame['b'], method='spearman')
    Out[7]: 0.045387541550166201
    
    #整个数据库的相关,生成一个矩阵,两两之间的相关系数。
    In [8]: In [19]: frame.corr()
    Out[8]:
              a         b         c         d         e
    a  1.000000  0.043409 -0.044429 -0.062085  0.059482
    b  0.043409  1.000000 -0.068241  0.061163 -0.019741
    c -0.044429 -0.068241  1.000000 -0.028021 -0.066897
    d -0.062085  0.061163 -0.028021  1.000000 -0.034094
    e  0.059482 -0.019741 -0.066897 -0.034094  1.000000
    

    可以设置min_periods参数,用来指定所需要的非空数据点的阈值,如果非空数据点小于阈值结果为NA。在下一篇关于rolling的介绍中会用到,这里不详细说了。

    另一个计算相关的方法corrwith()

    In [25]: index = ['a', 'b', 'c', 'd', 'e']
    
    In [26]: columns = ['one', 'two', 'three', 'four']
    
    In [27]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)
    
    In [28]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
    
    In [29]: df1.corrwith(df2)
    Out[29]: 
    one     -0.125501
    two     -0.493244
    three    0.344056
    four     0.004183
    dtype: float64
    
    In [30]: df2.corrwith(df1, axis=1)
    Out[30]: 
    a   -0.675817
    b    0.458296
    c    0.190809
    d   -0.186275
    e         NaN
    dtype: float64
    

    In [31]: s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
    
    In [32]: s['d'] = s['b'] # so there's a tie
    
    In [33]: s.rank()
    Out[33]: 
    a    5.0
    b    2.5
    c    1.0
    d    2.5
    e    4.0
    dtype: float64
    
    In [29]: df = pd.DataFrame(np.random.np.random.randn(10, 6))
        ...: df.rank(1)  # axis=1 对列rank,也就是求一行数据的秩
    In [30]: df.rank(0)  # 默认是0
    

    相关文章

      网友评论

          本文标题:Python数据分析_Pandas05_统计量/statisti

          本文链接:https://www.haomeiwen.com/subject/acruittx.html