美文网首页学习提升模版Python
传统统计分析在Python中的使用

传统统计分析在Python中的使用

作者: 数据人阿多 | 来源:发表于2020-01-14 14:17 被阅读0次

    背景

    大家都知道现在大数据非常火爆,在大数据还没有出现时,用的都是“小数据”,这些“小数据”在分析时大部分用的都是Excel、SPSS等工具,直到现在把Excel运用的很熟练的人,仍然很受青睐。但是Python的到来,使处理大数据比较方便。那么之前在Excel、SPSS中的描述统计、假设检验在Python中怎么使用呢?下面将进行详细介绍,前提是已经掌握numpy、pandas这两个库,并且对统计知识有所了解

    加载数据

    这里引用的是GitHub上的一个CSV数据文件,insurance.csv 保险数据
    ,可以下载下来参考练习
    网址:https://github.com/stedy/Machine-Learning-with-R-datasets

    >>> import numpy as np
    >>> import pandas as pd
    >>> data=pd.read_csv('F:/Machine-Learning datasets/insurance.csv')
    >>> data
          age     sex     bmi  children smoker     region      charges
    0      19  female  27.900         0    yes  southwest  16884.92400
    1      18    male  33.770         1     no  southeast   1725.55230
    2      28    male  33.000         3     no  southeast   4449.46200
    3      33    male  22.705         0     no  northwest  21984.47061
    4      32    male  28.880         0     no  northwest   3866.85520
    ...   ...     ...     ...       ...    ...        ...          ...
    1333   50    male  30.970         3     no  northwest  10600.54830
    1334   18  female  31.920         0     no  northeast   2205.98080
    1335   18  female  36.850         0     no  southeast   1629.83350
    1336   21  female  25.800         0     no  southwest   2007.94500
    1337   61  female  29.070         0    yes  northwest  29141.36030
    
    [1338 rows x 7 columns]
    >>> data.isnull().sum()
    age         0
    sex         0
    bmi         0
    children    0
    smoker      0
    region      0
    charges     0
    dtype: int64
    >>> data.dtypes
    age           int64
    sex          object
    bmi         float64
    children      int64
    smoker       object
    region       object
    charges     float64
    dtype: object
    

    描述统计各指标

    • 最大值、最小值

    >>> data['age'].max()
    64
    >>> data['age'].min()
    18
    
    • 均值

    >>> data['age'].mean()
    39.20702541106129
    
    • 方差

    >>> data['age'].std()
    14.049960379216172
    
    • 中位数

    >>> data['age'].median()
    39.0
    
    • 众数

    众数有时会有多个,这里是全部返回

    >>> data['age'].mode()
    0    18
    dtype: int64
    
    >>> d=pd.DataFrame([1,1,1,2,2,2,3,4,4,4,5,5,6],columns=['a'])
    >>> d['a'].mode()
    0    1
    1    2
    2    4
    dtype: int64
    
    • 分位数

    >>> data['age'].quantile([0,0.25,0.5,0.75,1])
    0.00    18.0
    0.25    27.0
    0.50    39.0
    0.75    51.0
    1.00    64.0
    Name: age, dtype: float64
    
    • 频数统计

    >>> data['region'].value_counts()
    southeast    364
    northwest    325
    southwest    325
    northeast    324
    Name: region, dtype: int64
    
    • 极差

    >>> data['age'].max()-data['age'].min()
    46
    
    • 四分位差

    >>> data['age'].quantile(0.75)-data['age'].quantile(0.25)
    24.0
    
    • 变异系数

    >>> data['age'].std()/data['age'].mean()
    0.3583531326824994
    

    假设检验

    • 单样本t检验

    >>> data
          age     sex     bmi  children smoker     region      charges
    0      19  female  27.900         0    yes  southwest  16884.92400
    1      18    male  33.770         1     no  southeast   1725.55230
    2      28    male  33.000         3     no  southeast   4449.46200
    3      33    male  22.705         0     no  northwest  21984.47061
    4      32    male  28.880         0     no  northwest   3866.85520
    ...   ...     ...     ...       ...    ...        ...          ...
    1333   50    male  30.970         3     no  northwest  10600.54830
    1334   18  female  31.920         0     no  northeast   2205.98080
    1335   18  female  36.850         0     no  southeast   1629.83350
    1336   21  female  25.800         0     no  southwest   2007.94500
    1337   61  female  29.070         0    yes  northwest  29141.36030
    
    [1338 rows x 7 columns]
    >>> data['age'].mean()
    39.20702541106129
    >>> 
    >>> import statsmodels.api as sm   #加载分析库
    >>> t=sm.stats.DescrStatsW(data['age'])      #构造统计量对象
    >>> t.ttest_mean(38)    #t检验,假设总体均值为38,返回t值、P值、自由度
    (3.142457193279878, 0.0017121567548687802, 1337.0)
    >>> t.ttest_mean(39)   #假设总体均值为39,p>0.05,接受原假设
    (0.5389849179805168, 0.5899869939488361, 1337.0)
    >>> t.ttest_mean(18)    #假设总体均值为18,p<0.05,小于0.05拒绝原假设
    (55.21190269926711, 0.0, 1337.0)
    
    • 双样本t检验

    >>> data.groupby('sex').mean()['charges']
    sex
    female    12569.578844
    male      13956.751178
    Name: charges, dtype: float64
    >>> sex0=data[data['sex']=='female']['charges']
    >>> sex1=data[data['sex']=='male']['charges']
    >>> from scipy import stats
    >>> leveneTestRes=stats.levene(sex0,sex1)  #方差齐性检验
    >>> leveneTestRes   #p值<0.05,说明方差非齐性
    LeveneResult(statistic=9.90925122305512, pvalue=0.0016808765833903443)
    >>> stats.stats.ttest_ind(sex0,sex1,equal_var=False)   #双样本T检验
    Ttest_indResult(statistic=-2.1008878232359565, pvalue=0.035841014956016645)
    

    相关系数

    >>> data[['age','charges']].corr(method='pearson')   #皮尔逊相关系数
                  age   charges
    age      1.000000  0.299008
    charges  0.299008  1.000000
    >>> data[['age','charges']].corr(method='spearman')   #斯皮尔曼等级相关系数
                  age   charges
    age      1.000000  0.534392
    charges  0.534392  1.000000
    >>> data[['age','charges']].corr(method='kendall')   #肯德尔相关系数
                  age   charges
    age      1.000000  0.475302
    charges  0.475302  1.000000
    

    相关文章

      网友评论

        本文标题:传统统计分析在Python中的使用

        本文链接:https://www.haomeiwen.com/subject/gsrnactx.html