本周的题目是判断一个数据集是否符合正态分布。
本文章只是单纯考虑数据是否符合正态分布,至于数据中的意义本文不考虑。
- 数据集地址:http://jse.amstat.org/datasets/normtemp.dat.txt
- 数据集描述:总共只有三列:体温、性别、心率
- 数据集详细描述:Journal of Statistics Education, V4N2:Shoemaker
思路:
- 粗看图形形状
- 调用科学计算包中的函数查看是否符合正态分布
- kstest https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html
- shapirohttps://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html?highlight=shapiro
- normaltest https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html?highlight=normaltest#scipy.stats.normaltest
- lillieforshttps://www.statsmodels.org/devel/generated/statsmodels.stats.diagnostic.lilliefors.html
- anderson https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html
准备:加载数据
import numpy as np
data = np.loadtxt('temp_data.txt')
temp = data[:,0]
验证正态分布
- kstest
from scipy.stats import kstest
def check_normality(testData):
if len(testData)>300
p_value=stats.kstest(testData,'norm')[1]
if p_value<0.05:
print("use kstest:")
print("data are not normal distrubuted")
return False
else:
print("use kstest:")
print("data are normal distributed")
return True
#验证
print(check_normality(temp))
- shapiro
样本数小于50,用shapiro-wiki
from scipy import stats
def check_normality(testData):
if len(testData)<50:
p_value = stats.shapiro(testData)[1]
if p_value<0.05:
print("use shapiro:")
print("data are not normal distributed")
return False
else:
print("use shapiro:")
print("data are normal distributed")
return True
#验证
print(check_normality(temp))
- normaltest
样本数在(20,50)之间,用normal test算法检测正态分布性
from scipy.stats import normaltest
def check_normality(teestData):
if 20<len(testData)<50:
p_value=normaltest(testData)[1]
if p_value<0.05
print("use normaltest")
print("data are not normal distributed")
return False
else:
print("use normaltest")
print("data are normal distributed")
return True
#验证
print(check_normality(temp))
- lilliefors
样本在[50,300]适用此验证方法
from statsnodels.stats.diagnostic importlilliefors
def check_normality(testData):
if 300>=len(testData)>=50
p_value=lilliefors(testData)[1]
if p_value<0.05
print("use lillifors:")
print("data are not normal distributed")
return False
else:
print("use lillifors:")
print("data are normal distributed")
return True
#check
print(check_normality(temp))
- anderson
from scipy.stats import anderson
anderson(temp)
ref:
- scipy帮助文档:https://docs.scipy.org/doc/
- 知识星球夜跑分享 https://blog.csdn.net/YEPAO01/article/details/99197487
- 知识星球:追寻原风景的分享
- anderson https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html
网友评论