pandas实例-Stats-Wind Statistics

作者: 橘猫吃不胖 | 来源:发表于2020-05-13 17:38 被阅读0次

继续前面的练习，之前的文章参考：

这个数据集的分隔符，有点儿意思

有一个空格、2个或者3个的，不知道下面还有没有别的，所以加载的时候要注意下

df = pd.read_csv(data_path , sep='\s+' , skiprows=1)

这里的sep是可以指定正则表达式的，这里的skiprows，貌似不指定默认也会跳过空行

这里有个问题，就是数据中的前三列，是年-月-日，我们需要进行转换下

df2 = pd.read_csv(data_path , sep='\s+' , skiprows=1 , parse_dates = [[0,1,2]])

会自动把3列拼成一个日期，但是哦，这里我是有个疑问的，就是系统怎么知道这个61是1961还是2061，所以我认为这个数据是有点儿问题的，反正这里默认识别成了2061

还有就是，我本来想，先加载完，然后再处理，但是总是会出现一个问题：

说是格式识别不了，但是我明明指定了新的格式%y%m%d

记录下，这个问题，我一会研究一下。

1. Year 2061? Do we really have data from this year? Create a function to fix it and apply it

行吧，作者是要修复这个问题，我感觉原始数据还是直接指定好完整年份好了

import datetime 

def fix_year(x):
    year = x.year-100 if x.year>1989 else x.year
    return datetime.date(year , x.month , x.day)

df2['Yr_Mo_Dy'] = df2['Yr_Mo_Dy'].apply(fix_year)

2. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns]

这里有两步，一个是修改字段的类型，一个是将这个日期设置为index

df2['Yr_Mo_Dy'] = pd.to_datetime(df2['Yr_Mo_Dy'])
df2.dtypes

df2.set_index('Yr_Mo_Dy' , inplace=True)

关于重置index，参考：Pandas实例-把某一列指定为index

3. Compute how many values are missing for each location over the entire record

这个问题，我还是真的不太会，就是缺失值的记录统计，没用过相关的函数，我先去看看
好了，用起来很简单，参考：pandas缺失值函数-isna

df2.isna().sum()

4. Calculate the mean windspeeds of the windspeeds over all the locations and all the times

这个题目，我其实没有太看懂，说是要统计所有地区，所有时间的平均风速？？

看了答案，仿佛有一丝丝的立即，因为是所有地区所有时间，相当于对所有的values求一个和，然后在除以记录数

这里要注意的可能是这个记录数，怎么算，是所有的记录数，还是剔除空值的记录数，所以得思考一下，这个可能需要考虑业务场景了，有的时候需要算，有的时候不需要算的

df2.sum().sum() / df2.notna().sum().sum()

5. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days

loc_stats = df2.describe()

这里再记录一个问题，就是describe函数的参数

这个是用来指定四分位数的，题目是没要这个的，所以答案中是使用了这个参数，我这里就没有管它

6. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day

这一题和上一题是类似的，一个是根据column来聚合，一个是根据index来聚合，
上一题，使用的是describe函数，但是对于index来说，describe函数就没法用了，所以我们可以直接使用agg函数

df2.agg([ 'min' , 'max' , 'mean' , 'std'] , axis=1).head()

7. Find the average windspeed in January for each location.

针对每一个location，也就是每一个column，求1月份的平均值

这个1月份是咋来的呢？

df2.loc[df2.index.month == 1].mean()

这里，我本来想使用query函数，但是不行，貌似是识别不出这个month属性

8. Downsample the record to a yearly frequency for each location.

这个类似于时间序列的重构了，我记得之前有写过，参考：

pandas-时间序列重构-resample

df2.resample('Y').mean()

这里不好的地方时，这个label用的是最后1天，我暂时还不知道怎么直接显示年份

答案不是这种方式写的，原作者是使用了：

df2.groupby(df2.index.to_period('A')).mean()

这一篇就到这里吧，后面还有几道类似的题目，我们暂时先不看了

网友评论

本文标题：pandas实例-Stats-Wind Statistics

本文链接：https://www.haomeiwen.com/subject/lsxinhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！