内容来自datacamp课程:pandas foundation
数据以及代码在github
数据:
数据1
-
weather_data_austin_2010:
2010年的Austin天气情况
head
为了后续更好使用,把date作为index
df.Date=pd.to_datetime(df.Date)
df.index=df.Date
df=df.drop(['Date'],axis=1)
df.head()
head
数据2
NOAA_QCLCD_2011_hourly_13904.txt
2011年的天气情况,没有header,数据列数44列,会在后面删除一些
head
column_labels='Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk'
column_labels_list = column_labels.split(',')
df2.columns = column_labels_list
list_to_drop=['sky_conditionFlag', 'visibilityFlag', 'wx_and_obst_to_vision', 'wx_and_obst_to_visionFlag', 'dry_bulb_farenFlag', 'dry_bulb_celFlag', 'wet_bulb_farenFlag', 'wet_bulb_celFlag', 'dew_point_farenFlag', 'dew_point_celFlag', 'relative_humidityFlag', 'wind_speedFlag', 'wind_directionFlag', 'value_for_wind_character', 'value_for_wind_characterFlag', 'station_pressureFlag', 'pressure_tendencyFlag', 'pressure_tendency', 'presschange', 'presschangeFlag', 'sea_level_pressureFlag', 'hourly_precip', 'hourly_precipFlag', 'altimeter', 'record_type', 'altimeterFlag', 'junk']
df2_dropped = df2.drop(list_to_drop,axis='columns')
print(df2_dropped.head())
只保留这些columns
数据清洗,把date还有time合并,并且作为index
# Convert the date column to string: df_dropped['date']
df2_dropped['date'] = df2_dropped['date'].astype(str)
# Pad leading zeros to the Time column: df_dropped['Time']
df2_dropped['Time'] = df2_dropped['Time'].apply(lambda x:'{:0>4}'.format(x))
# Concatenate the new date and Time columns: date_string
date_string = df2_dropped['date'] + df2_dropped['Time']
# Convert the date_string Series to datetime: date_times
date_times = pd.to_datetime(date_string, format='%Y%m%d%H%M')
# Set the index to be the new date_times container: df_clean
df2_clean = df2_dropped.set_index(date_times)
# Print the output of df_clean.head()
print(df2_clean.head())
清洗后的数据2
处理缺失值 把表格中标记为M的缺失值改为NAN
# Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df2_clean.loc['2011-6-20 8:00:00':'2011-6-20 9:00:00','dry_bulb_faren' ])
# Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
df2_clean['dry_bulb_faren'] = pd.to_numeric(df2_clean['dry_bulb_faren'], errors='coerce')
# Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df2_clean.loc['2011-6-20 8:00:00':'2011-6-20 9:00:00', 'dry_bulb_faren'])
# Convert the wind_speed and dew_point_faren columns to numeric values
df2_clean['wind_speed'] = pd.to_numeric(df2_clean['wind_speed'], errors='coerce')
df2_clean['dew_point_faren'] = pd.to_numeric(df2_clean['dew_point_faren'], errors='coerce')
了解数据2
# Print the median of the dry_bulb_faren column
print(df2_clean.dry_bulb_faren.median())
# Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
print(df2_clean.loc['2011-Apr':'2011-Jun', 'dry_bulb_faren'].median())
# Print the median of the dry_bulb_faren column for the month of January
print(df2_clean.loc['2011-Jan', 'dry_bulb_faren'].median())
72.0
78.0
48.0
只分析列了‘干球温度’的中位数,以及他在不同时间的中位数
how much hotter was every day in 2011 than expected from the 30-year average?求方差
# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df2_clean.resample('D').mean()
# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011['dry_bulb_faren'].values
# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df.resample('D').mean()
# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index()['Temperature']
# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
print(difference.mean())
1.3301831870056477
晴天还是雨天?
On average, how much hotter is it when the sun is shining? In this exercise, you will compare temperatures on sunny days against temperatures on overcast days.
Your job is to use Boolean selection to filter out sunny and overcast days, and then compute the difference of the mean daily maximum temperatures between each type of day.
The column 'sky_condition' provides information about whether the day was sunny ('CLR') or overcast ('OVC').
# Using df_clean, when is sky_condition 'CLR'?
is_sky_clear = df2_clean['sky_condition']=='CLR'
# Filter df_clean using is_sky_clear
sunny = df2_clean.loc[is_sky_clear]
# Resample sunny by day then calculate the max
sunny_daily_max = sunny.resample('D').max()
# Using df_clean, when does sky_condition contain 'OVC'?
is_sky_overcast = df2_clean['sky_condition'].str.contains('OVC')
# Filter df_clean using is_sky_overcast
overcast = df2_clean.loc[is_sky_overcast]
# Resample overcast by day then calculate the max
overcast_daily_max = overcast.resample('D').max()
# Calculate the mean of sunny_daily_max
sunny_daily_max_mean = sunny_daily_max.mean()
# Calculate the mean of overcast_daily_max
overcast_daily_max_mean = overcast_daily_max.mean()
# Print the difference (sunny minus overcast)
print(sunny_daily_max_mean-overcast_daily_max_mean)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Wban 0.000000
StationType 0.000000
dry_bulb_faren 6.504304
dew_point_faren -4.339286
wind_speed -3.246062
dtype: float64
The average daily maximum dry bulb temperature was 6.5 degrees Fahrenheit higher on sunny days compared to overcast days.
可见度和温度
your job is to plot the weekly average temperature and visibility as subplots.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df2_clean[['visibility','dry_bulb_faren']].resample('W').mean()
# Print the output of weekly_mean.corr()
print(weekly_mean.corr())
# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
plt.show()
温度高,可见度大?
计算晴天的比例
# Using df_clean, when is sky_condition 'CLR'?
is_sky_clear = df2_clean['sky_condition']=='CLR'
# Resample is_sky_clear by day
resampled = is_sky_clear.resample('D')
# Calculate the number of sunny hours per day
sunny_hours = resampled.sum()
# Calculate the number of measured hours per day
total_hours = resampled.count()
# Calculate the fraction of hours per day that were sunny
sunny_fraction = sunny_hours/total_hours
sunny_fraction.plot(kind='box')
plt.show()
image.png
露点和温度
Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.
In this exercise, you will explore the maximum temperature and dew point of each month. The columns of interest are 'dew_point_faren' and 'dry_bulb_faren'. After resampling them appropriately to get the maximum temperature and dew point in each month, generate a histogram of these values as subplots.
# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
monthly_max = df2_clean[['dew_point_faren','dry_bulb_faren']].resample('M').max()
# Generate a histogram with bins=8, alpha=0.5, subplots=True
monthly_max.plot(kind='hist',bins=8,alpha=0.5,subplots=True)
# Show the plot
plt.show()
result
温度高的可能性 cdf
We already know that 2011 was hotter than the climate normals for the previous thirty years. In this final exercise, you will compare the maximum temperature in August 2011 against that of the August 2010 climate normals. More specifically, you will use a CDF plot to determine the probability of the 2011 daily maximum temperature in August being above the 2010 climate normal value. To do this, you will leverage the data manipulation, filtering, resampling, and visualization skills you have acquired throughout this course.
The two DataFrames df_clean and df_climate are available in the workspace. Your job is to select the maximum temperature in August in df_climate, and then maximum daily temperatures in August 2011. You will then filter out the days in August 2011 that were above the August 2010 maximum, and use this to construct a CDF plot.
# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df.loc['2010-Aug','Temperature'].max()
print(august_max)
# Resample August 2011 temps in df_clean by day & aggregate the max value: august_2011
august_2011 = df2_clean.loc['2011-Aug','dry_bulb_faren'].resample('D').max()
# Filter for days in august_2011 where the value exceeds august_max: august_2011_high
august_2011_high = august_2011.loc[august_2011 > august_max]
# Construct a CDF of august_2011_high
august_2011_high.plot(kind='hist', normed=True, cumulative=True, bins=25)
# Display the plot
plt.show()
result
网友评论