美文网首页
python练习:Case Study - Sunlight i

python练习:Case Study - Sunlight i

作者: 鲸鱼酱375 | 来源:发表于2019-06-19 22:15 被阅读0次

    内容来自datacamp课程:pandas foundation
    数据以及代码在github

    数据:

    数据1

    • weather_data_austin_2010:
      2010年的Austin天气情况


      head

      为了后续更好使用,把date作为index

    df.Date=pd.to_datetime(df.Date)
    df.index=df.Date
    df=df.drop(['Date'],axis=1)
    df.head()
    
    head

    数据2

    NOAA_QCLCD_2011_hourly_13904.txt
    2011年的天气情况,没有header,数据列数44列,会在后面删除一些


    head
    column_labels='Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk'
    
    column_labels_list = column_labels.split(',')
    df2.columns = column_labels_list
    list_to_drop=['sky_conditionFlag', 'visibilityFlag', 'wx_and_obst_to_vision', 'wx_and_obst_to_visionFlag', 'dry_bulb_farenFlag', 'dry_bulb_celFlag', 'wet_bulb_farenFlag', 'wet_bulb_celFlag', 'dew_point_farenFlag', 'dew_point_celFlag', 'relative_humidityFlag', 'wind_speedFlag', 'wind_directionFlag', 'value_for_wind_character', 'value_for_wind_characterFlag', 'station_pressureFlag', 'pressure_tendencyFlag', 'pressure_tendency', 'presschange', 'presschangeFlag', 'sea_level_pressureFlag', 'hourly_precip', 'hourly_precipFlag', 'altimeter', 'record_type', 'altimeterFlag', 'junk']
    df2_dropped = df2.drop(list_to_drop,axis='columns')
    print(df2_dropped.head())
    
    只保留这些columns

    数据清洗,把date还有time合并,并且作为index

    # Convert the date column to string: df_dropped['date']
    df2_dropped['date'] = df2_dropped['date'].astype(str)
    
    # Pad leading zeros to the Time column: df_dropped['Time']
    df2_dropped['Time'] = df2_dropped['Time'].apply(lambda x:'{:0>4}'.format(x))
    
    # Concatenate the new date and Time columns: date_string
    date_string = df2_dropped['date'] + df2_dropped['Time']
    
    # Convert the date_string Series to datetime: date_times
    date_times = pd.to_datetime(date_string, format='%Y%m%d%H%M')
    
    # Set the index to be the new date_times container: df_clean
    df2_clean = df2_dropped.set_index(date_times)
    
    # Print the output of df_clean.head()
    print(df2_clean.head())
    
    清洗后的数据2

    处理缺失值 把表格中标记为M的缺失值改为NAN

    # Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
    print(df2_clean.loc['2011-6-20 8:00:00':'2011-6-20 9:00:00','dry_bulb_faren' ])
    
    # Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
    df2_clean['dry_bulb_faren'] = pd.to_numeric(df2_clean['dry_bulb_faren'], errors='coerce')
    
    # Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
    print(df2_clean.loc['2011-6-20 8:00:00':'2011-6-20 9:00:00', 'dry_bulb_faren'])
    
    # Convert the wind_speed and dew_point_faren columns to numeric values
    df2_clean['wind_speed'] = pd.to_numeric(df2_clean['wind_speed'], errors='coerce')
    df2_clean['dew_point_faren'] = pd.to_numeric(df2_clean['dew_point_faren'], errors='coerce')
    

    了解数据2

    # Print the median of the dry_bulb_faren column
    print(df2_clean.dry_bulb_faren.median())
    
    # Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
    print(df2_clean.loc['2011-Apr':'2011-Jun', 'dry_bulb_faren'].median())
    
    # Print the median of the dry_bulb_faren column for the month of January
    print(df2_clean.loc['2011-Jan', 'dry_bulb_faren'].median())
    
    72.0
    78.0
    48.0
    

    只分析列了‘干球温度’的中位数,以及他在不同时间的中位数

    how much hotter was every day in 2011 than expected from the 30-year average?求方差

    # Downsample df_clean by day and aggregate by mean: daily_mean_2011
    daily_mean_2011 = df2_clean.resample('D').mean()
    
    # Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
    daily_temp_2011 = daily_mean_2011['dry_bulb_faren'].values
    
    # Downsample df_climate by day and aggregate by mean: daily_climate
    daily_climate = df.resample('D').mean()
    
    # Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
    daily_temp_climate = daily_climate.reset_index()['Temperature']
    
    # Compute the difference between the two arrays and print the mean difference
    difference = daily_temp_2011 - daily_temp_climate
    print(difference.mean())
    
    1.3301831870056477
    

    晴天还是雨天?

    On average, how much hotter is it when the sun is shining? In this exercise, you will compare temperatures on sunny days against temperatures on overcast days.
    Your job is to use Boolean selection to filter out sunny and overcast days, and then compute the difference of the mean daily maximum temperatures between each type of day.
    The column 'sky_condition' provides information about whether the day was sunny ('CLR') or overcast ('OVC').

    # Using df_clean, when is sky_condition 'CLR'?
    is_sky_clear = df2_clean['sky_condition']=='CLR'
    
    # Filter df_clean using is_sky_clear
    sunny = df2_clean.loc[is_sky_clear]
    
    # Resample sunny by day then calculate the max
    sunny_daily_max = sunny.resample('D').max()
    # Using df_clean, when does sky_condition contain 'OVC'?
    is_sky_overcast = df2_clean['sky_condition'].str.contains('OVC')
    
    # Filter df_clean using is_sky_overcast
    overcast = df2_clean.loc[is_sky_overcast]
    
    # Resample overcast by day then calculate the max
    overcast_daily_max = overcast.resample('D').max()
    # Calculate the mean of sunny_daily_max
    sunny_daily_max_mean = sunny_daily_max.mean()
    
    # Calculate the mean of overcast_daily_max
    overcast_daily_max_mean = overcast_daily_max.mean()
    
    # Print the difference (sunny minus overcast)
    print(sunny_daily_max_mean-overcast_daily_max_mean)
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Wban               0.000000
    StationType        0.000000
    dry_bulb_faren     6.504304
    dew_point_faren   -4.339286
    wind_speed        -3.246062
    dtype: float64
    

    The average daily maximum dry bulb temperature was 6.5 degrees Fahrenheit higher on sunny days compared to overcast days.

    可见度和温度
    your job is to plot the weekly average temperature and visibility as subplots.

    # Import matplotlib.pyplot as plt
    import matplotlib.pyplot as plt
    
    # Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
    weekly_mean = df2_clean[['visibility','dry_bulb_faren']].resample('W').mean()
    
    # Print the output of weekly_mean.corr()
    print(weekly_mean.corr())
    
    # Plot weekly_mean with subplots=True
    weekly_mean.plot(subplots=True)
    plt.show()
    
    温度高,可见度大?

    计算晴天的比例

    # Using df_clean, when is sky_condition 'CLR'?
    is_sky_clear = df2_clean['sky_condition']=='CLR'
    
    # Resample is_sky_clear by day
    resampled = is_sky_clear.resample('D')
    # Calculate the number of sunny hours per day
    sunny_hours = resampled.sum()
    
    # Calculate the number of measured hours per day
    total_hours = resampled.count()
    
    # Calculate the fraction of hours per day that were sunny
    sunny_fraction = sunny_hours/total_hours
    sunny_fraction.plot(kind='box')
    plt.show()
    
    image.png

    露点和温度

    Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.

    In this exercise, you will explore the maximum temperature and dew point of each month. The columns of interest are 'dew_point_faren' and 'dry_bulb_faren'. After resampling them appropriately to get the maximum temperature and dew point in each month, generate a histogram of these values as subplots.

    # Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
    monthly_max = df2_clean[['dew_point_faren','dry_bulb_faren']].resample('M').max()
    
    # Generate a histogram with bins=8, alpha=0.5, subplots=True
    monthly_max.plot(kind='hist',bins=8,alpha=0.5,subplots=True)
    
    # Show the plot
    plt.show()
    
    result

    温度高的可能性 cdf

    We already know that 2011 was hotter than the climate normals for the previous thirty years. In this final exercise, you will compare the maximum temperature in August 2011 against that of the August 2010 climate normals. More specifically, you will use a CDF plot to determine the probability of the 2011 daily maximum temperature in August being above the 2010 climate normal value. To do this, you will leverage the data manipulation, filtering, resampling, and visualization skills you have acquired throughout this course.

    The two DataFrames df_clean and df_climate are available in the workspace. Your job is to select the maximum temperature in August in df_climate, and then maximum daily temperatures in August 2011. You will then filter out the days in August 2011 that were above the August 2010 maximum, and use this to construct a CDF plot.

    # Extract the maximum temperature in August 2010 from df_climate: august_max
    august_max = df.loc['2010-Aug','Temperature'].max()
    print(august_max)
    
    # Resample August 2011 temps in df_clean by day & aggregate the max value: august_2011
    august_2011 = df2_clean.loc['2011-Aug','dry_bulb_faren'].resample('D').max()
    
    # Filter for days in august_2011 where the value exceeds august_max: august_2011_high
    
    august_2011_high = august_2011.loc[august_2011 > august_max]
    
    # Construct a CDF of august_2011_high
    august_2011_high.plot(kind='hist', normed=True, cumulative=True, bins=25)
    
    # Display the plot
    plt.show()
    
    result

    相关文章

      网友评论

          本文标题:python练习:Case Study - Sunlight i

          本文链接:https://www.haomeiwen.com/subject/awvcqctx.html