美文网首页
用Numpy和Pandas分析二维数据

用Numpy和Pandas分析二维数据

作者: esskeetit | 来源:发表于2018-04-18 17:33 被阅读0次

    1. 数据说明

    • UNIT
      Remote unit that collects turnstile information. Can collect from multiple banks of turnstiles. Large subway stations can have more than one unit.

    • DATEn
      Date in “yyyy­mm­dd” (2011­05­21) format.

    • TIMEn
      Time in “hh:mm:ss” (08:05:02) format.

    • ENTRIESn
      Raw reading of cummulative turnstile entries from the remote unit. Occasionally resets to 0.

    • EXITSn
      Raw reading of cummulative turnstile exits from the remote unit. Occasionally resets to 0.

    • ENTRIESn_hourly
      Difference in ENTRIES from the previous REGULAR reading.

    • EXITSn_hourly
      Difference in EXITS from the previous REGULAR reading.

    • datetime
      Date and time in “yyyy­mm­dd hh:mm:ss” format (2011­05­01 00:00:00). Can be parsed into a Pandas datetime object without modifications.

    • hour
      Hour of the timestamp from TIMEn. Truncated rather than rounded.

    • day_week
      Integer (0 ­ 6 Mon ­ Sun) corresponding to the day of the week.

    • weekday
      Indicator (0 or 1) if the date is a weekday (Mon ­ Fri).

    • station
      Subway station corresponding to the remote unit.

    • latitude
      Latitude of the subway station corresponding to the remote unit.

    • longitude
      Longitude of the subway station corresponding to the remote unit.

    • conds Categorical variable of the weather conditions (Clear, Cloudy etc.) for the time and location.

    • fog
      Indicator (0 or 1) if there was fog at the time and location.

    • precipi
      Precipitation in inches at the time and location.

    • pressurei
      Barometric pressure in inches Hg at the time and location.

    • rain
      Indicator (0 or 1) if rain occurred within the calendar day at the location.

    • tempi
      Temperature in ℉ at the time and location.

    • wspdi
      Wind speed in mph at the time and location.

    • meanprecipi
      Daily average of precipi for the location.

    • meanpressurei
      Daily average of pressurei for the location.

    • meantempi
      Daily average of tempi for the location.

    • meanwspdi
      Daily average of wspdi for the location.

    • weather_lat
      Latitude of the weather station the weather data is from.

    • weather_lon
      Longitude of the weather station the weather data is from.

    questions i thought of :

    • what variables are related to subwary ridership?
      -- which stations have the most riders?
      -- what are the ridership patterns over time?
      -- how does the weather affect ridership?

    • what patterns can i find in the weather?
      -- is the temperature rising throughout the month?
      -- how does weather vary across the city?

    3. 二维numpy数组

    two-dimensional data:
    python:list of lists
    numpy:2D array
    pandas:dataframe

    2D arrays as opposed to array of arrays:

    • more memory efficient
    • accessing element is a bit different a[1,3]
    • mean(),std() operate on entire array
    import numpy as np
    
    
    ridership = np.array([
        [   0,    0,    2,    5,    0],
        [1478, 3877, 3674, 2328, 2539],
        [1613, 4088, 3991, 6461, 2691],
        [1560, 3392, 3826, 4787, 2613],
        [1608, 4802, 3932, 4477, 2705],
        [1576, 3933, 3909, 4979, 2685],
        [  95,  229,  255,  496,  201],
        [   2,    0,    1,   27,    0],
        [1438, 3785, 3589, 4174, 2215],
        [1342, 4043, 4009, 4665, 3033]
    ])
    print ridership
    print ridership[1, 3]
    print ridership[1:3, 3:5]
    print ridership[1, :]
        
    # Vectorized operations on rows or columns
    print ridership[0, :] + ridership[1, :]
    print ridership[:, 0] + ridership[:, 1]
        
    # Vectorized operations on entire arrays
    a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
    print a + b
    
    
    write a function:
    find the max riders on the first day
    find the mean riders per days
    def mean_riders_for_max_station(ridership):
        
        overall_mean = ridership.mean() # Replace this with your code
        max_station = ridership[0,:].argmax()
        mean_for_max = ridership[:,max_station].mean() # Replace this with your code
        
        return (overall_mean, mean_for_max)
    

    4. NumPy 轴

    行的平均值

    ridership.mean(axis=1)
    

    列的平均值

    ridership.mean(axis=0)
    
    import numpy as np
    
    
    a = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])
        
    print a.sum()
    print a.sum(axis=0)
    print a.sum(axis=1)
        
    
    ridership = np.array([
        [   0,    0,    2,    5,    0],
        [1478, 3877, 3674, 2328, 2539],
        [1613, 4088, 3991, 6461, 2691],
        [1560, 3392, 3826, 4787, 2613],
        [1608, 4802, 3932, 4477, 2705],
        [1576, 3933, 3909, 4979, 2685],
        [  95,  229,  255,  496,  201],
        [   2,    0,    1,   27,    0],
        [1438, 3785, 3589, 4174, 2215],
        [1342, 4043, 4009, 4665, 3033]
    ])
    
    def min_and_max_riders_per_day(ridership):
        mean_ridership_for_station = ridership.mean(axis=0)
        
        max_daily_ridership = mean_ridership_for_station.max()    # Replace this with your code
        min_daily_ridership = mean_ridership_for_station.min()   # Replace this with your code
        
        return (max_daily_ridership, min_daily_ridership)
    

    5. NumPy 和 Pandas 数据类型

    Pandas dataframe 每一列可以是不同的类型
    dataframe.mean() 计算每一列的平均值

    6. 访问 DataFrame 元素

    .loc['索引名']  #访问相应的一行
    .iloc[9] #按位置获取一行
    .iloc[1,3]
    df['列名']  #获取列
    df.values #返回不含列名称或行索引,仅含有df中值的numpy二维数据,这样就可以计算整个df的统计量
    
    import pandas as pd
    
    # Subway ridership for 5 stations on 10 different days
    ridership_df = pd.DataFrame(
        data=[[   0,    0,    2,    5,    0],
              [1478, 3877, 3674, 2328, 2539],
              [1613, 4088, 3991, 6461, 2691],
              [1560, 3392, 3826, 4787, 2613],
              [1608, 4802, 3932, 4477, 2705],
              [1576, 3933, 3909, 4979, 2685],
              [  95,  229,  255,  496,  201],
              [   2,    0,    1,   27,    0],
              [1438, 3785, 3589, 4174, 2215],
              [1342, 4043, 4009, 4665, 3033]],
        index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
               '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
        columns=['R003', 'R004', 'R005', 'R006', 'R007']
    )
    
    
    # DataFrame creation
    print ridership_df.iloc[0]
    print ridership_df.loc['05-05-11']
    print ridership_df['R003']
    print ridership_df.iloc[1, 3]
            
    print ridership_df[['R003', 'R005']]
        
    df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    print df.sum()
    print df.sum(axis=1)
    print df.values.sum()
        
    def mean_riders_for_max_station(ridership):
        overall_mean = ridership.values.mean() 
        max_station = ridership.iloc[0].argmax()  #return the colunm name 
        mean_for_max = ridership.loc[:,max_station].mean() # Replace this with your code
        
        return (overall_mean, mean_for_max)
    
    

    7. 将数据加载到 DataFrame 中

    DataFrame 可有效表示csv文件内容,可使每一列的数据类型不同

    df = pd.read_csv('filename.csv')
    

    8. 计算相关性

    默认情况下,Pandas 的 std() 函数使用贝塞耳校正系数来计算标准偏差。调用 std(ddof=0) 可以禁止使用贝塞耳校正系数。
    计算皮尔森系数时,需要使用ddof=0

    NumPy 的 corrcoef() 函数可用来计算皮尔逊积矩相关系数,也简称为“相关系数”。

    import pandas as pd
    def correlation(x, y):
        x_standard = (x-x.mean())/x.std(ddof=0) 
        y_standard = (y-y.mean())/y.std(ddof=0)
        return (x_standard * y_standard).mean()
    

    9. Pandas 轴名

    axis = 1 axis='column' 行
    axis = 0 axis='index' 列

    10. DataFrame 向量化运算

    import pandas as pd
    
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
    print df1 + df2
        
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
    print df1 + df2
    
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                        index=['row1', 'row2', 'row3'])
    df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                        index=['row4', 'row3', 'row2'])
    print df1 + df2
    
    
    # Cumulative entries and exits for one station for a few hours.
    entries_and_exits = pd.DataFrame({
        'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                     3144808, 3144895, 3144905, 3144941, 3145094],
        'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
                   1088317, 1088328, 1088331, 1088420, 1088753]
    })
    
    def get_hourly_entries_and_exits(entries_and_exits):
        '''
        Fill in this function to take a DataFrame with cumulative entries
        and exits (entries in the first column, exits in the second) and
        return a DataFrame with hourly entries and exits (entries in the
        first column, exits in the second).
        '''
        return entries_and_exits-entries_and_exits.shift(1)
    

    11. DataFrame applymap()

    import pandas as pd
    
    df = pd.DataFrame({
        'a': [1, 2, 3],
        'b': [10, 20, 30],
        'c': [5, 10, 15]
    })
        
    def add_one(x):
        return x + 1
            
    print df.applymap(add_one)
        
    grades_df = pd.DataFrame(
        data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
              'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
        index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
               'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
    )
     
    def convert_grade(x):
        if x>= 90:
            return 'A'
        elif x>= 80:
            return 'B'
        elif x>= 70:
            return 'C'
        elif x>=60:
            return 'D'
        else:
            return 'F'
    def convert_grades(grades):
        
        return grades.applymap(convert_grade)
    

    12.DataFrame apply()

    def standardize_column(column):
        return (column - column.mean())/column.std(ddof=0)
    def standardize(df):
        return df.apply(standardize_column)
    

    计算得出的默认标准偏差类型在 numpy 的 .std() 和 pandas 的 .std() 函数之间是不同的。默认情况下,numpy 计算的是总体标准偏差,ddof = 0。另一方面,pandas 计算的是样本标准偏差,ddof = 1。如果我们知道所有的分数,那么我们就有了总体——因此,要使用 pandas 进行归一化处理,我们需要将“ddof”设置为 0。

    13. DataFrame apply() 使用案例 2

    将一列数据转化为单个值

    def column_second_largest(column):
        sorted_values = column.sort_values(ascending = False)
        return sorted_values.iloc[1]
        
    def second_largest(df):
        '''
        Fill in this function to return the second-largest value of each 
        column of the input DataFrame.
        '''
        return df.apply(column_second_largest)
    

    14. 向 Series 添加 DataFrame

    import pandas as pd
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    # Adding a Series to a square DataFrame    
    print df + s
        
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
    # Adding a Series to a one-row DataFrame 
    print df + s
    
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({0: [10, 20, 30, 40]})
    # Adding a Series to a one-column DataFrame
    print df + s
        
    
        
    # Adding when DataFrame column names match Series index
    s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
        
    print df + s
        
    # Adding when DataFrame column names don't match Series index
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
    print df + s
    
    

    df.add(s) --- df+s
    df.add(s,axis='columns')
    df.add(s,axis='index')

    将dataframe与series相加,就是将dataframe的每一列与series的每一个值相加,它根据series的索引值和dataframe的列名匹配dataframe和series.

    15. 再次归一化每一列

    def standardize(df):
        '''
        归一化每一列
        '''
        return (df-df.mean())/df.std(ddof=0)
    
    def standardize_rows(df):
        '''
        归一化每一行
        '''
        mean = df.mean(axis='columns')
        mean_difference = df-mean
        std = df.std(axis = 'columns',ddof=0)
        return mean_difference/std
    
    

    16. Pandas groupby()

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import seaborn as sns
    
    values = np.array([1, 3, 2, 4, 1, 6, 4])
    example_df = pd.DataFrame({
        'value': values,
        'even': values % 2 == 0,
        'above_three': values > 3 
    }, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
    
    
    print example_df
        
    grouped_data = example_df.groupby('even')
    print grouped_data.groups
        
    # Group by multiple columns
    grouped_data = example_df.groupby(['even', 'above_three'])
    print grouped_data.groups
        
    # Get sum of each group
    grouped_data = example_df.groupby('even')
    print grouped_data.sum()
        
    
    grouped_data = example_df.groupby('even')
    print grouped_data.sum()['value']
    print grouped_data['value'].sum()
    

    17. 每小时入站和出站数

    def hourly(column):
    return column - column.shift(1)

    def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits and return a DataFrame with hourly entries and exits.
    The hourly entries and exits should be calculated separately for
    each station (the 'UNIT' column).
    '''
    return entries_and_exits.groupby('UNIT')[['ENTRIESn','EXITSn']].apply(hourly)

    18.合并 Pandas DataFrame

    import pandas as pd
    
    subway_df = pd.DataFrame({
        'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
                 'R004', 'R004'],
        'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
                  '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
        'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                     14656174, 14660126, 14664247, 14668301],
        'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
                   14451851, 14454734, 14457780, 14460818],
        'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                      40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
        'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                      -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
    })
    
    weather_df = pd.DataFrame({
        'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
                  '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
        'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                      40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
        'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                      -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
        'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                       30.01,  30.01],
        'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
        'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
    })
    
    def combine_dfs(subway_df, weather_df):
        '''
        Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
        and return a single dataframe with one row for each date, hour, and location. Only include
        times and locations that have both subway data and weather data available.
        '''
        return subway_df.merge(weather_df,
            on=['DATEn','hour','latitude','longitude'],
            how='inner')
    

    相关文章

      网友评论

          本文标题:用Numpy和Pandas分析二维数据

          本文链接:https://www.haomeiwen.com/subject/ajojtftx.html