美文网首页python数据分析与机器学习实战我爱编程
(八)pandas知识学习3-python数据分析与机器学习实战

(八)pandas知识学习3-python数据分析与机器学习实战

作者: 努力奋斗的durian | 来源:发表于2018-05-03 20:39 被阅读57次

    文章原创,最近更新:2018-05-3

    引言:关于series的介绍

    这这里为了方便大家可以学习series这个案例,将fandango_score_comparison.csv这个文件以百度网盘分享给大家,链接: https://pan.baidu.com/s/1U6z7OvXK75L1AGm1vYlN4w 密码: qe1a

    课程来源: python数据分析与机器学习实战-唐宇迪

    dataframe是相当于矩阵,series是相当于矩阵的一行,series类型由一组数据及与之相关的数据索引组成.
    比如以下一个小的案例:

    import pandas as pd
    a=pd.Series([9,8,7,6])
    a
    Out[19]: 
    0    9
    1    8
    2    7
    3    6
    dtype: int64
    
    

    以下是关于电影的一个评分以及相关的数据.我们观察以下用series结构有没有什么特别之处?

    import pandas as pd
    
    fandango=pd.read_csv('fandango_score_comparison.csv')
    
    series_film = fandango['FILM']
    
    type(series_film)
    Out[85]: pandas.core.series.Series
    

    通过上面可以看出从fandango是个Datafram,然后将fandango其中的一列['FILM']拿出来,fandango['FILM']变成了Series.

    在Series进行定位,与Datafram有什么区别呢?

    其实都是一样的用法,通过索引和切片的方式.

    series_film = fandango['FILM']
    series_film[0:5]
    Out[84]: 
    0    Avengers: Age of Ultron (2015)
    1                 Cinderella (2015)
    2                    Ant-Man (2015)
    3            Do You Believe? (2015)
    4     Hot Tub Time Machine 2 (2015)
    Name: FILM, dtype: object
    
    series_rt = fandango['RottenTomatoes']
    series_rt[0:5]
    Out[87]: 
    0    74
    1    85
    2    80
    3    18
    4    14
    Name: RottenTomatoes, dtype: int64
    

    新建一个Series结构应该怎么办?

    首先我们查看series.values的结构.发现结果是一个ndarray.即就是从series每一个值拿出来,每个值就是ndarray.这就说明了,dataframe里面的结构是series,series里面的结构是ndarray.其实pandas是封装在numpy的基础之上的.

    很多操作就是把numpy组合形成便利的条件,pandas与numpy很多操作都是互通的.

    film_names=series_film.values
    
    type(film_names)
    Out[89]: numpy.ndarray
    

    下面的操作是创建一个series出来,在pandas当中要将series导进来.

    from pandas  import Series
    

    Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引,于是会自动创建一个0到N-1(N为数据的长度)的整数型索引。你可以通过Series 的values和index属性获取其数组表示形式和索引对象:

    与普通NumPy数组相比,你可以通过索引的方式选取Series中的单个或一组值:

    案例创建一个series,在这个结构当中,一个电影名字,对应其中一个媒体的评分值等于多少.

    from pandas  import Series
    rt_scores = series_rt.values
    series_custom = Series(rt_scores , index=film_names)
    
    series_custom[['Minions (2015)', 'Leviathan (2014)']]
    Out[96]: 
    Minions (2015)      54
    Leviathan (2014)    99
    dtype: int64
    

    series如何排序?

    reindex更多的不是修改pandas对象的索引,而只是修改索引的顺序,如果修改的索引不存在就会使用默认的None代替此行。且不会修改原数组,要修改需要使用赋值语句。

    首先提取电影的名称,即是将index提取成列表.

    original_index = series_custom.index.tolist()
    
    original_index
    Out[110]: 
    ['Avengers: Age of Ultron (2015)',
     'Cinderella (2015)',
     'Ant-Man (2015)',
     'Do You Believe? (2015)',
     'Hot Tub Time Machine 2 (2015)',
     ....
     'Mr. Holmes (2015)',
     "'71 (2015)",
     'Two Days, One Night (2014)',
     'Gett: The Trial of Viviane Amsalem (2015)',
     'Kumiko, The Treasure Hunter (2015)']
    

    对电影的名称进行排序.排序后的结果如下:

    sorted_index = sorted(original_index)
    
    sorted_index
    Out[112]: 
    ["'71 (2015)",
     '5 Flights Up (2015)',
     'A Little Chaos (2015)',
     'A Most Violent Year (2014)',
     'About Elly (2015)',
    ....
     'What We Do in the Shadows (2015)',
     'When Marnie Was There (2015)',
     "While We're Young (2015)",
     'Wild Tales (2014)',
     'Woman in Gold (2015)']
    

    用reindex函数,根据排序后的电影名称修改series_custom的索引顺序,具体如下:

    sorted_by_index = series_custom.reindex(sorted_index)
    
    sorted_by_index
    Out[114]: 
    '71 (2015)                                         97
    5 Flights Up (2015)                                52
    A Little Chaos (2015)                              40
    A Most Violent Year (2014)                         90
    About Elly (2015)                                  97
    ....
    When Marnie Was There (2015)                       89
    While We're Young (2015)                           83
    Wild Tales (2014)                                  96
    Woman in Gold (2015)                               52
    Length: 146, dtype: int64
    

    如何用对series的索引以及值进行排序?

    用sort_index()对索引进行排序,得到sc2

    sc2 = series_custom.sort_index()
    
    sc2
    Out[116]: 
    '71 (2015)                                         97
    5 Flights Up (2015)                                52
    A Little Chaos (2015)                              40
    A Most Violent Year (2014)                         90
    About Elly (2015)                                  97
    ....
    What We Do in the Shadows (2015)                   96
    When Marnie Was There (2015)                       89
    While We're Young (2015)                           83
    Wild Tales (2014)                                  96
    Woman in Gold (2015)                               52
    Length: 146, dtype: int64
    

    用sort_values()对值进行排序,得到sc3

    sc3 = series_custom.sort_values()
    
    sc3
    Out[118]: 
    Paul Blart: Mall Cop 2 (2015)                    5
    Hitman: Agent 47 (2015)                          7
    Hot Pursuit (2015)                               8
    Fantastic Four (2015)                            9
    Taken 3 (2015)                                   9
    ....
    Song of the Sea (2014)                          99
    Phoenix (2015)                                  99
    Selma (2014)                                    99
    Seymour: An Introduction (2015)                100
    Gett: The Trial of Viviane Amsalem (2015)      100
    Length: 146, dtype: int64
    
    

    如何对2个series进行相加?

    对于两个维度一样的series,相加之后就会得到一个新的series.如果维度一样,对应位置相加,如果维度不一样,直接是分别相加的要给操作.


    通过用add函数将2个series_custom进行相加.

    series_custom
    Out[123]: 
    Avengers: Age of Ultron (2015)                     74
    Cinderella (2015)                                  85
    Ant-Man (2015)                                     80
    Do You Believe? (2015)                             18
    Hot Tub Time Machine 2 (2015)                      14
    ....
    Mr. Holmes (2015)                                  87
    '71 (2015)                                         97
    Two Days, One Night (2014)                         97
    Gett: The Trial of Viviane Amsalem (2015)         100
    Kumiko, The Treasure Hunter (2015)                 87
    Length: 146, dtype: int64
    

    np.add(a,b)等价于a+b,相加的结果如下:

    np.add(series_custom, series_custom)#等价于series_custom+series_custom
    Out[124]: 
    Avengers: Age of Ultron (2015)                    148
    Cinderella (2015)                                 170
    Ant-Man (2015)                                    160
    Do You Believe? (2015)                             36
    Hot Tub Time Machine 2 (2015)                      28
    ....
    Mr. Holmes (2015)                                 174
    '71 (2015)                                        194
    Two Days, One Night (2014)                        194
    Gett: The Trial of Viviane Amsalem (2015)         200
    Kumiko, The Treasure Hunter (2015)                174
    Length: 146, dtype: int64
    

    用np.sin()对series求sin值

    np.sin(series_custom)
    Out[126]: 
    Avengers: Age of Ultron (2015)                   -0.985146
    Cinderella (2015)                                -0.176076
    Ant-Man (2015)                                   -0.993889
    Do You Believe? (2015)                           -0.750987
    Hot Tub Time Machine 2 (2015)                     0.990607
    ....
    Mr. Holmes (2015)                                -0.821818
    '71 (2015)                                        0.379608
    Two Days, One Night (2014)                        0.379608
    Gett: The Trial of Viviane Amsalem (2015)        -0.506366
    Kumiko, The Treasure Hunter (2015)               -0.821818
    Length: 146, dtype: float64
    

    求series_custom的最大值,用np.max()进行计算

    np.max(series_custom)
    Out[127]: 100
    

    判断series_custom中大于50的数

    series_custom > 50
    Out[128]: 
    Avengers: Age of Ultron (2015)                     True
    Cinderella (2015)                                  True
    Ant-Man (2015)                                     True
    Do You Believe? (2015)                            False
    Hot Tub Time Machine 2 (2015)                     False
    ....
    Mr. Holmes (2015)                                  True
    '71 (2015)                                         True
    Two Days, One Night (2014)                         True
    Gett: The Trial of Viviane Amsalem (2015)          True
    Kumiko, The Treasure Hunter (2015)                 True
    Length: 146, dtype: bool
    

    查找series_custom中大于50的数

    series_greater_than_50
    Out[130]: 
    Avengers: Age of Ultron (2015)                                             74
    Cinderella (2015)                                                          85
    Ant-Man (2015)                                                             80
    The Water Diviner (2015)                                                   63
    Top Five (2014)                                                            86
    ....
    Mr. Holmes (2015)                                                          87
    '71 (2015)                                                                 97
    Two Days, One Night (2014)                                                 97
    Gett: The Trial of Viviane Amsalem (2015)                                 100
    Kumiko, The Treasure Hunter (2015)                                         87
    Length: 94, dtype: int64
    

    查找series_custom中>50,<75的数

    criteria_one = series_custom > 50
    
    criteria_two = series_custom < 75
    
    both_criteria = series_custom[criteria_one & criteria_two]
    
    both_criteria
    Out[134]: 
    Avengers: Age of Ultron (2015)                                            74
    The Water Diviner (2015)                                                  63
    Unbroken (2014)                                                           51
    Southpaw (2015)                                                           59
    Insidious: Chapter 3 (2015)                                               59
    The Man From U.N.C.L.E. (2015)                                            68
    ....
    Woman in Gold (2015)                                                      52
    The Last Five Years (2015)                                                60
    Jurassic World (2015)                                                     71
    Minions (2015)                                                            54
    Spare Parts (2015)                                                        52
    dtype: int64
    

    如何使2个series的index相同?如何进行计算?

    index相同,两个value会在相对应的位置进行计算,会得到一个新的series

    rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
    
    rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
    
    rt_mean = (rt_critics + rt_users)/2
    
    rt_mean
    Out[138]: 
    FILM
    Avengers: Age of Ultron (2015)                    80.0
    Cinderella (2015)                                 82.5
    Ant-Man (2015)                                    85.0
    Do You Believe? (2015)                            51.0
    Hot Tub Time Machine 2 (2015)                     21.0
    ....
    Inside Out (2015)                                 94.0
    Mr. Holmes (2015)                                 82.5
    '71 (2015)                                        89.5
    Two Days, One Night (2014)                        87.5
    Gett: The Trial of Viviane Amsalem (2015)         90.5
    Kumiko, The Treasure Hunter (2015)                75.0
    Length: 146, dtype: float64
    

    如何指定一个索引?

    set_index函数拓展:
    DataFrame可以通过set_index方法,可以设置单索引和复合索引。
    DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
    append添加新索引,drop为False,inplace为True时,索引将会还原为列

    fandango的index是0-146.

    fandango=pd.read_csv('fandango_score_comparison.csv')
    fandango.index
    Out[149]: RangeIndex(start=0, stop=146, step=1)
    

    通过set_index,将0-146更改为'FILM'这一列的值为索引,结果如下:

    fandango_films = fandango.set_index('FILM', drop=False)
    fandango_films.index
    Out[140]: 
    Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
           'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
           'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
           'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
           ...
           'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
           'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
           'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
           'Gett: The Trial of Viviane Amsalem (2015)',
           'Kumiko, The Treasure Hunter (2015)'],
          dtype='object', name='FILM', length=146)
    
    

    对指定索引进行切片

    一个数值型可以进行切片选择,对str之间用冒号:,安装字典的排列,比如a:c,代表a,b,c进行排列的.会将对应索引的行所有的数据都可以拿出来.与数值做索引的方法是类似的.

    案例:切片从"Avengers: Age of Ultron (2015)"到"Hot Tub Time Machine 2 (2015)"的行.

    fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]与fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]等价

    fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
    Out[147]: 
                                                              FILM  \
    FILM                                                             
    Avengers: Age of Ultron (2015)  Avengers: Age of Ultron (2015)   
    Cinderella (2015)                            Cinderella (2015)   
    Ant-Man (2015)                                  Ant-Man (2015)   
    Do You Believe? (2015)                  Do You Believe? (2015)   
    Hot Tub Time Machine 2 (2015)    Hot Tub Time Machine 2 (2015)   
    
    
                                    RT_user_norm         ...           IMDB_norm  \
    FILM                                                 ...                       
    Avengers: Age of Ultron (2015)           4.3         ...                3.90   
    Cinderella (2015)                        4.0         ...                3.55   
    Ant-Man (2015)                           4.5         ...                3.90   
    Do You Believe? (2015)                   4.2         ...                2.70   
    Hot Tub Time Machine 2 (2015)            1.4         ...                2.55   
    
                                    RT_norm_round  RT_user_norm_round  \
    
                                    Fandango_Difference  
    FILM                                                 
    Avengers: Age of Ultron (2015)                  0.5  
    Cinderella (2015)                               0.5  
    Ant-Man (2015)                                  0.5  
    Do You Believe? (2015)                          0.5  
    Hot Tub Time Machine 2 (2015)                   0.5  
    
    [5 rows x 22 columns]
    
    

    相类似的小练习:

    #查找一个索引对应的行
    fandango_films.loc['Kumiko, The Treasure Hunter (2015)']
    #查找三个索引对应的行
    movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
    fandango_films.loc[movies]
    

    如何更改数据类型?

    通过dtypes函数查询dataframe每行的数据类型,得到结果如下:

    import numpy as np
    
    types = fandango_films.dtypes
    
    types
    Out[158]: 
    FILM                           object
    RottenTomatoes                  int64
    RottenTomatoes_User             int64
    Metacritic                      int64
    Metacritic_User               float64
    ....
    IMDB_norm_round               float64
    Metacritic_user_vote_count      int64
    IMDB_user_vote_count            int64
    Fandango_votes                  int64
    Fandango_Difference           float64
    dtype: object
    
    

    获取数据类型是float64的索引

    float_columns = types[types.values == 'float64'].index
    
    float_columns
    Out[160]: 
    Index(['Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
           'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
           'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
           'Metacritic_norm_round', 'Metacritic_user_norm_round',
           'IMDB_norm_round', 'Fandango_Difference'],
          dtype='object')
    

    通过获得的float64的索引,以此得到对应索引中所有行的数据

    float_df = fandango_films[float_columns]
    
    float_df
    Out[162]: 
                                                    Metacritic_User  IMDB  \
    FILM                                                                    
    Avengers: Age of Ultron (2015)                              7.1   7.8   
    Cinderella (2015)                                           7.5   7.1   
    Ant-Man (2015)                                              8.1   7.8   
    Do You Believe? (2015)                                      4.7   5.4   
    Hot Tub Time Machine 2 (2015)                               3.4   5.1   
    The Water Diviner (2015)                                    6.8   7.2   
    Irrational Man (2015)                                       7.6   6.9   
    Top Five (2014)                                             6.8   6.5   
    Shaun the Sheep Movie (2015)                                8.8   7.4   
    Love & Mercy (2015)                                         8.5   7.8   
    Far From The Madding Crowd (2015)                           7.5   7.2   
    Black Sea (2015)                                            6.6   6.4   
    Leviathan (2014)                                            7.2   7.7   
    Unbroken (2014)                                             6.5   7.2   
    The Imitation Game (2014)                                   8.2   8.1   
    Taken 3 (2015)                                              4.6   6.1   
    Ted 2 (2015)                                                6.5   6.6   
    Southpaw (2015)                                             8.2   7.8   
    Night at the Museum: Secret of the Tomb (2014)              5.8   6.3   
    Pixels (2015)                                               5.3   5.6   
    McFarland, USA (2015)                                       7.2   7.5   
    Insidious: Chapter 3 (2015)                                 6.9   6.3   
    The Man From U.N.C.L.E. (2015)                              7.9   7.6   
    Run All Night (2015)                                        7.3   6.6   
    Trainwreck (2015)                                           6.0   6.7   
    Selma (2014)                                                7.1   7.5   
    Ex Machina (2015)                                           7.9   7.7   
    Still Alice (2015)                                          7.8   7.5   
    Wild Tales (2014)                                           8.8   8.2   
    The End of the Tour (2015)                                  7.5   7.9   
                                                                ...  
    Clouds of Sils Maria (2015)                                     0.1  
    Testament of Youth (2015)                                       0.1  
    Infinitely Polar Bear (2015)                                    0.1  
    Phoenix (2015)                                                  0.1  
    The Wolfpack (2015)                                             0.1  
    The Stanford Prison Experiment (2015)                           0.1  
    Tangerine (2015)                                                0.1  
    Magic Mike XXL (2015)                                           0.1  
    Home (2015)                                                     0.1  
    The Wedding Ringer (2015)                                       0.1  
    Woman in Gold (2015)                                            0.1  
    The Last Five Years (2015)                                      0.1  
    Mission: Impossible – Rogue Nation (2015)                     0.1  
    Amy (2015)                                                      0.1  
    Jurassic World (2015)                                           0.0  
    Minions (2015)                                                  0.0  
    Max (2015)                                                      0.0  
    Paul Blart: Mall Cop 2 (2015)                                   0.0  
    The Longest Ride (2015)                                         0.0  
    The Lazarus Effect (2015)                                       0.0  
    The Woman In Black 2 Angel of Death (2015)                      0.0  
    Danny Collins (2015)                                            0.0  
    Spare Parts (2015)                                              0.0  
    Serena (2015)                                                   0.0  
    Inside Out (2015)                                               0.0  
    Mr. Holmes (2015)                                               0.0  
    '71 (2015)                                                      0.0  
    Two Days, One Night (2014)                                      0.0  
    Gett: The Trial of Viviane Amsalem (2015)                       0.0  
    Kumiko, The Treasure Hunter (2015)                              0.0  
    
    [146 rows x 15 columns]
    

    通过std()函数,对每个指标都进行计算标准差

    deviations = float_df.apply(lambda x: np.std(x))
    
    deviations
    Out[165]: 
    Metacritic_User               1.505529
    IMDB                          0.955447
    Fandango_Stars                0.538532
    Fandango_Ratingvalue          0.501106
    RT_norm                       1.503265
    RT_user_norm                  0.997787
    Metacritic_norm               0.972522
    Metacritic_user_nom           0.752765
    IMDB_norm                     0.477723
    RT_norm_round                 1.509404
    RT_user_norm_round            1.003559
    Metacritic_norm_round         0.987561
    Metacritic_user_norm_round    0.785412
    IMDB_norm_round               0.501043
    Fandango_Difference           0.152141
    dtype: float64
    

    相类似的小练习:

    rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
    rt_mt_user.apply(lambda x: np.std(x), axis=1)
    

    相关文章

      网友评论

      本文标题:(八)pandas知识学习3-python数据分析与机器学习实战

      本文链接:https://www.haomeiwen.com/subject/nullrftx.html