美文网首页呆鸟的Python数据分析python学习
利用Python进行数据分析第二版复现(十三)_2

利用Python进行数据分析第二版复现(十三)_2

作者: 一行白鹭上青天 | 来源:发表于2020-02-09 11:28 被阅读0次

    实例分析

    14.2 MovieLens 1M数据集

    这个数据集是关于电影的评分数据。

    import pandas as pd
    
    pd.options.display.max_rows = 10
    
    unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
    users = pd.read_table('datasets/movielens/users.dat', sep='::',
                          header=None, names=unames)
    rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
    
    ratings = pd.read_table('datasets/movielens/ratings.dat', sep='::',
                            header=None, names=rnames)
    mnames = ['movie_id', 'title', 'genres']
    movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                           header=None, names=mnames)
    
    
    E:\anaconda\lib\site-packages\ipykernel_launcher.py:7: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
      import sys
    E:\anaconda\lib\site-packages\ipykernel_launcher.py:11: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
      # This is added back by InteractiveShellApp.init_path()
    E:\anaconda\lib\site-packages\ipykernel_launcher.py:14: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
    
    print(users[:5])
    
       user_id gender  age  occupation    zip
    0        1      F    1          10  48067
    1        2      M   56          16  70072
    2        3      M   25          15  55117
    3        4      M   45           7  02460
    4        5      M   25          20  55455
    
    print(ratings[:5]
    )
    
       user_id  movie_id  rating  timestamp
    0        1      1193       5  978300760
    1        1       661       3  978302109
    2        1       914       3  978301968
    3        1      3408       4  978300275
    4        1      2355       5  978824291
    
    print(movies[:5])
    
       movie_id                               title                        genres
    0         1                    Toy Story (1995)   Animation|Children's|Comedy
    1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
    2         3             Grumpier Old Men (1995)                Comedy|Romance
    3         4            Waiting to Exhale (1995)                  Comedy|Drama
    4         5  Father of the Bride Part II (1995)                        Comedy
    
    print(ratings)
    
             user_id  movie_id  rating  timestamp
    0              1      1193       5  978300760
    1              1       661       3  978302109
    2              1       914       3  978301968
    3              1      3408       4  978300275
    4              1      2355       5  978824291
    ...          ...       ...     ...        ...
    1000204     6040      1091       1  956716541
    1000205     6040      1094       5  956704887
    1000206     6040       562       5  956704746
    1000207     6040      1096       4  956715648
    1000208     6040      1097       4  956715569
    
    [1000209 rows x 4 columns]
    
    data = pd.merge(pd.merge(ratings, users), movies)
    #pandas的merge可以把多个表合并在一起。
    print(data)
    
             user_id  movie_id  rating  timestamp gender  age  occupation    zip  \
    0              1      1193       5  978300760      F    1          10  48067   
    1              2      1193       5  978298413      M   56          16  70072   
    2             12      1193       4  978220179      M   25          12  32793   
    3             15      1193       4  978199279      M   25           7  22903   
    4             17      1193       5  978158471      M   50           1  95350   
    ...          ...       ...     ...        ...    ...  ...         ...    ...   
    1000204     5949      2198       5  958846401      M   18          17  47901   
    1000205     5675      2703       3  976029116      M   35          14  30030   
    1000206     5780      2845       1  958153068      M   18          17  92886   
    1000207     5851      3607       5  957756608      F   18          20  55410   
    1000208     5938      2909       4  957273353      M   25           1  35401   
    
                                                   title                genres  
    0             One Flew Over the Cuckoo's Nest (1975)                 Drama  
    1             One Flew Over the Cuckoo's Nest (1975)                 Drama  
    2             One Flew Over the Cuckoo's Nest (1975)                 Drama  
    3             One Flew Over the Cuckoo's Nest (1975)                 Drama  
    4             One Flew Over the Cuckoo's Nest (1975)                 Drama  
    ...                                              ...                   ...  
    1000204                           Modulations (1998)           Documentary  
    1000205                        Broken Vessels (1998)                 Drama  
    1000206                            White Boys (1999)                 Drama  
    1000207                     One Little Indian (1973)  Comedy|Drama|Western  
    1000208  Five Wives, Three Secretaries and Me (1998)           Documentary  
    
    [1000209 rows x 10 columns]
    
    data.iloc[0]
    
    user_id                                            1
    movie_id                                        1193
    rating                                             5
    timestamp                                  978300760
    gender                                             F
    age                                                1
    occupation                                        10
    zip                                            48067
    title         One Flew Over the Cuckoo's Nest (1975)
    genres                                         Drama
    Name: 0, dtype: object
    
    #用pivot_table方法可以添加参数计算相关的统计量
    mean_ratings = data.pivot_table('rating', index='title',
                                    columns='gender', aggfunc='mean')
    print(mean_ratings[:5])
    
    gender                                F         M
    title                                            
    $1,000,000 Duck (1971)         3.375000  2.761905
    'Night Mother (1986)           3.388889  3.352941
    'Til There Was You (1997)      2.675676  2.733333
    'burbs, The (1989)             2.793478  2.962085
    ...And Justice for All (1979)  3.828571  3.689024
    
    #在书中的例子中,去除了评论数少于250条的数据
    ratings_by_title = data.groupby('title').size()#数据分组
    active_titles = ratings_by_title.index[ratings_by_title >= 250]#数据判别
    mean_ratings = mean_ratings.loc[active_titles]#子集在总计算中选取就好了
    
    
    top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
    #根据F(女生)列进行降序排列
    print(top_female_ratings[:10])
    
    gender                                                     F         M
    title                                                                 
    Close Shave, A (1995)                               4.644444  4.473795
    Wrong Trousers, The (1993)                          4.588235  4.478261
    Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
    Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
    Schindler's List (1993)                             4.562602  4.491415
    Shawshank Redemption, The (1994)                    4.539075  4.560625
    Grand Day Out, A (1992)                             4.537879  4.293255
    To Kill a Mockingbird (1962)                        4.536667  4.372611
    Creature Comforts (1990)                            4.513889  4.272277
    Usual Suspects, The (1995)                          4.513317  4.518248
    

    计算评分分歧

    计算评分分歧,可以再添加一列,计算男女评分数据之差。

    mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
    sorted_by_diff = mean_ratings.sort_values(by='diff')#按着新计算的数据列进行排序
    print(sorted_by_diff[:10])
    
    gender                                        F         M      diff
    title                                                              
    Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
    Jumpin' Jack Flash (1986)              3.254717  2.578358 -0.676359
    Grease (1978)                          3.975265  3.367041 -0.608224
    Little Women (1994)                    3.870588  3.321739 -0.548849
    Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
    Anastasia (1997)                       3.800000  3.281609 -0.518391
    Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
    Color Purple, The (1985)               4.158192  3.659341 -0.498851
    Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
    Free Willy (1993)                      2.921348  2.438776 -0.482573
    

    计算数据份方差和协方差

    rating_std_by_title = data.groupby('title')['rating'].std()
    rating_std_by_title = rating_std_by_title.loc[active_titles]
    print(rating_std_by_title.sort_values(ascending=False)[:10])
    
    title
    Dumb & Dumber (1994)                     1.321333
    Blair Witch Project, The (1999)          1.316368
    Natural Born Killers (1994)              1.307198
    Tank Girl (1995)                         1.277695
    Rocky Horror Picture Show, The (1975)    1.260177
    Eyes Wide Shut (1999)                    1.259624
    Evita (1996)                             1.253631
    Billy Madison (1995)                     1.249970
    Fear and Loathing in Las Vegas (1998)    1.246408
    Bicentennial Man (1999)                  1.245533
    Name: rating, dtype: float64
    

    说明:
    放上参考链接,这个系列都是复现的这个链接中的内容。
    放上原链接: https://www.jianshu.com/p/04d180d90a3f
    作者在链接中放上了书籍,以及相关资源。因为平时杂七杂八的也学了一些,所以这次可能是对书中的部分内容的复现。也可能有我自己想到的内容,内容暂时都还不定。在此感谢原简书作者SeanCheney的分享。

    相关文章

      网友评论

        本文标题:利用Python进行数据分析第二版复现(十三)_2

        本文链接:https://www.haomeiwen.com/subject/qbygxhtx.html