pandas 面试题挑战三

作者: 人工智能人话翻译官 | 来源:发表于2019-05-20 23:00 被阅读71次

    11 把数据进行cut操作

    现有数据ages如下

    ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32, 101]
    

    想按照[(18, 25] < (25, 35] < (35, 60] < (60, 100]]把该数据进行Categories 操作

    解决办法:

    bins = [18, 25, 35, 60, 100]
    cats = pd.cut(ages, bins)
    cats
    

    输出

    [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (60.0, 100.0], (35.0, 60.0], (35.0, 60.0], (25.0, 35.0], NaN]
    Length: 13
    Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
    

    此时的cats是非常特殊的一类数据在pd中称为是Categories
    Categories可以方便的通过pd.Series()转换为Series

    ser_cats = pd.Series(cats)
    ser_cats
    

    输出:

    0      (18.0, 25.0]
    1      (18.0, 25.0]
    2      (18.0, 25.0]
    3      (25.0, 35.0]
    4      (18.0, 25.0]
    5      (18.0, 25.0]
    6      (35.0, 60.0]
    7      (25.0, 35.0]
    8     (60.0, 100.0]
    9      (35.0, 60.0]
    10     (35.0, 60.0]
    11     (25.0, 35.0]
    12              NaN
    dtype: category
    Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
    

    当然这个分类方式看起来不太符合人类的习惯。你可以通过设定labels让输出更加符合人类的习惯

    group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
    cats = pd.cut(ages, bins, labels=group_names)
    ser_cats = pd.Series(cats)
    ser_cats
    

    输出

    0          Youth
    1          Youth
    2          Youth
    3     YoungAdult
    4          Youth
    5          Youth
    6     MiddleAged
    7     YoungAdult
    8         Senior
    9     MiddleAged
    10    MiddleAged
    11    YoungAdult
    12           NaN
    dtype: category
    Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
    

    12 把数据进行qcut操作

    现有数据ages如下:

    ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32, 101]
    

    要把数据按照[(19.999, 23.0] < (23.0, 31.0] < (31.0, 41.0] < (41.0, 101.0]]分成4类
    也就是把数据的按照百分位[0, .25, .5, .75, 1.]进行分类

    解决方法如下:

    ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32, 101]
    cats = pd.qcut(ages, 4) 
    #按[0, .25, .5, .75, 1.]分类
    cats
    

    输出

    [(19.999, 23.0], (19.999, 23.0], (23.0, 31.0], (23.0, 31.0], (19.999, 23.0], ..., (41.0, 101.0], (41.0, 101.0], (31.0, 41.0], (31.0, 41.0], (41.0, 101.0]]
    Length: 13
    Categories (4, interval[float64]): [(19.999, 23.0] < (23.0, 31.0] < (31.0, 41.0] < (41.0, 101.0]]
    

    参考pandas的cut&qcut函數

    相关文章

      网友评论

        本文标题:pandas 面试题挑战三

        本文链接:https://www.haomeiwen.com/subject/oefrzqtx.html