美文网首页
58.关于学习因子的一个有用的数据集gss_cat

58.关于学习因子的一个有用的数据集gss_cat

作者: 心惊梦醒 | 来源:发表于2021-09-13 22:41 被阅读0次

    【上一篇:57.关于因子的四要素之创建因子】
    【下一篇:59.关于调整因子的属性levels的order(一)】

        forcats::gss_cat数据集是General Social Survey中的一个样本,General Social Survel是由芝加哥大学的一个独立调查机构NORC进行的一个长期的美国调查,调查中有成千上万的问题,gss_cat数据集中包含的分类变量很适合用来演示处理因子时会遇到的常见挑战。

    > gss_cat
    # A tibble: 21,483 x 9
        year marital     age race  rincome   partyid     relig    denom    tvhours
       <int> <fct>     <int> <fct> <fct>     <fct>       <fct>    <fct>      <int>
     1  2000 Never ma~    26 White $8000 to~ Ind,near r~ Protest~ Souther~      12
     2  2000 Divorced     48 White $8000 to~ Not str re~ Protest~ Baptist~      NA
     3  2000 Widowed      67 White Not appl~ Independent Protest~ No deno~       2
     4  2000 Never ma~    39 White Not appl~ Ind,near r~ Orthodo~ Not app~       4
     5  2000 Divorced     25 White Not appl~ Not str de~ None     Not app~       1
     6  2000 Married      25 White $20000 -~ Strong dem~ Protest~ Souther~      NA
     7  2000 Never ma~    36 White $25000 o~ Not str re~ Christi~ Not app~       3
     8  2000 Divorced     44 White $7000 to~ Ind,near d~ Protest~ Luthera~      NA
     9  2000 Married      44 White $25000 o~ Not str de~ Protest~ Other          0
    10  2000 Married      47 White $25000 o~ Strong rep~ Protest~ Souther~       3
    # ... with 21,473 more rows
    
    year:调查年份,2000-2014
    age:年龄,最大年龄截断到89岁
    marital:婚姻状态
    race:种族
    rincome:reported income
    partyid:党派
    relig:宗教信仰,例如道教、基督教、佛教、伊斯兰教
    denom:教派、派别,例如佛教分为汉传佛教、藏传佛教,道教分茅山派、天师道、全真道等
    tvhours:每天看电视的时间
    

        gss_cat数据集中有六列是因子,levels()函数可以查看因子所有的levels,count()函数可以看当前数据集中包含的具体值。

    > gss_cat %>% count(race)
    # A tibble: 3 x 2
      race      n
      <fct> <int>
    1 Other  1959
    2 Black  3129
    3 White 16395
    
    > levels(gss_cat$race)
    [1] "Other"          "Black"          "White"          "Not applicable"
    

        用geom_bar()绘制每个种族的数量:

    > library(ggpubr)
    # ggplot2默认扔掉没有任何值的level
    > p1<-ggplot(gss_cat, aes(race)) + geom_bar()
    # scale_x_discrete(drop=FALSE)关闭ggplot2的次默认行为
    > p2<-ggplot(gss_cat, aes(race)) + geom_bar() + scale_x_discrete(drop = FALSE)
    > ggarrange(p1,p2)
    
    ggplot2默认扔掉没有任何值的level

        gss_cat数据集中relig(宗教信仰)和denom(教派)的关系有哪些(如下)?可以发现,新教中有更多的派别(个人理解)。

    > gss_cat %>% count(relig,denom) %>% print(n=Inf)
    # A tibble: 47 x 3
       relig                   denom                    n
       <fct>                   <fct>                <int>
     1 No answer               No answer               93
     2 Don't know              Not applicable          15
     3 Inter-nondenominational Not applicable         109
     4 Native american         Not applicable          23
     5 Christian               No answer                2
     6 Christian               Don't know              11
     7 Christian               No denomination        452
     8 Christian               Not applicable         224
     9 Orthodox-christian      Not applicable          95
    10 Moslem/islam            Not applicable         104
    11 Other eastern           Not applicable          32
    12 Hinduism                Not applicable          71
    13 Buddhism                Not applicable         147
    14 Other                   No denomination          7
    15 Other                   Not applicable         217
    16 None                    Not applicable        3523
    17 Jewish                  Not applicable         388
    18 Catholic                Not applicable        5124
    19 Protestant              No answer               22
    20 Protestant              Don't know              41
    21 Protestant              No denomination       1224
    22 Protestant              Other                 2534
    23 Protestant              Episcopal              397
    24 Protestant              Presbyterian-dk wh     244
    25 Protestant              Presbyterian, merged    67
    26 Protestant              Other presbyterian      47
    27 Protestant              United pres ch in us   110
    28 Protestant              Presbyterian c in us   104
    29 Protestant              Lutheran-dk which      267
    30 Protestant              Evangelical luth       122
    31 Protestant              Other lutheran          30
    32 Protestant              Wi evan luth synod      71
    33 Protestant              Lutheran-mo synod      212
    34 Protestant              Luth ch in america      71
    35 Protestant              Am lutheran            146
    36 Protestant              Methodist-dk which     239
    37 Protestant              Other methodist         33
    38 Protestant              United methodist      1067
    39 Protestant              Afr meth ep zion        32
    40 Protestant              Afr meth episcopal      77
    41 Protestant              Baptist-dk which      1457
    42 Protestant              Other baptists         213
    43 Protestant              Southern baptist      1536
    44 Protestant              Nat bapt conv usa       40
    45 Protestant              Nat bapt conv of am     76
    46 Protestant              Am bapt ch in usa      130
    47 Protestant              Am baptist asso        237
    
    > gss_cat %>% count(relig,denom) %>% ggplot(aes(x=relig,y=n,fill=denom))+geom_bar(stat = "identity",position = "dodge")
    
    宗教与派别的可视化

    【上一篇:57.关于因子的四要素之创建因子】
    【下一篇:59.关于调整因子的属性levels的order(一)】

    相关文章

      网友评论

          本文标题:58.关于学习因子的一个有用的数据集gss_cat

          本文链接:https://www.haomeiwen.com/subject/sekowltx.html