R语言学习(四)高级数据管理

作者: 邱俊辉 | 来源:发表于2019-01-23 19:08 被阅读2次

    还是举一个例子引出高级数据管理的操作
    首先创建一个数据框

    > Student<-c("John Davis","Angela Williams","Bullwinkle Moose","David Jones","Janice Markhammer","Cheryl Cushing","Reuven Ytzrhak","Greg Knox","Joel Enghland","Marry Rayburn")
    > math<-c(502,600,412,358,495,512,410,625,573,522)
    > Science<-c(95,99,80,82,75,85,80,95,89,86)
    > English<-c(25,22,18,15,20,28,15,30,27,18)
    > grade<-data.frame(Student,math,Science,English)
    > grade
                 Student math Science English
    1         John Davis  502      95      25
    2    Angela Williams  600      99      22
    3   Bullwinkle Moose  412      80      18
    4        David Jones  358      82      15
    5  Janice Markhammer  495      75      20
    6     Cheryl Cushing  512      85      28
    7     Reuven Ytzrhak  410      80      15
    8          Greg Knox  625      95      30
    9      Joel Enghland  573      89      27
    10     Marry Rayburn  522      86      18
    

    现在要解决的问题如下:
    1.给学生确定一个单一的成绩衡量指标,需要将这些科目的成绩组合起来
    2.将成绩前20%的评为A,接下来的20%评定为B,以此类推
    3.按学生名字的字母顺序对学生进行排序
    在解决问题之前我们有必要了解数值和字符处理函数

    数学函数

    abs(x) 绝对值
    sqrt(x) 平方根
    ceiling(x) 不小于x的最小整数
    floor(x) 不大于x的最大整数
    trunc(x) 取整
    round(x,digits=n) 将x舍入为指定位的小数
    signif(x,digits=n) 将x舍入为指定的有效数字位数

    统计函数

    mean(x) 平均数
    median(x) 中位数
    sd(x) 标准差
    var(x) 方差
    mad(x) 绝对中位差
    quantile(x,probs) 求分位数
    range(x) 求值域
    sum(x) 求和
    diff(x,lag=n) 滞后差分
    min(x) 最小值
    max(x) 最大值
    scale() 将各向量值中心化后标准化

    字符处理函数

    nchar(x) 统计x中字符数量

    sub(pattern,replacement,ignore.case=FALSE,fixed=FALSE) 在x中搜索
    pattern,并以文本replacement替换

    strsplit(x,split,fixed=FALSE) 在x处分割字符向量x中的元素

    接下来我们着手处理一下上面提到的问题
    1.将学生的各科成绩组合为单一的成绩衡量指标

    > grade
                 Student math Science English
    1         John Davis  502      95      25
    2    Angela Williams  600      99      22
    3   Bullwinkle Moose  412      80      18
    4        David Jones  358      82      15
    5  Janice Markhammer  495      75      20
    6     Cheryl Cushing  512      85      28
    7     Reuven Ytzrhak  410      80      15
    8          Greg Knox  625      95      30
    9      Joel Enghland  573      89      27
    10     Marry Rayburn  522      86      18
    > z<-scale(grade[,2:4]) #将各科分数分别中心化后标准化以便于比较
    > z
                 math     Science     English
     [1,]  0.01269128  1.07806562  0.58685145
     [2,]  1.14336936  1.59143020  0.03667822
     [3,] -1.02568654 -0.84705156 -0.69688609
     [4,] -1.64871324 -0.59036927 -1.24705932
     [5,] -0.06807144 -1.48875728 -0.33010394
     [6,]  0.12806660 -0.20534583  1.13702468
     [7,] -1.04876160 -0.84705156 -1.24705932
     [8,]  1.43180765  1.07806562  1.50380683
     [9,]  0.83185601  0.30801875  0.95363360
    [10,]  0.24344191 -0.07700469 -0.69688609
    attr(,"scaled:center")
       math Science English 
      500.9    86.6    21.8 
    attr(,"scaled:scale")
         math   Science   English 
    86.673654  7.791734  5.452828 
    > score<-apply(z,1,mean) #分别对z的每行求均值
    > score
     [1]  0.5592028  0.9238259 -0.8565414 -1.1620473 -0.6289776  0.3532485 -1.0476242
     [8]  1.3378934  0.6978361 -0.1768163
    > grade<-cbind(grade,score)
    > grade #将比较得分与数据框结合
                 Student math Science English      score
    1         John Davis  502      95      25  0.5592028
    2    Angela Williams  600      99      22  0.9238259
    3   Bullwinkle Moose  412      80      18 -0.8565414
    4        David Jones  358      82      15 -1.1620473
    5  Janice Markhammer  495      75      20 -0.6289776
    6     Cheryl Cushing  512      85      28  0.3532485
    7     Reuven Ytzrhak  410      80      15 -1.0476242
    8          Greg Knox  625      95      30  1.3378934
    9      Joel Enghland  573      89      27  0.6978361
    10     Marry Rayburn  522      86      18 -0.1768163
    > y<-quantile(grade$score,c(0.8,0.6,0.4,0.2)) #用quantile函数计算出各个分位数的临界点
    > y
           80%        60%        40%        20% 
     0.7430341  0.4356302 -0.3576808 -0.8947579 
    #对各个学生得分的百分数排名重编码为一个新的类别型等级变量
    > grade$level[grade$score>=y[1]]<-"A"
    > grade$level[grade$score<y[1] & grade$score>=y[2]]<-"B"
    > grade$level[grade$score<y[2] & grade$score>=y[3]]<-"C"
    > grade$level[grade$score<y[3] & grade$score>=y[4]]<-"D"
    > grade$level[grade$score<y[4]]<-"F"
    > grade
                 Student math Science English      score level
    1         John Davis  502      95      25  0.5592028     B
    2    Angela Williams  600      99      22  0.9238259     A
    3   Bullwinkle Moose  412      80      18 -0.8565414     D
    4        David Jones  358      82      15 -1.1620473     F
    5  Janice Markhammer  495      75      20 -0.6289776     D
    6     Cheryl Cushing  512      85      28  0.3532485     C
    7     Reuven Ytzrhak  410      80      15 -1.0476242     F
    8          Greg Knox  625      95      30  1.3378934     A
    9      Joel Enghland  573      89      27  0.6978361     B
    10     Marry Rayburn  522      86      18 -0.1768163     C
    #用strsplit()函数将学生的姓和名拆分
    > name<-strsplit(grade$Student," ")
    Error in strsplit(grade$Student, " ") : non-character argument
    #这里报错了,因为Student这个变量不是字符串变量
    > is.character(grade$Student)
    [1] FALSE
    > class(grade$Student)
    [1] "factor"
    #是因子变量,我们将它转化为字符串
    > grade$Student<-as.character(grade$Student)
    > name<-strsplit((grade$Student)," ")
    > name
    [[1]]
    [1] "John"  "Davis"
    
    [[2]]
    [1] "Angela"   "Williams"
    
    [[3]]
    [1] "Bullwinkle" "Moose"     
    
    [[4]]
    [1] "David" "Jones"
    
    [[5]]
    [1] "Janice"     "Markhammer"
    
    [[6]]
    [1] "Cheryl"  "Cushing"
    
    [[7]]
    [1] "Reuven"  "Ytzrhak"
    
    [[8]]
    [1] "Greg" "Knox"
    
    [[9]]
    [1] "Joel"     "Enghland"
    
    [[10]]
    [1] "Marry"   "Rayburn"
    #用sapply()函数提取列表每个成分的第一个元素作为Firstname第二个元素作为Lastname
    > Firstname<-sapply(name,"[",1)
    > Lastname<-sapply(name,"[",2)
    > Firstname
     [1] "John"       "Angela"     "Bullwinkle" "David"      "Janice"     "Cheryl"    
     [7] "Reuven"     "Greg"       "Joel"       "Marry"     
    > Lastname
     [1] "Davis"      "Williams"   "Moose"      "Jones"      "Markhammer" "Cushing"   
     [7] "Ytzrhak"    "Knox"       "Enghland"   "Rayburn"   
    #删除原有的姓名,将拆分后的姓氏和姓名与数据框结合
    > grade<-grade[,-1]
    > grade
       math Science English      score level
    1   502      95      25  0.5592028     B
    2   600      99      22  0.9238259     A
    3   412      80      18 -0.8565414     D
    4   358      82      15 -1.1620473     F
    5   495      75      20 -0.6289776     D
    6   512      85      28  0.3532485     C
    7   410      80      15 -1.0476242     F
    8   625      95      30  1.3378934     A
    9   573      89      27  0.6978361     B
    10  522      86      18 -0.1768163     C
    > grade<-cbind(Firstname,Lastname,grade)
    > grade
        Firstname   Lastname math Science English      score level
    1        John      Davis  502      95      25  0.5592028     B
    2      Angela   Williams  600      99      22  0.9238259     A
    3  Bullwinkle      Moose  412      80      18 -0.8565414     D
    4       David      Jones  358      82      15 -1.1620473     F
    5      Janice Markhammer  495      75      20 -0.6289776     D
    6      Cheryl    Cushing  512      85      28  0.3532485     C
    7      Reuven    Ytzrhak  410      80      15 -1.0476242     F
    8        Greg       Knox  625      95      30  1.3378934     A
    9        Joel   Enghland  573      89      27  0.6978361     B
    10      Marry    Rayburn  522      86      18 -0.1768163     C
    #最后一步,按照姓名和姓氏进行排序
    > grade[order(Firstname,Lastname),]
        Firstname   Lastname math Science English      score level
    2      Angela   Williams  600      99      22  0.9238259     A
    3  Bullwinkle      Moose  412      80      18 -0.8565414     D
    6      Cheryl    Cushing  512      85      28  0.3532485     C
    4       David      Jones  358      82      15 -1.1620473     F
    8        Greg       Knox  625      95      30  1.3378934     A
    5      Janice Markhammer  495      75      20 -0.6289776     D
    9        Joel   Enghland  573      89      27  0.6978361     B
    1        John      Davis  502      95      25  0.5592028     B
    10      Marry    Rayburn  522      86      18 -0.1768163     C
    7      Reuven    Ytzrhak  410      80      15 -1.0476242     F
    #当然现实一点我们也可以按得分高低排序
    > grade[order(-score),]
        Firstname   Lastname math Science English      score level
    8        Greg       Knox  625      95      30  1.3378934     A
    2      Angela   Williams  600      99      22  0.9238259     A
    9        Joel   Enghland  573      89      27  0.6978361     B
    1        John      Davis  502      95      25  0.5592028     B
    6      Cheryl    Cushing  512      85      28  0.3532485     C
    10      Marry    Rayburn  522      86      18 -0.1768163     C
    5      Janice Markhammer  495      75      20 -0.6289776     D
    3  Bullwinkle      Moose  412      80      18 -0.8565414     D
    7      Reuven    Ytzrhak  410      80      15 -1.0476242     F
    4       David      Jones  358      82      15 -1.1620473     F
    #还可以输出为Excel表格
    > grade<-grade[order(-score),]
    > write.csv(grade,file = "grade.csv")
    
    image.png

    任务完成

    相关文章

      网友评论

        本文标题:R语言学习(四)高级数据管理

        本文链接:https://www.haomeiwen.com/subject/gsmhjqtx.html