R语言初级教程(21): 数据框（下篇）

作者: R语言和Python学堂 | 来源:发表于2020-02-22 16:34 被阅读0次

这是介绍数据框处理的最后一篇文章。

10. 删掉重复的行

有时，数据框会包含重复的行，其中所有变量在两行或更多行中具有完全相同的值。来看个简单的示例：

> var1 <- c(1,2,3,4,3,6,1)
> var2 <- c(2,2,2,4,2,1,2)
> var3 <- c(3,2,1,2,1,2,3)
> var4 <- c(1,1,1,1,1,5,2)
> dups <- data.frame(var1, var2, var3, var4)
> dups
  var1 var2 var3 var4
1    1    2    1    1
2    2    2    1    1
3    3    2    1    1
4    4    4    1    1
5    3    2    1    1
6    6    1    5    5
7    1    2    2    2

注意，第5行与第3行完全相同。要删除所有重复行，可以使用unique()函数：

> unique(dups)
  var1 var2 var3 var4
1    1    2    1    1
2    2    2    1    1
3    3    2    1    1
4    4    4    1    1
6    6    1    5    5
7    1    2    2    2

注意：新数据框中的行名与原始数据行中的行名相同，可以发现通过unique()函数删除第5行了。

要查看数据框中重复的行（如果有的话），使用duplicated()函数来创建TRUE和FALSE的向量来充当过滤器：

> dups[duplicated(dups), ]
  var1 var2 var3 var4
5    3    2    1    1

11. 处理日期变量

经常会碰到数据框中有日期变量，比如：

> nums <- read.table("C:/data/sortdata.txt", header=T)
> attach(nums)
> head(nums)
     name       date   response treatment
1  albert 25/08/2003 0.05963704         A
2     ann 21/05/2003 1.46555993         A
3    john 12/10/2003 1.59406539         B
4     ian 02/12/2003 2.09505949         A
5 michael 18/10/2003 2.38330748         B
6     ann 02/07/2003 2.86983693         B

想通过日期(date变量)对行进行排序。按照之前的方法，排序无法按我们希望的方式进行：

> nums[order(date), ]
        name       date    response treatment
6        ann 02/07/2003  2.86983693         B
4        ian 02/12/2003  2.09505949         A
8      james 05/06/2003  4.90041370         A
13       ian 06/05/2003 39.97237868         A
9    william 11/06/2003  6.54439283         A
11 elizabeth 12/05/2003 39.39536726         B
3       john 12/10/2003  1.59406539         B
12   michael 14/06/2003 39.56900878         A
10    albert 14/07/2003 39.19746613         A
17   heather 14/11/2003 41.81821146         B
14      rose 15/05/2003 39.98892034         A
16  georgina 16/08/2003 40.81249037         A
5    michael 18/10/2003  2.38330748         B
2        ann 21/05/2003  1.46555993         A
15  georgina 24/05/2003 40.35117518         B
1     albert 25/08/2003  0.05963704         A
7   georgina 27/09/2003  3.37154802         B

这是因为用于描述日期的格式是一个字符串（首先是日，然后是月，最后是年），所以数据框是按首字母顺序排序的，而不是按日期顺序排序。为了按日期排序，我们首先需要使用strptime()函数将变量转换为我们希望的日期时间格式（关于日期的处理，后期文章会详细介绍）：

> dates <- strptime(date, format="%d/%m/%Y")
> dates
 [1] "2003-08-25 CST" "2003-05-21 CST" "2003-10-12 CST" "2003-12-02 CST" "2003-10-18 CST"
 [6] "2003-07-02 CST" "2003-09-27 CST" "2003-06-05 CST" "2003-06-11 CST" "2003-07-14 CST"
[11] "2003-05-12 CST" "2003-06-14 CST" "2003-05-06 CST" "2003-05-15 CST" "2003-05-24 CST"
[16] "2003-08-16 CST" "2003-11-14 CST"

请注意：strptime()是如何生成日期对象的：首先是年份，然后是连字符-，然后是月份，然后是连字符-，然后是日，最后是字符串CST。正是我们所需的序列。将这个新变量绑定到数据框，如下所示：

nums <- cbind(nums, dates)

既然新日期变量的格式是正确的，日期就可以正确排序了：

> nums[order(dates), ]
        name       date    response treatment      dates
13       ian 06/05/2003 39.97237868         A 2003-05-06
11 elizabeth 12/05/2003 39.39536726         B 2003-05-12
14      rose 15/05/2003 39.98892034         A 2003-05-15
2        ann 21/05/2003  1.46555993         A 2003-05-21
15  georgina 24/05/2003 40.35117518         B 2003-05-24
8      james 05/06/2003  4.90041370         A 2003-06-05
9    william 11/06/2003  6.54439283         A 2003-06-11
12   michael 14/06/2003 39.56900878         A 2003-06-14
6        ann 02/07/2003  2.86983693         B 2003-07-02
10    albert 14/07/2003 39.19746613         A 2003-07-14
16  georgina 16/08/2003 40.81249037         A 2003-08-16
1     albert 25/08/2003  0.05963704         A 2003-08-25
7   georgina 27/09/2003  3.37154802         B 2003-09-27
3       john 12/10/2003  1.59406539         B 2003-10-12
5    michael 18/10/2003  2.38330748         B 2003-10-18
17   heather 14/11/2003 41.81821146         B 2003-11-14
4        ian 02/12/2003  2.09505949         A 2003-12-02

12. 使用`match`函数

上面的worms数据框包含五种不同植被类型的田地：

> worms <- read.table("C:/data/worms.txt",header=T)
> unique(worms$Vegetation)
[1] Grassland Arable    Meadow    Scrub     Orchard  
Levels: Arable Grassland Meadow Orchard Scrub
> worms
          Field.Name Area Slope Vegetation Soil.pH  Damp Worm.density
1        Nashs.Field  3.6    11  Grassland     4.1 FALSE            4
2     Silwood.Bottom  5.1     2     Arable     5.2 FALSE            7
3      Nursery.Field  2.8     3  Grassland     4.3 FALSE            2
4        Rush.Meadow  2.4     5     Meadow     4.9  TRUE            5
5    Gunness.Thicket  3.8     0      Scrub     4.2 FALSE            6
6           Oak.Mead  3.1     2  Grassland     3.9 FALSE            2
7       Church.Field  3.5     3  Grassland     4.2 FALSE            3
8            Ashurst  2.1     0     Arable     4.8 FALSE            4
9        The.Orchard  1.9     0    Orchard     5.7 FALSE            9
10     Rookery.Slope  1.5     4  Grassland     5.0  TRUE            7
11       Garden.Wood  2.9    10      Scrub     5.2 FALSE            8
12      North.Gravel  3.3     1  Grassland     4.1 FALSE            1
13      South.Gravel  3.7     2  Grassland     4.0 FALSE            2
14 Observatory.Ridge  1.8     6  Grassland     3.8 FALSE            0
15        Pond.Field  4.1     0     Meadow     5.0  TRUE            6
16      Water.Meadow  3.9     0     Meadow     4.9  TRUE            8
17         Cheapside  2.2     8      Scrub     4.7  TRUE            4
18        Pound.Hill  4.4     2     Arable     4.5 FALSE            5
19        Gravel.Pit  2.9     1  Grassland     3.5 FALSE            1
20         Farm.Wood  0.8    10      Scrub     5.1  TRUE            3

我们想知道在20块地中每个地都适合使用的除草剂（Herbicide）。除草剂数据位于另外一个数据框中，其中包含针对更大范围的植物类型推荐的除草剂：

> herbicides <- read.table("C:/data/herbicides.txt",header=T)
> herbicides
        Type Herbicide
1   Woodland  Fusilade
2    Conifer  Weedwipe
3     Arable  Twinspan
4       Hill  Weedwipe
5    Bracken  Fusilade
6      Scrub  Weedwipe
7  Grassland  Allclear
8      Chalk  Vanquish
9     Meadow  Propinol
10      Lawn  Vanquish
11   Orchard  Fusilade
12     Verge  Allclear

任务是创建一个长度为20的向量（对应于worms数据框的20行），并把这个向量作为列加入到worms数据框中。第一个值必须为Allclear，因为Nash’s Field是grassland，第二个值必须为Twinspan，因为Silwood Bottom是arable，依此类推。 match()函数的第一个参数是worms$Vegetation，match中的第二个参数是herbicides$Type。该匹配结果作为下标向量，从herbicides$Herbicide中提取相关除草剂。如下所示：

> herbicides$Herbicide[match(worms$Vegetation, herbicides$Type)]
 [1] Allclear Twinspan Allclear Propinol Weedwipe Allclear Allclear Twinspan
 [9] Fusilade Allclear Weedwipe Allclear Allclear Allclear Propinol Propinol
[17] Weedwipe Twinspan Allclear Weedwipe
Levels: Allclear Fusilade Propinol Twinspan Vanquish Weedwipe

将此信息添加为worms数据框中的新列：

> worms$hb <- herbicides$Herbicide[match(worms$Vegetation,herbicides$Type)]
> worms
          Field.Name Area Slope Vegetation Soil.pH  Damp Worm.density       hb
1        Nashs.Field  3.6    11  Grassland     4.1 FALSE            4 Allclear
2     Silwood.Bottom  5.1     2     Arable     5.2 FALSE            7 Twinspan
3      Nursery.Field  2.8     3  Grassland     4.3 FALSE            2 Allclear
4        Rush.Meadow  2.4     5     Meadow     4.9  TRUE            5 Propinol
5    Gunness.Thicket  3.8     0      Scrub     4.2 FALSE            6 Weedwipe
6           Oak.Mead  3.1     2  Grassland     3.9 FALSE            2 Allclear
7       Church.Field  3.5     3  Grassland     4.2 FALSE            3 Allclear
8            Ashurst  2.1     0     Arable     4.8 FALSE            4 Twinspan
9        The.Orchard  1.9     0    Orchard     5.7 FALSE            9 Fusilade
10     Rookery.Slope  1.5     4  Grassland     5.0  TRUE            7 Allclear
11       Garden.Wood  2.9    10      Scrub     5.2 FALSE            8 Weedwipe
12      North.Gravel  3.3     1  Grassland     4.1 FALSE            1 Allclear
13      South.Gravel  3.7     2  Grassland     4.0 FALSE            2 Allclear
14 Observatory.Ridge  1.8     6  Grassland     3.8 FALSE            0 Allclear
15        Pond.Field  4.1     0     Meadow     5.0  TRUE            6 Propinol
16      Water.Meadow  3.9     0     Meadow     4.9  TRUE            8 Propinol
17         Cheapside  2.2     8      Scrub     4.7  TRUE            4 Weedwipe
18        Pound.Hill  4.4     2     Arable     4.5 FALSE            5 Twinspan
19        Gravel.Pit  2.9     1  Grassland     3.5 FALSE            1 Allclear
20         Farm.Wood  0.8    10      Scrub     5.1  TRUE            3 Weedwipe

13. 合并两个数据框

假设我们有两个数据框，第一个包含有关植物生命形式的信息，第二个包含开花时间的信息。我们要将这两个数据框融合成一个单一的数据框，以显示有关生命形式和开花时间的信息。两个数据框都包含属名（genus）和种名（species）的变量：

> (lifeforms <- read.table("C:/data/lifeforms.txt",header=T))
   Genus     species lifeform
1   Acer platanoides     tree
2   Acer    palmatum     tree
3  Ajuga     reptans     herb
4 Conyza sumatrensis   annual
5 Lamium       album     herb

> (flowering <- read.table("C:/data/fltimes.txt",header=T))
      Genus       species flowering
1      Acer   platanoides       May
2     Ajuga       reptans      June
3  Brassica         napus     April
4 Chamerion angustifolium      July
5    Conyza     bilbaoana    August
6    Lamium         album   January

因为在两个数据框中至少有一个变量名是相同的（在这里，两个变量名是相同的，即属(Genus)和种(species)），所以我们可以使用所有merge命令中最简单的一个：

> merge(flowering, lifeforms)
   Genus     species flowering lifeform
1   Acer platanoides       May     tree
2  Ajuga     reptans      June     herb
3 Lamium       album   January     herb

需要注意的是，合并的数据框仅包含在两个数据框中均具有完整条目的那些行。从lifeforms数据库中排除了两行（Acer platanoides和Conyza sumatrensis），因为没有它们的开花时间数据。从开花数据库中排除了三行（Chamerion angustifolium，Conyza bilbaoana和Brassica napus），因为没有针对它们的生命形式数据。

如果要包括所有物种，但不知道开花时间或生命形式时用缺失值NA代替，则使用all = T选项：

> (both <- merge(flowering,lifeforms,all=T))
      Genus       species flowering lifeform
1      Acer   platanoides       May     tree
2      Acer      palmatum      <NA>     tree
3     Ajuga       reptans      June     herb
4  Brassica         napus     April     <NA>
5 Chamerion angustifolium      July     <NA>
6    Conyza     bilbaoana    August     <NA>
7    Conyza   sumatrensis      <NA>   annual
8    Lamium         album   January     herb

通常会出现的一种复杂情况是，需要合并的变量在两个数据框中具有不同的名称。最简单的解决方案通常是将需要合并的变量名换成一样就行。还有另一个办法，则需要在第一个数据框（通常称为x数据框）和第二个数据框（通常称为y数据框）在merge中使用by.x和by.y选项。我们有第三个数据框，其中包含有关所有8个物种的种子重量的信息，但变量属Genus称为name1，变量种species称为name2。

> (seeds <- read.table("C:/data/seedwts.txt",header=T))
      name1         name2 seed
1      Acer   platanoides 32.0
2    Lamium         album 12.0
3     Ajuga       reptans  4.0
4 Chamerion angustifolium  1.5
5    Conyza     bilbaoana  0.5
6  Brassica         napus  7.0
7      Acer      palmatum 21.0
8    Conyza   sumatrensis  0.6

> merge(both,seeds,by.x=c("Genus","species"),by.y=c("name1","name2"))
      Genus       species flowering lifeform seed
1      Acer      palmatum      <NA>     tree 21.0
2      Acer   platanoides       May     tree 32.0
3     Ajuga       reptans      June     herb  4.0
4  Brassica         napus     April     <NA>  7.0
5 Chamerion angustifolium      July     <NA>  1.5
6    Conyza     bilbaoana    August     <NA>  0.5
7    Conyza   sumatrensis      <NA>   annual  0.6
8    Lamium         album   January     herb 12.0

请注意，合并后数据框中使用的变量名称是x数据框中使用的名称。

14. 添加数据框的行列统计（ margins）

假设我们有一个按季节和按人员显示的销售额数据框：

> frame <- read.table("C:/data/sales.txt", header=T)
> frame
             name spring summer autumn winter
1      Jane.Smith     14     18     11     12
2    Robert.Jones     17     18     10     13
3     Dick.Rogers     12     16      9     14
4 William.Edwards     15     14     11     10
5     Janet.Jones     11     17     11     16

我们想在此数据框中添加margins，以显示季节性均值相对于总体均值的偏离（在底部增加一行），以及人员均值的偏离（在右侧增加一列）。最后，我们希望数据框主体中的销售额可以用与整体均值的偏差来表示。

> people <- rowMeans(frame[,2:5])
> people <- people-mean(people)
> people
[1]  0.30  1.05 -0.70 -0.95  0.30

使用cbind()函数向数据框添加新列非常简单：

> (new.frame <- cbind(frame, people))
             name spring summer autumn winter people
1      Jane.Smith     14     18     11     12   0.30
2    Robert.Jones     17     18     10     13   1.05
3     Dick.Rogers     12     16      9     14  -0.70
4 William.Edwards     15     14     11     10  -0.95
5     Janet.Jones     11     17     11     16   0.30

Robert Jones是效率最高的销售人员(+1.05)，William Edwards是效率最低的销售人员(–0.95)。列平均值的计算方法与此类似：

> seasons <- colMeans(frame[ ,2:5])
> seasons <- seasons-mean(seasons)
> seasons
spring summer autumn winter 
  0.35   3.15  -3.05  -0.45

夏季销售额最高(+3.15)，秋季最低(–3.05)。但是现在有一个问题，因为只有四列，但是在新数据框中有六列，所以不能直接使用rbind()函数。处理这个问题最简单的方法是复制新数据框的一行

new.row <- new.frame[1, ]

然后对其进行编辑，以包含所需的值：第一列中的标签表示“季节性”，然后四列中的平均值，最后是总的均值零：

> new.row[1] <- "seasonal effects"
> new.row[2:5] <- seasons
> new.row[6] <- 0
> new.row
              name spring summer autumn winter people
1 seasonal effects   0.35   3.15  -3.05  -0.45      0

现在，我们可以使用rbind()函数将新行添加到扩展数据框的底部：

> (new.frame <- rbind(new.frame,new.row))
              name spring summer autumn winter people
1       Jane.Smith  14.00  18.00  11.00  12.00   0.30
2     Robert.Jones  17.00  18.00  10.00  13.00   1.05
3      Dick.Rogers  12.00  16.00   9.00  14.00  -0.70
4  William.Edwards  15.00  14.00  11.00  10.00  -0.95
5      Janet.Jones  11.00  17.00  11.00  16.00   0.30
6 seasonal effects   0.35   3.15  -3.05  -0.45   0.00

最后一个任务是用与每人每个季节的总体平均销售额（总均值gm=13.45）的偏离来替换数据框new.frame[1:5, 2:5]中的销售计数。我们需要使用unlist()函数来阻止R估计每一列的均值；然后创建一个长度为4的向量，其中包含重复的均值（每列销售额一个）。最后，我们使用sweep()函数将每个值减去总均值：

> gm <- mean(unlist(new.frame[1:5, 2:5]))
> gm <- rep(gm, 4)
> new.frame[1:5, 2:5] <- sweep(new.frame[1:5,2:5], 2, gm)
> new.frame
              name spring summer autumn winter people
1       Jane.Smith   0.55   4.55  -2.45  -1.45   0.30
2     Robert.Jones   3.55   4.55  -3.45  -0.45   1.05
3      Dick.Rogers  -1.45   2.55  -4.45   0.55  -0.70
4  William.Edwards   1.55   0.55  -2.45  -3.45  -0.95
5      Janet.Jones  -2.45   3.55  -2.45   2.55   0.30
6 seasonal effects   0.35   3.15  -3.05  -0.45   0.00

为了完成表格，我们要把总平均值放在右下角：

> new.frame[6,6] <- gm[1]
> new.frame
              name spring summer autumn winter people
1       Jane.Smith   0.55   4.55  -2.45  -1.45   0.30
2     Robert.Jones   3.55   4.55  -3.45  -0.45   1.05
3      Dick.Rogers  -1.45   2.55  -4.45   0.55  -0.70
4  William.Edwards   1.55   0.55  -2.45  -3.45  -0.95
5      Janet.Jones  -2.45   3.55  -2.45   2.55   0.30
6 seasonal effects   0.35   3.15  -3.05  -0.45  13.45

在夏季，Jane Smith和Robert Jones表现最好，销售量比夏季的整体平均水平高出4.55。

15. 用aggregate汇总数据框

除了使用summary()函数来总结数据框内容外，我们也可以使用aggregate()函数。下面还是以worms数据框为例：

> worms <- read.table("C:/data/worms.txt",header=T)
> attach(worms)
> worms
          Field.Name Area Slope Vegetation Soil.pH  Damp Worm.density
1        Nashs.Field  3.6    11  Grassland     4.1 FALSE            4
2     Silwood.Bottom  5.1     2     Arable     5.2 FALSE            7
3      Nursery.Field  2.8     3  Grassland     4.3 FALSE            2
4        Rush.Meadow  2.4     5     Meadow     4.9  TRUE            5
5    Gunness.Thicket  3.8     0      Scrub     4.2 FALSE            6
6           Oak.Mead  3.1     2  Grassland     3.9 FALSE            2
7       Church.Field  3.5     3  Grassland     4.2 FALSE            3
8            Ashurst  2.1     0     Arable     4.8 FALSE            4
9        The.Orchard  1.9     0    Orchard     5.7 FALSE            9
10     Rookery.Slope  1.5     4  Grassland     5.0  TRUE            7
11       Garden.Wood  2.9    10      Scrub     5.2 FALSE            8
12      North.Gravel  3.3     1  Grassland     4.1 FALSE            1
13      South.Gravel  3.7     2  Grassland     4.0 FALSE            2
14 Observatory.Ridge  1.8     6  Grassland     3.8 FALSE            0
15        Pond.Field  4.1     0     Meadow     5.0  TRUE            6
16      Water.Meadow  3.9     0     Meadow     4.9  TRUE            8
17         Cheapside  2.2     8      Scrub     4.7  TRUE            4
18        Pound.Hill  4.4     2     Arable     4.5 FALSE            5
19        Gravel.Pit  2.9     1  Grassland     3.5 FALSE            1
20         Farm.Wood  0.8    10      Scrub     5.1  TRUE            3

aggregate()函数与tapply()函数类似，用于将一个函数(在本例中为mean)应用到指定的类别变量(在本例中为Vegetation)的各级别上，作用于指定变量(Area，Slope，Soil.pH和Worm.density，索引下标分别为2,3,5和7)：

> aggregate(worms[ ,c(2,3,5,7)], by=list(veg=Vegetation), mean)
        veg     Area    Slope  Soil.pH Worm.density
1    Arable 3.866667 1.333333 4.833333     5.333333
2 Grassland 2.911111 3.666667 4.100000     2.444444
3    Meadow 3.466667 1.666667 4.933333     6.333333
4   Orchard 1.900000 0.000000 5.700000     9.000000
5     Scrub 2.425000 7.000000 4.800000     5.250000

即使上例只有一个分类因子，by参数也必须是一个列表。以下是Vegetation和Damp交叉分类的汇总结果：

> aggregate(worms[ ,c(2,3,5,7)], by=list(veg=Vegetation, d=Damp), mean)
        veg     d     Area    Slope  Soil.pH Worm.density
1    Arable FALSE 3.866667 1.333333 4.833333     5.333333
2 Grassland FALSE 3.087500 3.625000 3.987500     1.875000
3   Orchard FALSE 1.900000 0.000000 5.700000     9.000000
4     Scrub FALSE 3.350000 5.000000 4.700000     7.000000
5 Grassland  TRUE 1.500000 4.000000 5.000000     7.000000
6    Meadow  TRUE 3.466667 1.666667 4.933333     6.333333
7     Scrub  TRUE 1.500000 9.000000 4.900000     3.500000

今天的内容就到此为止，所有关于数据框的内容都介绍完了，希望对大家有点帮助。

感谢您的阅读！想了解更多有关技巧，请关注我的微信公众号“R语言和Python学堂”，我将定期更新相关文章。

R语言初级教程(21): 数据框（下篇）

10. 删掉重复的行

11. 处理日期变量

12. 使用`match`函数

13. 合并两个数据框

14. 添加数据框的行列统计（ margins）

15. 用aggregate汇总数据框

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读