dplyr advanced: 数据筛选(filter)

作者: drlee_fc74 | 来源:发表于2020-05-13 07:03 被阅读0次

dplyr advanced: 数据筛选(filter)
DAY 4
学习小组Day6笔记--小明
学习小组D6-高岭之猹
R 数据科学（十四）
R语言笔记Day1（七 dplyr包）
生信星球Day6
R for data Science（三）
使用dplyr进行数据转换
R - dplyr学习

我们在进行数据处理的时候经常需要经常需要对数据进行筛选，在dplyr我们可以使用filter()来进行数据的筛选

基本的使用

library(tidyverse)
data("msleep")

对于数字变量的筛选

对于数字变量，我们在筛选的时候可以通过可以都过布尔符来进行筛选。其中包括:
>, <, >=, <=, ==, !=

msleep %>% 
  select(name, sleep_total) %>% 
  filter(sleep_total > 18)

## # A tibble: 4 x 2
##   name                 sleep_total
##   <chr>                      <dbl>
## 1 Big brown bat               19.7
## 2 Thick-tailed opposum        19.4
## 3 Little brown bat            19.9
## 4 Giant armadillo             18.1

在进行筛选的时候，我们经常需要筛选某一个范围内的数据，这个时候我们可以通过filter()多次制定，同时呢，也是可以通过between()函数来在函数内一起制定的

### 筛选16-18之间的数值
msleep %>% 
  select(name, sleep_total) %>% 
  filter(between(sleep_total, 16, 18))

## # A tibble: 4 x 2
##   name                   sleep_total
##   <chr>                        <dbl>
## 1 Owl monkey                    17  
## 2 Long-nosed armadillo          17.4
## 3 North American Opossum        18  
## 4 Arctic ground squirrel        16.6

在进行数字筛选的时候，还有一个函数经常用到这个就是near()。这个函数可以让我们筛选接近某一个数值的数字，我们可以通过tol参数来选择距离指定值多远的范围。

### 选择具体17±sd的数值
msleep %>% 
  select(name, sleep_total) %>% 
  filter(near(sleep_total, 17, tol = sd(sleep_total)))

## # A tibble: 26 x 2
##    name                       sleep_total
##    <chr>                            <dbl>
##  1 Owl monkey                        17  
##  2 Mountain beaver                   14.4
##  3 Greater short-tailed shrew        14.9
##  4 Three-toed sloth                  14.4
##  5 Long-nosed armadillo              17.4
##  6 North American Opossum            18  
##  7 Big brown bat                     19.7
##  8 Western american chipmunk         14.9
##  9 Thick-tailed opposum              19.4
## 10 Mongolian gerbil                  14.2
## # … with 16 more rows

字符的筛选

单个字符的筛选

对于字符的筛选，和数字有一些相近的地方。我们可以通过==以及!=来选择是某一个或者不是某一个字符。也可以通过>或者<来选择在某一个字符后面或者前面的字符(我们可以把字母表a-z当作1-26来看待)

msleep %>% 
    select(name) %>% filter(name > "v")

## # A tibble: 3 x 1
##   name                       
##   <chr>                      
## 1 "Vesper mouse"             
## 2 "Western american chipmunk"
## 3 "Vole "

多个字符的筛选

如果我们想要筛选多个字符的话，需要使用%in%参数来进行指定。但是如果我们想要反向选择的话，不能使用!%in%要在筛选的前面来加!

remove <- c("Rodentia", "Carnivora", "Primates")
msleep %>% 
  select(order, name, sleep_total) %>% 
  filter(!order %in% remove) %>% head

## # A tibble: 6 x 3
##   order        name                       sleep_total
##   <chr>        <chr>                            <dbl>
## 1 Soricomorpha Greater short-tailed shrew        14.9
## 2 Artiodactyla Cow                                4  
## 3 Pilosa       Three-toed sloth                  14.4
## 4 Artiodactyla Roe deer                           3  
## 5 Artiodactyla Goat                               5.3
## 6 Soricomorpha Star-nosed mole                   10.3

按照正则表达式来进行筛选

基础的filter()函数是不支持正则表达式的，但是了解其中主要原理就可以自定义了。其实filter()只是把返回每一行一个逻辑值。通过逻辑值来进行选择的。同样的，我们只需要通过str_detect()来进行逻辑值的指定就可以的。

msleep %>% 
  select(name, sleep_total) %>% 
  filter(str_detect(tolower(name), pattern = "mouse"))

## # A tibble: 5 x 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Vesper mouse                       7  
## 2 House mouse                       12.5
## 3 Northern grasshopper mouse        14.5
## 4 Deer mouse                        11.5
## 5 African striped mouse              8.7

基于多个条件进行筛选

我们在筛选的时候经常按照多个条件来进行筛选的。在filter()当中。则是可以有这么多种选择方式。

filter(condition1, condition2): 选择两者都为真的结果
filter(condition1, !condition2): 选择1为真，2为假的结果
filter(condition1 | condition2): 选择1为真或者2为真的结果
filter(xor(condition1, condition2)): 选择两者不同时为真的结果

msleep %>%
  select(name, bodywt:brainwt) %>% 
  filter(xor(bodywt > 100, brainwt > 1))

## # A tibble: 5 x 3
##   name            bodywt brainwt
##   <chr>            <dbl>   <dbl>
## 1 Cow               600    0.423
## 2 Horse             521    0.655
## 3 Donkey            187    0.419
## 4 Human              62    1.32 
## 5 Brazilian tapir   208.   0.169

对于多列的同时筛选

如果要对于多列进行相同的批量操作的话，同样的对于fliter()也存在进阶的函数。这个和muate是一样的。这几个函数分别是filter_all/if/at。需要注意的是，和mutate不同的时候。filter在进行筛选的时候需要用到any_vars()函数以及all_vars()来配合使用

fliter_all

对所有列都进行筛选，返回的结果是基于是any_vars()还是all_vars()来决定的。

msleep %>% 
  select(name:order, sleep_total, -vore) %>% 
  filter_all(any_vars(str_detect(., pattern = "Ca"))) %>% head

## # A tibble: 6 x 4
##   name              genus       order        sleep_total
##   <chr>             <chr>       <chr>              <dbl>
## 1 Cheetah           Acinonyx    Carnivora           12.1
## 2 Northern fur seal Callorhinus Carnivora            8.7
## 3 Vesper mouse      Calomys     Rodentia             7  
## 4 Dog               Canis       Carnivora           10.1
## 5 Roe deer          Capreolus   Artiodactyla         3  
## 6 Goat              Capri       Artiodactyla         5.3

filter_if

对符合条件的列来进行筛选。同样的是基于any_vars()还是all_vars()来返回结果。

msleep %>% 
  select(name:order, sleep_total:sleep_rem) %>% 
  filter_if(is.character, any_vars(is.na(.))) %>% head

## # A tibble: 6 x 6
##   name            genus       vore  order          sleep_total sleep_rem
##   <chr>           <chr>       <chr> <chr>                <dbl>     <dbl>
## 1 Vesper mouse    Calomys     <NA>  Rodentia               7        NA  
## 2 Desert hedgehog Paraechinus <NA>  Erinaceomorpha        10.3       2.7
## 3 Deer mouse      Peromyscus  <NA>  Rodentia              11.5      NA  
## 4 Phalanger       Phalanger   <NA>  Diprotodontia         13.7       1.8
## 5 Rock hyrax      Procavia    <NA>  Hyracoidea             5.4       0.5
## 6 Mole rat        Spalax      <NA>  Rodentia              10.6       2.4

filter_at

对某一个特定的列进行筛选。

msleep %>% 
  select(name, sleep_total:sleep_rem, brainwt:bodywt) %>% 
  filter_at(vars(sleep_total, sleep_rem), all_vars(.>5))

## # A tibble: 2 x 5
##   name                 sleep_total sleep_rem brainwt bodywt
##   <chr>                      <dbl>     <dbl>   <dbl>  <dbl>
## 1 Thick-tailed opposum        19.4       6.6  NA       0.37
## 2 Giant armadillo             18.1       6.1   0.081  60

dplyr advanced: 数据筛选(filter)
我们在进行数据处理的时候经常需要经常需要对数据进行筛选，在dplyr我们可以使用filter()来进行数据的筛选 ...
DAY 4
参考：datacamp dplyr package 通过filter 筛选符合某种情况的数据集，多个条件可以用逗号...
学习小组Day6笔记--小明
dplyr五个基础函数 mutate()新增列 select()按列筛选 filter()筛选行 arrange(...
学习小组D6-高岭之猹
一、dplyr几个基本函数 1.select(),按列筛选按列号筛选按列名筛选 2.filter()筛选行 3...
R 数据科学（十四）
dplyr中必须掌握的几大函数筛选行 filter 筛选列 select 只有select函数存在含有 “st...
R语言笔记Day1（七 dplyr包）
1、dplyr 基础包 dplyr包函数函数名称说明函数1filter()按值筛选观测函数2arrange()对行...
生信星球Day6
设置镜像 dplyr五个基础函数 mutate新增列 select按列号或列名筛选 filter按特定条件筛选行 ...
R for data Science（三）
在看之前，首先来一波总结 dplyr 函数总结： filter 筛选行 arrange 排列行 select ...
使用dplyr进行数据转换
dplyr最常用的5个函数：• 按值筛选观测(filter())。• 对行进行重新排序(arrange())。• ...
R - dplyr学习
对所有的变量进行筛选 library(dplyr)all_data3 <- filter_all(all_data...