美文网首页ggplot集锦
数据文件智能读取: R语言vroom包

数据文件智能读取: R语言vroom包

作者: Jason数据分析生信教室 | 来源:发表于2022-10-25 14:28 被阅读0次

    最近折腾Shiny的时候接触到了一款非常好用的数据读取包。写一下备忘录。

    1. 自动识别分隔文件

    vroom有自动识别文件格式功能,所以不管是csv,还是tsv文件都只需要同一个读取指令vroom(”xxx.csv”)就可以。

    library(vroom)
    
    data <- vroom("flights.tsv")
    #> Observations: 336,776
    #> Variables: 19
    #> chr  [ 4]: carrier, tailnum, origin, dest
    #> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
    #> dttm [ 1]: time_hour
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    

    会跳出来一大段有关该数据各列属性的信息,不需要的话可以关掉。

    s <- spec(data)
    
    data <- vroom("flights.tsv", col_types = s)
    

    2. 同时读取多个文件

    批量读取数据是vroom的一大亮点。

    files <- fs::dir_ls(glob = "flights_*tsv")
    files
    #> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
    #> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
    #> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
    #> flights_YV.tsv
    data <- vroom(files)
    #> Observations: 336,776
    #> Variables: 19
    #> chr  [ 4]: carrier, tailnum, origin, dest
    #> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
    #> dttm [ 1]: time_hour
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    

    3. 读取和写出压缩文件

    • vroom_write() 可以直接写出压缩文件
    vroom_write(flights, "flights.tsv.gz")
    
    # Check file sizes to show file is compressed
    fs::file_size(c("flights.tsv", "flights.tsv.gz"))
    #> 29.62M  7.87M
    
    # Read the file back in
    data <- vroom("flights.tsv.gz")
    #> Observations: 336,776
    #> Variables: 19
    #> chr  [ 4]: carrier, tailnum, origin, dest
    #> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
    #> dttm [ 1]: time_hour
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    

    4. 读取网页文件

    file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
    data <- vroom(file)
    #> Observations: 32
    #> Variables: 12
    #> chr [ 1]: model
    #> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    

    5. 读取和写出管道代码连接数据

    这个有点神奇的,完全代替Perl。

    • 提取United Airlines(包含UA字符)的数据
    # Return only flights on United Airlines
    data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights))
    #> Observations: 58,665
    #> Variables: 19
    #> chr  [ 4]: carrier, tailnum, origin, dest
    #> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
    #> dttm [ 1]: time_hour
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    
    • 又或者可以在写出压缩文件的时候指定压缩工具pigz
    bench::workout({
      vroom_write(flights, "flights.tsv.gz")
      vroom_write(flights, pipe("pigz > flights.tsv.gz"))
    })
    #> # A tibble: 2 x 3
    #>   exprs                                                process     real
    #>   <bch:expr>                                          <bch:tm> <bch:tm>
    #> 1 vroom_write(flights, "flights.tsv.gz")                  3.5s    2.69s
    #> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz"))    1.54s 975.09ms
    

    6. 选择数据列

    • 提取指定列
    data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
    #> Observations: 336,776
    #> Variables: 3
    #> chr [1]: tailnum
    #> dbl [2]: year, flight
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    
    • 不提取指定列
    data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
    #> Observations: 336,776
    #> Variables: 13
    #> chr [4]: carrier, tailnum, origin, dest
    #> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    
    • 重命名指定列
    data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
    #> Observations: 336,776
    #> Variables: 19
    #> chr  [ 4]: carrier, tailnum, origin, dest
    #> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
    #> dttm [ 1]: time_hour
    #> 
    #> Call `spec()` for a copy-pastable column specification
    #> Specify the column types with `col_types` to quiet this message
    data
    #> # A tibble: 336,776 x 19
    #>    plane  year month   day dep_time sched_dep_time dep_delay arr_time
    #>    <chr> <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
    #>  1 N142…  2013     1     1      517            515         2      830
    #>  2 N242…  2013     1     1      533            529         4      850
    #>  3 N619…  2013     1     1      542            540         2      923
    #>  4 N804…  2013     1     1      544            545        -1     1004
    #>  5 N668…  2013     1     1      554            600        -6      812
    #>  6 N394…  2013     1     1      554            558        -4      740
    #>  7 N516…  2013     1     1      555            600        -5      913
    #>  8 N829…  2013     1     1      557            600        -3      709
    #>  9 N593…  2013     1     1      557            600        -3      838
    #> 10 N3AL…  2013     1     1      558            600        -2      753
    #> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>,
    #> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
    #> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
    #> #   time_hour <dttm>
    

    7. 修改变量属性

    大多数情况下vroom可以准确的判断变量属性,当然偶尔也会出错,这个时候可以手动指定。当然也可以后期用dplyr 改,当然这样做就会稍微麻烦点。

    属性对照,[ ]里的字符是实际用到的缩写字符。

    • col_logical() ‘l’, containing only T, F, TRUE, FALSE, 1 or 0.
    • col_integer() ‘i’, integer values.
    • col_double() ‘d’, floating point values.
    • col_number() [n], numbers containing the grouping_mark
    • col_date(format = "") [D]: with the locale’s date_format.
    • col_time(format = "") [t]: with the locale’s time_format.
    • col_datetime(format = "") [T]: ISO8601 date times.
    • col_factor(levels, ordered) ‘f’, a fixed set of values.
    • col_character() ‘c’, everything else.
    • col_skip() ‘_, -', don’t import this column.
    • col_guess() ‘?', parse using the “best” type based on the input.

    用例如下:

    # read the 'year' column as an integer
    data <- vroom("flights.tsv", col_types = c(year = "i"))
    
    # also skip reading the 'time_hour' column
    data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_"))
    
    # also read the carrier as a factor
    data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))
    
    data <- vroom("flights.tsv",
      col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor())
    )
    

    8. 数据读取速度

    一个字,快!非常适合机器学习动不动就几个G的数据。

    下图是读取和输出1.55G数据时各个包所用的时间比较。


    相关文章

      网友评论

        本文标题:数据文件智能读取: R语言vroom包

        本文链接:https://www.haomeiwen.com/subject/pxznzrtx.html