Spark中DataFrame的基本操作

作者: 环境与方法 | 来源:发表于2019-10-07 23:41 被阅读0次

    前言

    在使用spark进行数据处理时,之前RDD使用较多,目前随着任务复杂度增加,有schema的DataFrame更加简洁便利。现在总结一下df的基本操作,作为备忘。

    DataFrame是什么

    DataFrame是一种以RDD为基础的分布式数据集,每一列都有自己的名称和数据类型。 Spark中的DataFrame属于比较基本的数据结构,在进行数据处理时,用DataFrame进行处理较RDD相比,速度更快,是带有schema的RDD数据结构。反观RDD,在进行使用时,只有内部分区,无法知道数据元素内部的数据结构。

    下面介绍一下在使用DataFrame(df)时的基本操作:

    1.在服务器上启动spark-shell
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.2.1
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
    
    2.生成DF格式数据

    首先我们创造一个map,用字母指向该字母为首的水果英文单词,并转换为array形式。再将该array转换为rdd,并转换为名称为"alphabet"和"fruit"的df
    同理生成名称为"alphabet"和"fruit1"的df1,为之后两个df做join操作作准备

    scala> val test = Map("a" -> "apple", "b" -> "banana", "c" -> "cherry", "d" -> "durian", "g" -> "grape").toArray
    test: Array[(String, String)] = Array((a,apple), (b,banana), (g,grape), (c,cherry), (d,durian))
    
    scala> val df = sc.parallelize(test).toDF("alphabet", "fruit")
    df: org.apache.spark.sql.DataFrame = [alphabet: string, fruit: string]
    
    scala> val test1 = Map("a" -> "apple1", "b" -> "banana1", "c" -> "cherry1", "d" -> "durian1", "g" -> "grape1").toArray
    test1: Array[(String, String)] = Array((a,apple1), (b,banana1), (g,grape1), (c,cherry1), (d,durian1))
    
    scala> val df1 = sc.parallelize(test1).toDF("alphabet", "fruit1")
    df1: org.apache.spark.sql.DataFrame = [alphabet: string, fruit1: string]
    
    3.利用show()查看当前的df数据
    scala> df.show()
    +--------+------+
    |alphabet| fruit|
    +--------+------+
    |       a| apple|
    |       b|banana|
    |       g| grape|
    |       c|cherry|
    |       d|durian|
    +--------+------+
    
    4.增加一列fruit_copy,值和fruit一样
    scala> df.withColumn("fruit_copy", df("fruit")).show()
    +--------+------+----------+
    |alphabet| fruit|fruit_copy|
    +--------+------+----------+
    |       a| apple|     apple|
    |       b|banana|    banana|
    |       g| grape|     grape|
    |       c|cherry|    cherry|
    |       d|durian|    durian|
    +--------+------+----------+
    
    5.将其中一列重命名,将fruit_copy重命名为fruit_double
    scala> df.withColumn("fruit_copy", df("fruit")).withColumnRenamed("fruit_copy", "fruit_douuble").show()
    +--------+------+-------------+
    |alphabet| fruit|fruit_douuble|
    +--------+------+-------------+
    |       a| apple|        apple|
    |       b|banana|       banana|
    |       g| grape|        grape|
    |       c|cherry|       cherry|
    |       d|durian|       durian|
    +--------+------+-------------+
    
    6.将df和df1两个DataFrame做join left的操作,相当于rdd的leftOuterJoin,根据"alphabet"这列进行join
    scala> df.join(df1, df("alphabet") === df1("alphabet"), "left").show()
    +--------+------+--------+-------+
    |alphabet| fruit|alphabet| fruit1|
    +--------+------+--------+-------+
    |       g| grape|       g| grape1|
    |       d|durian|       d|durian1|
    |       c|cherry|       c|cherry1|
    |       b|banana|       b|banana1|
    |       a| apple|       a| apple1|
    +--------+------+--------+-------+
    
    7.explode操作, 将"fruit"列按照"a"分割。两种写法:

    方法一:

    scala> df.withColumn("split", split(col("fruit"), "a")).withColumn("split", explode(col("split"))).show()
    +--------+------+------+
    |alphabet| fruit| split|
    +--------+------+------+
    |       a| apple|      |
    |       a| apple|  pple|
    |       b|banana|     b|
    |       b|banana|     n|
    |       b|banana|     n|
    |       b|banana|      |
    |       g| grape|    gr|
    |       g| grape|    pe|
    |       c|cherry|cherry|
    |       d|durian|  duri|
    |       d|durian|     n|
    +--------+------+------+
    

    方法二:

    scala> df.explode("fruit", "split") { it: String => it.split("a")}.show()
    +--------+------+------+
    |alphabet| fruit| split|
    +--------+------+------+
    |       a| apple|      |
    |       a| apple|  pple|
    |       b|banana|     b|
    |       b|banana|     n|
    |       b|banana|     n|
    |       b|banana|      |
    |       g| grape|    gr|
    |       g| grape|    pe|
    |       c|cherry|cherry|
    |       d|durian|  duri|
    |       d|durian|     n|
    +--------+------+------+
    

    相关文章

      网友评论

        本文标题:Spark中DataFrame的基本操作

        本文链接:https://www.haomeiwen.com/subject/svrvpctx.html