RDD

作者: 自由编程 | 来源:发表于2019-10-20 14:49 被阅读0次

    名词解释

    resilient distributed dataset (RDD)

    运行环境

    Spark 2.4.0 is built and distributed to work with Scala 2.11 by default,此处注意,Spark和Scala的版本号要对应,否则运行的时候回发送各种未知错误。另外,Spark2.4.0对应JDK的版本最好是1.8,如果配合hadoop使用的话,hadoop的版本号可选2.7

    maven依赖

    #spark
    groupId = org.apache.spark
    artifactId = spark-core_2.11
    version = 2.4.0
    #hdfs
    groupId = org.apache.hadoop
    artifactId = hadoop-client
    version = <your-hdfs-version>
    

    初始化环境

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    
    val conf = new SparkConf().setAppName(appName).setMaster(master)
    val sc = new SparkContext(conf)
    

    使用shell初始化

    ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
    

    数据操作

    创建RDD:Parallelized Collections

    val data = Array(1, 2, 3, 4, 5)
    val distData = sc.parallelize(data)
    

    导入数据:External Datasets

    #either a local path on the machine, or a hdfs://, s3a://, etc URI
    val distFile = sc.textFile("data.txt")
    

    RDD Operations
    ...未完待续

    相关文章

      网友评论

          本文标题:RDD

          本文链接:https://www.haomeiwen.com/subject/qnhqlqtx.html