名词解释
resilient distributed dataset (RDD)
运行环境
Spark 2.4.0 is built and distributed to work with Scala 2.11 by default,此处注意,Spark和Scala的版本号要对应,否则运行的时候回发送各种未知错误。另外,Spark2.4.0对应JDK的版本最好是1.8,如果配合hadoop使用的话,hadoop的版本号可选2.7
maven依赖
#spark
groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.4.0
#hdfs
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
初始化环境
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
使用shell初始化
./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
数据操作
创建RDD:Parallelized Collections
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
导入数据:External Datasets
#either a local path on the machine, or a hdfs://, s3a://, etc URI
val distFile = sc.textFile("data.txt")
RDD Operations
...未完待续
网友评论