1. 读取保存文件
默认读取格式为parquet
// 1.读取保存默认为parquet
val df1 = spark.read.load("***.parquet")
df1.write.save("output")
// 2.操作json文件
//val df2 = spark.read.format("json").load("***.json")
val df2 = spark.read.json("***.json")
df2.write.format("json").save("output")
// 3.操作csv文件
val df3 = spark.read.format("csv")
.option("sep",",")//分隔符
.option("header","true")//是否有表头
.load("***.csv")
// 4.sql直接查询文件
val df = spark.sql("select * from json.`file/1.txt`")
df.show()
2. 保存模式:SaveMode
//ErrorIfExists:默认,存在文件会报错
df.write.format("json").save("output")
//Append:追加方式,在目录下新生成一个文件
df.write.format("json").mode("append").save("output")
//Overwrite:如果文件存在则覆盖
df.write.format("json").mode("overwrite").save("output")
//Ignore:文件存在则忽略save操作没有则保存
df.write.format("json").mode("ignore").save("output")
3. Spark SQL 连接 Mysql
//读取mysql
val df = spark.read
.format("jdbc")
.option("driver", "com.mysql.cj.jdbc.Driver")
.option("url", "jdbc:mysql://sinan01:3306/caster?characterEncoding=utf-8&serverTimezone=UTC")
.option("user", "root")
.option("password", "123456")
.option("dbtable", "caster1")
.load()
df.show()
//存入mysql
df.write
.format("jdbc")
.option("driver", "com.mysql.cj.jdbc.Driver")
.option("url", "jdbc:mysql://sinan01:3306/caster?characterEncoding=utf-8&serverTimezone=UTC")
.option("user", "root")
.option("password", "123456")
.option("dbtable", "caster2")
.mode(SaveMode.Append)
.save()
4. Spark SQL 连接 Hive
4.1 用于练习操作的内置Hive数仓
添加maven依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.3.2</version>
</dependency>
创建内置Hive表并加载数据,需要enableHiveSupport()
val spark = SparkSession.builder
.master("local").appName("test")
.enableHiveSupport()
.getOrCreate()
//构造内置hive表
val df = spark.sql("create table user1(id int)")
//加载本地数据入表
spark.sql("load data local inpath 'file/id.txt' into table user1")
//显示内置表
spark.sql("show tables").show
//查询表
spark.sql("select * from user1").show
内置表会存在当前的文件中:
![](https://img.haomeiwen.com/i5906405/1d6e9d895c811268.png)
4.2 连接外部的Hive
- Linux的Spark-shell操作环境:
- Spark需要将hive-site.xml拷贝到conf下
- Mysql驱动拷贝到jars目录下(hive元数据在mysql中)
- HDFS的core-site.xml和hdfs-site.xml拷贝到conf下
- IDEA开发代码配置:
- maven依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.11</version>
</dependency>
- 将hive的配置文件hive-site.xml到工程的resources文件下
- 需要设置enableHiveSupport()
- 依赖:包含Mysql驱动
代码操作和内置Hive方法一致。
网友评论