安装完环境, 现在是时候写一个demo项目了
需求就是打印出上传到hdfs中的日志行数.
依赖
需要用到spark-sql库, 先查看一下spark目录下sql版本:
spark-2.4.3-bin-hadoop2.7/jars/spark-sql_2.11-2.4.3.jar
那么在程序中引用相同的库
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
主程序
public class Application {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("sparkappdemo").getOrCreate();
Dataset<Row> logs = spark.read().json("hdfs://192.168.1.26:9000/sjfxlogs/gateway-json-2019-05-31-1.log");
System.out.println(">>>######## 日志总行数:" + logs.count());
spark.stop();
}
}
打包
程序要提交到spark集群上执行, 所以spark建议我们打包成胖jar( fat jar) , 里面包含除hadoop和spark库以外的所有依赖包, 这样就不会有依赖问题.
打包成fat jar可以用maven的assembly插件: http://maven.apache.org/plugins/maven-assembly-plugin/assembly.html
pom.xml
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<descriptors>assembly.xml</descriptors>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
定义打包规则(assembly.xml):
<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd">
<!-- TODO: a jarjar format would be better -->
<id>jar-with-dependencies</id>
<formats>
<format>jar</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<outputDirectory>/</outputDirectory>
<useProjectArtifact>true</useProjectArtifact>
<unpack>true</unpack>
<scope>runtime</scope>
<excludes>
<exclude>org.apache.hadoop:*</exclude>
<exclude>org.apache.spark:*</exclude>
</excludes>
</dependencySet>
</dependencySets>
</assembly>
执行打包命令:
mvn clean && mvn package
提交运行
https://spark.apache.org/docs/latest/submitting-applications.html
./spark-2.4.3-bin-hadoop2.7/bin/spark-submit --master spark://192.168.1.26:5030 --class cn.com.sjfx.sparkappdemo.Application ~/sjfx-spark-app-demo/target/spark-app-demo-1.0-SNAPSHOT-jar-with-dependencies.jar
错误
运行的时候遇到以下错误
2019-06-06 14:52:33 INFO DAGScheduler:54 - Job 0 failed: json at Application.java:10, took 1.526856 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 16, 192.168.1.26, executor 1): java.io.InvalidClassException: org.apache.spark.sql.catalyst.expressions.AttributeReference; local class incompatible: stream classdesc serialVersionUID = 6743846238922907819, local class serialVersionUID = -3473797288281215461
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:687)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1883)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1749)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2040)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1571)
........
造成错误的原因可能有:
- 编译程序的jdk和运行的jdk版本不一致
- 程序的依赖库与spark运行库版本不一致
结果
19/06/10 10:32:58 INFO DAGScheduler: Job 1 finished: count at Application.java:11, took 2.744932 s
>>>######## 日志总行数:668244
19/06/10 10:32:58 INFO SparkUI: Stopped Spark web UI at http://mo-x:4040
提交到集群上运行(集群里某节点作为驱动程序)
/home/mo/sjfx-hadoop/hadoop-2.7.7/bin/hadoop fs -put -f ~/sjfx-spark-app-demo/target/spark-app-demo-1.0-SNAPSHOT-jar-with-dependencies.jar hdfs://192.168.1.22:5020/ && \
/home/mo/sjfx-spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster --master spark://192.168.1.22:5030 --class cn.com.sjfx.sparkappdemo.Application hdfs://192.168.1.22:5020/spark-app-demo-1.0-SNAPSHOT-jar-with-dependencies.jar
网友评论