TinkerPop中可以结合SparkGraphComputer和HadoopGraph实现使用大数据集群资源分布式对图进行OLAP。官方文档只是介绍了使用The Gremlin Console进行OLAP,但是实际的生产环境中通常还是需要写程序打成jar包来执行的。本文将介绍如何使用java -cp
的方式执行程序实现spark on yarn。
客户端程序
将官网提供的样例Using CloneVertexProgram改造成java程序,业务逻辑很简单,加载hdfs上的原始数据文件tinkerpop-modern.json(官方提供)生成图,然后通过CloneVertexProgram复制这个图并将复制后的图以json的格式再存储到hdfs中。
public class HadoopGraphSparkComputerDemo {
public static void main(String[] args) throws Exception {
FileConfiguration configuration = new PropertiesConfiguration();
configuration.load(new File(args[0]));
final Configuration hadoopConfig = new Configuration(false);
if ("kerberos".equalsIgnoreCase(hadoopConfig.get("hadoop.security.authentication"))) {//1
UserGroupInformation.setConfiguration(hadoopConfig);
try {
UserGroupInformation userGroupInformation =
UserGroupInformation.loginUserFromKeytabAndReturnUGI(configuration.getString("user.principal"), configuration.getString("user.keytab"));
UserGroupInformation.setLoginUser(userGroupInformation);
System.out.println("Login successfully!");
} catch (Exception e) {
e.printStackTrace();
}
}
HadoopGraph graph = HadoopGraph.open(configuration);
graph.compute(SparkGraphComputer.class).program(CloneVertexProgram.build().create()).submit().get();
}
}
- 适配Kerberos环境
- 最后一行就是执行复制图的操作
配置文件
创建一个配置文件hadoop-graphson.properties
,内容如下
# the graph class
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# 输入输出都是json格式
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# 输入数据源的路径
gremlin.hadoop.inputLocation=/tmp/tinkerpop-modern.json
#输出结果的路径
gremlin.hadoop.outputLocation=/tmp/output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
#SparkGraphComputer只支持client模式
spark.submit.deployMode=client
#spark on yarn运行时依赖jar包的路径
spark.yarn.jars=/tmp/graph-jars/*.jar
spark.driver.extraJavaOptions=-Dhdp.version=2.6.0.3-8
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.0.3-8
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
user.principal=user@AA.COM
user.keytab=/tmp/hdfs.headless.keytab
程序编译
将必要的依赖都定义在pom文件中,然后使用maven-dependency-plugin
插件在编译的时候会将依赖的jar放到lib目录
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.woople</groupId>
<artifactId>graph-tutorials</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency><!--required-->
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-core</artifactId>
<version>3.4.4</version>
</dependency>
<dependency><!--required-->
<groupId>org.apache.tinkerpop</groupId>
<artifactId>tinkergraph-gremlin</artifactId>
<version>3.4.4</version>
</dependency>
<dependency><!--required-->
<groupId>org.apache.tinkerpop</groupId>
<artifactId>spark-gremlin</artifactId>
<version>3.4.4</version>
</dependency>
<dependency><!--required-->
<groupId>org.apache.tinkerpop</groupId>
<artifactId>hadoop-gremlin</artifactId>
<version>3.4.4</version>
</dependency>
<dependency><!--required-->
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency><!--required-->
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.11.8</version>
</dependency>
<dependency><!--required-->
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-core</artifactId>
<version>1.9</version>
</dependency>
<dependency><!--required-->
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-client</artifactId>
<version>1.9</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.28</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-core</artifactId>
<version>1.2.3</version>
</dependency>
</dependencies>
<build>
<defaultGoal>package</defaultGoal>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<goals>
<goal>copy-resources</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>eclipse-add-source</id>
<goals>
<goal>add-source</goal>
</goals>
</execution>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile-first</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
<execution>
<id>attach-scaladocs</id>
<phase>verify</phase>
<goals>
<goal>doc-jar</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>2.11.8</scalaVersion>
<recompileMode>incremental</recompileMode>
<useZincServer>true</useZincServer>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>prepare-package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
部署运行
- 将上一步编译时在lib下生成的所有依赖的jar放到
spark.yarn.jars
所指定的hdfs路径 - 将生成的lib文件夹、
graph-tutorials-1.0-SNAPSHOT.jar
放到运行环境,例如/opt/graph-tutorials
- 把hadoop-graphson.properties、core-site.xml、hdfs-site.xml和yarn-site.xml放到指定目录,例如
/opt/graph-tutorials/conf
-
cd /opt/graph-tutorials
然后执行java -cp lib/*:conf:graph-tutorials-1.0-SNAPSHOT.jar com.woople.tinkerpop.gremlin.HadoopGraphSparkComputerDemo hadoop-graphson.properties
总结
在调试过程中遇到的问题主要是hadoop相关配置文件加载不到,jar包冲突,缺少类等问题。需要注意的是要根据tinkerpop的版本选择spark版本。本文完整示例请参考graph-tutorials。
网友评论