美文网首页数客联盟
TinkerPop中使用Spark on Yarn模式运行OLA

TinkerPop中使用Spark on Yarn模式运行OLA

作者: Woople | 来源:发表于2019-12-07 21:33 被阅读0次

    TinkerPop中可以结合SparkGraphComputerHadoopGraph实现使用大数据集群资源分布式对图进行OLAP。官方文档只是介绍了使用The Gremlin Console进行OLAP,但是实际的生产环境中通常还是需要写程序打成jar包来执行的。本文将介绍如何使用java -cp的方式执行程序实现spark on yarn。

    客户端程序

    将官网提供的样例Using CloneVertexProgram改造成java程序,业务逻辑很简单,加载hdfs上的原始数据文件tinkerpop-modern.json(官方提供)生成图,然后通过CloneVertexProgram复制这个图并将复制后的图以json的格式再存储到hdfs中。

    public class HadoopGraphSparkComputerDemo {
        public static void main(String[] args) throws Exception {
            FileConfiguration configuration = new PropertiesConfiguration();
            configuration.load(new File(args[0]));
    
            final Configuration hadoopConfig = new Configuration(false);
    
            if ("kerberos".equalsIgnoreCase(hadoopConfig.get("hadoop.security.authentication"))) {//1
    
                UserGroupInformation.setConfiguration(hadoopConfig);
                try {
                    UserGroupInformation userGroupInformation =
                            UserGroupInformation.loginUserFromKeytabAndReturnUGI(configuration.getString("user.principal"), configuration.getString("user.keytab"));
                    UserGroupInformation.setLoginUser(userGroupInformation);
    
                    System.out.println("Login successfully!");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
    
            HadoopGraph graph = HadoopGraph.open(configuration);
            graph.compute(SparkGraphComputer.class).program(CloneVertexProgram.build().create()).submit().get();
        }
    }
    
    1. 适配Kerberos环境
    2. 最后一行就是执行复制图的操作

    配置文件

    创建一个配置文件hadoop-graphson.properties,内容如下

    # the graph class
    gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
    # 输入输出都是json格式
    gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
    gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
    # 输入数据源的路径
    gremlin.hadoop.inputLocation=/tmp/tinkerpop-modern.json
    #输出结果的路径
    gremlin.hadoop.outputLocation=/tmp/output
    # if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
    gremlin.hadoop.jarsInDistributedCache=true
    
    ####################################
    # SparkGraphComputer Configuration #
    ####################################
    spark.master=yarn
    #SparkGraphComputer只支持client模式
    spark.submit.deployMode=client
    #spark on yarn运行时依赖jar包的路径
    spark.yarn.jars=/tmp/graph-jars/*.jar
    spark.driver.extraJavaOptions=-Dhdp.version=2.6.0.3-8
    spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.0.3-8
    
    spark.serializer=org.apache.spark.serializer.KryoSerializer
    spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
    user.principal=user@AA.COM
    user.keytab=/tmp/hdfs.headless.keytab
    

    程序编译

    将必要的依赖都定义在pom文件中,然后使用maven-dependency-plugin插件在编译的时候会将依赖的jar放到lib目录

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.woople</groupId>
        <artifactId>graph-tutorials</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
            <dependency><!--required-->
                <groupId>org.apache.tinkerpop</groupId>
                <artifactId>gremlin-core</artifactId>
                <version>3.4.4</version>
            </dependency>
            <dependency><!--required-->
                <groupId>org.apache.tinkerpop</groupId>
                <artifactId>tinkergraph-gremlin</artifactId>
                <version>3.4.4</version>
            </dependency>
            <dependency><!--required-->
                <groupId>org.apache.tinkerpop</groupId>
                <artifactId>spark-gremlin</artifactId>
                <version>3.4.4</version>
            </dependency>
            <dependency><!--required-->
                <groupId>org.apache.tinkerpop</groupId>
                <artifactId>hadoop-gremlin</artifactId>
                <version>3.4.4</version>
            </dependency>
            <dependency><!--required-->
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-yarn_2.11</artifactId>
                <version>2.4.0</version>
            </dependency>
            <dependency><!--required-->
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-reflect</artifactId>
                <version>2.11.8</version>
            </dependency>
            <dependency><!--required-->
                <groupId>com.sun.jersey</groupId>
                <artifactId>jersey-core</artifactId>
                <version>1.9</version>
            </dependency>
            <dependency><!--required-->
                <groupId>com.sun.jersey</groupId>
                <artifactId>jersey-client</artifactId>
                <version>1.9</version>
            </dependency>
    
            <dependency>
                <groupId>log4j</groupId>
                <artifactId>log4j</artifactId>
                <version>1.2.17</version>
            </dependency>
    
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
                <version>1.7.28</version>
            </dependency>
            <dependency>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-classic</artifactId>
                <version>1.2.3</version>
            </dependency>
            <dependency>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-core</artifactId>
                <version>1.2.3</version>
            </dependency>
        </dependencies>
        <build>
            <defaultGoal>package</defaultGoal>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-resources-plugin</artifactId>
                    <configuration>
                        <encoding>UTF-8</encoding>
                    </configuration>
                    <executions>
                        <execution>
                            <goals>
                                <goal>copy-resources</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.2</version>
                    <executions>
                        <execution>
                            <id>eclipse-add-source</id>
                            <goals>
                                <goal>add-source</goal>
                            </goals>
                        </execution>
                        <execution>
                            <id>scala-compile-first</id>
                            <phase>process-resources</phase>
                            <goals>
                                <goal>compile</goal>
                            </goals>
                        </execution>
                        <execution>
                            <id>scala-test-compile-first</id>
                            <phase>process-test-resources</phase>
                            <goals>
                                <goal>testCompile</goal>
                            </goals>
                        </execution>
                        <execution>
                            <id>attach-scaladocs</id>
                            <phase>verify</phase>
                            <goals>
                                <goal>doc-jar</goal>
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <scalaVersion>2.11.8</scalaVersion>
                        <recompileMode>incremental</recompileMode>
                        <useZincServer>true</useZincServer>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.5.1</version>
                    <executions>
                        <execution>
                            <phase>compile</phase>
                            <goals>
                                <goal>compile</goal>
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <source>8</source>
                        <target>8</target>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-dependency-plugin</artifactId>
                    <version>3.1.1</version>
                    <executions>
                        <execution>
                            <id>copy-dependencies</id>
                            <phase>prepare-package</phase>
                            <goals>
                                <goal>copy-dependencies</goal>
                            </goals>
                            <configuration>
                                <outputDirectory>${project.build.directory}/lib</outputDirectory>
                                <overWriteReleases>false</overWriteReleases>
                                <overWriteSnapshots>false</overWriteSnapshots>
                                <overWriteIfNewer>true</overWriteIfNewer>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    </project>
    

    部署运行

    1. 将上一步编译时在lib下生成的所有依赖的jar放到spark.yarn.jars所指定的hdfs路径
    2. 将生成的lib文件夹、graph-tutorials-1.0-SNAPSHOT.jar放到运行环境,例如/opt/graph-tutorials
    3. 把hadoop-graphson.properties、core-site.xml、hdfs-site.xml和yarn-site.xml放到指定目录,例如/opt/graph-tutorials/conf
    4. cd /opt/graph-tutorials然后执行java -cp lib/*:conf:graph-tutorials-1.0-SNAPSHOT.jar com.woople.tinkerpop.gremlin.HadoopGraphSparkComputerDemo hadoop-graphson.properties

    总结

    在调试过程中遇到的问题主要是hadoop相关配置文件加载不到,jar包冲突,缺少类等问题。需要注意的是要根据tinkerpop的版本选择spark版本。本文完整示例请参考graph-tutorials

    相关文章

      网友评论

        本文标题:TinkerPop中使用Spark on Yarn模式运行OLA

        本文链接:https://www.haomeiwen.com/subject/lerygctx.html