TinkerPop中的GraphComputer组件

作者: 天之見證 | 来源:发表于2020-03-11 20:49 被阅读0次

TinkerPop中的GraphComputer组件
Apache TinkerPop快速体验
JanusGraph---TinkerPop’s Hadoop-
gremlin语言介绍一：前言
TinkerPop中使用Spark on Yarn模式运行OLA
二十一 Gremlin查询语句的实现原理
记一次 Cassandra 删除“不干净”的问题发现与定位过程
gremlin 性能
Vue父子组件之间的传值
JanusGraph 查询步骤简要分析

TinkerPop中对于图的计算,底层的计算引擎是通过GraphComputer来提供的

1. 计算的分类

TinkerPop提供的2种和图数据交互的方式:

	OLTP	OLAP
驱动计算的数据量	1个点/很少的几个点	点很多,计算过程中会用到全图
数据时效性	毫秒/秒级	数分钟/数小时
数据访问方式	随机	顺序
数据处理方式	串行	并行

oltp vs olap

2. 计算的分解

TinkerPop使用的计算模型,在我看来即是一个 大的MapReduce=BSP+MapReduce

其中:

BSP: VertexProgram
MapReduce: MapReduce (可能会没有)

1. 计算的入口/并行的开始`VertexProgram`

图中的涉及到的点,都会执行VertexProgram对应的代码, 一个抽象的执行机器称为worker, 这些worker可以并行执行这些代码, 即调用VertexProgram.execute() (这里采用了BSP计算模型), 节点之间通过2种消息进行通信

MessageScope.Local: 相邻节点之间通信
MessageScope.Global: 图中任一节点通信

当VertexProgram执行完成之后, 会接着执行其对应的MapReduce任务, 可以通过VertexProgram.getMapReducers() 获得

BSP的执行示例如下:

本地计算
计算结果通信
阻塞直到所有的本次计算完成
转到步骤1

bsp diagram

以 PageRankVertexProgram 的实现为例子, 可以看到具体的实现过程

public class PageRankVertexProgram implements VertexProgram<Double> { //1

    public static final String PAGE_RANK = "gremlin.pageRankVertexProgram.pageRank";
    private static final String EDGE_COUNT = "gremlin.pageRankVertexProgram.edgeCount";
    private static final String PROPERTY = "gremlin.pageRankVertexProgram.property";
    private static final String VERTEX_COUNT = "gremlin.pageRankVertexProgram.vertexCount";
    private static final String ALPHA = "gremlin.pageRankVertexProgram.alpha";
    private static final String EPSILON = "gremlin.pageRankVertexProgram.epsilon";
    private static final String MAX_ITERATIONS = "gremlin.pageRankVertexProgram.maxIterations";
    private static final String EDGE_TRAVERSAL = "gremlin.pageRankVertexProgram.edgeTraversal";
    private static final String INITIAL_RANK_TRAVERSAL = "gremlin.pageRankVertexProgram.initialRankTraversal";
    private static final String TELEPORTATION_ENERGY = "gremlin.pageRankVertexProgram.teleportationEnergy";
    private static final String CONVERGENCE_ERROR = "gremlin.pageRankVertexProgram.convergenceError";

    private MessageScope.Local<Double> incidentMessageScope = MessageScope.Local.of(__::outE); //2
    private MessageScope.Local<Double> countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
    private PureTraversal<Vertex, Edge> edgeTraversal = null;
    private PureTraversal<Vertex, ? extends Number> initialRankTraversal = null;
    private double alpha = 0.85d;
    private double epsilon = 0.00001d;
    private int maxIterations = 20;
    private String property = PAGE_RANK; //3
    private Set<VertexComputeKey> vertexComputeKeys;
    private Set<MemoryComputeKey> memoryComputeKeys;

    private PageRankVertexProgram() {    }

    @Override
    public void loadState(final Graph graph, final Configuration configuration) { //4
        if (configuration.containsKey(INITIAL_RANK_TRAVERSAL))
            this.initialRankTraversal = PureTraversal.loadState(configuration, INITIAL_RANK_TRAVERSAL, graph);
        if (configuration.containsKey(EDGE_TRAVERSAL)) {
            this.edgeTraversal = PureTraversal.loadState(configuration, EDGE_TRAVERSAL, graph);
            this.incidentMessageScope = MessageScope.Local.of(() -> this.edgeTraversal.get().clone());
            this.countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
        }
        this.alpha = configuration.getDouble(ALPHA, this.alpha);
        this.epsilon = configuration.getDouble(EPSILON, this.epsilon);
        this.maxIterations = configuration.getInt(MAX_ITERATIONS, 20);
        this.property = configuration.getString(PROPERTY, PAGE_RANK);
        this.vertexComputeKeys = new HashSet<>(Arrays.asList(
                VertexComputeKey.of(this.property, false),
                VertexComputeKey.of(EDGE_COUNT, true))); //5
        this.memoryComputeKeys = new HashSet<>(Arrays.asList(
                MemoryComputeKey.of(TELEPORTATION_ENERGY, Operator.sum, true, true),
                MemoryComputeKey.of(VERTEX_COUNT, Operator.sum, true, true),
                MemoryComputeKey.of(CONVERGENCE_ERROR, Operator.sum, false, true)));
    }

    @Override
    public void storeState(final Configuration configuration) {
        VertexProgram.super.storeState(configuration);
        configuration.setProperty(ALPHA, this.alpha);
        configuration.setProperty(EPSILON, this.epsilon);
        configuration.setProperty(PROPERTY, this.property);
        configuration.setProperty(MAX_ITERATIONS, this.maxIterations);
        if (null != this.edgeTraversal)
            this.edgeTraversal.storeState(configuration, EDGE_TRAVERSAL);
        if (null != this.initialRankTraversal)
            this.initialRankTraversal.storeState(configuration, INITIAL_RANK_TRAVERSAL);
    }

    @Override
    public void setup(final Memory memory) {
        memory.set(TELEPORTATION_ENERGY, null == this.initialRankTraversal ? 1.0d : 0.0d);
        memory.set(VERTEX_COUNT, 0.0d);
        memory.set(CONVERGENCE_ERROR, 1.0d);
    }

    @Override
    public void execute(final Vertex vertex, Messenger<Double> messenger, final Memory memory) { //7
        if (memory.isInitialIteration()) {
            messenger.sendMessage(this.countMessageScope, 1.0d);  //8
            memory.add(VERTEX_COUNT, 1.0d);
        } else {
            final double vertexCount = memory.<Double>get(VERTEX_COUNT);
            final double edgeCount;
            double pageRank;
            if (1 == memory.getIteration()) {
                edgeCount = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b);
                vertex.property(VertexProperty.Cardinality.single, EDGE_COUNT, edgeCount);
                pageRank = null == this.initialRankTraversal ?
                        0.0d :
                        TraversalUtil.apply(vertex, this.initialRankTraversal.get()).doubleValue(); //9
            } else {
                edgeCount = vertex.value(EDGE_COUNT);
                pageRank = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b); //10
            }
            //////////////////////////
            final double teleporationEnergy = memory.get(TELEPORTATION_ENERGY);
            if (teleporationEnergy > 0.0d) {
                final double localTerminalEnergy = teleporationEnergy / vertexCount;
                pageRank = pageRank + localTerminalEnergy;
                memory.add(TELEPORTATION_ENERGY, -localTerminalEnergy);
            }
            final double previousPageRank = vertex.<Double>property(this.property).orElse(0.0d);
            memory.add(CONVERGENCE_ERROR, Math.abs(pageRank - previousPageRank));
            vertex.property(VertexProperty.Cardinality.single, this.property, pageRank);
            memory.add(TELEPORTATION_ENERGY, (1.0d - this.alpha) * pageRank);
            pageRank = this.alpha * pageRank;
            if (edgeCount > 0.0d)
                messenger.sendMessage(this.incidentMessageScope, pageRank / edgeCount);
            else
                memory.add(TELEPORTATION_ENERGY, pageRank);
        }
    }

    @Override
    public boolean terminate(final Memory memory) { //11
        boolean terminate = memory.<Double>get(CONVERGENCE_ERROR) < this.epsilon || memory.getIteration() >= this.maxIterations;
        memory.set(CONVERGENCE_ERROR, 0.0d);
        return terminate;
    }
}

loadState 和 storeState 都是让代码在其他机器上执行的时候, 需要对实例/配置的状态进行保存和恢复
execute里面可以看到消息之间的通信
terminate 表示任务的结束条件

2. 计算的延伸/收敛`MapReduce`

一般来讲BSP模型计算完成之后, 它的结果是像图的属性一样分布在图的各个节点上

BSP计算的结束,并不是所有结果都收敛在了一个值上面, 而是计算迭代到了一定步数或者该次要进行的计算已经为空了

所以当我们需要对一些全局的问题作答的时候, 就需要对BSP的计算结果进行再加工

例如:

图聚类结束之后, 每个类簇下有多少节点
图聚类结束之后一共有多少个簇

3. Spark对这部分计算的实现

这里Spark对图计算的实现没有依赖GraphX, 而是按照上面的2个计算步骤来实现的, 具体实现

VertexProgram的执行

while (true) {
    if (Thread.interrupted()) {
        sparkContext.cancelAllJobs();
        throw new TraversalInterruptedException();
    }
    memory.setInExecute(true);
    viewIncomingRDD = SparkExecutor.executeVertexProgramIteration(loadedGraphRDD, viewIncomingRDD, memory, graphComputerConfiguration, vertexProgramConfiguration);
    memory.setInExecute(false);
    if (this.vertexProgram.terminate(memory))
        break;
    else {
        memory.incrIteration();
        memory.broadcastMemory(sparkContext);
    }
}

MapReduce的执行

for (final MapReduce mapReduce : this.mapReducers) {
    // execute the map reduce job
    final HadoopConfiguration newApacheConfiguration = new HadoopConfiguration(graphComputerConfiguration);
    mapReduce.storeState(newApacheConfiguration);
    // map
    final JavaPairRDD mapRDD = SparkExecutor.executeMap((JavaPairRDD) mapReduceRDD, mapReduce, newApacheConfiguration);
    // combine
    final JavaPairRDD combineRDD = mapReduce.doStage(MapReduce.Stage.COMBINE) ? SparkExecutor.executeCombine(mapRDD, newApacheConfiguration) : mapRDD;
    // reduce
    final JavaPairRDD reduceRDD = mapReduce.doStage(MapReduce.Stage.REDUCE) ? SparkExecutor.executeReduce(combineRDD, mapReduce, newApacheConfiguration) : combineRDD;
    // write the map reduce output back to disk and computer result memory
    if (null != outputRDD)
        mapReduce.addResultToMemory(finalMemory, outputRDD.writeMemoryRDD(graphComputerConfiguration, mapReduce.getMemoryKey(), reduceRDD));
}

ref:

TinkerPop中的GraphComputer组件
TinkerPop中对于图的计算,底层的计算引擎是通过GraphComputer来提供的 1. 计算的分类 Tin...
Apache TinkerPop快速体验
什么是TinkerPop ？ Apache TinkerPop is a graph computing fram...
JanusGraph---TinkerPop’s Hadoop-
TinkerPop’s Hadoop-Gremlin JanusGraph with TinkerPop’s Ha...
gremlin语言介绍一：前言
1 gremlin与TinkerPop的关系 TinkerPop是一个开源图计算框架（Graph Computin...
TinkerPop中使用Spark on Yarn模式运行OLA
TinkerPop中可以结合SparkGraphComputer和HadoopGraph实现使用大数据集群资源分布...
二十一 Gremlin查询语句的实现原理
总的来说，Gremlin 是一种图查询的DSL语言，在tinkerpop的实现中；他采用了 groovy 脚本来支...
记一次 Cassandra 删除“不干净”的问题发现与定位过程
问题描述 hugegraph-0.10.4 版本的 cassandra 后端运行 tinkerpop 的 proc...
gremlin 性能
gremlin 是apache tinkerpop框架的查询语言，使用函数编程风格，非常的灵活。学习gremli...
Vue父子组件之间的传值
1、父组件向子组件传值父组件中子组件中 2、子组件向父组件传值子组件中父组件中
JanusGraph 查询步骤简要分析
1. 客户端连接 Java 客户端连接在org.apache.tinkerpop.gremlin.driver....