TinkerPop中对于图的计算,底层的计算引擎是通过GraphComputer
来提供的
1. 计算的分类
TinkerPop提供的2种和图数据交互的方式:
OLTP | OLAP | |
---|---|---|
驱动计算的数据量 | 1个点/很少的几个点 | 点很多,计算过程中会用到全图 |
数据时效性 | 毫秒/秒级 | 数分钟/数小时 |
数据访问方式 | 随机 | 顺序 |
数据处理方式 | 串行 | 并行 |
2. 计算的分解
TinkerPop使用的计算模型,在我看来即是一个 大的MapReduce=BSP+MapReduce
其中:
- BSP:
VertexProgram
- MapReduce: MapReduce (可能会没有)
1. 计算的入口/并行的开始VertexProgram
图中的涉及到的点,都会执行VertexProgram
对应的代码, 一个抽象的执行机器称为worker, 这些worker可以并行执行这些代码, 即调用VertexProgram.execute()
(这里采用了BSP计算模型), 节点之间通过2种消息进行通信
-
MessageScope.Local
: 相邻节点之间通信 -
MessageScope.Global
: 图中任一节点通信
当VertexProgram
执行完成之后, 会接着执行其对应的MapReduce
任务, 可以通过VertexProgram.getMapReducers()
获得
BSP的执行示例如下:
- 本地计算
- 计算结果通信
- 阻塞直到所有的本次计算完成
- 转到步骤1
以 PageRankVertexProgram
的实现为例子, 可以看到具体的实现过程
public class PageRankVertexProgram implements VertexProgram<Double> { //1
public static final String PAGE_RANK = "gremlin.pageRankVertexProgram.pageRank";
private static final String EDGE_COUNT = "gremlin.pageRankVertexProgram.edgeCount";
private static final String PROPERTY = "gremlin.pageRankVertexProgram.property";
private static final String VERTEX_COUNT = "gremlin.pageRankVertexProgram.vertexCount";
private static final String ALPHA = "gremlin.pageRankVertexProgram.alpha";
private static final String EPSILON = "gremlin.pageRankVertexProgram.epsilon";
private static final String MAX_ITERATIONS = "gremlin.pageRankVertexProgram.maxIterations";
private static final String EDGE_TRAVERSAL = "gremlin.pageRankVertexProgram.edgeTraversal";
private static final String INITIAL_RANK_TRAVERSAL = "gremlin.pageRankVertexProgram.initialRankTraversal";
private static final String TELEPORTATION_ENERGY = "gremlin.pageRankVertexProgram.teleportationEnergy";
private static final String CONVERGENCE_ERROR = "gremlin.pageRankVertexProgram.convergenceError";
private MessageScope.Local<Double> incidentMessageScope = MessageScope.Local.of(__::outE); //2
private MessageScope.Local<Double> countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
private PureTraversal<Vertex, Edge> edgeTraversal = null;
private PureTraversal<Vertex, ? extends Number> initialRankTraversal = null;
private double alpha = 0.85d;
private double epsilon = 0.00001d;
private int maxIterations = 20;
private String property = PAGE_RANK; //3
private Set<VertexComputeKey> vertexComputeKeys;
private Set<MemoryComputeKey> memoryComputeKeys;
private PageRankVertexProgram() { }
@Override
public void loadState(final Graph graph, final Configuration configuration) { //4
if (configuration.containsKey(INITIAL_RANK_TRAVERSAL))
this.initialRankTraversal = PureTraversal.loadState(configuration, INITIAL_RANK_TRAVERSAL, graph);
if (configuration.containsKey(EDGE_TRAVERSAL)) {
this.edgeTraversal = PureTraversal.loadState(configuration, EDGE_TRAVERSAL, graph);
this.incidentMessageScope = MessageScope.Local.of(() -> this.edgeTraversal.get().clone());
this.countMessageScope = MessageScope.Local.of(new MessageScope.Local.ReverseTraversalSupplier(this.incidentMessageScope));
}
this.alpha = configuration.getDouble(ALPHA, this.alpha);
this.epsilon = configuration.getDouble(EPSILON, this.epsilon);
this.maxIterations = configuration.getInt(MAX_ITERATIONS, 20);
this.property = configuration.getString(PROPERTY, PAGE_RANK);
this.vertexComputeKeys = new HashSet<>(Arrays.asList(
VertexComputeKey.of(this.property, false),
VertexComputeKey.of(EDGE_COUNT, true))); //5
this.memoryComputeKeys = new HashSet<>(Arrays.asList(
MemoryComputeKey.of(TELEPORTATION_ENERGY, Operator.sum, true, true),
MemoryComputeKey.of(VERTEX_COUNT, Operator.sum, true, true),
MemoryComputeKey.of(CONVERGENCE_ERROR, Operator.sum, false, true)));
}
@Override
public void storeState(final Configuration configuration) {
VertexProgram.super.storeState(configuration);
configuration.setProperty(ALPHA, this.alpha);
configuration.setProperty(EPSILON, this.epsilon);
configuration.setProperty(PROPERTY, this.property);
configuration.setProperty(MAX_ITERATIONS, this.maxIterations);
if (null != this.edgeTraversal)
this.edgeTraversal.storeState(configuration, EDGE_TRAVERSAL);
if (null != this.initialRankTraversal)
this.initialRankTraversal.storeState(configuration, INITIAL_RANK_TRAVERSAL);
}
@Override
public void setup(final Memory memory) {
memory.set(TELEPORTATION_ENERGY, null == this.initialRankTraversal ? 1.0d : 0.0d);
memory.set(VERTEX_COUNT, 0.0d);
memory.set(CONVERGENCE_ERROR, 1.0d);
}
@Override
public void execute(final Vertex vertex, Messenger<Double> messenger, final Memory memory) { //7
if (memory.isInitialIteration()) {
messenger.sendMessage(this.countMessageScope, 1.0d); //8
memory.add(VERTEX_COUNT, 1.0d);
} else {
final double vertexCount = memory.<Double>get(VERTEX_COUNT);
final double edgeCount;
double pageRank;
if (1 == memory.getIteration()) {
edgeCount = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b);
vertex.property(VertexProperty.Cardinality.single, EDGE_COUNT, edgeCount);
pageRank = null == this.initialRankTraversal ?
0.0d :
TraversalUtil.apply(vertex, this.initialRankTraversal.get()).doubleValue(); //9
} else {
edgeCount = vertex.value(EDGE_COUNT);
pageRank = IteratorUtils.reduce(messenger.receiveMessages(), 0.0d, (a, b) -> a + b); //10
}
//////////////////////////
final double teleporationEnergy = memory.get(TELEPORTATION_ENERGY);
if (teleporationEnergy > 0.0d) {
final double localTerminalEnergy = teleporationEnergy / vertexCount;
pageRank = pageRank + localTerminalEnergy;
memory.add(TELEPORTATION_ENERGY, -localTerminalEnergy);
}
final double previousPageRank = vertex.<Double>property(this.property).orElse(0.0d);
memory.add(CONVERGENCE_ERROR, Math.abs(pageRank - previousPageRank));
vertex.property(VertexProperty.Cardinality.single, this.property, pageRank);
memory.add(TELEPORTATION_ENERGY, (1.0d - this.alpha) * pageRank);
pageRank = this.alpha * pageRank;
if (edgeCount > 0.0d)
messenger.sendMessage(this.incidentMessageScope, pageRank / edgeCount);
else
memory.add(TELEPORTATION_ENERGY, pageRank);
}
}
@Override
public boolean terminate(final Memory memory) { //11
boolean terminate = memory.<Double>get(CONVERGENCE_ERROR) < this.epsilon || memory.getIteration() >= this.maxIterations;
memory.set(CONVERGENCE_ERROR, 0.0d);
return terminate;
}
}
-
loadState
和storeState
都是让代码在其他机器上执行的时候, 需要对实例/配置的状态进行保存和恢复 -
execute
里面可以看到消息之间的通信 -
terminate
表示任务的结束条件
2. 计算的延伸/收敛MapReduce
一般来讲BSP模型计算完成之后, 它的结果是像图的属性一样分布在图的各个节点上
BSP计算的结束,并不是所有结果都收敛在了一个值上面, 而是计算迭代到了一定步数或者该次要进行的计算已经为空了
所以当我们需要对一些全局的问题作答的时候, 就需要对BSP的计算结果进行再加工
例如:
- 图聚类结束之后, 每个类簇下有多少节点
- 图聚类结束之后一共有多少个簇
3. Spark对这部分计算的实现
这里Spark对图计算的实现没有依赖GraphX, 而是按照上面的2个计算步骤来实现的, 具体实现
VertexProgram
的执行
while (true) {
if (Thread.interrupted()) {
sparkContext.cancelAllJobs();
throw new TraversalInterruptedException();
}
memory.setInExecute(true);
viewIncomingRDD = SparkExecutor.executeVertexProgramIteration(loadedGraphRDD, viewIncomingRDD, memory, graphComputerConfiguration, vertexProgramConfiguration);
memory.setInExecute(false);
if (this.vertexProgram.terminate(memory))
break;
else {
memory.incrIteration();
memory.broadcastMemory(sparkContext);
}
}
MapReduce
的执行
for (final MapReduce mapReduce : this.mapReducers) {
// execute the map reduce job
final HadoopConfiguration newApacheConfiguration = new HadoopConfiguration(graphComputerConfiguration);
mapReduce.storeState(newApacheConfiguration);
// map
final JavaPairRDD mapRDD = SparkExecutor.executeMap((JavaPairRDD) mapReduceRDD, mapReduce, newApacheConfiguration);
// combine
final JavaPairRDD combineRDD = mapReduce.doStage(MapReduce.Stage.COMBINE) ? SparkExecutor.executeCombine(mapRDD, newApacheConfiguration) : mapRDD;
// reduce
final JavaPairRDD reduceRDD = mapReduce.doStage(MapReduce.Stage.REDUCE) ? SparkExecutor.executeReduce(combineRDD, mapReduce, newApacheConfiguration) : combineRDD;
// write the map reduce output back to disk and computer result memory
if (null != outputRDD)
mapReduce.addResultToMemory(finalMemory, outputRDD.writeMemoryRDD(graphComputerConfiguration, mapReduce.getMemoryKey(), reduceRDD));
}
ref:
网友评论