十五 hugegraph 对几个 step的优化

作者: NazgulSun | 来源:发表于2021-07-20 17:09 被阅读0次

十五 hugegraph 对几个 step的优化
十九 Gremlin Step 模型
记一次 HttpClient 连接半释放导致的问题
九 hugegraph+cassandra 底层存储结构
xiaolinBot（Twitter笑话集锦爬虫Bot） Ste
爬取有道翻译
HugeGraph入门
每日感恩日记D177
BP神经网络|手写数字识别
一： Hugegraph 源代码： HugeGraph对象

gremlin-driver，我们可以实现各类自定义的strategy，可以对我们的 gremlin query 语句进行优化；
比如，在baidu-hugegraph 里面，我们可以对多个 has 进行优化；
例如：我们的查询：
g.V().limit(1).out().has('name','xx').has('eid','xx')
如果不做优化，我们可能就是需要两次has查询数据库；

这个是最原始解析：
[HugeGraphStep(vertex,[]), RangeGlobalStep(0,1), HugeVertexStep(OUT,vertex), NoOpBarrierStep(2500), HasStep([name.eq(xx), eid.eq(xx)])]

百度的优化：

    public static void extractHasContainer(HugeGraphStep<?, ?> newStep,
                                           Traversal.Admin<?, ?> traversal) {
        Step<?, ?> step = newStep;
        do {
            step = step.getNextStep();
            if (step instanceof HasStep) {
                HasContainerHolder holder = (HasContainerHolder) step;
                for (HasContainer has : holder.getHasContainers()) {
                    if (!GraphStep.processHasContainerIds(newStep, has)) {
                        newStep.addHasContainer(has);
                    }
                }
                TraversalHelper.copyLabels(step, step.getPreviousStep(), false);
                traversal.removeStep(step);
            }
        } while (step instanceof HasStep || step instanceof NoOpBarrierStep);
    }

可以把HasStep 给去掉，然后合并到一个 VertexStep 中；
[HugeGraphStep(vertex,[]), RangeGlobalStep(0,1), HugeVertexStep(OUT,Vertex,[name.eq(xx), eid.eq(xx)]), NoOpBarrierStep(2500)]

那么在 HugeVertexStep的查询中，就可以以一个复合索引方式来查询了；对性能有一定的提升；

我们再看看，hugegraph 里面对其他两个步骤的重构；

        for (GraphStep originStep : steps) {
            HugeGraphStep<?, ?> newStep = new HugeGraphStep<>(originStep);
            TraversalHelper.replaceStep(originStep, newStep, traversal);

            TraversalUtil.extractHasContainer(newStep, traversal);

            // TODO: support order-by optimize
            // TraversalUtil.extractOrder(newStep, traversal);

            TraversalUtil.extractRange(newStep, traversal, false);

            TraversalUtil.extractCount(newStep, traversal);
        }

rangeSetp:

g.V().range(5,10)

编译之后变成了range(0,5),

/*

NOTE: keep the step to limit results after query from DB
due to limit(in DB) may not be implemented accurately.
but the backend driver should ensure offset accurately.
*/

5作为offset，放在了 g.V(),hugegraphGraphSetp 上。
也就是由不同的底层存储，管理offset；然后在内存里面实现 range limit;

这里的优化是正确性保证， offset 一般都是正确实现。

我们再看看，extractCount() 这个功能的优化；

    public static void extractCount(Step<?, ?> newStep,
                                    Traversal.Admin<?, ?> traversal) {
        Step<?, ?> step = newStep;
        do {
            step = step.getNextStep();
            if (step instanceof CountGlobalStep) {
                QueryHolder holder = (QueryHolder) newStep;
                holder.setCount();
            }
        } while (step instanceof CountGlobalStep ||
                 step instanceof FilterStep ||
                 step instanceof IdentityStep ||
                 step instanceof NoOpBarrierStep);
    }

主要是把相应的信息回写到 QueryHolder里面；注意每个 VertexStep/GraphStep 在Hugegraph里面都实现了 queryHolder接口；

最后看一下， tinkpop作者提供的一个例子： Then check this cool strategy out: g.V().count() is O(1) time for TinkerGraph:

https://github.com/apache/tinkerpop/blob/master/tinkergraph-gremlin/src/main/java/org/apache/tinkerpop/gremlin/tinkergraph/process/traversal/strategy/optimization/TinkerGraphCountStrategy.java

对于优化的地方，需要结合后端实现的backend，如果百度只专注于做一个存储后端，那么他的性能应该可以更好，这也正是我们优化的地方；

进一步的优化

count的优化，根据不同的backend使用不同的 countStep，有些数据库是支持直接查询count的
out().out()，这样的step的优化；可以使用 KoutStep来做；在查询数据库的时候可以使用batch-get，并发去查询；
我们看到 hugegraph 在提供的算法包里面，包含了kout，但是依旧是使用的是单步的邻居节点查询；应该也是考虑到了通用性的考虑；
其他耗时的step我们都可以做定制化的step 优化；但是会牺牲通用性；我觉得nebula 之所有快，应该是做了很多优化；针对特型的存储，使用并行+ 下推等方式；