美文网首页
Elasticsearch中的关联查询。Nested类型介绍及查

Elasticsearch中的关联查询。Nested类型介绍及查

作者: Ombres | 来源:发表于2019-06-20 21:01 被阅读0次

    nested初解

    了解这个nested之前呢,需要先了解两个基本概念---索引时关联和查询时关联,另外还有es的两种类型,object类型和join

    Lucene中的关联

    1. 查询时关联 Query-time join

    简介

    最直接的做法,类似于传统关系型数据库中用于处理多表连接的一种做法,为两张"表"建立一个外键关联,以此来解决关联问题,下面是Lucene对外提供的接口工具方法。当然不止这一个,还有其他的,这里我就不多列举了。

    public final class JoinUtil {
    /**
       * A query time join using global ordinals over a dedicated join field.
       *
       * This join has certain restrictions and requirements:
       * 1) A document can only refer to one other document. (but can be referred by one or more documents)
       * 2) Documents on each side of the join must be distinguishable. Typically this can be done by adding an extra field
       *    that identifies the "from" and "to" side and then the fromQuery and toQuery must take the this into account.
       * 3) There must be a single sorted doc values join field used by both the "from" and "to" documents. This join field
       *    should store the join values as UTF-8 strings.
       * 4) An ordinal map must be provided that is created on top of the join field.
       *
       * Note: min and max filtering and the avg score mode will require this join to keep track of the number of times
       * a document matches per join value. This will increase the per join cost in terms of execution time and memory.
       *
       * @param joinField   The {@link SortedDocValues} field containing the join values
       * @param fromQuery   The query containing the actual user query. Also the fromQuery can only match "from" documents.
       * @param toQuery     The query identifying all documents on the "to" side.
       * @param searcher    The index searcher used to execute the from query
       * @param scoreMode   Instructs how scores from the fromQuery are mapped to the returned query
       * @param ordinalMap  The ordinal map constructed over the joinField. In case of a single segment index, no ordinal map
       *                    needs to be provided.
       * @param min         Optionally the minimum number of "from" documents that are required to match for a "to" document
       *                    to be a match. The min is inclusive. Setting min to 0 and max to <code>Interger.MAX_VALUE</code>
       *                    disables the min and max "from" documents filtering
       * @param max         Optionally the maximum number of "from" documents that are allowed to match for a "to" document
       *                    to be a match. The max is inclusive. Setting min to 0 and max to <code>Interger.MAX_VALUE</code>
       *                    disables the min and max "from" documents filtering
       * @return a {@link Query} instance that can be used to join documents based on the join field
       * @throws IOException If I/O related errors occur
       */
      public static Query createJoinQuery(String joinField,
                                          Query fromQuery,
                                          Query toQuery,
                                          IndexSearcher searcher,
                                          ScoreMode scoreMode,
                                          OrdinalMap ordinalMap,
                                          int min,
                                          int max
    }
    
    查询过程
    1. 先执行fromQuery查询子文档,获得一个collector
    2. 然后再以这个collector的结果作为条件查询toQuery结果
    总结

    query-time join由于查询了两遍,性能会下降。

    2. 索引时关联 Index-time join

    简介

    索引时关联,顾名思义,就是在建立索引时对两种文档进行关联。

    关联的思路

    lucene中的文档都是有顺序的,那么考虑一种最方便的做法,我们把关联的文档以一定的顺序写入,那么就能很快的找到关联的文档。比如目前的做法就是先索引子文档,再索引主文档那么在索引中,数据是这样存储的。
    有以下三个doc,A,B,C,他们有子文档1,2......
    A - 1,2,3
    B - 4
    C - 5,6
    在索引中:1,2,3,A,4,B,5,6,C
    当我们要找子文档是1的父文档时,只要找到1的位置,然后一直遍历,直到不是子文档的时候就会找到他的父文档。

    Lucene用法
    public class ToParentBlockJoinQuery extends Query {
    /** Create a ToParentBlockJoinQuery.
       *
       * @param childQuery Query matching child documents.
       * @param parentsFilter Filter identifying the parent documents.
       * @param scoreMode How to aggregate multiple child scores
       * into a single parent score.
       **/
      public ToParentBlockJoinQuery(Query childQuery, BitSetProducer parentsFilter, ScoreMode scoreMode)
    }
    

    上述Query就是lucene中用来进行查询关联的Query类,第一个Query是子文档的查询条件。

    总结

    这种方式比query time index要快一些,大概30%,目前更建议在合适的情况下选择两种不同的关联用法。

    object

    在Elasticsearch中,object对象其实是被当做多列数据来处理的。比如:

    {
      "group" : "fans",
      "user" : [
        {
          "first" : "John",
          "last" :  "Smith"
        },
        {
          "first" : "Alice",
          "last" :  "White"
        }
      ]
    }
    

    这样一个JSON,如果将user定义为object类型,那么它会变为这样一个索引,与以下json生成同样的索引

    {
      "group" : "fans",
      "user.first" : ["John","Alice"],
      "user.last" : ["Smith","White"]
    }
    

    很明显,在搜索的时候,如果我们使用一个条件,同时满足firstname是"Alice"和lastname是"Smith"的时候,这条结果也会被返回。
    因此,object不能用来作为父子关系的文档来进行索引。

    nested

    nested就是为了解决上述问题而制造的。nested实际上在索引中会创建一个父文档以及多个子文档(比如上述事例,数量取决于user数组的大小)。

    nested搜索原理

    查询

    主要的查询是在NestedQueryBuilder.java这个类中,这里会构建一个ESToParentBlockJoinQuery对象,这个对象中实际上封装了一个ToParentBlockJoinQueryToParentBlockJoinQuery是Lucene中的一种查询,主要用于索引时关联的查询使用。

    protected Query doToQuery(QueryShardContext context) throws IOException {
            ObjectMapper nestedObjectMapper = context.getObjectMapper(path);
            if (nestedObjectMapper == null) {
                if (ignoreUnmapped) {
                    return new MatchNoDocsQuery();
                } else {
                    throw new IllegalStateException("[" + NAME + "] failed to find nested object under path [" + path + "]");
                }
            }
            if (!nestedObjectMapper.nested().isNested()) {
                throw new IllegalStateException("[" + NAME + "] nested object under path [" + path + "] is not of nested type");
            }
            final BitSetProducer parentFilter;
            Query innerQuery;
            ObjectMapper objectMapper = context.nestedScope().getObjectMapper();
            if (objectMapper == null) {
                parentFilter = context.bitsetFilter(Queries.newNonNestedFilter(context.indexVersionCreated()));
            } else {
                parentFilter = context.bitsetFilter(objectMapper.nestedTypeFilter());
            }
    
            try {
                context.nestedScope().nextLevel(nestedObjectMapper);
                innerQuery = this.query.toQuery(context);
            } finally {
                context.nestedScope().previousLevel();
            }
    
            // ToParentBlockJoinQuery requires that the inner query only matches documents
            // in its child space
            if (new NestedHelper(context.getMapperService()).mightMatchNonNestedDocs(innerQuery, path)) {
                innerQuery = Queries.filtered(innerQuery, nestedObjectMapper.nestedTypeFilter());
            }
    
            return new ESToParentBlockJoinQuery(innerQuery, parentFilter, scoreMode,
                    objectMapper == null ? null : objectMapper.fullPath());
        }
    

    聚合

    Nested 实际的查询时一个聚合NestedAggregator,主要实现在NestedAggregator.java这个类中:

     public LeafBucketCollector getLeafCollector(final LeafReaderContext ctx, final LeafBucketCollector sub) throws IOException {
            IndexReaderContext topLevelContext = ReaderUtil.getTopLevelContext(ctx);
            IndexSearcher searcher = new IndexSearcher(topLevelContext);
            searcher.setQueryCache(null);
            Weight weight = searcher.createWeight(searcher.rewrite(childFilter), ScoreMode.COMPLETE_NO_SCORES, 1f);
            Scorer childDocsScorer = weight.scorer(ctx);
    
            final BitSet parentDocs = parentFilter.getBitSet(ctx);
            final DocIdSetIterator childDocs = childDocsScorer != null ? childDocsScorer.iterator() : null;
            if (collectsFromSingleBucket) {
                return new LeafBucketCollectorBase(sub, null) {
                    @Override
                    public void collect(int parentDoc, long bucket) throws IOException {
                        // if parentDoc is 0 then this means that this parent doesn't have child docs (b/c these appear always before the parent
                        // doc), so we can skip:
                        if (parentDoc == 0 || parentDocs == null || childDocs == null) {
                            return;
                        }
    
                        final int prevParentDoc = parentDocs.prevSetBit(parentDoc - 1);
                        int childDocId = childDocs.docID();
                        if (childDocId <= prevParentDoc) {
                            childDocId = childDocs.advance(prevParentDoc + 1);
                        }
    
                        for (; childDocId < parentDoc; childDocId = childDocs.nextDoc()) {
                            collectBucket(sub, childDocId, bucket);
                        }
                    }
                };
            } else {
                return bufferingNestedLeafBucketCollector = new BufferingNestedLeafBucketCollector(sub, parentDocs, childDocs);
            }
        }
    

    主要包括几部分:

    1. 先拿到父文档的docid集合。
    2. 获取父子文档的docid迭代器。
    3. 判断子文档是否符合条件,符合条件的数据放到collectBucket

    join

    es中的join实际上就是query time join的实现,以一个字段作为关联的主键,然后进行关联查询,具体查询的实现逻辑如下,原理还是JoinUtil.createJoinQuery

    public class HasChildQueryBuilder extends AbstractQueryBuilder<HasChildQueryBuilder> {
        public Query rewrite(IndexReader reader) throws IOException {
                Query rewritten = super.rewrite(reader);
                if (rewritten != this) {
                    return rewritten;
                }
                if (reader instanceof DirectoryReader) {
                    IndexSearcher indexSearcher = new IndexSearcher(reader);
                    indexSearcher.setQueryCache(null);
                    indexSearcher.setSimilarity(similarity);
                    IndexOrdinalsFieldData indexParentChildFieldData = fieldDataJoin.loadGlobal((DirectoryReader) reader);
                    OrdinalMap ordinalMap = indexParentChildFieldData.getOrdinalMap();
                    return JoinUtil.createJoinQuery(joinField, innerQuery, toQuery, indexSearcher, scoreMode,
                        ordinalMap, minChildren, maxChildren);
                } else {
                    if (reader.leaves().isEmpty() && reader.numDocs() == 0) {
                        // asserting reader passes down a MultiReader during rewrite which makes this
                        // blow up since for this query to work we have to have a DirectoryReader otherwise
                        // we can't load global ordinals - for this to work we simply check if the reader has no leaves
                        // and rewrite to match nothing
                        return new MatchNoDocsQuery();
                    }
                    throw new IllegalStateException("can't load global ordinals for reader of type: " +
                        reader.getClass() + " must be a DirectoryReader");
                }
            }
    
    }
    

    相关文章

      网友评论

          本文标题:Elasticsearch中的关联查询。Nested类型介绍及查

          本文链接:https://www.haomeiwen.com/subject/xzhsqctx.html