美文网首页
为何要避免往ES里写入稀疏数据

为何要避免往ES里写入稀疏数据

作者: 莱克蒙 | 来源:发表于2017-11-11 14:25 被阅读0次

转一些文档链接,说明当前版本(<=5.x) 为何要避免将稀疏的数据写入ES。 随着ES/Lucene编码的改进,这个问题未来版本可能会得到改善,特别是ES6.0/Lucene7.0优化了doc_values对稀疏数据的编码方式。

general-recommendations.html#sparsity

Avoid sparsity

The data-structures behind Lucene, which Elasticsearch relies on in order to index and store data, work best with dense data, ie. when all documents have the same fields. This is especially true for fields that have norms enabled (which is the case for text fields by default) or doc values enabled (which is the case for numerics, date, ip and keyword by default).

The reason is that Lucene internally identifies documents with so-called doc ids, which are integers between 0 and the total number of documents in the index. These doc ids are used for communication between the internal APIs of Lucene: for instance searching on a term with a matchquery produces an iterator of doc ids, and these doc ids are then used to retrieve the value of the norm in order to compute a score for these documents. The way this norm lookup is implemented currently is by reserving one byte for each document. The norm value for a given doc id can then be retrieved by reading the byte at index doc_id. While this is very efficient and helps Lucene quickly have access to the norm values of every document, this has the drawback that documents that do not have a value will also require one byte of storage.

In practice, this means that if an index has M documents, norms will require M bytes of storage per field, even for fields that only appear in a small fraction of the documents of the index. Although slightly more complex with doc values due to the fact that doc values have multiple ways that they can be encoded depending on the type of field and on the actual data that the field stores, the problem is very similar. In case you wonder: fielddata, which was used in Elasticsearch pre-2.0 before being replaced with doc values, also suffered from this issue, except that the impact was only on the memory footprint since fielddata was not explicitly materialized on disk.

Note that even though the most notable impact of sparsity is on storage requirements, it also has an impact on indexing speed and search speed since these bytes for documents that do not have a field still need to be written at index time and skipped over at search time.

It is totally fine to have a minority of sparse fields in an index. But beware that if sparsity becomes the rule rather than the exception, then the index will not be as efficient as it could be.

This section mostly focused on norms and doc values because those are the two features that are most affected by sparsity. Sparsity also affect the efficiency of the inverted index (used to index text/keyword fields) and dimensional points (used to index geo_point and numerics) but to a lesser extent.

Here are some recommendations that can help avoid sparsity:

index-vs-type

Fields that exist in one type will also consume resources for documents of types where this field does not exist.

This is a general issue with Lucene indices: they don’t like sparsity. Sparse postings lists can’t be compressed efficiently because of high deltas between consecutive matches. And the issue is even worse with doc values: for speed reasons, doc values often reserve a fixed amount of disk space for every document, so that values can be addressed efficiently. This means that if Lucene establishes that it needs one byte to store all value of a given numeric field, it will also consume one byte for documents that don’t have a value for this field. Future versions of Elasticsearch will have improvements in this area but I would still advise you to model your data in a way that will limit sparsity as much as possible.

sparse-versus-dense-document-values-with-apache-lucene
issues# LUCENE-6863​
elasticsearch-6-0-0-alpha1-released

相关文章

  • 为何要避免往ES里写入稀疏数据

    转一些文档链接,说明当前版本(<=5.x) 为何要避免将稀疏的数据写入ES。 随着ES/Lucene编码的改进,这...

  • 如何在es中查询null值

    1、背景 在我们向es中写入数据时,有些时候数据写入到es中的是null,或者没有写入这个字段,那么这个时候在es...

  • py-elasticsearch的stream_bulk、par

    最近的爬虫项目里涉及往ES中大量写入数据,因此做了一些调研。总而言之,py-elasticsearch库推荐使用h...

  • ES-HADOOP配置

    参考使用 ES-Hadoop 将 Spark Streaming 流数据写入 ES

  • redis查看所有key的value值所占内存大小

    redis 虽好,却是个吃内存兽,因此在写入 redis 数据时,应该避免写入无用的数据,浪费内存。 如果我们要分...

  • 索引无法写入数据的原因排查

    现象描述 数据经Storm处理写入ES, 反应测试环境elastic cluster无数据写入,特此进行排查!!!...

  • ElasticSearch写入数据的底层存储

    写入数据的底层存储 数据写入时,首先写入到缓冲区buffer中(ES的JVM内存中),此时数据对外是搜索不到的。 ...

  • es写优化

    es写入流程 写入lucene缓存,此时数据不可见,同时会写一份数据到translog; 如果此时写入成功,会将写...

  • ES写入数据,访问数据

    ES教程英文官网:https://elasticsearch-py.readthedocs.io/en/maste...

  • 搜索引擎

    es的分布式架构原理能说一下吗?(es是如何实现分布式的啊)? es写入数据的工作原理是什么? es查询数据的工作...

网友评论

      本文标题:为何要避免往ES里写入稀疏数据

      本文链接:https://www.haomeiwen.com/subject/kfwwmxtx.html