InfluxDB 原理及细节剖析（二）

作者: 逗逼程序员 | 来源:发表于2020-03-22 22:14 被阅读0次

InfluxDB 原理及细节剖析（二）
iOS底层学习文章
PageRank算法原理剖析及Spark实现
Shuffle操作原理操作
(十二) ConcurrentHashMap分段锁设计原理
(十一) 深度分析ConcurrentHashMap中的并发扩容
influxDB原理
原理剖析（第 004 篇）CAS工作原理分析
iOS IP 直连原理剖析
原理剖析（第 007 篇）CountDownLatch工作原理分

截止目前最新版本：

v2.0.0-beta.6

任何一个数据库系统内核关注的重点无非以下

数据在内存中如何存储
数据在文件中如何存储
索引的结构如何存储
数据写入流程
数据读取流程

当然本系列不会就每一点去做去做全面介绍。

centos 下 influxdb 默认配置路径：

/influxdb-1.6.3-1/etc/influxdb/influxdb.conf

数据文件目录：

/data 存放实际存储的数据文件，以 .tsm 结尾

/meta 用于存储数据库的一些元数据，meta 目录下有一个 meta.db 文件

/wal 存放预写日志文件，以 .wal 结尾

从 LevelDB（LSM Tree），到 BoltDB（mmap B+树），现在InfluxDB使用的是自己实现的 TSM Tree 的算法，类似 LSM Tree，针对 InfluxDB 的使用做了特殊优化。

TSM Tree 是 InfluxDB 根据实际需求在 LSM Tree 的基础上稍作修改优化而来。

TSM 存储引擎主要由几个部分组成： cache、wal、tsm file、compactor。

4.png

Shard

shard 并不能算是其中的一个组件，因为这是在 tsm 存储引擎之上的一个概念。在 InfluxDB 中按照数据的时间戳所在的范围，会去创建不同的 shard，每一个 shard 都有自己的 cache、wal、tsm file 以及 compactor，这样做的目的就是为了可以通过时间来快速定位到要查询数据的相关资源，加速查询的过程，并且也让之后的批量删除数据的操作变得非常简单且高效。

在 LSM Tree 中删除数据是通过给指定 key 插入一个删除标记的方式，数据并不立即删除，需要等之后对文件进行压缩合并时才会真正地将数据删除，所以删除大量数据在 LSM Tree 中是一个非常低效的操作。

而在 InfluxDB 中，通过 retention policy 设置数据的保留时间，当检测到一个 shard 中的数据过期后，只需要将这个 shard 的资源释放，相关文件删除即可，这样的做法使得删除过期数据变得非常高效。

(WAL)Write Ahead Log

The Write Ahead Log (WAL) retains InfluxDB data when the storage engine restarts. The WAL ensures data is durable in case of an unexpected failure.

When the storage engine receives a write request, the following steps occur:

The write request is appended to the end of the WAL file.
Data is written data to disk using fsync().
The in-memory cache is updated.
When data is successfully written to disk, a response confirms the write request was successful.

fsync() takes the file and pushes pending writes all the way to the disk. As a system call, fsync() has a kernel context switch that’s computationally expensive, but guarantees that data is safe on disk.

When the storage engine restarts, the WAL file is read back into the in-memory database. InfluxDB then answers requests to the /read endpoint.

wal 文件的内容与内存中的 cache 相同，其作用就是为了持久化数据，当系统崩溃后可以通过 wal 文件恢复还没有写入到 tsm 文件中的数据。

由于数据是被顺序插入到 wal 文件中，所以写入效率非常高。但是如果写入的数据没有按照时间顺序排列，而是以杂乱无章的方式写入，数据将会根据时间路由到不同的 shard 中，每一个 shard 都有自己的 wal 文件，这样就不再是完全的顺序写入，对性能会有一定影响。看到官方社区有说后续会进行优化，只使用一个 wal 文件，而不是为每一个 shard 创建 wal 文件。

wal 单个文件达到一定大小后会进行分片，创建一个新的 wal 分片文件用于写入数据。

Cache

The cache is an in-memory copy of data points currently stored in the WAL. The cache:

Organizes points by key (measurement, tag set, and unique field) Each field is stored in its own time-ordered range.
Stores uncompressed data.
Gets updates from the WAL each time the storage engine restarts. The cache is queried at runtime and merged with the data stored in TSM files.

Queries to the storage engine merge data from the cache with data from the TSM files. Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running do not affect the result. Deletes sent to the cache clear the specified key or time range for a specified key.

插入数据时，实际上是同时往 cache 与 wal 中写入数据，可以认为 cache 是 wal 文件中的数据在内存中的缓存。当 InfluxDB 启动时，会遍历所有的 wal 文件，重新构造 cache，这样即使系统出现故障，也不会导致数据的丢失。

cache 中的数据并不是无限增长的，有一个 maxSize 参数用于控制当 cache 中的数据占用多少内存后就会将数据写入 tsm 文件。如果不配置的话，默认上限为 25MB，每当 cache 中的数据达到阀值后，会将当前的 cache 进行一次快照，之后清空当前 cache 中的内容，再创建一个新的 wal 文件用于写入，剩下的 wal 文件最后会被删除，快照中的数据会经过排序写入一个新的 tsm 文件中。

TSM(Time-Structured Merge Tree )

To efficiently compact and store data, the storage engine groups field values by series key, and then orders those field values by time. (A series key is defined by measurement, tag key and value, and field key.)

The storage engine uses a Time-Structured Merge Tree (TSM) data format. TSM files store compressed series data in a columnar format. To improve efficiency, the storage engine only stores differences (or deltas) between values in a series. Column-oriented storage lets the engine read by series key and omit extraneous data.

After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared. The TSM compaction code is quite complex. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.

单个 tsm file 大小最大为 2GB，用于存放数据。

TSM file 使用了自己设计的格式，对查询性能以及压缩方面进行了很多优化，在后面的章节会具体说明其文件结构。

TSI （Time Series Index）

As data cardinality (the number of series) grows, queries read more series keys and become slower. The Time Series Index ensures queries remain fast as data cardinality grows. The TSI stores series keys grouped by measurement, tag, and field. This allows the database to answer two questions well:

What measurements, tags, fields exist? (This happens in meta queries.)
Given a measurement, tags, and fields, what series keys exist?

InfluxDB 中采用索引的方式进行优化，主要存在两种类型的索引。

1）元数据索引

一个数据库的元数据索引通过 DatabaseIndex 这个结构体来存储，在数据库启动时，会进行初始化，从所有 shard 下的 tsm file 中加载 index 数据，获取其中所有 Measurement 以及 Series 的信息并缓存到内存中。

2）TSM File 索引

上文中对于 tsm file 中的 Index 部分会在内存中做间接索引，从而可以实现快速检索的目的

从 tsm file 中读取数据

InfluxDB 中的所有数据读取操作都通过 Iterator 来完成。

Iterator 是一个抽象概念，并且支持嵌套，一个 Iterator 可以从底层的其他 Iterator 中获取数据并进行处理，之后再将结果传递给上层的 Iterator。

总结

通过上述对InfluxDB 原理性质介绍，现在做个归纳：

1、InfluxDB 主要完成了对时序数据的存储，并根据时间序列维度做一定的连续展示

2、存储引擎启动的时候就会将所有 shard 中 .wal 文件载入缓存，提高查询效率。

同时保证数据的不丢失。所以才有了数据写入的时候是 cache 和 waf 进行双写。

3、当 cache 存储达到 Maxsize 时候，创建 snapshot , 同时创建新的 waf 文件，

snapshot 写入 tsf 文件中，然后删除原来的文件。

4、为了提供查询的速度，InfluxDB 还利用了索引。存储引擎启动时候将 Measurement 以及 Series 的信息并缓存到内存中。

5、查询的时候通过一个抽象的 Iterator 进行迭代获取数据。

网友评论

时序数据库

本文标题：InfluxDB 原理及细节剖析（二）

本文链接：https://www.haomeiwen.com/subject/yfebyhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！