Druid-Druid中task详解

作者: 李小李的路 | 来源:发表于2020-03-02 23:32 被阅读0次

Druid-Druid中task详解
Gradle 操作指南
11.Activity启动模式
8、Task
Druid-Druid中Segment
Druid-Druid中Broker
C#并行和多线程编程
Druid-Druid中Coordinator Process
Druid-Druid中Router Process
Druid-Druid中MiddleManager Proces

基于apche-druid-0.17

概述

在Druid中task完成与数据提取相关的工作。
对批处理，通常通过Task的api进行提交，对于流式处理，task由supervisor提交。

Task API

Overlord进程提供Http api来提交任务、取消任务、检查任务状态、查看日志及报告等工作；具体的需要查看Task Api列表。
Druid SQL 包含一个sys.tasks，提供当前正在运行的任务信息。这个表是只读的，包含的信息有效，但是有用，其中包含了通过Overlord api可以获得的全部信息。

Task报告

报告包含了已完成task和正在运行的task中提取数据的总行数，任何发生的解析异常信息。
The reporting feature is supported by the [simple native batch task](file:///Users/liyahui/Huobi/0_Code_Repository/5_apache/apache-druid-0.17.0-src/docs/ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks.

Task终态报告

当task完成后，可以从以下地址获取完成报告：

http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports

完成报告案例如下：

{
  "ingestionStatsAndErrors": {
    "taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
    "payload": {
      "ingestionState": "COMPLETED",
      "unparseableEvents": {},
      "rowStats": {
        "determinePartitions": {
          "processed": 0,
          "processedWithError": 0,
          "thrownAway": 0,
          "unparseable": 0
        },
        "buildSegments": {
          "processed": 5390324,
          "processedWithError": 0,
          "thrownAway": 0,
          "unparseable": 0
        }
      },
      "errorMsg": null
    },
    "type": "ingestionStatsAndErrors"
  }
}

Live类型报告

当一个task运行时，可以获取一个对于5min、10min、15min一个实时滚动的(Live)类型的报告，包含数据提取状态，未解析的事件，处理事件的平均数等报告信息。地址如下：

http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/task/<task-id>/reports

及

http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/liveReports

样例如下：

{
  "ingestionStatsAndErrors": {
    "taskId": "compact_twitter_2018-09-24T18:24:23.920Z",
    "payload": {
      "ingestionState": "RUNNING",
      "unparseableEvents": {},
      "rowStats": {
        "movingAverages": {
          "buildSegments": {
            "5m": {
              "processed": 3.392158326408501,
              "unparseable": 0,
              "thrownAway": 0,
              "processedWithError": 0
            },
            "15m": {
              "processed": 1.736165476881023,
              "unparseable": 0,
              "thrownAway": 0,
              "processedWithError": 0
            },
            "1m": {
              "processed": 4.206417693750045,
              "unparseable": 0,
              "thrownAway": 0,
              "processedWithError": 0
            }
          }
        },
        "totals": {
          "buildSegments": {
            "processed": 1994,
            "processedWithError": 0,
            "thrownAway": 0,
            "unparseable": 0
          }
        }
      },
      "errorMsg": null
    },
    "type": "ingestionStatsAndErrors"
  }
}

字段释义如下：
ingestionStatsAndErrors：提供有关行数和错误信息。
ingestionState：数据提取工作的task任务达到的任务阶段状态。可能包含以下几种：
- NOT_STARTED: The task has not begun reading any rows
- DETERMINE_PARTITIONS: The task is processing rows to determine partitioning
- BUILD_SEGMENTS: The task is processing rows to construct segments
- COMPLETED: The task has finished its work.
注意：只有批类型的task有DETERMINE_PARTITIONS阶段，流式类型任务没有此阶段。
unparseableEvents：包含由不可解析的输入引起的异常消息的列表。这有助于识别有问题的输入行。将为DETERMINE_PARTITIONS和BUILD_SEGMENTS阶段各提供一个列表。注意，Hadoop批处理任务不支持保存不可解析的事件。
rowStats：包含关于行数的信息。每个数据提取阶段都有一个entry：
- processed：在没有解析错误的情况下成功接收的行数；
- processedWithError：数据提取过程中在一个或多个列中包含解析错误的行数。这种情况通常发生在输入行具有可解析的结构但列类型无效的情况下，例如为数字列传递非数字字符串值。
- thrownAway：跳过的行数。这包括在摄取任务定义的时间间隔之外的具有时间戳的行，以及用transformSpec过滤掉的行，但不包括显式用户配置跳过的行。例如，在CSV格式中，skipHeaderRows或hasHeaderRow跳过的行不计算在内。
- unparseable：根本无法解析并被丢弃的行数。它跟踪没有可解析结构的输入行，比如在使用JSON解析器时传入非JSON数据。
errorMsg:显示描述导致任务失败的错误信息。如果任务成功，则为空。

Live报告的各个指标

Row stats

非并行的简单本机批处理任务、Hadoop批处理任务以及Kafka和Kinesis摄取任务都支持在任务运行时检索行状态。
可以通过GET的方式，从下面的URL获取：

http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/rowStats

movingAverages部分包含1分钟、5分钟和15分钟的增加到四个行计数器的移动平均线，它们的定义与完成报告中的定义相同。
样例如下：

{
  "movingAverages": {
    "buildSegments": {
      "5m": {
        "processed": 3.392158326408501,
        "unparseable": 0,
        "thrownAway": 0,
        "processedWithError": 0
      },
      "15m": {
        "processed": 1.736165476881023,
        "unparseable": 0,
        "thrownAway": 0,
        "processedWithError": 0
      },
      "1m": {
        "processed": 4.206417693750045,
        "unparseable": 0,
        "thrownAway": 0,
        "processedWithError": 0
      }
    }
  },
  "totals": {
    "buildSegments": {
      "processed": 1994,
      "processedWithError": 0,
      "thrownAway": 0,
      "unparseable": 0
    }
  }
}

For the Kafka Indexing Service, a GET to the following Overlord API will retrieve live row stat reports from each task being managed by the supervisor and provide a combined report.

http://<OVERLORD-HOST>:<OVERLORD-PORT>/druid/indexer/v1/supervisor/<supervisor-id>/stats

Unparseable events

Lists of recently-encountered unparseable events can be retrieved from a running task with a GET to the following Peon API:

http://<middlemanager-host>:<worker-port>/druid/worker/v1/chat/<task-id>/unparseableEvents

注意，并非所有任务类型都支持此功能。目前，它只支持非并行本机批处理任务(type Index)和Kafka和Kinesis索引服务创建的任务。

Task lock system

Druid的任务锁定系统。Druid的锁定系统和版本控制系统紧密结合在一起，以保证摄入数据的正确性。

"Overshadowing" between segments

Druid中可以运行一个任务来覆盖（overshadows）现有数据。覆盖任务创建的Segment掩盖了现有的Segment。注意，overshadow关系只对相同时间块和相同DataSource有效。在筛选过时数据的查询处理中，不考虑这些被覆盖（overshadowed）的Segment。
每个Segment有一个主要(major)版本和一个次要(minor)版本。主版本以“yyyy-MM-dd' not 'hh:mm:ss”格式表示时间戳，而次版本是一个整数。这些主要和次要的版本被用来确定段之间的overshadow关系，如下所示。
s1Segment overshadows关系为s2,且满足以下关系：

s1 has a higher major version than s2, or
s1 has the same major version and a higher minor version than s2.

Here are some examples.

A segment of the major version of 2019-01-01T00:00:00.000Z and the minor version of 0 overshadows
another of the major version of 2018-01-01T00:00:00.000Z and the minor version of 1.
A segment of the major version of 2019-01-01T00:00:00.000Z and the minor version of 1 overshadows
another of the major version of 2019-01-01T00:00:00.000Z and the minor version of 0.

Locking

如果你正在运行两个或更多的Druid task，为相同的数据源和相同的时间块生成Segment，生成的Segment可能互相覆盖（overshadow），这可能导致不正确的查询结果。
为了避免这个问题，task将会在Druid创造任何片Segment之前尝试获得锁。有两种类型的锁：time chunk lock 和 segment lock。
当time chunk lock被使用时，task locks将写入DataSource创建Segment的所有时间块中。举例来说：假设我们有一个任务，将摄取wikipedia数据源的时间块2019-01- 01t00:000.000z /2019-01- 02t00:000.000z。通过time chunk lock锁定，该task将在创建任何Segment之前锁定wikipedia数据源的整个时间块(2019-01- 01t00:000.000z /2019-01- 02t00:000.000z)。只要它持有锁，任何其他task将无法为同一数据源的同一时间块创建Segment。使用time chunk lock创建的Segment具有比现有Segment更高的主版本。它们的次要版本总是0。
当使用segment lock时，任务锁定单个Segment而不是整个时间块。因此，如果两个或多个task正在读取不同的Segment，则它们可以同时为同一数据源的同一时间块创建Segment。例如，Kafka索引任务和压缩任务总是可以同时将Segment写入同一数据源的同一时间块。这样做的原因是Kafka索引任务总是附加新的Segment，而压缩任务总是覆盖现有的Segment。使用segment lock创建的Segment具有相同的主版本(major version )和更高的次版本(minor version)。
- The segment locking is still experimental. It could have unknown bugs which potentially lead to incorrect query results.
在启动segment lock时，需要在 task content中将参数forceTimeChunkLock设置为false。一旦取消forceTimeChunkLock设置，任务将自动选择适当的锁类型来使用。请注意segment lock并不总是可用的。使用time chunk lock的最常见用例是当重写任务更改Segment粒度时。此外，只有本地索引任务和Kafka/Kinesis索引任务支持段锁定。Hadoop索引任务和index_realtime任务(由[Tranquility](file:///Users/liyahui/Huobi/0_Code_Repository/5_apache/apache-druid-0.17.0-src/docs/ingestion/tranquility.md)使用)还不支持它。
task content中的forceTimeChunkLock仅应用于单个任务。如果你想取消对所有任务的设置，你需要设置[overlord configuration](file:///Users/liyahui/Huobi/0_Code_Repository/5_apache/apache-druid-0.17.0-src/docs/configuration/index.html#overlord-operations)中参数druid.indexer.tasklock.forceTimeChunkLock为false。
如果两个或多个task任务试图为同一DataSource的重叠时间块获取锁，那么锁请求可能会相互冲突。注意，锁冲突可能发生在不同的锁类型之间。
锁冲突的行为取决于task任务优先级。如果冲突锁请求的所有任务具有相同的优先级，那么首先请求的任务将获得锁。其他任务将等待任务释放锁。
如果一个低优先级的任务请求锁的时间晚于另一个高优先级的任务，那么这个任务也将等待高优先级的任务释放锁。如果一个高优先级的任务比另一个低优先级的任务晚请求一个锁，那么这个任务将抢占另一个低优先级的任务。低优先级任务的锁将被撤销，而高优先级任务将获得一个新锁。
这种锁抢占可以在任务运行的任何时候发生，除非它在临界区中发布Segment。一旦发布Segment完成，它的锁将再次成为可抢占的。
注意，锁是由同一个groupId的任务共享的。例如，同一个管理器的Kafka索引任务具有相同的groupId，并且彼此共享所有锁。

Lock priority

每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越大，优先级越高。

task type	default priority
Realtime index task	75
Batch index task	50
Merge/Append/Compaction task	25
Other tasks	0

不过可以对配置进行重写：

"context" : {
  "priority" : 100
}

Context parameters

以下参数对所有的task类型均有效；

property	default	description
taskLockTimeout	300000	task lock timeout in millisecond. For more details, see Locking.
forceTimeChunkLock	true	Setting this to false is still experimental Force to always use time chunk lock. If not set, each task automatically chooses a lock type to use. If this set, it will overwrite the `druid.indexer.tasklock.forceTimeChunkLock` configuration for the overlord. See Locking for more details.
priority	Different based on task types. See Priority.	Task priority

当一个任务获得一个锁时，它通过HTTP发送一个请求并等待，直到它收到一个包含锁获取结果的响应。因此，如果taskLockTimeout大于Overlords的druid.server.http.maxIdleTime，就会出现HTTP超时错误。

All task types

`index`

See Native batch ingestion (simple task).

`index_parallel`

See Native batch ingestion (parallel task).

`index_sub`

Submitted automatically, on your behalf, by an index_parallel task.

`index_hadoop`

See Hadoop-based ingestion.

`index_kafka`

Submitted automatically, on your behalf, by a
Kafka-based ingestion supervisor.

`index_kinesis`

Submitted automatically, on your behalf, by a
Kinesis-based ingestion supervisor.

`index_realtime`

Submitted automatically, on your behalf, by Tranquility.

`compact`

Compaction tasks merge all segments of the given interval. See the documentation on
compaction for details.

`kill`

Kill tasks delete all metadata about certain segments and removes them from deep storage.
See the documentation on deleting data for details.

`append`

Append tasks append a list of segments together into a single segment (one after the other). The grammar is:

{
    "type": "append",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "segments": <JSON list of DataSegment objects to append>,
    "aggregations": <optional list of aggregators>,
    "context": <task context>
}

`merge`

Merge tasks merge a list of segments together. Any common timestamps are merged.
If rollup is disabled as part of ingestion, common timestamps are not merged and rows are reordered by their timestamp.

The compact task is often a better choice than the merge task.

The grammar is:

{
    "type": "merge",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "aggregations": <list of aggregators>,
    "rollup": <whether or not to rollup data during a merge>,
    "segments": <JSON list of DataSegment objects to merge>,
    "context": <task context>
}

`same_interval_merge`

Same Interval Merge task is a shortcut of merge task, all segments in the interval are going to be merged.

The compact task is often a better choice than the same_interval_merge task.

The grammar is:

{
    "type": "same_interval_merge",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "aggregations": <list of aggregators>,
    "rollup": <whether or not to rollup data during a merge>,
    "interval": <DataSegment objects in this interval are going to be merged>,
    "context": <task context>
}
``

Druid-Druid中task详解
基于apche-druid-0.17 概述在Druid中task完成与数据提取相关的工作。对批处理，通常通过T...
Gradle 操作指南
打印Task依赖树变量作用域加快build Task 详解 Task 详解 TaskGraph 打印Task输...
11.Activity启动模式
Android中Activity四种启动模式和taskAffinity属性详解 Application，Task和...
8、Task
本节内容： Task定义及配置 Task执行详解 Task的依赖和执行顺序 Task类型挂接到构建生命周期 Ta...
Druid-Druid中Segment
基于apache-druid-0.17 概述 Druid将索引存储在按时间分区的Segment文件中。在基本的设置...
Druid-Druid中Broker
基于apache-druid-0.17 概述如果希望运行分布式集群，Broker是查询路由的流程。Broker可...
C#并行和多线程编程
—— 第四天 Task进阶一、Task的嵌套Task中还可以再嵌套Task，Thread中能不能这样做，我只能说我...
Druid-Druid中Coordinator Process
基于apache-druid-0.17.0 Configuration和HTTP endpoints详见官网；概...
Druid-Druid中Router Process
基于apache-druid-0.17.0 概述 Router进程可以被用于查询不同的Broker进程。通常情况下...
Druid-Druid中MiddleManager Proces
基于apache-druid-0.17 启动命令概述 MiddleManager进程是一个执行已提交任务的工作进...

Druid-Druid中task详解

概述

Task API

Task报告

Task终态报告

Live类型报告

Live报告的各个指标

Row stats

Unparseable events

Task lock system

"Overshadowing" between segments

Locking

Lock priority

Context parameters

All task types

index

index_parallel

index_sub

index_hadoop

index_kafka

index_kinesis

index_realtime

compact

kill

append

merge

same_interval_merge

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读