美文网首页
Dataflow Model笔记

Dataflow Model笔记

作者: chaokunyang | 来源:发表于2018-11-04 22:34 被阅读0次

Dataflow Model笔记原文链接http://timeyang.com/articles/27/2018/10/26/The%20Dataflow%20Model

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in MassiveScale, Unbounded, OutofOrder Data Processing

最近看了Google发表的dataflow论文,其中两点颇有感触:

  1. dataflow提供了一个思考维度:从什么结果被计算、在事件时间的哪里计算、处理时间的什么时候观察到结果、以及早先的结果如何与之后的修正相关。这种思维方式的转变是很重要的。
  2. dataflow分析了lambda架构的思想,即实时计算结果不准确但是低延迟,然后在批处理中修正结果,达到最终正确性。它对其推广,在流处理中使用触发器低延时产生结果,这个结果不一定准确,然后在后续计算中,利用晚到的数据产生新的结果来修正之前的结果,从而实现低延迟和正确性,而不是依赖等待数据完整来实现正确性(这会造成延时)

下面是它的主要内容:

A single unified model

  • 允许在无界乱序数据源上,使用各种correctness、latency、cost的combinations和tradeoff,进行事件时间有序结果、按数据自身特征划分窗口进行计算
  • 分解data pipeline为四个维度,提供清晰性、可组合性和灵活性
    • What results are being computed
    • Where in event time they are being computed.
    • When in processing time they are materialized.
    • How earlier results relate to later renements.
  • 分离数据处理的逻辑与下层的实现,允许基于correctness, latency, and cost选择batch, micro-batch, or streaming engine

Concrete contribution

  • A windowing model which supports unaligned event-time windows, and a simple API for their creation and use.
  • A triggering model that binds the output times of results to runtime characteristics of the pipeline, with a powerful and flexible declarative API for describing desired triggering semantics.
  • An incremental processing model that integrates retractions and updates into the windowing and triggering models described above

Dataflow Model

Core Primitives

  • ParDo
  • GroupByKey

Windowing

  • unaligned windows

  • windowing can be broken apart:

    • AssignWindows
      • Elements are initially assigned to a default global window, covering all of event time, providing semantics that match the defaults in the standard batch model.
      • Since windows are associated directly with the elements to which they belong, this means window assignment can happen anywhere in the pipeline before grouping is applied. This is important, as the grouping operation may be buried somewhere downstream inside a composite transformation
    • MergeWindows
      • All are initially placed in a default global window by the system.
      • Then implementation of AssignWindows puts each element into a single window.
      • DropTimestamps
      • GroupByKey
      • MergeWindows
      • GroupAlsoByWindow
      • ExpandToElements

Triggers & Incremental Processing

  • Triggers are complementary to the windowing model, in that they each affect system behaviour along a different axis of time:

  • Windowing determines where in event time data are grouped together for processing.

  • Triggering determines when in processing time the results of groupings are emitted as panes.

  • Triggers system provides a way to control how multiple panes for the same window relate to each other

  • Discarding

  • Accumulating

  • Accumulating & Retracting

Design Principles

  • Never rely on any notion of completeness.
  • Be flexible, to accommodate the diversity of known use cases, and those to come in the future.
  • Not only make sense, but also add value, in the context of each of the envisioned execution engines.
  • Encourage clarity of implementation.
  • Support robust analysis of data in the context in which they occurred.

相关文章

  • Dataflow Model笔记

    Dataflow Model笔记原文链接: http://timeyang.com/articles/27/201...

  • The Dataflow Model

    现状 现实中的数据:无边界、乱序、超大规模数据集数据消费需求:低延迟,exactly once, 有序(按发生时间...

  • Dataflow Programming Model

    原文地址 抽象层级 Flink提供不同的抽象层级来开发流/批处理应用。 最低层次的抽象只提供有状态的流。它可以通过...

  • 【Data Flow】The Dataflow Model: A

    正文之前 三部曲最后一个系列 正文 IMPLEMENTATION & DESIGN3.实现 & 设计 3.1 Im...

  • 【Data Flow】The Dataflow Model: A

    正文之前 终于翻译完了,可以开始看论文了,开心啊。。。。。。 正文 Event time for a given ...

  • Paper Reading《The Dataflow Model

    What is DataFlow ? 谷歌的Dataflow首先是一个为用户提供以流式或批量模式处理海量数据能力的...

  • Dataflow模型分析

    The Dataflow Model 是 Google Research 于2015年发表的一篇流式处理领域的有指...

  • streaming 102

    将数据处理的概念拆分成dataflow model中的四个问题:what, where,when,how What...

  • NIFI术语

    DataFlow Manager:DataFlow Manager(DFM)是一个NiFi用户,具有添加,移除和修...

  • Dataflow

    拥有数据流的对象: 项目数据流:整个项目最顶级的数据流,只能单向向下连接其他阶层的数据流,不能反向连接,修改方法M...

网友评论

      本文标题:Dataflow Model笔记

      本文链接:https://www.haomeiwen.com/subject/bcadxqtx.html