kafka

作者: Wilbur_ | 来源:发表于2021-08-08 11:31 被阅读0次

Kafka详细的设计和生态系统
kafka全面认知
Kafka & NSQ
kafka学习系列
Kafka Producer源码
Kafka 详解一简介
kafka详解
kafka配置KAFKA_LISTENERS和KAFKA_ADV
Kafka 基本原理
Spark Streaming实时流处理-3.分布式消息队列Ka

Summary

Kafka is a distributed messaging system for log processing. Log is generated from each activities from system such as (1) logins, pageviews, clicks, "likes" (2) operational metrics such as service call stack, latency, errors, disk utils etc. Log data is important for analytics to track users engagement and system utilization.

Recent trends is that log data can be used in real time to provide features such as (1) search relevance (2) recommendations driven by item popularity (3) ad targeting (4) newsfeed features that aggregate user status updates or actions for their friends to read

Kafka is designed because this real time usage of log data is orders of magnitude larger than the "real" data. Early systems for processing this kind of data relied on physically scraping log files off productino servers for analysis. Recent years, log aggregators have been built for collecting and loading the log data into a data warehouse or Hadoop for offline consumption. LinkedIn need to support most of the real-time applications with delays of no more than fre seconds. Thus Kafka is designed.

Kafka Design principles

Kafka defines a stream of messages of a particular type and call it topic. The published messages are stored at a set of servers called brokers.

A consumer can subscribe to one or more topics from the brokers. (consume by pulling data from the brokers)

What makes Kafka fast?

SImple Storage

Kafka use logical log to store each topic into a partition. A log is implemented as a set of segment files of approximately same size(1GB). Everytime a producer publishes a message to a partition, the broker simply appends it to the last segment file. The segment file is flushed once a configurable number of messages have been published or certain amount of time elapsed. (A message is only exposed to the consumers after it is flushed, I assume this make sure the log is large enough for disk transaction? So we don't initiate data tranfer for every log or when log are still small)

Addressing by logical offset

Kafka uses logical offset to address each message. This avoids the overhead of maintaining auxiliary, seek-intensive random access index structures (B-tree?) that map the message ids to the actual message locations.

Each pull request from consumer contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch. Each broker keeps in memory a sorted list of offsets.

Rely on underlying file system page cache

Kafka explicitly avoid caching messgaes in memory at the Kafka layer. It utilize the local page cache. This has main benefit of avoiding double buffering. It has very little overhead in garbage collecting its memory, making efficient implementation in a VM-based language feasible (Java?)

With consumer and producer access segment file sequencially and consumer slightly lagging behind producer, normal operating system caching heuristics are effetively used. (write through caching and read ahead)

Network optimization

On Linux and other Unix operating sytems, there exists a sendfile API that can directly transfer bytes from a file channel to a socket channel. This typically avoids 2 of copies and 1 system call for typical method of sending bytes from consumer applications.

Stateless broker

Kafka uses a simple time-based SLA for retention policy. A message is automatically deleted if it has been retained longer than 7days. An important benefit of this design is consumer can deliberately rewind back to an old offset and recomsume data. This is particularly important to ETL data loads to data warehouse or Hadoop system.

Distributed

Kafka has concept of consumer groups where each group consist one or more consumers that jointly consume a set of subscribed topics (concurrently/parallelly processing a topic)

To facilitate coordination, Kafka employed a highly available Zookeeper.

Zookeeper for the following tasks:

detecting the addition and the removal of brokers and consumers
triggering a rebalance process in each consumer when the above events happen
maintaining the consumption relationship and keeping track of the consumed offset of each partition. Specifically, when each broker or consumer starts up, it stores its information in a broker or consumer registry in Zookeeper.

Experiment results

Producer test: Kafka can produce in orders of magnitude higher than ActiveMQ, and at least 2 times higher than RabbitMQ.

Few reasons: first, Kafka doesn't wait for ackowledgements from broker and sends messages as fast as broker can handle. From figure 4, single Kafka producer almost saturated the 1Gb link shows this hypothesis

figure4

We note that without acknowledging the producer, there is no guarantee that every published message is actually received by the broker. For many types of log data, it is desirable to trade durability for throughput, as long as the number of dropped messages is relatively small.

Durability tradeoff.

Second is because Kafka has more efficient storage format (logical offset)

Consumer test:

On average, Kafka consumed 22,000 messages per second, more than 4 times that of ActiveMQ and RabbitMQ.

Reasons:

Fewer bytes were tranferred from the broker in Kafka due to its efficient storage
Broker in AvtiveMQ and RabbitMQ had to maintain delivery state. Kafka has no disk write activities on broker (sendfile API from Linux system?)

Conclusion

Kafka is designed for processing huge volume of log data streams. It use pull-based comsumption model that allows an application to consume data at its own rate and rewind the consumption whenever needed.

Future work: durability, data availability guarantees, async and sync replication models

网友评论

本文标题：kafka

本文链接：https://www.haomeiwen.com/subject/watovltx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

kafka

Summary

Kafka Design principles

What makes Kafka fast?

SImple Storage

Addressing by logical offset

Rely on underlying file system page cache

Network optimization

Stateless broker

Distributed

Experiment results

Conclusion

相关文章

Kafka详细的设计和生态系统

kafka全面认知

Kafka & NSQ

kafka学习系列

Kafka Producer源码

Kafka 详解一简介

kafka详解

kafka配置KAFKA_LISTENERS和KAFKA_ADV

Kafka 基本原理

Spark Streaming实时流处理-3.分布式消息队列Ka

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读