kafka stream入门1

作者: 来福马斯特 | 来源:发表于2017-09-22 06:59 被阅读107次

    kafka stream入门1

    最近本人在单位经常有对于大量心跳数据进行汇总计算,然后更加计算汇总出不同种类的中间数据集合,来提供后期的处理的需求。
    原先的方案是自己写了不少的job,然后利用zookeeper等进行job进度的控制,问题是这种模式下,需要大量的编码,保证数据不被重复消费,感觉自己的程序在出现异常的时候,
    还是会有部分数据丢失的问题。

    考虑采用一个业绩主流的流式计算的方案,同时也要支持对于历史数据的批量操作。

    对比了spark,storm,kafka_stream,首先本人完全没有大数据的实战经验,个人感觉,前两者相对成熟很多,后者kafka_stream是新出来的,相对资源少。
    但是前两者是框架级别的,以spark为例,看了下,一般要单独部署一套自己的spark集群(除非单位有现成的给你使用)我们这边是不具备的。搭建的硬件要求也很高。
    对比kafkastream,其只是个库,依赖只有kafka,硬件资源需求较小,决定自己先研究下。如果可行,就投入生产。
    以下摘录一个最简单的入门的案例。

    后续继续补全。

    package io.confluent.examples.streams;
    
    import org.apache.kafka.common.serialization.Serde;
    import org.apache.kafka.common.serialization.Serdes;
    import org.apache.kafka.streams.KafkaStreams;
    import org.apache.kafka.streams.StreamsConfig;
    import org.apache.kafka.streams.kstream.KStream;
    import org.apache.kafka.streams.kstream.KStreamBuilder;
    import org.apache.kafka.streams.kstream.KTable;
    
    import java.util.Arrays;
    import java.util.Properties;
    import java.util.regex.Pattern;
    
    /**
     * Demonstrates, using the high-level KStream DSL, how to implement the WordCount program that
     * computes a simple word occurrence histogram from an input text. This example uses lambda
     * expressions and thus works with Java 8+ only.
     * <p>
     * In this example, the input stream reads from a topic named "TextLinesTopic", where the values of
     * messages represent lines of text; and the histogram output is written to topic
     * "WordsWithCountsTopic", where each record is an updated count of a single word, i.e. {@code word (String) -> currentCount (Long)}.
     * <p>
     * Note: Before running this example you must 1) create the source topic (e.g. via {@code kafka-topics --create ...}),
     * then 2) start this example and 3) write some data to the source topic (e.g. via {@code kafka-console-producer}).
     * Otherwise you won't see any data arriving in the output topic.
     * <p>
     * <br>
     * HOW TO RUN THIS EXAMPLE
     * <p>
     * 1) Start Zookeeper and Kafka. Please refer to <a href='http://docs.confluent.io/current/quickstart.html#quickstart'>QuickStart</a>.
     * <p>
     * 2) Create the input and output topics used by this example.
     * <pre>
     * {@code
     * $ bin/kafka-topics --create --topic TextLinesTopic \
     *                    --zookeeper localhost:2181 --partitions 1 --replication-factor 1
     * $ bin/kafka-topics --create --topic WordsWithCountsTopic \
     *                    --zookeeper localhost:2181 --partitions 1 --replication-factor 1
     * }</pre>
     * Note: The above commands are for the Confluent Platform. For Apache Kafka it should be {@code bin/kafka-topics.sh ...}.
     * <p>
     * 3) Start this example application either in your IDE or on the command line.
     * <p>
     * If via the command line please refer to <a href='https://github.com/confluentinc/examples/tree/master/kafka-streams#packaging-and-running'>Packaging</a>.
     * Once packaged you can then run:
     * <pre>
     * {@code
     * $ java -cp target/streams-examples-3.3.0-standalone.jar io.confluent.examples.streams.WordCountLambdaExample
     * }</pre>
     * 4) Write some input data to the source topic "TextLinesTopic" (e.g. via {@code kafka-console-producer}).
     * The already running example application (step 3) will automatically process this input data and write the
     * results to the output topic "WordsWithCountsTopic".
     * <pre>
     * {@code
     * # Start the console producer. You can then enter input data by writing some line of text, followed by ENTER:
     * #
     * #   hello kafka streams<ENTER>
     * #   all streams lead to kafka<ENTER>
     * #   join kafka summit<ENTER>
     * #
     * # Every line you enter will become the value of a single Kafka message.
     * $ bin/kafka-console-producer --broker-list localhost:9092 --topic TextLinesTopic
     * }</pre>
     * 5) Inspect the resulting data in the output topic, e.g. via {@code kafka-console-consumer}.
     * <pre>
     * {@code
     * $ bin/kafka-console-consumer --topic WordsWithCountsTopic --from-beginning \
     *                              --new-consumer --bootstrap-server localhost:9092 \
     *                              --property print.key=true \
     *                              --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
     * }</pre>
     * You should see output data similar to below. Please note that the exact output
     * sequence will depend on how fast you type the above sentences. If you type them
     * slowly, you are likely to get each count update, e.g., kafka 1, kafka 2, kafka 3.
     * If you type them quickly, you are likely to get fewer count updates, e.g., just kafka 3.
     * This is because the commit interval is set to 10 seconds. Anything typed within
     * that interval will be compacted in memory.
     * <pre>
     * {@code
     * hello    1
     * kafka    1
     * streams  1
     * all      1
     * streams  2
     * lead     1
     * to       1
     * join     1
     * kafka    3
     * summit   1
     * }</pre>
     * 6) Once you're done with your experiments, you can stop this example via {@code Ctrl-C}. If needed,
     * also stop the Kafka broker ({@code Ctrl-C}), and only then stop the ZooKeeper instance (`{@code Ctrl-C}).
     */
    public class WordCountLambdaExample {
    
      public static void main(final String[] args) throws Exception {
        final String bootstrapServers = args.length > 0 ? args[0] : "localhost:9092";
        final Properties streamsConfiguration = new Properties();
        // Give the Streams application a unique name.  The name must be unique in the Kafka cluster
        // against which the application is run.
        streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example");
        streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "wordcount-lambda-example-client");
        // Where to find Kafka broker(s).
        streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
        // Specify default (de)serializers for record keys and for record values.
        streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
        streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
        // Records should be flushed every 10 seconds. This is less than the default
        // in order to keep this example interactive.
        streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
        // For illustrative purposes we disable record caches
        streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
    
        // Set up serializers and deserializers, which we will use for overriding the default serdes
        // specified above.
        final Serde<String> stringSerde = Serdes.String();
        final Serde<Long> longSerde = Serdes.Long();
    
        // In the subsequent lines we define the processing topology of the Streams application.
        final KStreamBuilder builder = new KStreamBuilder();
    
        // Construct a `KStream` from the input topic "TextLinesTopic", where message values
        // represent lines of text (for the sake of this example, we ignore whatever may be stored
        // in the message keys).
        //
        // Note: We could also just call `builder.stream("TextLinesTopic")` if we wanted to leverage
        // the default serdes specified in the Streams configuration above, because these defaults
        // match what's in the actual topic.  However we explicitly set the deserializers in the
        // call to `stream()` below in order to show how that's done, too.
        final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic");
    
        final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS);
    
        final KTable<String, Long> wordCounts = textLines
          // Split each text line, by whitespace, into words.  The text lines are the record
          // values, i.e. we can ignore whatever data is in the record keys and thus invoke
          // `flatMapValues()` instead of the more generic `flatMap()`.
          .flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
          // Count the occurrences of each word (record key).
          //
          // This will change the stream type from `KStream<String, String>` to `KTable<String, Long>`
          // (word -> count).  In the `count` operation we must provide a name for the resulting KTable,
          // which will be used to name e.g. its associated state store and changelog topic.
          //
          // Note: no need to specify explicit serdes because the resulting key and value types match our default serde settings
          .groupBy((key, word) -> word)
          .count("Counts");
    
        // Write the `KStream<String, Long>` to the output topic.
        wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic");
    
        // Now that we have finished the definition of the processing topology we can actually run
        // it via `start()`.  The Streams application as a whole can be launched just like any
        // normal Java application that has a `main()` method.
        final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
        // Always (and unconditionally) clean local state prior to starting the processing topology.
        // We opt for this unconditional call here because this will make it easier for you to play around with the example
        // when resetting the application for doing a re-run (via the Application Reset Tool,
        // http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool).
        //
        // The drawback of cleaning up local state prior is that your app must rebuilt its local state from scratch, which
        // will take time and will require reading all the state-relevant data from the Kafka cluster over the network.
        // Thus in a production scenario you typically do not want to clean up always as we do here but rather only when it
        // is truly needed, i.e., only under certain conditions (e.g., the presence of a command line flag for your app).
        // See `ApplicationResetExample.java` for a production-like example.
        streams.cleanUp();
        streams.start();
    
        // Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
      }
    
    }
    

    这里的逻辑答题上就是从kafka的输入stream TextLinesTopic里,不断读入用户输入的文本行,

     final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic");
    

    然后针对每行输入用正则表达式查封成各个word。flatMap到word 单词数据流

    flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
    

    接下来,按照不同的单词进行分组

    groupBy((key, word) -> word)
    

    最后把kstream 通过count进行转存到ktable里,后续可以通过ksql进行查询
    切记,streams需要开启才能工作

    streams.start();
    

    相关文章

      网友评论

      本文标题:kafka stream入门1

      本文链接:https://www.haomeiwen.com/subject/bpexextx.html