spark streaming 初步

作者: 枫隐_5f5f | 来源:发表于2019-04-29 17:31 被阅读0次

spark streaming 初步
Java-Spark系列7-Spark streaming介绍
10 Spark Streaming
Spark Streaming 基本概念及操作
Spark Streaming
spark之旅-5.spark-streaming
Spark Streaming
Spark Streaming 编程指南（2.x）
5.Spark基础学习五（SparkStreaming）
Spark Streaming 开发指南

一基本概念

streaming的重要特点是使用了spark DataFrame

spark streaming 是数据流式系统，采用RDD批量模式加速处理数据，以小批量或者批次间隔运行

spark streaming接受输入数据流将其分为多个较小的batch spark引擎将这些数据的batch处理后，生成处理过数据的batch结果集

spark streaming的主要抽象是离散流（Dstream）代表了数据流中的一个小批量数据 Dstream建立在RDD上
可与MLLIB SQL DataFrame GraphX集成

目前 spark streaming 有四种广泛的应用场景

流ETL 将数据推入下游分析系统之前对其进行持续的数据清洗和聚合
触发器实时检测行为或异常事件及时触发下游动作
数据浓缩将实时数据和其他数据集合连接进行更加丰富的分析
复杂会话和持续学习与实时流相关的多组事件被持续分析，从而更新机器学习模型例如与在线游戏相关联的用户活动流

二 streaming工作流程

Image.png

【1】当spark streaming启动上下文时，驱动进程会对工作节点excutor执行长时间运行的任务
【2】excutor中的receiver进程从streaming源中获取数据并切分成多个数据块保存至内存中
【3】这些数据块被复制到另一个excutor中进行数据备份保存
【4】数据块的ID信息被传送到driver上的块管理Master
【5】对于在streaming context内配置的批次间隔，driver将启动spark任务对每个批次的数据块进行处理，然后这些数据被持久化到目标存储中如NoSQL

代码实践

1.Dstream完成wordcount

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]","NewWordCount")
ssc = StreamingContext(sc,1)    # 1 表示1秒间隔

lines = ssc.socketTextStream("localhost",9999)

words = lines.flatMap(lambda line:line.strip().split(" "))
pairs = words.map(lambda x:(x,1))
wordCounts  = pairs.reduceByKey(lambda x, y: x + y)
wordCounts.pprint()

ssc.start()   #启动spark
ssc.awaitTermination()

2.全局聚合保留信息的状态返回特定时间窗口长度内的所有数据UpdateStateByKey/mapWithState

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]","statefulNetworkWordCount")
ssc = StreamingContext(sc,1)

ssc.checkpoint("checkpoint")

def updateFunc(new_values,last_sum):
    return sum(new_values,last_sum)


lines = ssc.socketTextStream("localhost",9999)
running_counts = lines.flatMap(lambda line:line.split(" ")).map(lambda x:(x,1)).updateStateByKey(updateFunc)
running_counts.pprint()

ssc.start()
ssc.awaitTermination()

注意：通常在spark1.5 之前的版本使用updateStateByKey
spark1.6之后应该使用mapWithState 性能和批量的大小成正比

3.而在spark2.0 中引入了结构化流 structured streaming 将streaming的概念与Dataset/DataFrame进行了整合

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession.builder \
        .appName("ATestName") \    
        .getOrcreate()

lines = spark.readStream \
        .format("socket") \
        .option("host","localhost") \
        .option("port",9999) \
        .load()

words = lines.select(explode(split(lines.value," ")).alias("word"))

wordCounts = words.groupBy("words").count()

query = wordCounts.writeStream \
        .outputMode("complete") \
        .format("console") \
        .start()
query.awaitTermination()

网友评论

本文标题：spark streaming 初步

本文链接：https://www.haomeiwen.com/subject/drlunqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

spark streaming 初步

一基本概念

二 streaming工作流程

代码实践

相关文章