Structured Streaming如何实现Parquet存

作者: 祝威廉 | 来源:发表于2018-05-23 15:54 被阅读415次

Structured Streaming如何实现Parquet存
spark之旅-6.structured streaming
202、Spark 2.0之Structured Streami
structured streaming 介绍
201、Spark 2.0之Structured Streami
SQL实现Structured Streaming
StreamingPro 支持Spark Structured
Structured Streaming
Structured Streaming
Structured Streaming

缘由

StreamingPro现在支持以SQL脚本的形式写Structured Streaming流式程序了： mlsql-stream。不过期间遇到个问题，我希望按天进行分区，但是这个分区比较特殊，就是是按接收时间来落地进行分区，而不是记录产生的时间。

当然，我可以新增一个时间字段，然后使用partitionBy动态分区的方式解决这个问题，但是使用动态分区有一个麻烦的地方是，删除数据并不方便。流式程序会不断地写入数据，我们需要将七天前的数据清理掉，因为采用partitionBy后，parquet的meta信息是会在同一个目录里，然后里面的文件记录了当前批次数据分布在那些文件里。这样导致删除数据不方便了。

所以最好的方式是类似这样的：

set today="select current_date..." options type=sql;
load kafka9....;

save append table21  
as parquet.`/tmp/abc2/hp_date=${today}` 
options mode="Append"
and duration="10"
and checkpointLocation="/tmp/cpl2";

这种方式的好处就是，删除分区直接删除就可以，坏处是，通过上面的方式，由于Structured Streaming的目录地址是不允许变化的，也就是他拿到一次值之后，后续就固定了，所以数据都会写入到服务启动的那天。

解决方案

解决办法是自己实现一个parquet sink,改造的地方并不多。新添加一个类：

class NewFileStreamSink(
                         sparkSession: SparkSession,
                         _path: String,
                         fileFormat: FileFormat,
                         partitionColumnNames: Seq[String],
                         options: Map[String, String]) extends Sink with Logging {
 // 使用velocity模板引擎,方便实现复杂的模板渲染
  def evaluate(value: String, context: Map[String, AnyRef]) = {
    RenderEngine.render(value, context)
  }

// 将路径获取改成一个方法调用，这样每次写入时，都会通过方法调用
//从而获得一个新值
  def path = {
    evaluate(_path, Map("date" -> new DateTime()))
  }
-- 这些路径获取都需要变成方法
  private def basePath = new Path(path)

  private def logPath = new Path(basePath, FileStreamSink.metadataDir)

  private def fileLog =
    new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, logPath.toUri.toString)

  private val hadoopConf = sparkSession.sessionState.newHadoopConf()

  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    if (batchId <= fileLog.getLatest().map(_._1).getOrElse(-1L)) {
      logInfo(s"Skipping already committed batch $batchId")
    } else {
      val committer = FileCommitProtocol.instantiate(
        className = sparkSession.sessionState.conf.streamingFileCommitProtocolClass,
        jobId = batchId.toString,
        outputPath = path,
        isAppend = false)

      committer match {
        case manifestCommitter: ManifestFileCommitProtocol =>
          manifestCommitter.setupManifestOptions(fileLog, batchId)
        case _ => // Do nothing
      }

      FileFormatWriter.write(
        sparkSession = sparkSession,
        queryExecution = data.queryExecution,
        fileFormat = fileFormat,
        committer = committer,
        outputSpec = FileFormatWriter.OutputSpec(path, Map.empty),
        hadoopConf = hadoopConf,
        partitionColumnNames = partitionColumnNames,
        bucketSpec = None,
        refreshFunction = _ => (),
        options = options)
    }
  }

  override def toString: String = s"FileSink[$path]"
}

实现sink之后，我们还需要一个DataSource 以便我们能让这个新的Sink集成进Spark里并被外部使用：

package org.apache.spark.sql.execution.streaming.newfile

import org.apache.spark.sql.{AnalysisException, SQLContext}
import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
import org.apache.spark.sql.execution.streaming. Sink
import org.apache.spark.sql.sources.StreamSinkProvider
import org.apache.spark.sql.streaming.OutputMode

class DefaultSource extends StreamSinkProvider {
  override def createSink(sqlContext: SQLContext, parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): Sink = {
    val path = parameters.getOrElse("path", {
      throw new IllegalArgumentException("'path' is not specified")
    })
    if (outputMode != OutputMode.Append) {
      throw new AnalysisException(
        s"Data source ${getClass.getCanonicalName} does not support $outputMode output mode")
    }
    new NewFileStreamSink(sqlContext.sparkSession, parameters("path"), new ParquetFileFormat(), partitionColumns, parameters)
  }
}

这个是标准的datasource API。现在使用时可以这样：

save append table21  
-- 使用jodatime的语法
as parquet.`/tmp/jack/hp_date=${date.toString("yyyy-MM-dd")}` 
options mode="Append"
and duration="10"
-- 指定实现类
and implClass="org.apache.spark.sql.execution.streaming.newfile"
and checkpointLocation="/tmp/cpl2";

是不是很方便？

网友评论

97373edf99d9:威廉大神，我按你的思路实现了按时间动态分区，在实际应用的时候，确实遇到了meta合并的问题
329.compact在dt=20181024目录下
从dt=20181025目录开始记录batchID 330以后的，当运行到339的时候，从dt=20181025目录获取329.compact文件报错，文件找不到
dt=20181025/_spark_metadata/329.compact doesn't exist when compacting batch 339 (compactInterval: 10)
请问这种情况有没有什么解决思路，有办法可以强行关闭meta合并么，还是只只能修改spark源码
spark版本是2.3.0

本文标题：Structured Streaming如何实现Parquet存

本文链接：https://www.haomeiwen.com/subject/eobrjftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Structured Streaming如何实现Parquet存

缘由

解决方案

相关文章