[Spark] Schema Inference - Spark

作者: pingpong_龘 | 来源:发表于2019-07-07 14:07 被阅读0次

[Spark] Schema Inference - Spark
20-SparkSQL01
2018-01-23 spark local[*] janusG
【Spark】Spark DataFrame schema转换方
Spark SQL程序
每日一读 12.15
大数据面试必备知识点总结：Spark，Hadoop，kafka，
spark安装与部署
Proof Theory 摘要第五辑
Spark 入门

1. 背景

Spark在的Dataframe在使用的过程中或涉及到schema的问题，schema就是这个Row的数据结构(StructType)，在代码中就是这个类的定义。如果你想解析一个json或者csv文件成dataframe，那么就需要知道他的StructType。

徒手写一个复杂类的StructType是个吃力不讨好的事情，所以Spark默认是支持自动推断schema的。但是如果使用流处理(Streaming)的话，他的支持力度是很受限的，最近在做Streaming处理的时候，遇到一些schema inference的问题，所以借机学习整理下Spark源码是如何实现的。

2. Spark版本

以下的代码基于spark的版本：

项目	version
scala	2.11
spark-core	2.4.0
spark-sql	2.4.0
mongo-spark-connector	2.11

gradle的配置：

providedRuntime group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.4.0'
providedRuntime group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.4.0'
providedRuntime group: 'org.mongodb.spark', name: 'mongo-spark-connector_2.11', version: '2.3.1'

3. Schema inference

3.1 spark的Schema inference

3.1.1 通过DDL来解析Schema

DDL的格式类似于："a INT, b STRING, c DOUBLE",

深入学习看这里：Open Data Description Language (OpenDDL)

StructType提供了接口直接通过解析DDL来识别StructType

    this.userSpecifiedSchema = Option(StructType.fromDDL(schemaString))

先把DDL string解析成SqlBaseLexer

val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command)))

然后，然后...就看的不太懂了...

3.1.2 解析一个Json的Schema

Spark中Dataframe的文件读取是通过DataFrameReader来完成的.

都是通过DataSet的ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan)方法转为DataFrame

  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

schema是由QueryExecution得到的

  def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

其中的qe.analyzed.schema这句就是QueryExecution先分析生成LogicPlan,分析的源码在CheckAnalysis.scala中的def checkAnalysis(plan: LogicalPlan): Unit

 def checkAnalysis(plan: LogicalPlan): Unit = {
    // We transform up and order the rules so as to catch the first possible failure instead
    // of the result of cascading resolution failures.
    plan.foreachUp {

      case p if p.analyzed => // Skip already analyzed sub-plans

      case u: UnresolvedRelation =>
        u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}")

      case operator: LogicalPlan =>
        // Check argument data types of higher-order functions downwards first.
        // If the arguments of the higher-order functions are resolved but the type check fails,
        // the argument functions will not get resolved, but we should report the argument type
        // check failure instead of claiming the argument functions are unresolved.
        operator transformExpressionsDown {
          case hof: HigherOrderFunction
              if hof.argumentsResolved && hof.checkArgumentDataTypes().isFailure =>
            hof.checkArgumentDataTypes() match {
              case TypeCheckResult.TypeCheckFailure(message) =>
                hof.failAnalysis(
                  s"cannot resolve '${hof.sql}' due to argument data type mismatch: $message")
            }
      
      
      ...
      ...
      
}

最终由Logic的output: Seq[Attribute]转为StructType:

  lazy val schema: StructType = StructType.fromAttributes(output)

具体每个Attribute转你为StructType的代码如下：

  private[sql] def fromAttributes(attributes: Seq[Attribute]): StructType =
    StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata)))

3.1.3 Kafka的Schema

在使用Kafka的Streaming的时候，自动推断只能推断到固定的几个StructField, 如果value是Json的话，也不会进一步解析出来。
这个是因为Kafka和json的dataSource是不一样的
DataFrame在load的时候，会有DataSource基于provider name来找到这个provider的data source的类定义

// DataSource.scala line 613
def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
    val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
      case name if name.equalsIgnoreCase("orc") &&
          conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
        classOf[OrcFileFormat].getCanonicalName
      case name if name.equalsIgnoreCase("orc") &&
          conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" =>
        "org.apache.spark.sql.hive.orc.OrcFileFormat"
      case "com.databricks.spark.avro" if conf.replaceDatabricksSparkAvroEnabled =>
        "org.apache.spark.sql.avro.AvroFileFormat"
      case name => name
    }
    val provider2 = s"$provider1.DefaultSource" 
    ...
}

如果输入provider name是"json"返回的是JsonFileFormat
如果是”kafka“返回的是KafkaSourceProvider
而KafkaSourceProvider的sourceSchema是kafkaSchema

  override def sourceSchema(
      sqlContext: SQLContext,
      schema: Option[StructType],
      providerName: String,
      parameters: Map[String, String]): (String, StructType) = {
    validateStreamOptions(parameters)
    require(schema.isEmpty, "Kafka source has a fixed schema and cannot be set with a custom one")
    (shortName(), KafkaOffsetReader.kafkaSchema)
  }

具体kafkaSchema的定义如下：

  def kafkaSchema: StructType = StructType(Seq(
    StructField("key", BinaryType),
    StructField("value", BinaryType),
    StructField("topic", StringType),
    StructField("partition", IntegerType),
    StructField("offset", LongType),
    StructField("timestamp", TimestampType),
    StructField("timestampType", IntegerType)
  ))

3.2 mongo-spark的Schema inference

3.2.1 MongoInferSchema源码分析

看mongoSpark源码的时候，意外从一个toDF的方法里发现了有个MongoInferSchema实现了类型推断.

  /**
   * Creates a `DataFrame` based on the schema derived from the optional type.
   *
   * '''Note:''' Prefer [[toDS[T<:Product]()*]] as computations will be more efficient.
   *  The rdd must contain an `_id` for MongoDB versions < 3.2.
   *
   * @tparam T The optional type of the data from MongoDB, if not provided the schema will be inferred from the collection
   * @return a DataFrame
   */
  def toDF[T <: Product: TypeTag](): DataFrame = {
    val schema: StructType = MongoInferSchema.reflectSchema[T]() match {
      case Some(reflectedSchema) => reflectedSchema
      case None                  => MongoInferSchema(toBsonDocumentRDD)
    }
    toDF(schema)
  }

于是研究了下，发现MongoInferSchema的实现分两种情况：

给定了要解析的class类型

如果是给定了要解析的class类型，那就很好办，直接基于Spark的ScalaReflection的 schemaFor函数将class转为Schema:

case class Schema(dataType: DataType, nullable: Boolean)

这个Schema是ScalaReflection中定义的一个case class,本质是个catalyst DataType
所以可以再进一步直接转为StructType, 所以代码实现很简单：

ScalaReflection.schemaFor[T].dataType.asInstanceOf[StructType]

未给定要解析的class类型

如果没有给定要解析的class类型，那就直接根据从mongo里读取的RDD来推断Schema. 这个具体的实现方式是对RDD进行采样，采样数可以在readConfig中设置，默认值是1000(private val DefaultSampleSize: Int = 1000).

因为从mongo读取出来的格式就是BsonDocument, 所以采样的过程就是将每个BsonDocument转为StructType

  private def getSchemaFromDocument(document: BsonDocument, readConfig: ReadConfig): StructType = {
    val fields = new util.ArrayList[StructField]()
    document.entrySet.asScala.foreach(kv => fields.add(DataTypes.createStructField(kv.getKey, getDataType(kv.getValue, readConfig), true)))
    DataTypes.createStructType(fields)
  }

然后将采样的1000个集合进行两两merge，获取兼容的类型，最终得到RootType，即为所需的Schema：

// perform schema inference on each row and merge afterwards
val rootType: DataType = sampleData
.map(getSchemaFromDocument(_, mongoRDD.readConfig))
.treeAggregate[DataType](StructType(Seq()))(
compatibleType(_, _, mongoRDD.readConfig, nested = false),
compatibleType(_, _, mongoRDD.readConfig, nested = false)
)

3.2.2 MongoInferSchema存在的问题

3.2.2.1 Java兼容性问题

虽然scala脱胎于java，但是在类型和结构上也逐渐出现了很多的不同点，包括部分基础结构和各种各样的复杂结构。所以如果要推断的类是java类，MongoInferSchema 也提供了MongoInferSchemaJava 实现类型反射:

/**
 * A helper for inferring the schema from Java
 *
 * In Spark 2.2.0 calling this method from Scala 2.10 caused compilation errors with the shadowed library in
 * `JavaTypeInference`. Moving it into Java stops Scala falling over and allows it to continue to work.
 *
 * See: SPARK-126
 */
final class MongoInferSchemaJava {

    @SuppressWarnings("unchecked")
    public static <T> StructType reflectSchema(final Class<T> beanClass) {
        return (StructType) JavaTypeInference.inferDataType(beanClass)._1();
    }

}

具体的推断实现在def inferDataType(typeToken: TypeToken[_], seenTypeSet: Set[Class[_]]函数中，代码如下，这里就不详细展开了。

 /**
   * Infers the corresponding SQL data type of a Java type.
   * @param typeToken Java type
   * @return (SQL data type, nullable)
   */
  private def inferDataType(typeToken: TypeToken[_], seenTypeSet: Set[Class[_]] = Set.empty)
    : (DataType, Boolean) = {
    typeToken.getRawType match {
      case c: Class[_] if c.isAnnotationPresent(classOf[SQLUserDefinedType]) =>
        (c.getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance(), true)

      case c: Class[_] if UDTRegistration.exists(c.getName) =>
        val udt = UDTRegistration.getUDTFor(c.getName).get.newInstance()
          .asInstanceOf[UserDefinedType[_ >: Null]]
        (udt, true)
       ...
       ...
       ...
        val properties = getJavaBeanReadableProperties(other)
        val fields = properties.map { property =>
          val returnType = typeToken.method(property.getReadMethod).getReturnType
          val (dataType, nullable) = inferDataType(returnType, seenTypeSet + other)
          new StructField(property.getName, dataType, nullable)
        }
        (new StructType(fields), true)
    }
  }

所以如果大家要使用mongo-spark的类型推断，那么可以基于scala和java封装2个接口函数用于Schema Infer, 下面是我自己封装的2个函数：

  /**
    * @see [[MongoInferSchema.apply]]
    */
  protected def inferSchemaScala[T <: Product : TypeTag](): StructType = {
    MongoInferSchema.reflectSchema[T]() match {
      case Some(reflectedSchema) => reflectedSchema
      // canonicalizeType erases all empty structs, including the only one we want to keep
      case None => StructType(Seq())
    }
  }

  /**
    * @see [[MongoInferSchema.apply]]
    */
  protected def inferSchemaJava[T](beanClass: Class[T]): StructType = {
    MongoInferSchema.reflectSchema(beanClass)
  }

3.2.2.2 采样推断不准确问题

产生不准确的原因在于：

采用点不完整
毕竟是采样，如果某个字段在采样点没出现，则会导致最终推断的不准确
集合类结构泛型推断错误
另外一个问题是，比如字段里有个Map[String , String]类型，可能会把其中的key推断成不同的StrutType,而不是统一推断成String。我自己做过测试，会一定程度上依赖某些key是否会高频出现，所以说这种infer schema具有不确定性。

解决方案：

使用的数据结构尽量简单，不要有嵌套或者复杂结构
但这种情况，真正的生产环境很难，大部分公司的代码结构，迭代了那么久，怎么会那么简单呢，对吧？
给定要解析的class类型
这个是个很好的方案，可以确保没有错误，同时，如果类的字段或者结构发生变化了，可以确保无缝兼容，不用重新修改代码。

4. 总结

以上介绍了几种spark内部实现 schema inference 源码和使用方式。在日常大部分工作中这些东西都是被spark隐藏的，而且如果没有特殊场景，也是不需要涉及到这里的东西。我是因为刚好遇到Spark Streaming读写Kafka的Topic，但发现读到的RDD/DataFrame没有很好的解析Schema，于是研究了下相关的实现。
最终基于项目选择了MongoInferSchema的实现方式，友好的解决了问题。

[Spark] Schema Inference - Spark
1. 背景 Spark在的Dataframe在使用的过程中或涉及到schema的问题，schema就是这个Row的...
20-SparkSQL01
Spark SQL IOE SQL：schema + file select ... from xxx where...
2018-01-23 spark local[*] janusG
spark local[*] 模式操作janusgraph。如果在创建schema时：graph.openMan...
【Spark】Spark DataFrame schema转换方
比如原始表的schema如下：现在想将该DataFrame 的schema转换成:id:String,goods...
Spark SQL程序
在IDEA中开发Spark SQL程序 1、指定Schema格式数据E:/RmDownloads/student...
每日一读 12.15
spark sql编程之实现合并Parquet格式的DataFrame的schema http://www.abo...
大数据面试必备知识点总结：Spark，Hadoop，kafka，
spark spark core spark sql spark streaming spark编程模式 spar...
spark安装与部署
spark安装与部署 spark概述 spark平台结构spark统一栈 spark官网 spark的安装，配置，...
Proof Theory 摘要第五辑
Apply rules of inference to derive conclusions Schema：符合我...
Spark 入门
Spark Spark 背景什么是 Spark 官网：http://spark.apache.org Spark...

[Spark] Schema Inference - Spark

1. 背景

2. Spark版本

3. Schema inference

3.1 spark的Schema inference

3.1.1 通过DDL来解析Schema

3.1.2 解析一个Json的Schema

3.1.3 Kafka的Schema

3.2 mongo-spark的Schema inference

3.2.1 MongoInferSchema源码分析

3.2.2 MongoInferSchema存在的问题

3.2.2.1 Java兼容性问题

3.2.2.2 采样推断不准确问题

4. 总结

相关文章

[Spark] Schema Inference - Spark

20-SparkSQL01

2018-01-23 spark local[*] janusG

【Spark】Spark DataFrame schema转换方

Spark SQL程序

每日一读 12.15

大数据面试必备知识点总结：Spark，Hadoop，kafka，

spark安装与部署

Proof Theory 摘要第五辑

Spark 入门

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读