解决：Spark数据清洗的时候出现的Caused by: jav

作者: Sam_L | 来源:发表于2019-02-27 16:48 被阅读0次

问题背景：在用sparksql进行数据清理操作---》RDD转成DataFrame时，自己准备写了一个access.log文件，用Notepad写的，分隔符用的是Tab）

运行时报错：

Caused by: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of string

Image 1.png

分析是数据类型不匹配

 /**
   * Returns an encoded version of `t` as a Spark SQL row.  Note that multiple calls to
   * toRow are allowed to return the same actual [[InternalRow]] object.  Thus, the caller should
   * copy the result before making another call if required.
   */
  def toRow(t: T): InternalRow = try {
    inputRow(0) = t
    extractProjection(inputRow)
  } catch {
    case e: Exception =>
      throw new RuntimeException(
        s"Error while encoding: $e\n${serializer.map(_.treeString).mkString("\n")}", e)
  }

尝试第一次解决：整理整理log文件格式，运行成功

我的第一感觉是我的马马虎虎应该将log文件的格式写的不对，但看着我写的没错，于是乎，我在记事本那里好好的弄了弄我的格式，结果运行就好了，得到了清洗的结果

尝试第二次解决：修改代码，运行成功（Notepad写的log文件怎么看都写的没错~）

根据错误点和源码提示，问题大概出现在Row(0)这里

检查了一下定义的输出的字段类型，traffic和 cmsId都定义为LongType类型，也都转成long类型的，都没有什么错误，可能出错的点就在这里

Image 2.png

修改一下

case e:Exception => Row("","",0L,0L,"","","","")

这样即使用最开始在notepad写的log文件，还是后来在记事本修改的log文件，都可以正常识别了，运行后就得到清理的数据啦~

网友评论

本文标题：解决：Spark数据清洗的时候出现的Caused by: jav

本文链接：https://www.haomeiwen.com/subject/jysfuqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！