美文网首页pyspark
DataFrame保存为hive表时的换行符问题

DataFrame保存为hive表时的换行符问题

作者: mvpboss1004 | 来源:发表于2020-11-08 22:59 被阅读0次

    pyspark的DataFrame,在直接保存为hive表时,如果字符串中带有换行符,会导致换行错误。以spark 3.0.0版本为例。我们向hive表保存1条包含换行符字符串的数据,统计行数时却得到2行:

    >>> df = spark.createDataFrame([(1,'hello\nworld')], ('id','msg'))
    >>> df.write.format('hive').saveAsTable('test.newline0')
    >>> spark.sql('SELECT COUNT(1) FROM test.newline0').show()
    +--------+
    |count(1)|
    +--------+
    | 2|
    +--------+

    这一问题的相关文档我找了很久,最后发现是在Specifying storage format for Hive tables一节。直接使用hive格式保存时,底层是'textfile'且默认换行符是'\n',因此自然会出现换行错误。可以通过以下代码进行验证:

    >>> df.write.format('hive').option('fileFormat', 'textfile').option('lineDelim', '\x13').saveAsTable('test.newline1')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/share/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 868, in saveAsTable
    self._jwrite.saveAsTable(name)
    File "/usr/share/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call
    File "/usr/share/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 137, in deco
    raise_from(converted)
    File "<string>", line 3, in raise_from
    pyspark.sql.utils.IllegalArgumentException: Hive data source only support newline '\n' as line delimiter, but given: �

    解决的方法也很简单,使用其他格式进行保存:

    >>> df.write.format('hive').option('fileFormat', 'parquet').saveAsTable('test.newline1')
    >>> spark.sql('SELECT COUNT(1) FROM test.newline1').show()
    +--------+
    |count(1)|
    +--------+
    | 1|
    +--------+

    相关文章

      网友评论

        本文标题:DataFrame保存为hive表时的换行符问题

        本文链接:https://www.haomeiwen.com/subject/qpdkbktx.html