美文网首页
PySpark存储Hive数据的两种方式

PySpark存储Hive数据的两种方式

作者: 小甜瓜Melon | 来源:发表于2017-07-21 10:40 被阅读0次

    背景:Hive的CREATE TABLE AS 和PySpark的.write.saveAsTable存储之后产生的数据类型并不一样,前者存储的方式是Text形式的,后者的存储形式是parquet形式。

    示例

    原始数据的类型

    hiveContext.sql("SHOW CREATE TABLE testdb.tttest").show(n=1000, truncate=False)
    
    +--------------------------------------------------------------+
    |result                                                        |
    +--------------------------------------------------------------+
    |CREATE TABLE `testdb.tttest`(                                 |
    |  `username` string,                                          |
    |  `sex` string)                                               |
    |COMMENT 'Imported by sqoop on 2017/04/17 10:11:26'            |
    |ROW FORMAT SERDE                                              |
    |  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'        |
    |WITH SERDEPROPERTIES (                                        |
    |  'field.delim'='\t',                                         |
    |  'line.delim'='\n',                                          |
    |  'serialization.format'='\t')                                |
    |STORED AS INPUTFORMAT                                         |
    |  'org.apache.hadoop.mapred.TextInputFormat'                  |
    |OUTPUTFORMAT                                                  |
    |  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'|
    |LOCATION                                                      |
    |  'hdfs://nameservice1/user/hive/warehouse/testdb.db/tttest'  |
    |TBLPROPERTIES (                                               |
    |  'COLUMN_STATS_ACCURATE'='true',                             |
    |  'numFiles'='1',                                             |
    |  'numRows'='0',                                              |
    |  'rawDataSize'='0',                                          |
    |  'totalSize'='66',                                           |
    |  'transient_lastDdlTime'='1492395090')                       |
    +--------------------------------------------------------------+
    

    源数据为Text形式

    方式一:利用Hive的CREATE TABLE AS存储

    hiveContext.sql("DROP TABLE IF EXISTS testdb.test_a")
    hiveContext.sql("""CREATE TABLE IF NOT EXISTS testdb.test_a AS SELECT * FROM testdb.tttest""")
    hiveContext.sql("SHOW CREATE TABLE testdb.test_a").show(n=1000, truncate=False)
    
    +------------------------------------------------------------+
    |result                                                      |
    +------------------------------------------------------------+
    |CREATE TABLE `testdb.test_a`(                               |
    |  `username` string,                                        |
    |  `sex` string)                                             |
    |ROW FORMAT SERDE                                            |
    |  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'      |
    |STORED AS INPUTFORMAT                                       |
    |  'org.apache.hadoop.mapred.TextInputFormat'                |
    |OUTPUTFORMAT                                                |
    |  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'  |
    |LOCATION                                                    |
    |  'hdfs://nameservice1/user/hive/warehouse/testdb.db/test_a'|
    |TBLPROPERTIES (                                             |
    |  'COLUMN_STATS_ACCURATE'='false',                          |
    |  'numFiles'='2',                                           |
    |  'numRows'='-1',                                           |
    |  'rawDataSize'='-1',                                       |
    |  'totalSize'='66',                                         |
    |  'transient_lastDdlTime'='1500603886')                     |
    +------------------------------------------------------------+
    

    方式二:利用PySpark的.write.saveAsTable存储

    hiveContext.sql("DROP TABLE IF EXISTS testdb.test_b")
    hiveContext.sql("""SELECT * FROM testdb.tttest""").write.saveAsTable("testdb.test_b")
    hiveContext.sql("SHOW CREATE TABLE testdb.test_b").show(n=1000, truncate=False)
    
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                                           |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |CREATE TABLE `testdb.test_b`(                                                                                                                                                                                                    |
    |  `username` string COMMENT '',                                                                                                                                                                                                  |
    |  `sex` string COMMENT '')                                                                                                                                                                                                       |
    |ROW FORMAT SERDE                                                                                                                                                                                                                 |
    |  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                                                                  |
    |WITH SERDEPROPERTIES (                                                                                                                                                                                                           |
    |  'path'='hdfs://nameservice1/user/hive/warehouse/testdb.db/test_b')                                                                                                                                                             |
    |STORED AS INPUTFORMAT                                                                                                                                                                                                            |
    |  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                                                                |
    |OUTPUTFORMAT                                                                                                                                                                                                                     |
    |  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                                                               |
    |LOCATION                                                                                                                                                                                                                         |
    |  'hdfs://nameservice1/user/hive/warehouse/testdb.db/test_b'                                                                                                                                                                     |
    |TBLPROPERTIES (                                                                                                                                                                                                                  |
    |  'COLUMN_STATS_ACCURATE'='false',                                                                                                                                                                                               |
    |  'EXTERNAL'='FALSE',                                                                                                                                                                                                            |
    |  'numFiles'='2',                                                                                                                                                                                                                |
    |  'numRows'='-1',                                                                                                                                                                                                                |
    |  'rawDataSize'='-1',                                                                                                                                                                                                            |
    |  'spark.sql.sources.provider'='org.apache.spark.sql.parquet',                                                                                                                                                                   |
    |  'spark.sql.sources.schema.numParts'='1',                                                                                                                                                                                       |
    |  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"username\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sex\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', |
    |  'totalSize'='1308',                                                                                                                                                                                                            |
    |  'transient_lastDdlTime'='1500603889')                                                                                                                                                                                          |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    Tips
    第一种方式有时会产生乱码,对于大型的表最好采用第二种方式进行存储,不容易产生乱码。

    删除新建的两个表

    hiveContext.sql("DROP TABLE testdb.test_a PURGE")
    hiveContext.sql("DROP TABLE testdb.test_b PURGE")
    

    完。

    相关文章

      网友评论

          本文标题:PySpark存储Hive数据的两种方式

          本文链接:https://www.haomeiwen.com/subject/hjiskxtx.html