美文网首页
大数据开发之Hive案例篇1-Hive表中文注释乱码

大数据开发之Hive案例篇1-Hive表中文注释乱码

作者: 只是甲 | 来源:发表于2022-02-09 16:48 被阅读0次

    备注:
    Hive 版本 2.1.1
    MySQL 版本 5.6.49

    一.问题描述

    创建表备注中文显示乱码

    hive> create table t1 (id int comment 'id',name string comment "名字",login_date timestamp comment "登陆时间") comment "登陆日志表" stored as textfile;
    OK
    Time taken: 0.079 seconds
    hive> show create table t1;
    OK
    CREATE TABLE `t1`(
      `id` int COMMENT 'id', 
      `name` string COMMENT '??', 
      `login_date` timestamp COMMENT '????')
    COMMENT '?????'
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      'hdfs://nameservice1/user/hive/warehouse/test.db/t1'
    TBLPROPERTIES (
      'transient_lastDdlTime'='1606979085')
    Time taken: 0.052 seconds, Fetched: 15 row(s)
    

    二.解决方案

    因为是元数据乱码,而hive的元数据存储在mysql里面,所以找到mysql相关的表看看字符集

    mysql> use hive;
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A
    
    Database changed
    mysql> 
    mysql> show create table COLUMNS_V2\G
    *************************** 1. row ***************************
           Table: COLUMNS_V2
    Create Table: CREATE TABLE `COLUMNS_V2` (
      `CD_ID` bigint(20) NOT NULL,
      `COMMENT` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
      `COLUMN_NAME` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
      `TYPE_NAME` mediumtext,
      `INTEGER_IDX` int(11) NOT NULL,
      PRIMARY KEY (`CD_ID`,`COLUMN_NAME`),
      KEY `COLUMNS_V2_N49` (`CD_ID`),
      CONSTRAINT `COLUMNS_V2_FK1` FOREIGN KEY (`CD_ID`) REFERENCES `CDS` (`CD_ID`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
    1 row in set (0.01 sec)
    

    如下所示,果真是拉丁字符集,也就是MySQL 5.6版本的默认字符集。
    MySQL一直到8.0版本,默认字符集才改为utf8,而安装CDH的时候建表时默认为数据库默认的字符集,后续安装CDH配置的MySQL最好是8.0版本及以上的。

    解决方案
    修改hive表的字符集

    登陆Mysql数据库切换到Hive库:
    use hive 
    修改以下三张表即可;
    alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8; 
    alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8; 
    alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8; 
    

    删除表后重新创建,发现问题解决:

        > drop table t1;
    OK
    Time taken: 0.096 seconds
    hive> create table t1 (id int comment 'id',name string comment "名字",login_date timestamp comment "登陆时间") comment "登陆日志表" stored as textfile;
    OK
    Time taken: 0.061 seconds
    hive> 
        > show create table t1;
    OK
    CREATE TABLE `t1`(
      `id` int COMMENT 'id', 
      `name` string COMMENT '名字', 
      `login_date` timestamp COMMENT '登陆时间')
    COMMENT '登陆日志表'
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      'hdfs://nameservice1/user/hive/warehouse/test.db/t1'
    TBLPROPERTIES (
      'transient_lastDdlTime'='1606980419')
    Time taken: 0.051 seconds, Fetched: 15 row(s)
    

    三.更改hive整个数据库的字符集

    上面的方法虽然可以解决这个问题,但是其它相关的表依旧存放不了中文,为了彻底解决这个问题,我将整个数据库的字符集都调整为utf8.

    3.1 备份hive数据库

    通过mysqldump命令备份hive数据库

    mysqldump -uroot -p --quick --extended-insert --default-character-set=latin1 hive > hive_backup_20201203.sql
    

    3.2 替换备份文件

    打开hive_backup_20201203.sql,将SET NAMES latin1修改成SET NAMES utf8;
    找到表结构中DEFAULT CHARSET latin1, 修改成DEFAULT CHARSET=utf8;

    修改前:

    [root@hp1 mysql]# more hive_backup_20201203.sql
    -- MySQL dump 10.13  Distrib 5.6.49, for Linux (x86_64)
    --
    -- Host: localhost    Database: hive
    -- ------------------------------------------------------
    -- Server version       5.6.49
    
    /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
    /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
    /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
    /*!40101 SET NAMES latin1 */;
    /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
    /*!40103 SET TIME_ZONE='+00:00' */;
    /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
    /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
    /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
    /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
    
    --
    -- Table structure for table `AUX_TABLE`
    --
    
    DROP TABLE IF EXISTS `AUX_TABLE`;
    /*!40101 SET @saved_cs_client     = @@character_set_client */;
    /*!40101 SET character_set_client = utf8 */;
    CREATE TABLE `AUX_TABLE` (
      `MT_KEY1` varchar(128) NOT NULL,
      `MT_KEY2` bigint(20) NOT NULL,
      `MT_COMMENT` varchar(255) DEFAULT NULL,
      PRIMARY KEY (`MT_KEY1`,`MT_KEY2`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
    /*!40101 SET character_set_client = @saved_cs_client */;
    
    --
    -- Dumping data for table `AUX_TABLE`
    --
    
    LOCK TABLES `AUX_TABLE` WRITE;
    /*!40000 ALTER TABLE `AUX_TABLE` DISABLE KEYS */;
    /*!40000 ALTER TABLE `AUX_TABLE` ENABLE KEYS */;
    UNLOCK TABLES;
    
    --
    -- Table structure for table `BUCKETING_COLS`
    --
    **后面省略N多输出**
    

    修改数据:

    sed -i "s/SET NAMES latin1/SET NAMES utf8/g" hive_backup_20201203.sql
    sed -i "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g" hive_backup_20201203.sql
    sed -i "s/CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL/ /g" hive_backup_20201203.sql
    sed -i "s/CHARACTER SET latin1 COLLATE latin1_bin NOT NULL/ /g" hive_backup_20201203.sql
    sed -i "s/CHARACTER SET latin1 COLLATE latin1_bin/ /g" hive_backup_20201203.sql
    

    直到如下命令没有输出结果为止:

    more hive_backup_20201203.sql | grep latin  
    

    修改后的数据:
    可以看到字符集相关数据已经调整

    [root@hp1 mysql]# more hive_backup_20201203.sql
    -- MySQL dump 10.13  Distrib 5.6.49, for Linux (x86_64)
    --
    -- Host: localhost    Database: hive
    -- ------------------------------------------------------
    -- Server version       5.6.49
    
    /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
    /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
    /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
    /*!40101 SET NAMES utf8 */;
    /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
    /*!40103 SET TIME_ZONE='+00:00' */;
    /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
    /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
    /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
    /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
    
    --
    -- Table structure for table `AUX_TABLE`
    --
    
    DROP TABLE IF EXISTS `AUX_TABLE`;
    /*!40101 SET @saved_cs_client     = @@character_set_client */;
    /*!40101 SET character_set_client = utf8 */;
    CREATE TABLE `AUX_TABLE` (
      `MT_KEY1` varchar(128) NOT NULL,
      `MT_KEY2` bigint(20) NOT NULL,
      `MT_COMMENT` varchar(255) DEFAULT NULL,
      PRIMARY KEY (`MT_KEY1`,`MT_KEY2`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    /*!40101 SET character_set_client = @saved_cs_client */;
    
    --
    -- Dumping data for table `AUX_TABLE`
    --
    
    LOCK TABLES `AUX_TABLE` WRITE;
    /*!40000 ALTER TABLE `AUX_TABLE` DISABLE KEYS */;
    /*!40000 ALTER TABLE `AUX_TABLE` ENABLE KEYS */;
    UNLOCK TABLES;
    
    --
    -- Table structure for table `BUCKETING_COLS`
    --
    
    **后面省略N多输出**
    

    3.3 替换字符集

    导入数据1

    mysql -uroot -p hive < hive_backup_20201203.sql
    

    导入报错:

    [root@hp1 mysql]# mysql -uroot -p hive < hive_backup_20201203.sql
    Enter password: 
    ERROR 1071 (42000) at line 548: Specified key was too long; max key length is 767 bytes
    

    MySQL的varchar主键只支持不超过767个字节 或者 767/2=383个双字节 或者 767/3=255个三字节的字段
    而 GBK是双字节的,UTF-8是三字节的。MySQL8.0之后将767增加到3072

    查找主键为varchar且超过255的列

    select table_schema,table_name,column_name,data_type,character_maximum_length,COLUMN_KEY 
       from INFORMATION_SCHEMA.COLUMNS 
      where COLUMN_KEY='PRI' 
      and table_schema = 'hive' 
      and data_type = 'varchar' 
     order by character_maximum_length;
    
    mysql> select table_schema,table_name,column_name,data_type,character_maximum_length,COLUMN_KEY 
        ->   from INFORMATION_SCHEMA.COLUMNS 
        ->  where COLUMN_KEY='PRI' 
        ->  and table_schema = 'hive' 
        ->  and data_type = 'varchar' 
        -> order by character_maximum_length;
    +--------------+-------------------------+---------------+-----------+--------------------------+------------+
    | table_schema | table_name              | column_name   | data_type | character_maximum_length | COLUMN_KEY |
    +--------------+-------------------------+---------------+-----------+--------------------------+------------+
    | hive         | AUX_TABLE               | MT_KEY1       | varchar   |                      128 | PRI        |
    | hive         | PARTITION_KEYS          | PKEY_NAME     | varchar   |                      128 | PRI        |
    | hive         | NUCLEUS_TABLES          | CLASS_NAME    | varchar   |                      128 | PRI        |
    | hive         | TYPE_FIELDS             | FIELD_NAME    | varchar   |                      128 | PRI        |
    | hive         | DATABASE_PARAMS         | PARAM_KEY     | varchar   |                      180 | PRI        |
    | hive         | METASTORE_DB_PROPERTIES | PROPERTY_KEY  | varchar   |                      255 | PRI        |
    | hive         | SEQUENCE_TABLE          | SEQUENCE_NAME | varchar   |                      255 | PRI        |
    | hive         | TABLE_PARAMS            | PARAM_KEY     | varchar   |                      256 | PRI        |
    | hive         | SERDE_PARAMS            | PARAM_KEY     | varchar   |                      256 | PRI        |
    | hive         | SD_PARAMS               | PARAM_KEY     | varchar   |                      256 | PRI        |
    | hive         | PARTITION_PARAMS        | PARAM_KEY     | varchar   |                      256 | PRI        |
    | hive         | INDEX_PARAMS            | PARAM_KEY     | varchar   |                      256 | PRI        |
    | hive         | COLUMNS_V2              | COLUMN_NAME   | varchar   |                      767 | PRI        |
    | hive         | DELEGATION_TOKENS       | TOKEN_IDENT   | varchar   |                      767 | PRI        |
    +--------------+-------------------------+---------------+-----------+--------------------------+------------+
    14 rows in set (0.01 sec)
    

    在hive_backup_20201203.sql中找到这些列,然后修改长度
    超过255的全部要改为255

    COLUMN_NAME` varchar(766)
    修改为
    COLUMN_NAME` varchar(255) 
    
    其余的同上
    

    导入数据2

    [root@hp1 mysql]# mysql -uroot -p hive < hive_backup_20201203.sql
    Enter password: 
    ERROR 1709 (HY000) at line 548: Index column size too large. The maximum column size is 767 bytes.
    

    依旧报错
    原来varchar列作为索引都有767这个长度限制,在utf8情况下就是不能超过255
    查找有多少有索引的varchar列超过255

    select t2.table_schema,t2.table_name,t2.column_name,t2.data_type,t2.character_maximum_length,t2.COLUMN_KEY,t2.column_type
    from information_schema.statistics t1
    inner join information_schema.COLUMNS t2
    on t1.table_schema = t2.TABLE_SCHEMA
    and t1.table_name = t2.TABLE_NAME
    and t1.COLUMN_NAME = t2.COLUMN_NAME
    where t1.table_schema='hive'
    and   t2.data_type = 'varchar' 
    and   t2.character_maximum_length > 255
    ;
    
    
    mysql> select t2.table_schema,t2.table_name,t2.column_name,t2.data_type,t2.character_maximum_length,t2.COLUMN_KEY,t2.column_type
        -> from information_schema.statistics t1
        -> inner join information_schema.COLUMNS t2
        -> on t1.table_schema = t2.TABLE_SCHEMA
        -> and t1.table_name = t2.TABLE_NAME
        -> and t1.COLUMN_NAME = t2.COLUMN_NAME
        -> where t1.table_schema='hive'
        -> and   t2.data_type = 'varchar' 
        -> and   t2.character_maximum_length > 255
        -> ;
    +--------------+-------------------+-----------------+-----------+--------------------------+------------+--------------+
    | table_schema | table_name        | column_name     | data_type | character_maximum_length | COLUMN_KEY | column_type  |
    +--------------+-------------------+-----------------+-----------+--------------------------+------------+--------------+
    | hive         | COLUMNS_V2        | COLUMN_NAME     | varchar   |                      767 | PRI        | varchar(767) |
    | hive         | DELEGATION_TOKENS | TOKEN_IDENT     | varchar   |                      767 | PRI        | varchar(767) |
    | hive         | INDEX_PARAMS      | PARAM_KEY       | varchar   |                      256 | PRI        | varchar(256) |
    | hive         | KEY_CONSTRAINTS   | CONSTRAINT_NAME | varchar   |                      400 | PRI        | varchar(400) |
    | hive         | PARTITIONS        | PART_NAME       | varchar   |                      767 | MUL        | varchar(767) |
    | hive         | PARTITION_EVENTS  | PARTITION_NAME  | varchar   |                      767 | MUL        | varchar(767) |
    | hive         | PARTITION_PARAMS  | PARAM_KEY       | varchar   |                      256 | PRI        | varchar(256) |
    | hive         | PART_COL_PRIVS    | COLUMN_NAME     | varchar   |                      767 |            | varchar(767) |
    | hive         | PART_COL_STATS    | TABLE_NAME      | varchar   |                      256 |            | varchar(256) |
    | hive         | PART_COL_STATS    | PARTITION_NAME  | varchar   |                      767 |            | varchar(767) |
    | hive         | PART_COL_STATS    | COLUMN_NAME     | varchar   |                      767 |            | varchar(767) |
    | hive         | SD_PARAMS         | PARAM_KEY       | varchar   |                      256 | PRI        | varchar(256) |
    | hive         | SERDE_PARAMS      | PARAM_KEY       | varchar   |                      256 | PRI        | varchar(256) |
    | hive         | TABLE_PARAMS      | PARAM_KEY       | varchar   |                      256 | PRI        | varchar(256) |
    | hive         | TAB_COL_STATS     | TABLE_NAME      | varchar   |                      256 |            | varchar(256) |
    | hive         | TAB_COL_STATS     | COLUMN_NAME     | varchar   |                      767 |            | varchar(767) |
    | hive         | TBLS              | TBL_NAME        | varchar   |                      256 | MUL        | varchar(256) |
    | hive         | TBL_COL_PRIVS     | COLUMN_NAME     | varchar   |                      767 |            | varchar(767) |
    +--------------+-------------------+-----------------+-----------+--------------------------+------------+--------------+
    18 rows in set (0.03 sec)
    
    

    hive_backup_20201203.sql 一个个更改

    导入数据3
    终于导入成功

    [root@hp1 mysql]# mysql -uroot -p hive < hive_backup_20201203.sql
    Enter password: 
    [root@hp1 mysql]# 
    

    参考:
    1.https://blog.csdn.net/leen0304/article/details/80534385
    2.https://www.cnblogs.com/ilifeilong/p/11435675.html
    3.https://blog.csdn.net/weixin_30475039/article/details/101817331

    相关文章

      网友评论

          本文标题:大数据开发之Hive案例篇1-Hive表中文注释乱码

          本文链接:https://www.haomeiwen.com/subject/hekdwktx.html