备注:
Hive 版本 2.1.1
MySQL 版本 5.6.49
一.问题描述
创建表备注中文显示乱码
hive> create table t1 (id int comment 'id',name string comment "名字",login_date timestamp comment "登陆时间") comment "登陆日志表" stored as textfile;
OK
Time taken: 0.079 seconds
hive> show create table t1;
OK
CREATE TABLE `t1`(
`id` int COMMENT 'id',
`name` string COMMENT '??',
`login_date` timestamp COMMENT '????')
COMMENT '?????'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/test.db/t1'
TBLPROPERTIES (
'transient_lastDdlTime'='1606979085')
Time taken: 0.052 seconds, Fetched: 15 row(s)
二.解决方案
因为是元数据乱码,而hive的元数据存储在mysql里面,所以找到mysql相关的表看看字符集
mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql>
mysql> show create table COLUMNS_V2\G
*************************** 1. row ***************************
Table: COLUMNS_V2
Create Table: CREATE TABLE `COLUMNS_V2` (
`CD_ID` bigint(20) NOT NULL,
`COMMENT` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
`COLUMN_NAME` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`TYPE_NAME` mediumtext,
`INTEGER_IDX` int(11) NOT NULL,
PRIMARY KEY (`CD_ID`,`COLUMN_NAME`),
KEY `COLUMNS_V2_N49` (`CD_ID`),
CONSTRAINT `COLUMNS_V2_FK1` FOREIGN KEY (`CD_ID`) REFERENCES `CDS` (`CD_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.01 sec)
如下所示,果真是拉丁字符集,也就是MySQL 5.6版本的默认字符集。
MySQL一直到8.0版本,默认字符集才改为utf8,而安装CDH的时候建表时默认为数据库默认的字符集,后续安装CDH配置的MySQL最好是8.0版本及以上的。
解决方案
修改hive表的字符集
登陆Mysql数据库切换到Hive库:
use hive
修改以下三张表即可;
alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
删除表后重新创建,发现问题解决:
> drop table t1;
OK
Time taken: 0.096 seconds
hive> create table t1 (id int comment 'id',name string comment "名字",login_date timestamp comment "登陆时间") comment "登陆日志表" stored as textfile;
OK
Time taken: 0.061 seconds
hive>
> show create table t1;
OK
CREATE TABLE `t1`(
`id` int COMMENT 'id',
`name` string COMMENT '名字',
`login_date` timestamp COMMENT '登陆时间')
COMMENT '登陆日志表'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/test.db/t1'
TBLPROPERTIES (
'transient_lastDdlTime'='1606980419')
Time taken: 0.051 seconds, Fetched: 15 row(s)
三.更改hive整个数据库的字符集
上面的方法虽然可以解决这个问题,但是其它相关的表依旧存放不了中文,为了彻底解决这个问题,我将整个数据库的字符集都调整为utf8.
3.1 备份hive数据库
通过mysqldump命令备份hive数据库
mysqldump -uroot -p --quick --extended-insert --default-character-set=latin1 hive > hive_backup_20201203.sql
3.2 替换备份文件
打开hive_backup_20201203.sql,将SET NAMES latin1修改成SET NAMES utf8;
找到表结构中DEFAULT CHARSET latin1, 修改成DEFAULT CHARSET=utf8;
修改前:
[root@hp1 mysql]# more hive_backup_20201203.sql
-- MySQL dump 10.13 Distrib 5.6.49, for Linux (x86_64)
--
-- Host: localhost Database: hive
-- ------------------------------------------------------
-- Server version 5.6.49
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES latin1 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
--
-- Table structure for table `AUX_TABLE`
--
DROP TABLE IF EXISTS `AUX_TABLE`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `AUX_TABLE` (
`MT_KEY1` varchar(128) NOT NULL,
`MT_KEY2` bigint(20) NOT NULL,
`MT_COMMENT` varchar(255) DEFAULT NULL,
PRIMARY KEY (`MT_KEY1`,`MT_KEY2`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `AUX_TABLE`
--
LOCK TABLES `AUX_TABLE` WRITE;
/*!40000 ALTER TABLE `AUX_TABLE` DISABLE KEYS */;
/*!40000 ALTER TABLE `AUX_TABLE` ENABLE KEYS */;
UNLOCK TABLES;
--
-- Table structure for table `BUCKETING_COLS`
--
**后面省略N多输出**
修改数据:
sed -i "s/SET NAMES latin1/SET NAMES utf8/g" hive_backup_20201203.sql
sed -i "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g" hive_backup_20201203.sql
sed -i "s/CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL/ /g" hive_backup_20201203.sql
sed -i "s/CHARACTER SET latin1 COLLATE latin1_bin NOT NULL/ /g" hive_backup_20201203.sql
sed -i "s/CHARACTER SET latin1 COLLATE latin1_bin/ /g" hive_backup_20201203.sql
直到如下命令没有输出结果为止:
more hive_backup_20201203.sql | grep latin
修改后的数据:
可以看到字符集相关数据已经调整
[root@hp1 mysql]# more hive_backup_20201203.sql
-- MySQL dump 10.13 Distrib 5.6.49, for Linux (x86_64)
--
-- Host: localhost Database: hive
-- ------------------------------------------------------
-- Server version 5.6.49
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
--
-- Table structure for table `AUX_TABLE`
--
DROP TABLE IF EXISTS `AUX_TABLE`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `AUX_TABLE` (
`MT_KEY1` varchar(128) NOT NULL,
`MT_KEY2` bigint(20) NOT NULL,
`MT_COMMENT` varchar(255) DEFAULT NULL,
PRIMARY KEY (`MT_KEY1`,`MT_KEY2`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `AUX_TABLE`
--
LOCK TABLES `AUX_TABLE` WRITE;
/*!40000 ALTER TABLE `AUX_TABLE` DISABLE KEYS */;
/*!40000 ALTER TABLE `AUX_TABLE` ENABLE KEYS */;
UNLOCK TABLES;
--
-- Table structure for table `BUCKETING_COLS`
--
**后面省略N多输出**
3.3 替换字符集
导入数据1
mysql -uroot -p hive < hive_backup_20201203.sql
导入报错:
[root@hp1 mysql]# mysql -uroot -p hive < hive_backup_20201203.sql
Enter password:
ERROR 1071 (42000) at line 548: Specified key was too long; max key length is 767 bytes
MySQL的varchar主键只支持不超过767个字节 或者 767/2=383个双字节 或者 767/3=255个三字节的字段
而 GBK是双字节的,UTF-8是三字节的。MySQL8.0之后将767增加到3072
查找主键为varchar且超过255的列
select table_schema,table_name,column_name,data_type,character_maximum_length,COLUMN_KEY
from INFORMATION_SCHEMA.COLUMNS
where COLUMN_KEY='PRI'
and table_schema = 'hive'
and data_type = 'varchar'
order by character_maximum_length;
mysql> select table_schema,table_name,column_name,data_type,character_maximum_length,COLUMN_KEY
-> from INFORMATION_SCHEMA.COLUMNS
-> where COLUMN_KEY='PRI'
-> and table_schema = 'hive'
-> and data_type = 'varchar'
-> order by character_maximum_length;
+--------------+-------------------------+---------------+-----------+--------------------------+------------+
| table_schema | table_name | column_name | data_type | character_maximum_length | COLUMN_KEY |
+--------------+-------------------------+---------------+-----------+--------------------------+------------+
| hive | AUX_TABLE | MT_KEY1 | varchar | 128 | PRI |
| hive | PARTITION_KEYS | PKEY_NAME | varchar | 128 | PRI |
| hive | NUCLEUS_TABLES | CLASS_NAME | varchar | 128 | PRI |
| hive | TYPE_FIELDS | FIELD_NAME | varchar | 128 | PRI |
| hive | DATABASE_PARAMS | PARAM_KEY | varchar | 180 | PRI |
| hive | METASTORE_DB_PROPERTIES | PROPERTY_KEY | varchar | 255 | PRI |
| hive | SEQUENCE_TABLE | SEQUENCE_NAME | varchar | 255 | PRI |
| hive | TABLE_PARAMS | PARAM_KEY | varchar | 256 | PRI |
| hive | SERDE_PARAMS | PARAM_KEY | varchar | 256 | PRI |
| hive | SD_PARAMS | PARAM_KEY | varchar | 256 | PRI |
| hive | PARTITION_PARAMS | PARAM_KEY | varchar | 256 | PRI |
| hive | INDEX_PARAMS | PARAM_KEY | varchar | 256 | PRI |
| hive | COLUMNS_V2 | COLUMN_NAME | varchar | 767 | PRI |
| hive | DELEGATION_TOKENS | TOKEN_IDENT | varchar | 767 | PRI |
+--------------+-------------------------+---------------+-----------+--------------------------+------------+
14 rows in set (0.01 sec)
在hive_backup_20201203.sql中找到这些列,然后修改长度
超过255的全部要改为255
COLUMN_NAME` varchar(766)
修改为
COLUMN_NAME` varchar(255)
其余的同上
导入数据2
[root@hp1 mysql]# mysql -uroot -p hive < hive_backup_20201203.sql
Enter password:
ERROR 1709 (HY000) at line 548: Index column size too large. The maximum column size is 767 bytes.
依旧报错
原来varchar列作为索引都有767这个长度限制,在utf8情况下就是不能超过255
查找有多少有索引的varchar列超过255
select t2.table_schema,t2.table_name,t2.column_name,t2.data_type,t2.character_maximum_length,t2.COLUMN_KEY,t2.column_type
from information_schema.statistics t1
inner join information_schema.COLUMNS t2
on t1.table_schema = t2.TABLE_SCHEMA
and t1.table_name = t2.TABLE_NAME
and t1.COLUMN_NAME = t2.COLUMN_NAME
where t1.table_schema='hive'
and t2.data_type = 'varchar'
and t2.character_maximum_length > 255
;
mysql> select t2.table_schema,t2.table_name,t2.column_name,t2.data_type,t2.character_maximum_length,t2.COLUMN_KEY,t2.column_type
-> from information_schema.statistics t1
-> inner join information_schema.COLUMNS t2
-> on t1.table_schema = t2.TABLE_SCHEMA
-> and t1.table_name = t2.TABLE_NAME
-> and t1.COLUMN_NAME = t2.COLUMN_NAME
-> where t1.table_schema='hive'
-> and t2.data_type = 'varchar'
-> and t2.character_maximum_length > 255
-> ;
+--------------+-------------------+-----------------+-----------+--------------------------+------------+--------------+
| table_schema | table_name | column_name | data_type | character_maximum_length | COLUMN_KEY | column_type |
+--------------+-------------------+-----------------+-----------+--------------------------+------------+--------------+
| hive | COLUMNS_V2 | COLUMN_NAME | varchar | 767 | PRI | varchar(767) |
| hive | DELEGATION_TOKENS | TOKEN_IDENT | varchar | 767 | PRI | varchar(767) |
| hive | INDEX_PARAMS | PARAM_KEY | varchar | 256 | PRI | varchar(256) |
| hive | KEY_CONSTRAINTS | CONSTRAINT_NAME | varchar | 400 | PRI | varchar(400) |
| hive | PARTITIONS | PART_NAME | varchar | 767 | MUL | varchar(767) |
| hive | PARTITION_EVENTS | PARTITION_NAME | varchar | 767 | MUL | varchar(767) |
| hive | PARTITION_PARAMS | PARAM_KEY | varchar | 256 | PRI | varchar(256) |
| hive | PART_COL_PRIVS | COLUMN_NAME | varchar | 767 | | varchar(767) |
| hive | PART_COL_STATS | TABLE_NAME | varchar | 256 | | varchar(256) |
| hive | PART_COL_STATS | PARTITION_NAME | varchar | 767 | | varchar(767) |
| hive | PART_COL_STATS | COLUMN_NAME | varchar | 767 | | varchar(767) |
| hive | SD_PARAMS | PARAM_KEY | varchar | 256 | PRI | varchar(256) |
| hive | SERDE_PARAMS | PARAM_KEY | varchar | 256 | PRI | varchar(256) |
| hive | TABLE_PARAMS | PARAM_KEY | varchar | 256 | PRI | varchar(256) |
| hive | TAB_COL_STATS | TABLE_NAME | varchar | 256 | | varchar(256) |
| hive | TAB_COL_STATS | COLUMN_NAME | varchar | 767 | | varchar(767) |
| hive | TBLS | TBL_NAME | varchar | 256 | MUL | varchar(256) |
| hive | TBL_COL_PRIVS | COLUMN_NAME | varchar | 767 | | varchar(767) |
+--------------+-------------------+-----------------+-----------+--------------------------+------------+--------------+
18 rows in set (0.03 sec)
hive_backup_20201203.sql 一个个更改
导入数据3
终于导入成功
[root@hp1 mysql]# mysql -uroot -p hive < hive_backup_20201203.sql
Enter password:
[root@hp1 mysql]#
参考:
1.https://blog.csdn.net/leen0304/article/details/80534385
2.https://www.cnblogs.com/ilifeilong/p/11435675.html
3.https://blog.csdn.net/weixin_30475039/article/details/101817331
网友评论