美文网首页
hive中文字段乱码排查

hive中文字段乱码排查

作者: david9 | 来源:发表于2019-11-26 18:00 被阅读0次

    hive版本为2.1.1-cdh6.2.0,下载地址hive-2.1.1-cdh6.2.0

    问题

    在orc格式的表中,一些列中存储的是中文,进行select查询时,偶尔会乱码如下所示:

    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1   Cumulative CPU: 3.45 sec   HDFS Read: 5241 HDFS Write: 120 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 3 seconds 450 msec
    OK
    ��� 2
    Time taken: 22.704 seconds, Fetched: 1 row(s)
    

    经过测试,只要是orc格式的表,字段类型为varchar或者char,字段中有中文,进行一些select操作时必乱码,测试语句如下所示:

    set hive.fetch.task.conversion=none;  --强制使用yarn调度
    set hive.execution.engine=mr; --强制走mr
    create table `test_luan_ma`(`name` varchar(100) ) stored as orc;  --建表
    insert into table test_luan_ma values('乱');  --插入一条记录
    select name, case when name == null then 1 else 2 end as status from test_luan_ma;  --出现乱码的语句
    

    原因调查

    在hive的ql模块加入日志,打印出hive mr执行过程,定位到org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl类

        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.createReaderFromFile(OrcInputFormat.java:319)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.<init>(OrcInputFormat.java:226)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1696)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:75)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:257)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:217)
        at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:345)
        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:702)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:175)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
    

    分析代码发现:
    该类中存在next(Object previous)以及nextBatch(VectorizedRowBatch theirBatch)方法,负责将数据写入到MapReduce的value中,
    nextBatch(VectorizedRowBatch theirBatch)调用orc包的方法,不会出现乱码,
    而next(Object previous)方法会调用nextValue(ColumnVector vector, int row,TypeDescription schema,Object previous)来负责对每一列数据的写入,对于varchar或者char类型的列,
    写入过程如下:

        if (vector.isRepeating) {
          row = 0;
        }
        if (vector.noNulls || !vector.isNull[row]) {
          HiveCharWritable result;
          if (previous == null || previous.getClass() != HiveCharWritable.class) {
            result = new HiveCharWritable();
          } else {
            result = (HiveCharWritable) previous;
          }
          BytesColumnVector bytes = (BytesColumnVector) vector;
          result.set(bytes.toString(row), size);
          return result;
        } else {
          return null;
        }
    

    可以看到有个BytesColumnVector类,调用了它的toString方法,跟进后发现直接是new了一个String,并没有指定字符集

      public String toString(int row) {
        if (isRepeating) {
          row = 0;
        }
        if (noNulls || !isNull[row]) {
         return new String(vector[row], start[row], length[row]);
        } else {
          return null;
        }
      }
    

    猜测可能是这里出的问题,继续跟进

        static char[] decode(byte[] ba, int off, int len) {
            String csn = Charset.defaultCharset().name();
            try {
                // use charset name decode() variant which provides caching.
                return decode(csn, ba, off, len);
            } catch (UnsupportedEncodingException x) {
                warnUnsupportedCharset(csn);
            }
            ...
    

    String类中获取了default字符集,然后直接使用该字符集,在BytesColumnVector中添加日志,查看default字符集

        ...
        if (noNulls || !isNull[row]) {
          String str = new String(vector[row], start[row], length[row]);
          LOG.info("@@@ str is {}" , str);
          LOG.info("@@@ defaultCharset is {}" , Charset.defaultCharset().name());
          LOG.info("@@@ file.encoding is {}" , System.getProperty("file.encoding"));
          return str;
         ...
      }
    

    果然,默认字符集是US-ASCII,所以导致字符串乱码

    ...
    2019-11-27 10:18:35,241 INFO [main] org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector: @@@ str is ???
    2019-11-27 10:18:35,241 INFO [main] org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector: @@@ defaultCharset is US-ASCII
    2019-11-27 10:18:35,241 INFO [main] org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector: @@@ file.encoding is ANSI_X3.4-1968
    ...
    

    但是很奇怪,写了一个java程序,放到集群中运行,defaultCharset都是UTF-8,不清楚为什么在执行MR过程中defaultCharset是US-ASCII,猜测可能是某些步骤或者程序修改所致

    解决方案

    方案一:配置yarn mr启动参数,指定UTF-8编码


    image.png

    方案二:在系统环境变量中强制指定file.encoding为UTF-8,这样每个新启动的java进程的默认字符集都是UTF-8

    export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF-8'
    

    方案三:在hive中对UTF-8的字符串,使用string作为列类型,string类型在处理过程中,默认使用UTF-8进行编解码

    方案四:修改hive storage-api模块中的BytesColumnVector代码,强制使用UTF-8进行编解码,重新编译 storage-api和ql模块(ql依赖storage-api,要先编译storage-api),替换掉hive集群hive-storage-api-2.1.1-cdh6.2.0.jar和hive-exec-2.1.1-cdh6.2.0.jar,重启集群

      public String toString(int row) {
        if (isRepeating) {
          row = 0;
        }
        if (noNulls || !isNull[row]) {
          return new String(vector[row], start[row], length[row], StandardCharsets.UTF_8);
        } else {
          return null;
        }
      }
    

    方案五:使用textfile或者parquet等格式,这些文件格式默认使用UTF-8进行编解码

    相关文章

      网友评论

          本文标题:hive中文字段乱码排查

          本文链接:https://www.haomeiwen.com/subject/tjdywctx.html