美文网首页编码
Java内存中的文本编码

Java内存中的文本编码

作者: SpaceCat | 来源:发表于2020-11-01 18:58 被阅读0次

    1、编码简介

    1.1 概念简析:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式

    首先要弄清楚字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式等这些概念。

    A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.
    字符是一个文本的最小抽象单元,它没有具体的形状(形状是字形的范畴)。“A”是一个字符,“€”也是一个字符。
    A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.
    字符集是一个字符的集合。
    A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041(16) and the letter "€" the number 20AC(16). The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".
    编码字符集是一个经过编码的字符集,其中的每一个字符都被赋予了一个唯一的数字编码。Unicode标准的核心就是一个编码字符集,其中“A”对应0041(16进制)、“€”对应20AC(16进制)。Unicode编码标准用16进制表示,用“U+”作为前缀,比如,“A”被表示成U+0041
    Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.
    Code Point是一个在编码字符集中使用的数字编码。一个编码字符集定义了Code Point的范围,但是,并不是所有的Code Point都已经被用来和一个字符对应(有的为未来可能的扩展预留)。Unicode标准的Code Point范围是U+0000U+10FFFF,到Unicode 4.0标准,Unicode超过100万的Code Point中已经已经用了96382个。
    Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.
    Code Point范围在U+0000U+FFFF中字符的集合,被称为基本多语言面(Basic Multilingual Plane, BMP)。Code Point范围在U+10000U+10FFFF中的字符,不能用最初16位设计的Unicode标准表示,这些字符被称为增补字符(Supplementary characters)。
    A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
    字符编码格式是指将一个或者多个编码字符集映射到一些定长的Code Unit序列(可能包含一个或者多个Code Unit)的规则。最常用Code Unit是字节,但是16位或者32位的整数也经常被用作内部处理。UTF-32、UTF-16和UTF-8就是Unicode标准的字符编码格式。

    1.2 示意图区分:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式

    概念图区分示意图

    2、Java代码示意

    JAVA在内存中是使用UTF-16作为编码格式的。char类型对应的是Code Unit,这样,就知道了JAVA内存中的Code Unit单元是16位二进制,也就是两个字节的长度。一个CodePoint可能包含一个或者两个CodeUnit。

    2.1 程序示意概念:Code Point、Code Unit、char

    package com.lfqy.trying.javaenc;
    
    /**
     * Created by chengxia on 2020/10/25.
     */
    public class CharsetTest {
        public static void main(String []args){
            // ------------- char的定义和打印(基本多语言面字符)------------------
            char codeUnit = 'A';
            System.out.println(codeUnit); // A
            codeUnit = '\u0041'; // 直接用'A'的16进制表示来定义char变量的值,等价同上
            System.out.println(codeUnit); // A
            codeUnit = '€';
            System.out.println(codeUnit); // €
            codeUnit = '\u20AC';
            System.out.println(codeUnit); // €
    
            // ------------- char的定义和打印(增补字符)------------------
            //分别定义一个增补字符的前一个CodeUnit和后一个CodeUnit
            char codeUnit1 = '\uD801', codeUnit2 = '\uDC00';
            System.out.println(codeUnit1); // ?
            System.out.println(codeUnit2); // ?,无论是前面的CodeUnit还是后面的CodeUnit都只是部分,不构成完整字符,所以无法打印
            String tempStr = "" + codeUnit1 + codeUnit2;// 可以将两个CodeUnit拼接到一个字符串中,就能够正常打印了。
            System.out.println(tempStr);// 𐐀
    
            // ------------- CodePoint的打印------------------
            //CodePoint就是一个数字
            int codePoint = 0x0041;
            //%0#6X表示16进制格式,带前缀0X输出定长6位,不足前补0;%c表示打印一个CodePoint
            System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X0041 is encoded for A.
            codePoint = 0x10400;
            System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.
    
            // ------------- CodePoint中包含的CodeUnit打印------------------
            char [] cpchs = Character.toChars(codePoint);
            System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.
        }
    }
    

    程序输出如下:

    A
    A
    €
    €
    ?
    ?
    𐐀
    Code point 0X0041 is encoded for A.
    Code point 0X10400 is encoded for 𐐀.
    In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .
    
    

    这个例子区分了前面的几个概念,同时也展示了在java中如何打印他们。

    2.2 将CodePoint转化为String

    CodePoint可以转化为String,如下的两个示例方法展示了如何转换。
    方法一:

    private static String codePointToString(int codePoint) {
        return new String(Character.toChars(codePoint));
    }
    

    方法二:

    private static String codePointToString(int codePoint) {
        StringBuilder stringOut = new StringBuilder();
        stringOut.appendCodePoint(codePoint);
        return stringOut.toString();
    }
    

    前面的方法更加简洁些。
    如下是一个示例代码com.lfqy.trying.javaenc.CodePoint2String

    package com.lfqy.trying.javaenc;
    
    /**
     * Created by chengxia on 2020/11/1.
     */
    public class CodePoint2String {
        private static String codePointToString1(int codePoint) {
            return new String(Character.toChars(codePoint));
        }
    
        private static String codePointToString2(int codePoint) {
            StringBuilder stringOut = new StringBuilder();
            stringOut.appendCodePoint(codePoint);
            return stringOut.toString();
        }
    
        public static void main(String []args){
    
            int codePoint = 0x10400;
            System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.
    
            // ------------- CodePoint中包含的CodeUnit打印------------------
            char [] cpchs = Character.toChars(codePoint);
            System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.
    
            String resStr = null;
            //方法一
            resStr = codePointToString1(codePoint);
            System.out.printf("方法一,codePointToString1, resStr contains chars: %0#6X and %0#6X .%n",  (int)resStr.charAt(0), (int)resStr.charAt(1));
    
            //方法二
            resStr = codePointToString2(codePoint);
            System.out.printf("方法二,codePointToString2, resStr contains chars: %0#6X and %0#6X .%n",  (int)resStr.charAt(0), (int)resStr.charAt(1));
        }
    }
    

    运行结果如下:

    Code point 0X10400 is encoded for 𐐀.
    In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .
    方法一,codePointToString1, resStr contains chars: 0XD801 and 0XDC00 .
    方法二,codePointToString2, resStr contains chars: 0XD801 and 0XDC00 .
    
    

    2.3 遍历字符串中的字符

    直接打印各个char是不对的,正确的方式应该是逐个遍历其中的code point。参考代码如下:

    package com.lfqy.trying.javaenc;
    
    /**
     * Created by chengxia on 2020/11/1.
     */
    public class TraverseString {
    
        public static void main(String []args){
            String testStr = "\u0041\u20AC\uD801\uDC00"; //"A€𐐀"
            //方法一,直接打印String中的各个char。不正确,无法正确打印增补字符集中的字符
            System.out.println("方法一:");
            for(int i = 0; i < testStr.length(); i++){
                System.out.println(testStr.charAt(i));
            }
            //方法二,正向遍历各个CodePoint。正确
            System.out.println("方法二:");
            for(int i = 0; i < testStr.length(); ){
                int cp = testStr.codePointAt(i);
                System.out.printf("%c%n", cp); //%c表示打印一个code point
                if(Character.isSupplementaryCodePoint(cp)){//如果当前codepoint是增补字符集的话,占两个code unit,I自增2
                    i = i + 2;
                }else{
                    I++;
                }
            }
            //方法三,反向遍历各个CodePoint。正确
            System.out.println("方法三:");
            for(int i = testStr.length(); i > 0; ){
                I--;
                if(Character.isSurrogate(testStr.charAt(i))){
                    I--;
                }
                int cp = testStr.codePointAt(i);
                System.out.printf("%c%n", cp); //%c表示打印一个code point
            }
        }
    }
    

    代码输出如下:

    方法一:
    A
    €
    ?
    ?
    方法二:
    A
    €
    𐐀
    方法三:
    𐐀
    €
    A
    
    

    The Character.isSurrogate(char ch)方法介绍:

    The Character.isSurrogate(char ch) java method determines if the given char value is a Unicode surrogate code unit. Such values do not represent characters by themselves, but are used in the representation of supplementary characters in the UTF-16 encoding.
    A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

    也就是说,只要当前带判断的codeunit属于增补字符集的codeunit,不管是高位codeunit,还是低位codeunit,都会返回true。

    3、参考资料

    相关文章

      网友评论

        本文标题:Java内存中的文本编码

        本文链接:https://www.haomeiwen.com/subject/oyduvktx.html