Java内存中的文本编码

作者: SpaceCat | 来源:发表于2020-11-01 18:58 被阅读0次

Java内存中的文本编码
基本数据类型-字符型变量/常量
9. 字符编码与Python之文件操作
JavaWeb开发之编码格式
java中文乱码解决之道（6）：javaWeb中的编码解码的
Java文件编码
Java内存泄漏
说说Java内存泄漏
Java内存模型
java 面试题并发相关

1、编码简介

1.1 概念简析：字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式

首先要弄清楚字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式等这些概念。

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.
字符是一个文本的最小抽象单元，它没有具体的形状(形状是字形的范畴)。“A”是一个字符，“€”也是一个字符。
A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.
字符集是一个字符的集合。
A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041(16) and the letter "€" the number 20AC(16). The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".
编码字符集是一个经过编码的字符集，其中的每一个字符都被赋予了一个唯一的数字编码。Unicode标准的核心就是一个编码字符集，其中“A”对应0041(16进制)、“€”对应20AC(16进制)。Unicode编码标准用16进制表示，用“U+”作为前缀，比如，“A”被表示成U+0041。
Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.
Code Point是一个在编码字符集中使用的数字编码。一个编码字符集定义了Code Point的范围，但是，并不是所有的Code Point都已经被用来和一个字符对应(有的为未来可能的扩展预留)。Unicode标准的Code Point范围是U+0000到U+10FFFF，到Unicode 4.0标准，Unicode超过100万的Code Point中已经已经用了96382个。
Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.
Code Point范围在U+0000到U+FFFF中字符的集合，被称为基本多语言面(Basic Multilingual Plane, BMP)。Code Point范围在U+10000到U+10FFFF中的字符，不能用最初16位设计的Unicode标准表示，这些字符被称为增补字符(Supplementary characters)。
A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
字符编码格式是指将一个或者多个编码字符集映射到一些定长的Code Unit序列(可能包含一个或者多个Code Unit)的规则。最常用Code Unit是字节，但是16位或者32位的整数也经常被用作内部处理。UTF-32、UTF-16和UTF-8就是Unicode标准的字符编码格式。

1.2 示意图区分：字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式

概念图区分示意图

2、Java代码示意

JAVA在内存中是使用UTF-16作为编码格式的。char类型对应的是Code Unit，这样，就知道了JAVA内存中的Code Unit单元是16位二进制，也就是两个字节的长度。一个CodePoint可能包含一个或者两个CodeUnit。

2.1 程序示意概念：Code Point、Code Unit、char

package com.lfqy.trying.javaenc;

/**
 * Created by chengxia on 2020/10/25.
 */
public class CharsetTest {
    public static void main(String []args){
        // ------------- char的定义和打印(基本多语言面字符)------------------
        char codeUnit = 'A';
        System.out.println(codeUnit); // A
        codeUnit = '\u0041'; // 直接用'A'的16进制表示来定义char变量的值，等价同上
        System.out.println(codeUnit); // A
        codeUnit = '€';
        System.out.println(codeUnit); // €
        codeUnit = '\u20AC';
        System.out.println(codeUnit); // €

        // ------------- char的定义和打印(增补字符)------------------
        //分别定义一个增补字符的前一个CodeUnit和后一个CodeUnit
        char codeUnit1 = '\uD801', codeUnit2 = '\uDC00';
        System.out.println(codeUnit1); // ?
        System.out.println(codeUnit2); // ?，无论是前面的CodeUnit还是后面的CodeUnit都只是部分，不构成完整字符，所以无法打印
        String tempStr = "" + codeUnit1 + codeUnit2;// 可以将两个CodeUnit拼接到一个字符串中，就能够正常打印了。
        System.out.println(tempStr);// 𐐀

        // ------------- CodePoint的打印------------------
        //CodePoint就是一个数字
        int codePoint = 0x0041;
        //%0#6X表示16进制格式，带前缀0X输出定长6位，不足前补0；%c表示打印一个CodePoint
        System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X0041 is encoded for A.
        codePoint = 0x10400;
        System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.

        // ------------- CodePoint中包含的CodeUnit打印------------------
        char [] cpchs = Character.toChars(codePoint);
        System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.
    }
}

程序输出如下：

A
A
€
€
?
?
𐐀
Code point 0X0041 is encoded for A.
Code point 0X10400 is encoded for 𐐀.
In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .

这个例子区分了前面的几个概念，同时也展示了在java中如何打印他们。

2.2 将CodePoint转化为String

CodePoint可以转化为String，如下的两个示例方法展示了如何转换。
方法一：

private static String codePointToString(int codePoint) {
    return new String(Character.toChars(codePoint));
}

方法二：

private static String codePointToString(int codePoint) {
    StringBuilder stringOut = new StringBuilder();
    stringOut.appendCodePoint(codePoint);
    return stringOut.toString();
}

前面的方法更加简洁些。
如下是一个示例代码com.lfqy.trying.javaenc.CodePoint2String：

package com.lfqy.trying.javaenc;

/**
 * Created by chengxia on 2020/11/1.
 */
public class CodePoint2String {
    private static String codePointToString1(int codePoint) {
        return new String(Character.toChars(codePoint));
    }

    private static String codePointToString2(int codePoint) {
        StringBuilder stringOut = new StringBuilder();
        stringOut.appendCodePoint(codePoint);
        return stringOut.toString();
    }

    public static void main(String []args){

        int codePoint = 0x10400;
        System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.

        // ------------- CodePoint中包含的CodeUnit打印------------------
        char [] cpchs = Character.toChars(codePoint);
        System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.

        String resStr = null;
        //方法一
        resStr = codePointToString1(codePoint);
        System.out.printf("方法一，codePointToString1, resStr contains chars: %0#6X and %0#6X .%n",  (int)resStr.charAt(0), (int)resStr.charAt(1));

        //方法二
        resStr = codePointToString2(codePoint);
        System.out.printf("方法二，codePointToString2, resStr contains chars: %0#6X and %0#6X .%n",  (int)resStr.charAt(0), (int)resStr.charAt(1));
    }
}

运行结果如下：

Code point 0X10400 is encoded for 𐐀.
In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .
方法一，codePointToString1, resStr contains chars: 0XD801 and 0XDC00 .
方法二，codePointToString2, resStr contains chars: 0XD801 and 0XDC00 .

2.3 遍历字符串中的字符

直接打印各个char是不对的，正确的方式应该是逐个遍历其中的code point。参考代码如下：

package com.lfqy.trying.javaenc;

/**
 * Created by chengxia on 2020/11/1.
 */
public class TraverseString {

    public static void main(String []args){
        String testStr = "\u0041\u20AC\uD801\uDC00"; //"A€𐐀"
        //方法一，直接打印String中的各个char。不正确，无法正确打印增补字符集中的字符
        System.out.println("方法一：");
        for(int i = 0; i < testStr.length(); i++){
            System.out.println(testStr.charAt(i));
        }
        //方法二，正向遍历各个CodePoint。正确
        System.out.println("方法二：");
        for(int i = 0; i < testStr.length(); ){
            int cp = testStr.codePointAt(i);
            System.out.printf("%c%n", cp); //%c表示打印一个code point
            if(Character.isSupplementaryCodePoint(cp)){//如果当前codepoint是增补字符集的话，占两个code unit，I自增2
                i = i + 2;
            }else{
                I++;
            }
        }
        //方法三，反向遍历各个CodePoint。正确
        System.out.println("方法三：");
        for(int i = testStr.length(); i > 0; ){
            I--;
            if(Character.isSurrogate(testStr.charAt(i))){
                I--;
            }
            int cp = testStr.codePointAt(i);
            System.out.printf("%c%n", cp); //%c表示打印一个code point
        }
    }
}

代码输出如下：

方法一：
A
€
?
?
方法二：
A
€
𐐀
方法三：
𐐀
€
A

The Character.isSurrogate(char ch)方法介绍：

The Character.isSurrogate(char ch) java method determines if the given char value is a Unicode surrogate code unit. Such values do not represent characters by themselves, but are used in the representation of supplementary characters in the UTF-16 encoding.
A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

也就是说，只要当前带判断的codeunit属于增补字符集的codeunit，不管是高位codeunit，还是低位codeunit，都会返回true。

3、参考资料

Java内存中的文本编码
1、编码简介 1.1 概念简析：字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式 ...
基本数据类型-字符型变量/常量
与C语言不同，Java中字符型在内存中占用2个字节（因为Java中char类型采用Unicode编码，用来处理各种...
9. 字符编码与Python之文件操作
字符编码 1 字符在内存与硬盘中的编码对应关系 2 文本文件存取乱码问题 3 解决Python解释器读文件时不乱码...
JavaWeb开发之编码格式
编码格式 Java语言在内存当中默认使用的字符集默认会用“Unicode”编码格式（字符集）来保存字符。编码 ...
java中文乱码解决之道（6）：javaWeb中的编码解码的
在上篇博客中LZ介绍了前面两种场景（IO、内存）中的java编码解码操作，其实在这两种场景中我们只需要在编码解码过...
Java文件编码
Java文件编码处理文本文件时，经常会碰上乱码。那么，乱码是怎么产生的呢？文件以一定的编码规则存储在计算机中，...
Java内存泄漏
Java中的内存管理要了解Java中的内存泄漏，首先就得知道Java中的内存是如何管理的。在Java程序中，我...
说说Java内存泄漏
Java中的内存管理要了解Java中的内存泄漏，首先就得知道Java中的内存是如何管理的。在Java程序中，我...
Java内存模型
Java内存模型主内存和工作内存 Java虚拟机规范中定义了Java内存模型(Java Memory Model...
java 面试题并发相关
java 的内存模型（JMM）主内存 java内存模型规定所有变量存放在主内存中类比硬件中的内存工作内存每...