美文网首页编码
Java内存中的文本编码

Java内存中的文本编码

作者: SpaceCat | 来源:发表于2020-11-01 18:58 被阅读0次

1、编码简介

1.1 概念简析:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式

首先要弄清楚字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式等这些概念。

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.
字符是一个文本的最小抽象单元,它没有具体的形状(形状是字形的范畴)。“A”是一个字符,“€”也是一个字符。
A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.
字符集是一个字符的集合。
A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041(16) and the letter "€" the number 20AC(16). The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".
编码字符集是一个经过编码的字符集,其中的每一个字符都被赋予了一个唯一的数字编码。Unicode标准的核心就是一个编码字符集,其中“A”对应0041(16进制)、“€”对应20AC(16进制)。Unicode编码标准用16进制表示,用“U+”作为前缀,比如,“A”被表示成U+0041
Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.
Code Point是一个在编码字符集中使用的数字编码。一个编码字符集定义了Code Point的范围,但是,并不是所有的Code Point都已经被用来和一个字符对应(有的为未来可能的扩展预留)。Unicode标准的Code Point范围是U+0000U+10FFFF,到Unicode 4.0标准,Unicode超过100万的Code Point中已经已经用了96382个。
Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.
Code Point范围在U+0000U+FFFF中字符的集合,被称为基本多语言面(Basic Multilingual Plane, BMP)。Code Point范围在U+10000U+10FFFF中的字符,不能用最初16位设计的Unicode标准表示,这些字符被称为增补字符(Supplementary characters)。
A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
字符编码格式是指将一个或者多个编码字符集映射到一些定长的Code Unit序列(可能包含一个或者多个Code Unit)的规则。最常用Code Unit是字节,但是16位或者32位的整数也经常被用作内部处理。UTF-32、UTF-16和UTF-8就是Unicode标准的字符编码格式。

1.2 示意图区分:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式

概念图区分示意图

2、Java代码示意

JAVA在内存中是使用UTF-16作为编码格式的。char类型对应的是Code Unit,这样,就知道了JAVA内存中的Code Unit单元是16位二进制,也就是两个字节的长度。一个CodePoint可能包含一个或者两个CodeUnit。

2.1 程序示意概念:Code Point、Code Unit、char

package com.lfqy.trying.javaenc;

/**
 * Created by chengxia on 2020/10/25.
 */
public class CharsetTest {
    public static void main(String []args){
        // ------------- char的定义和打印(基本多语言面字符)------------------
        char codeUnit = 'A';
        System.out.println(codeUnit); // A
        codeUnit = '\u0041'; // 直接用'A'的16进制表示来定义char变量的值,等价同上
        System.out.println(codeUnit); // A
        codeUnit = '€';
        System.out.println(codeUnit); // €
        codeUnit = '\u20AC';
        System.out.println(codeUnit); // €

        // ------------- char的定义和打印(增补字符)------------------
        //分别定义一个增补字符的前一个CodeUnit和后一个CodeUnit
        char codeUnit1 = '\uD801', codeUnit2 = '\uDC00';
        System.out.println(codeUnit1); // ?
        System.out.println(codeUnit2); // ?,无论是前面的CodeUnit还是后面的CodeUnit都只是部分,不构成完整字符,所以无法打印
        String tempStr = "" + codeUnit1 + codeUnit2;// 可以将两个CodeUnit拼接到一个字符串中,就能够正常打印了。
        System.out.println(tempStr);// 𐐀

        // ------------- CodePoint的打印------------------
        //CodePoint就是一个数字
        int codePoint = 0x0041;
        //%0#6X表示16进制格式,带前缀0X输出定长6位,不足前补0;%c表示打印一个CodePoint
        System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X0041 is encoded for A.
        codePoint = 0x10400;
        System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.

        // ------------- CodePoint中包含的CodeUnit打印------------------
        char [] cpchs = Character.toChars(codePoint);
        System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.
    }
}

程序输出如下:

A
A
€
€
?
?
𐐀
Code point 0X0041 is encoded for A.
Code point 0X10400 is encoded for 𐐀.
In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .

这个例子区分了前面的几个概念,同时也展示了在java中如何打印他们。

2.2 将CodePoint转化为String

CodePoint可以转化为String,如下的两个示例方法展示了如何转换。
方法一:

private static String codePointToString(int codePoint) {
    return new String(Character.toChars(codePoint));
}

方法二:

private static String codePointToString(int codePoint) {
    StringBuilder stringOut = new StringBuilder();
    stringOut.appendCodePoint(codePoint);
    return stringOut.toString();
}

前面的方法更加简洁些。
如下是一个示例代码com.lfqy.trying.javaenc.CodePoint2String

package com.lfqy.trying.javaenc;

/**
 * Created by chengxia on 2020/11/1.
 */
public class CodePoint2String {
    private static String codePointToString1(int codePoint) {
        return new String(Character.toChars(codePoint));
    }

    private static String codePointToString2(int codePoint) {
        StringBuilder stringOut = new StringBuilder();
        stringOut.appendCodePoint(codePoint);
        return stringOut.toString();
    }

    public static void main(String []args){

        int codePoint = 0x10400;
        System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.

        // ------------- CodePoint中包含的CodeUnit打印------------------
        char [] cpchs = Character.toChars(codePoint);
        System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.

        String resStr = null;
        //方法一
        resStr = codePointToString1(codePoint);
        System.out.printf("方法一,codePointToString1, resStr contains chars: %0#6X and %0#6X .%n",  (int)resStr.charAt(0), (int)resStr.charAt(1));

        //方法二
        resStr = codePointToString2(codePoint);
        System.out.printf("方法二,codePointToString2, resStr contains chars: %0#6X and %0#6X .%n",  (int)resStr.charAt(0), (int)resStr.charAt(1));
    }
}

运行结果如下:

Code point 0X10400 is encoded for 𐐀.
In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .
方法一,codePointToString1, resStr contains chars: 0XD801 and 0XDC00 .
方法二,codePointToString2, resStr contains chars: 0XD801 and 0XDC00 .

2.3 遍历字符串中的字符

直接打印各个char是不对的,正确的方式应该是逐个遍历其中的code point。参考代码如下:

package com.lfqy.trying.javaenc;

/**
 * Created by chengxia on 2020/11/1.
 */
public class TraverseString {

    public static void main(String []args){
        String testStr = "\u0041\u20AC\uD801\uDC00"; //"A€𐐀"
        //方法一,直接打印String中的各个char。不正确,无法正确打印增补字符集中的字符
        System.out.println("方法一:");
        for(int i = 0; i < testStr.length(); i++){
            System.out.println(testStr.charAt(i));
        }
        //方法二,正向遍历各个CodePoint。正确
        System.out.println("方法二:");
        for(int i = 0; i < testStr.length(); ){
            int cp = testStr.codePointAt(i);
            System.out.printf("%c%n", cp); //%c表示打印一个code point
            if(Character.isSupplementaryCodePoint(cp)){//如果当前codepoint是增补字符集的话,占两个code unit,I自增2
                i = i + 2;
            }else{
                I++;
            }
        }
        //方法三,反向遍历各个CodePoint。正确
        System.out.println("方法三:");
        for(int i = testStr.length(); i > 0; ){
            I--;
            if(Character.isSurrogate(testStr.charAt(i))){
                I--;
            }
            int cp = testStr.codePointAt(i);
            System.out.printf("%c%n", cp); //%c表示打印一个code point
        }
    }
}

代码输出如下:

方法一:
A
€
?
?
方法二:
A
€
𐐀
方法三:
𐐀
€
A

The Character.isSurrogate(char ch)方法介绍:

The Character.isSurrogate(char ch) java method determines if the given char value is a Unicode surrogate code unit. Such values do not represent characters by themselves, but are used in the representation of supplementary characters in the UTF-16 encoding.
A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

也就是说,只要当前带判断的codeunit属于增补字符集的codeunit,不管是高位codeunit,还是低位codeunit,都会返回true。

3、参考资料

相关文章

  • Java内存中的文本编码

    1、编码简介 1.1 概念简析:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式 ...

  • 基本数据类型-字符型变量/常量

    与C语言不同,Java中字符型在内存中占用2个字节(因为Java中char类型采用Unicode编码,用来处理各种...

  • 9. 字符编码与Python之文件操作

    字符编码 1 字符在内存与硬盘中的编码对应关系 2 文本文件存取乱码问题 3 解决Python解释器读文件时不乱码...

  • JavaWeb开发之编码格式

    编码格式 Java语言在内存当中默认使用的字符集 默认会用“Unicode”编码格式(字符集)来保存字符。 编码 ...

  • java中文乱码解决之道(6):javaWeb中的编码解码的

    在上篇博客中LZ介绍了前面两种场景(IO、内存)中的java编码解码操作,其实在这两种场景中我们只需要在编码解码过...

  • Java文件编码

    Java文件编码 处理文本文件时,经常会碰上乱码。那么,乱码是怎么产生的呢? 文件以一定的编码规则存储在计算机中,...

  • Java内存泄漏

    Java中的内存管理 要了解Java中的内存泄漏,首先就得知道Java中的内存是如何管理的。 在Java程序中,我...

  • 说说Java内存泄漏

    Java中的内存管理 要了解Java中的内存泄漏,首先就得知道Java中的内存是如何管理的。 在Java程序中,我...

  • Java内存模型

    Java内存模型 主内存和工作内存 Java虚拟机规范中定义了Java内存模型(Java Memory Model...

  • java 面试题 并发相关

    java 的内存模型(JMM) 主内存 java内存模型规定所有变量存放在主内存中 类比硬件中的内存 工作内存 每...

网友评论

    本文标题:Java内存中的文本编码

    本文链接:https://www.haomeiwen.com/subject/oyduvktx.html