1、编码简介
1.1 概念简析:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式
首先要弄清楚字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式等这些概念。
A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.
字符是一个文本的最小抽象单元,它没有具体的形状(形状是字形的范畴)。“A”是一个字符,“€”也是一个字符。
A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.
字符集是一个字符的集合。
A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041(16) and the letter "€" the number 20AC(16). The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".
编码字符集是一个经过编码的字符集,其中的每一个字符都被赋予了一个唯一的数字编码。Unicode标准的核心就是一个编码字符集,其中“A”对应0041(16进制)
、“€”对应20AC(16进制)
。Unicode编码标准用16进制表示,用“U+”作为前缀,比如,“A”被表示成U+0041
。
Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.
Code Point是一个在编码字符集中使用的数字编码。一个编码字符集定义了Code Point的范围,但是,并不是所有的Code Point都已经被用来和一个字符对应(有的为未来可能的扩展预留)。Unicode标准的Code Point范围是U+0000
到U+10FFFF
,到Unicode 4.0标准,Unicode超过100万的Code Point中已经已经用了96382个。
Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.
Code Point范围在U+0000
到U+FFFF
中字符的集合,被称为基本多语言面(Basic Multilingual Plane, BMP)。Code Point范围在U+10000
到U+10FFFF
中的字符,不能用最初16位设计的Unicode标准表示,这些字符被称为增补字符(Supplementary characters)。
A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
字符编码格式是指将一个或者多个编码字符集映射到一些定长的Code Unit序列(可能包含一个或者多个Code Unit)的规则。最常用Code Unit是字节,但是16位或者32位的整数也经常被用作内部处理。UTF-32、UTF-16和UTF-8就是Unicode标准的字符编码格式。
1.2 示意图区分:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式
概念图区分示意图2、Java代码示意
JAVA在内存中是使用UTF-16作为编码格式的。char类型对应的是Code Unit,这样,就知道了JAVA内存中的Code Unit单元是16位二进制,也就是两个字节的长度。一个CodePoint可能包含一个或者两个CodeUnit。
2.1 程序示意概念:Code Point、Code Unit、char
package com.lfqy.trying.javaenc;
/**
* Created by chengxia on 2020/10/25.
*/
public class CharsetTest {
public static void main(String []args){
// ------------- char的定义和打印(基本多语言面字符)------------------
char codeUnit = 'A';
System.out.println(codeUnit); // A
codeUnit = '\u0041'; // 直接用'A'的16进制表示来定义char变量的值,等价同上
System.out.println(codeUnit); // A
codeUnit = '€';
System.out.println(codeUnit); // €
codeUnit = '\u20AC';
System.out.println(codeUnit); // €
// ------------- char的定义和打印(增补字符)------------------
//分别定义一个增补字符的前一个CodeUnit和后一个CodeUnit
char codeUnit1 = '\uD801', codeUnit2 = '\uDC00';
System.out.println(codeUnit1); // ?
System.out.println(codeUnit2); // ?,无论是前面的CodeUnit还是后面的CodeUnit都只是部分,不构成完整字符,所以无法打印
String tempStr = "" + codeUnit1 + codeUnit2;// 可以将两个CodeUnit拼接到一个字符串中,就能够正常打印了。
System.out.println(tempStr);// 𐐀
// ------------- CodePoint的打印------------------
//CodePoint就是一个数字
int codePoint = 0x0041;
//%0#6X表示16进制格式,带前缀0X输出定长6位,不足前补0;%c表示打印一个CodePoint
System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X0041 is encoded for A.
codePoint = 0x10400;
System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.
// ------------- CodePoint中包含的CodeUnit打印------------------
char [] cpchs = Character.toChars(codePoint);
System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.
}
}
程序输出如下:
A
A
€
€
?
?
𐐀
Code point 0X0041 is encoded for A.
Code point 0X10400 is encoded for 𐐀.
In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .
这个例子区分了前面的几个概念,同时也展示了在java中如何打印他们。
2.2 将CodePoint转化为String
CodePoint可以转化为String,如下的两个示例方法展示了如何转换。
方法一:
private static String codePointToString(int codePoint) {
return new String(Character.toChars(codePoint));
}
方法二:
private static String codePointToString(int codePoint) {
StringBuilder stringOut = new StringBuilder();
stringOut.appendCodePoint(codePoint);
return stringOut.toString();
}
前面的方法更加简洁些。
如下是一个示例代码com.lfqy.trying.javaenc.CodePoint2String
:
package com.lfqy.trying.javaenc;
/**
* Created by chengxia on 2020/11/1.
*/
public class CodePoint2String {
private static String codePointToString1(int codePoint) {
return new String(Character.toChars(codePoint));
}
private static String codePointToString2(int codePoint) {
StringBuilder stringOut = new StringBuilder();
stringOut.appendCodePoint(codePoint);
return stringOut.toString();
}
public static void main(String []args){
int codePoint = 0x10400;
System.out.printf("Code point %0#6X is encoded for %c.%n", codePoint, codePoint);// Code point 0X10400 is encoded for 𐐀.
// ------------- CodePoint中包含的CodeUnit打印------------------
char [] cpchs = Character.toChars(codePoint);
System.out.printf("In JVM inner representation(UTF-16), code point %0#6X contains code unit %0#6X and %0#6X .%n", codePoint, (int)cpchs[0], (int)cpchs[1]);// In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00.
String resStr = null;
//方法一
resStr = codePointToString1(codePoint);
System.out.printf("方法一,codePointToString1, resStr contains chars: %0#6X and %0#6X .%n", (int)resStr.charAt(0), (int)resStr.charAt(1));
//方法二
resStr = codePointToString2(codePoint);
System.out.printf("方法二,codePointToString2, resStr contains chars: %0#6X and %0#6X .%n", (int)resStr.charAt(0), (int)resStr.charAt(1));
}
}
运行结果如下:
Code point 0X10400 is encoded for 𐐀.
In JVM inner representation(UTF-16), code point 0X10400 contains code unit 0XD801 and 0XDC00 .
方法一,codePointToString1, resStr contains chars: 0XD801 and 0XDC00 .
方法二,codePointToString2, resStr contains chars: 0XD801 and 0XDC00 .
2.3 遍历字符串中的字符
直接打印各个char是不对的,正确的方式应该是逐个遍历其中的code point。参考代码如下:
package com.lfqy.trying.javaenc;
/**
* Created by chengxia on 2020/11/1.
*/
public class TraverseString {
public static void main(String []args){
String testStr = "\u0041\u20AC\uD801\uDC00"; //"A€𐐀"
//方法一,直接打印String中的各个char。不正确,无法正确打印增补字符集中的字符
System.out.println("方法一:");
for(int i = 0; i < testStr.length(); i++){
System.out.println(testStr.charAt(i));
}
//方法二,正向遍历各个CodePoint。正确
System.out.println("方法二:");
for(int i = 0; i < testStr.length(); ){
int cp = testStr.codePointAt(i);
System.out.printf("%c%n", cp); //%c表示打印一个code point
if(Character.isSupplementaryCodePoint(cp)){//如果当前codepoint是增补字符集的话,占两个code unit,I自增2
i = i + 2;
}else{
I++;
}
}
//方法三,反向遍历各个CodePoint。正确
System.out.println("方法三:");
for(int i = testStr.length(); i > 0; ){
I--;
if(Character.isSurrogate(testStr.charAt(i))){
I--;
}
int cp = testStr.codePointAt(i);
System.out.printf("%c%n", cp); //%c表示打印一个code point
}
}
}
代码输出如下:
方法一:
A
€
?
?
方法二:
A
€
𐐀
方法三:
𐐀
€
A
The Character.isSurrogate(char ch)
方法介绍:
The Character.isSurrogate(char ch) java method determines if the given char value is a Unicode surrogate code unit. Such values do not represent characters by themselves, but are used in the representation of supplementary characters in the UTF-16 encoding.
A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.
也就是说,只要当前带判断的codeunit属于增补字符集的codeunit,不管是高位codeunit,还是低位codeunit,都会返回true。
网友评论