UnicodeStandard-12.0
⓿❶❷❸❹❹❺❻❼❽❾
Chapter 2 General Structure
This chapter describes the fundamental princi ples governing the design of the Unicode Standard and presents an informal overview of its main features.
本章描述了控制Unicode标准设计的基本原则,并对其主要特性进行了非正式概述。
The chapter starts by placing the Unicode Standard in an architectural context by discussing the nature of text representation and text processing and its be aring on character encoding decisions.
本章首先将Unicode标准置于体系结构上下文中,讨论文本表示和文本处理的性质,以及它在字符编码决策中的作用。
Next, the Unicode Design Principles are introduced—ten basic principles that convey the essence of the standard.
接下来,介绍了Unicode设计原则,这十个基本原则传达了标准的本质。
The Unicode Design Pr inciples serve as a tutorial framework for understanding the Unicode Standard.
Unicode设计规范是理解Unicode标准的教程框架。
The chapter then moves on to the Unicode character encoding model, introducing the concepts of character, code point, and encoding forms, and diagramming the relationships between them.
然后,本章将继续讨论Unicode字符编码模型,介绍字符、代码点和编码形式的概念,并绘制它们之间的关系图。
This provides an explanation of the encoding forms UTF-8, UTF-16, and UTF-32 and some general guidelines regarding the circumstances under which one form would be preferable to another.
这提供了对编码形式utf-8、utf-16和utf-32的解释,以及一些关于一种形式比另一种形式更可取的情况的一般准则。
The sections on Unicode allocation then describe the overall structure of the Unicode codespace, showing a summary of the code charts and the locations of blocks of characters associated with different scripts or sets of symbols.
然后,有关Unicode分配的部分描述了Unicode代码空间的总体结构,显示了代码图表的摘要以及与不同脚本或符号集关联的字符块的位置。
Next, the chapter discusses the issue of writing direction and introduces several special types of characters important for understand ing the Unicode Standard.
接下来,本章讨论了书写方向的问题,并介绍了几种特殊类型的字符,这些字符对于理解Unicode标准很重要。
In particular, the use of combining characters, the byte order mark, and other special characters is explored in some detail.
特别是对组合字符、字节顺序标记和其他特殊字符的使用进行了详细的探讨。
The section on equivalent sequences and normalization describes the issue of multiple equivalent representations of Unicode text and explains how text can be transformed to use a unique and preferred representation for each character sequence.
关于等价序列和规范化的一节描述了Unicode文本的多个等价表示的问题,并解释了如何将文本转换为对每个字符序列使用唯一的和首选的表示。
Finally, there is an informal statement of the conformance requirements for the Unicode Standard.
最后,还有一个关于Unicode标准一致性要求的非正式声明。
This informal statement, with a number of easy-to-understand examples, gives a general sense of what conformance to the Unicode Standard means.
这个非正式的声明,有许多容易理解的例子,给出了一个符合Unicode标准意味着什么的一般意义。
The rigorous, formal definition of conformance is given in the subsequent Chapter 3, Conformance
.在随后的第3章“一致性”中给出了严格、正式的一致性定义。
-
2-1 Architectural Context
A character code standard such as the Unicode Standard enables the implementation of useful processes operating on textual data.
字符代码标准(如unicode标准)可以实现对文本数据进行操作的有用进程。
The interesting end products are not the character codes but rather the text processes, because these directly serve the needs of a system’s users.
有趣的最终产品不是字符代码,而是文本处理,因为它们直接满足系统用户的需求。
Character codes are like nuts and bolts—minor, but essential and ubiquitous components used in many different ways in the construction of computer software systems.
字符代码类似于小螺母和螺栓,但在计算机软件系统的构造中以许多不同的方式使用的基本和普遍的组件。
No single design of a character set can be optimal for all uses, so the architecture of the Uni- code Standard strikes a balance among several competing requirements.
没有一个字符集的单一设计可以适合所有的用途,因此统一代码标准的体系结构在几个相互竞争的需求之间达到了平衡。
Basic Text Processes
Most computer systems provide low-level functionality for a small number of basic text processes from which more sophisticated text-processing capabilities are built. The follow- ing text processes are supported by most computer systems to some degree:
大多数计算机系统为少量基本文本处理提供低级功能,从中构建更复杂的文本处理功能。大多数计算机系统在某种程度上支持以下文本处理:
• Rendering characters visible (including ligatures, contextual forms, and so on)
呈现可见字符(包括连字、上下文形式等)
• Breaking lines while rendering (including hyphenation)
渲染时换行(包括连字符)
• Modifying appearance, such as point si ze, kerning, underlining, slant, and weight (light, demi, bold, and so on)
修改外观,例如点大小、紧排、下划线、倾斜和粗细(浅色、半色、粗体等)
• Determining units such as “word” and “sentence”
确定“词”和“句”等单位
• Interacting with users in processes such as selecting and highlighting text
在选择和突出显示文本等过程中与用户交互
• Accepting keyboard input and editing stored text through insertion and deletion
通过插入和删除接受键盘输入和编辑存储的文本
• Comparing text in operations such as in searching or determining the sort order of two strings
比较操作中的文本,如搜索或确定两个字符串的排序顺序
• Analyzing text content in operations such as spell-checking, hyphenation, and parsing morphology (that is, determining word roots, stems, and affixes)
分析操作中的文本内容,如拼写检查、断字和分析形态(即确定词根、词干和词缀)
• Treating text as bulk data for operations such as compressing and decompress- ing, truncating, transmitting, and receiving
将文本作为批量数据进行压缩、解压缩、截断、传输和接收等操作
Text Elements, Characters, and Text Processes
文本元素、字符和文本处理
One of the more profound challenges in designing a character encoding stems from the fact that there is no universal set of fundamen tal units of text. Instead, the division of text into text elements necessarily varies by language and text process.
在设计字符编码时,一个更为深刻的挑战来自这样一个事实:没有一套通用的文本基本单位。相反,文本到文本元素的划分必然因语言和文本过程而异。
For example, in traditional German orthography, the letter combination “ck” is a text element for the process of hyphenation (where it appears as “k-k”), but not for the process of sorting.
例如,在传统的德语拼字法中,字母组合“ck”是一个用于连字符过程的文本元素(在这里它显示为“k-k”),而不是用于排序过程。
In Spanish, the combination “ll” may be a text element for the traditional process of sorting (where it is sorted between “l” and “m”), but not for the process of rendering.
在西班牙语中,“l l”组合可能是传统排序过程(在“l”和“m”之间排序)的文本元素,但不用于呈现过程。
In English, the letters “A” and “a” are usually distinct text elements for the process of rendering, but generally not distinct for the process of searching text. The text elements in a given language depend upon the specific text process; a text element for spell-checking may have different boundaries from a text element for sorting purposes. For example, in the phrase “the quick brown fox,” the sequence “fox” is a text element for the purpose of spell-checking.
在英语中,字母“a”和“a”在呈现过程中通常是不同的文本元素,但在搜索文本的过程中通常不是不同的。给定语言中的文本元素取决于特定的文本过程;用于拼写检查的文本元素可能与用于排序的文本元素有不同的边界。例如,在短语“快速棕色狐狸”中,序列“狐狸”是用于拼写检查的文本元素。
In contrast, a character encoding standard provides a single set of fundamental units of encoding, to which it uniquely assigns numerical code points. These units, called assigned characters, are the smallest interpretable units of stored text. Text elements are then represented by a sequence of one or more characters.
相反,字符编码标准提供了一组基本的编码单位,它唯一地为这些单位分配数字编码点。这些单位称为分配字符,是存储文本的最小可解释单位。然后,文本元素由一个或多个字符的序列表示。
The design of the character encoding must provide precisely the set of characters that allows programmers to design applications ca pable of implementing a variety of text pro- cesses in the desired languages.
字符编码的设计必须精确地提供一组字符,使程序员能够设计能够以所需语言实现各种文本过程的应用程序。
Therefore, the text elements encountered in most text processes are represented as sequences of character codes.
因此,在大多数文本处理中遇到的文本元素被表示为字符代码序列。
See Unicode Standard Annex #29,“Unicode Text Segmentation,” for detailed information on how to segment character strings into common types of text elements. Certain text elements correspond to what users perceive as single characters. These are called grapheme clusters.
有关如何将字符串分割为常见类型的文本元素的详细信息,请参见Unicode标准附录29“Unicode文本分割”。某些文本元素与用户认为的单个字符相对应。这些被称为字形簇.
Text Processes and Encoding文本处理和编码
In the case of English text using an encoding scheme such as ASCII, the relationships between the encoding and the basic text processes built on it are seemingly straightfor- ward: characters are generally rendered visible one by one in distinct rectangles from left to right in linear order.
在使用ASCII等编码方案的英语文本中,编码与基于它的基本文本处理之间的关系似乎是直接的:字符通常以线性顺序从左到右依次呈现在不同的矩形中
Thus one character code inside the computer corresponds to one log- ical character in a process such as simple English rendering.
因此,计算机中的一个字符代码对应于一个过程中的一个日志字符,例如简单的英语呈现。
When designing an international and multilingual text encoding such as the Unicode Stan- dard, the relationship between the encoding and implementation of basic text processes must be considered explicitly, for several reasons:
在设计国际和多语言文本编码(如Unicode标准)时,必须明确考虑编码与基本文本过程实现之间的关系,原因有以下几种:
. • Many assumptions about character rendering that hold true for the English alphabet fail for other writing systems. Characters in these other writing systems are not necessarily rendered visible one by one in rectangles from left to right. In many cases, character positioning is quite complex and does not proceed in a linear fashion. See Section 9.2, Arabic, and Section 12.1, Devanagari, for detailed examples of this situation.
. 对于英语字母表来说,许多关于字符呈现的假设对于其他书写系统来说都是失败的。这些其他书写系统中的字符不一定从左到右依次呈现为矩形。在许多情况下,字符定位非常复杂,不以线性方式进行处理。有关这种情况的详细示例,请参阅第9.2节(阿拉伯语)和第12.1节(天成文书)。
• It is not always obvious that one set of text characters is an optimal encoding for a given language. For example, two approaches exist for the encoding of accented characters commonly used in French or Swedish: ISO/IEC 8859 defines letters such as “ä” and “ö” as individual characters, whereas ISO 5426 represents them by composition with diacritics instead. In the Swedish language, both are considered distinct letters of the alphabet, following the letter “z”. In French, the diaeresis on a vowel merely marks it as being pronounced in isolation. In practice, both approaches can be used to implement either language.
对于给定的语言来说,一组文本字符是最佳编码并不总是显而易见的。例如,对法语或瑞典语中常用的重音字符进行编码有两种方法:ISO/IEC 8859将字母“”和“”定义为单个字符,而ISO 5426则用音调符号表示字母。在瑞典语中,这两个字母都被认为是字母表中不同的字母,跟在字母“Z”之后。在法语中,元音上的分音符仅仅标志着它是孤立发音的。在实践中,这两种方法都可以用来实现任意一种语言。
• No encoding can support all basic text processes equally well. As a result, some trade-offs are necessary. For example, following common practice, Unicode defines separate codes for uppercase and lowercase letters. This choice causes some text processes, such as rendering, to be carried out more easily, but other processes, such as comparison, to become more difficult. A different encoding design for English, such as case-shift control codes, would have the opposite effect. In designing a new encoding scheme for complex scripts, such trade-offs must be evaluated and decisions made explicitly.
没有任何编码可以同样地支持所有基本的文本处理。因此,一些权衡是必要的。例如,按照惯例,Unicode为大小写字母定义了单独的代码。这种选择会使一些文本过程(如渲染)更容易执行,但其他过程(如比较)则更难执行。不同的英语编码设计,如case-shift控制代码,会产生相反的效果。在为复杂的脚本设计新的编码方案时,必须对这种权衡进行评估,并明确做出决定。
For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering exists. See Unicode Technical Standard #10, “Unicode Collation Algorithm,” for the standard default mechanism for comparing Unicode strings.
出于这些原因,Unicode标准的设计并不是特定于特定基本文本处理算法的设计。相反,它提供了一种可以与各种算法一起使用的编码。特别是,排序和字符串比较算法不能假定Unicode字符代码编号的分配为词典字符串比较提供了字母顺序。文化上预期的排序顺序需要任意复杂的排序算法。相同字符的预期排序顺序在不同语言中有所不同;因此,一般来说,不存在可接受的单一词典排序。有关比较Unicode字符串的标准默认机制,请参阅Unicode技术标准“Unicode排序算法”。
Text processes supporting many languages are often more complex than they are for English. The character encoding design of the Unicode Standard strives to minimize this additional complexity, enabling modern computer systems to interchange, render, and manipulate text in a user’s own script and language—and possibly in other languages as well.
支持多种语言的文本处理通常比英语复杂。Unicode标准的字符编码设计力求将这种额外的复杂性降到最低,使现代计算机系统能够以用户自己的脚本和语言以及可能的其他语言交换、呈现和操作文本。
Character Identity. Whenever Unicode makes statements about the default layout behavior of characters, it is done to ensure that users and implementers face no ambiguities as to which characters or character sequences to use for a given purpose.
字符身份。每当Unicode对字符的默认布局行为做出声明时,都要确保用户和实现者不会对用于给定目的的字符或字符序列有任何含糊。
For bidirectional writing systems, this includes the specification of the sequence in which characters are to be encoded so as to correspond to a specific reading order when displayed.
See Section 2.10,Writing Direction.
对于双向写入系统,这包括字符编码顺序的说明,以便在显示时与特定的读取顺序相对应。
The actual layout in an implementation may differ in detail.
实现中的实际布局可能在细节上有所不同。
A mathematical layout system, for example, will have many additional, domain-specific rules for layout, but a well- designed system leaves no ambiguities as to which character codes are to be used for a given aspect of the mathematical expression being encoded.
例如,一个数学布局系统将有许多额外的、特定于领域的布局规则,但是一个设计良好的系统不会留下任何模棱两可的地方,即在所编码的数学表达式的给定方面要使用哪些字符代码。
The purpose of defining Unicode default layout behavior is not to enforce a single and specific aesthetic layout for each script, but rather to encourage uniformity in encoding.
定义Unicode默认布局行为的目的不是为每个脚本强制一个特定的美学布局,而是鼓励编码的一致性。
In that way implementers of layout systems can rely on the fact that users would have chosen a particular character sequence for a given purpose, and users can rely on the fact that implementers will create a layout for a particular character sequence that matches the intent of the user to within the capabilities or technical limitations of the implementation.
以这种方式,布局系统的实现者可以依赖这样一个事实:用户为给定的目的选择了一个特定的字符序列,并且用户可以依赖这样一个事实:实现者将为一个特定的字符序列创建一个布局,该布局与用户的意图相匹配,使其在实施的能力或技术限制。
In other words, two users who are familiar with the standard and who are presented with the same text ideally will choose the same sequence of character codes to encode the text.
换言之,两个熟悉该标准并且在理想情况下具有相同文本的用户将选择相同的字符代码序列来对文本进行编码。
In actual practice there are many limitations, so this goal cannot always be realized
在实际操作中有很多局限性,所以这个目标不可能总是实现的。
-
2-2 Unicode Design Principles
The design of the Unicode Standard reflects the 10 fundamental principles stated in X
Unicode标准的设计反映了下表中所述的10项基本原则。
![](https://img.haomeiwen.com/i15359095/da03aecabf4df621.png)
Not all of these principles can be satisfied simultaneously. The design strikes a balance between maintaining consistency for the sake of simplicity and efficiency and maintaining compatibility for interchange with existing standards.
并非所有这些原则都能同时满足。设计在保持一致性(为了简单和高效)和保持与现有标准的互换兼容性之间取得了平衡。
Universality
The Unicode Standard encodes a single, very large set of characters, encompassing all the characters needed for worldwide use.
Unicode标准对一组非常大的字符进行编码,包括全球使用所需的所有字符。
This single repertoire is intended to be universal in coverage, containing all the characters for text ual representation in all modern writing systems, in most historic writing systems, and for symbols used in plain text.
这一单一的剧目旨在在覆盖范围内具有普遍性,包含所有现代书写系统、大多数历史书写系统和纯文本中使用的符号的文本形式表示的所有字符。
The Unicode Standard is designed to meet the needs of diverse user communities within each language, serving business, educational, liturgical and scientific users, and covering the needs of both modern and historical texts.
&emspUnicode标准旨在满足每种语言中不同用户群体的需求,为商业、教育、礼仪和科学用户提供服务,并涵盖现代和历史文本的需求。
Despite its aim of universality, the Unicode Standard considers the following to be outside its scope: writing systems for which insufficient information is available to enable reliable encoding of characters, writing systems that have not become standardized through use, and writing systems that are nontextual in nature.
尽管Unicode标准具有普遍性,但它认为以下内容超出了其适用范围:没有足够信息来实现字符的可靠编码的书写系统、没有通过使用实现标准化的书写系统,以及本质上是非文本的。
Because the universal repertoire is known and well defined in the standard, it is possible to specify a rich set of character semantics.
因为通用指令库在标准中是已知的和定义良好的,所以可以指定一组丰富的字符语义。
By relying on those character semantics, imple- mentations can provide detailed support for complex operations on text in a portable way.
通过依赖这些字符语义,实现可以以可移植的方式为文本的复杂操作提供详细的支持。
Efficiency效率
The Unicode Standard is designed to make efficient implementation possible.
Unicode标准旨在使高效的实现成为可能。
There are no escape characters or shift states in the Unicode character encoding model.
Unicode字符编码模型中没有转义字符或移位状态。
Each character code has the same status as any other character code; all codes are equally accessible.
每个字符代码与任何其他字符代码具有相同的状态;所有代码都是同样可访问的。
All Unicode encoding forms are self-synchronizing and non-overlapping.
所有Unicode编码形式都是自同步和不重叠的。
This makes randomly accessing and searching inside streams of characters efficient.
这使得在字符流中随机访问和搜索变得高效。
By convention, characters of a script are grouped together as far as is practical.
按照惯例,脚本中的字符尽可能地组合在一起。
Not only is this practice convenient for looking up characters in the code charts, but it makes imple- mentations more compact and compression me thods more efficient.
这种做法不仅便于在代码图中查找字符,而且使实现更紧凑,压缩方法更高效。
The common punctuation characters are shared.
共用标点符号。
Format characters are given specific and unambiguous functions in the Unicode Standard.
&emsp在Unicode标准中,格式字符被赋予特定且明确的功能。
This design simplifies the support of subsets.
这种设计简化了子集的支持。
To keep implementations simple and efficient, stateful controls and format characters are avoided wherever possible.
为了保持实现的简单和高效,尽可能避免使用状态控制和格式字符。
Characters, Not Glyphs字符,而不是字形
The Unicode Standard draws a distinction between characters and glyphs.
Unicode标准对字符和标志符号进行了区分。
Characters are the abstract representations of the smallest components of written language that have semantic value.
字符是具有语义值的书面语言最小组件的抽象表示。
They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation.
它们主要(但不是唯一)代表构成自然语言文本和技术符号的字母、标点和其他符号。
The letters used in natural language text are grouped into scripts—sets of letters that are used together in writ- ing languages.
自然语言文本中使用的字母被分为脚本集,这些脚本集在书面语言中一起使用。
Letters in different scripts, even when they correspond either semantically or graphically, are represented in Unicode by distinct characters.
不同脚本中的字母,即使它们在语义上或图形上对应,也用Unicode字符表示。
This is true even in those instances where they correspond in sema ntics, pronunciation, or appearance.
即使在那些在语义、发音或外观上对应的情况下也是如此。
Characters are represented by code points that reside only in a memory representation, as strings in memory, on disk, or in data tran smission.
字符由仅驻留在内存表示中的代码点表示,如内存、磁盘或数据传输会话中的字符串。
The Unicode Standard deals only with character codes.
Unicode标准只处理字符代码。
Glyphs represent the shapes that characters can have when they are rendered or displayed.
字形表示字符在呈现或显示时可以具有的形状。
In contrast to characters, glyphs appear on the screen or paper as particular representa- tions of one or more characters.
与字符不同的是,屏幕或纸张上出现的字形是一个或多个字符的特定表示。
A repertoire of glyphs makes up a font.
一组字形构成了一种字体。
Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font ven- dors and of appropriate standards and are not part of the Unicode Standard.
字形形状和识别和选择字形的方法由单个字体文库和适当的标准负责,不属于Unicode标准的一部分。
Various relationships may exist between char acter and glyph: a single glyph may corre- spond to a single character or to a number of characters, or multiple glyphs may result from a single character.The distinction between characters and glyphs is illustrated inFigure 2-2
字符扮演者和glyph之间可能存在各种关系:单个glyph可能对应于单个字符或多个字符,或者单个字符可能产生多个glyph。
![](https://img.haomeiwen.com/i15359095/086d401804ddf2e1.png)
Even the letter “a” has a wide variety of glyphs that can represent it. A lowercase Cyrillic “Ò”also has a variety of glyphs; the second glyph for U+043F cyrillic small letter pe shownin Figure 2-2 is customary for italic in Russia, while the third is customary for italic in Serbia. Arabic letters are displayed with different glyphs, depending on their position in aword; the glyphs in Figure 2-2 show independent, final, initial, and medial forms.
甚至字母“A”也有各种各样的字形来表示它。小写西里尔文“_”也有各种字形;图2-2中显示的U+043F西里尔文小写字母Pe的第二个字形在俄罗斯是斜体,而第三个字形在塞尔维亚是斜体。阿拉伯字母根据其在aword中的位置以不同的字形显示;图2-2中的字形显示独立、最终、初始和中间形式。
Sequences such as “fi” may be displayed with two independent glyphs or with a ligatureglyph.What the user thinks of as a single character—which may or may not be represented by asingle glyph—may be represented in the Unicode Standard as multiple code points. SeeTa b l e 2 - 2 for additional examples.
诸如“fi”这样的序列可以用两个独立的标志符号或一个连接符来显示。用户认为可以用一个标志符号表示或不可以用一个标志符号表示的单个字符可以用Unicode标准表示为多个码位。
![](https://img.haomeiwen.com/i15359095/7ebf00e48d8180fc.png)
For certain scripts, such as Arabic and the various Indic scripts, the number of glyphsneeded to display a given script may be significantly larger than the number of charactersencoding the basic units of that script. The number of glyphs may also depend on theorthographic style supported by the font. For example, an Arabic font intended to supportthe Nastaliq style of Arabic script may possess many thousands of glyphs. However, thecharacter encoding employs the same few dozen letters regardless of the font style used todepict the character data in context.
对于某些脚本,如阿拉伯语和各种印度语脚本,显示给定脚本所需的glyphscan数量可能远远大于编码该脚本基本单位的字符数。字形的数量也可能取决于字体支持的正版样式。例如,一种旨在支持阿拉伯文字纳斯塔利克风格的阿拉伯字体可能拥有数千个字形。然而,字符编码使用相同的几十个字母,而不管在上下文中描述字符数据所使用的字体样式如何。
A font and its associated rendering process define an arbitrary mapping from Unicodecharacters to glyphs. Some of the glyphs in a font may be independent forms for individualcharacters; others may be rendering forms that do not directly correspond to any singlecharacter.
字体及其关联的呈现过程定义了从单码字符到字形的任意映射。字体中的一些glyph可能是独立的形式,用于单个字符;其他的可能是呈现不直接对应于任何单个字符的形式。
Txt rendering requires that characters in memory be mapped to glyphs. The final appear-ance of rendered text may depend on context (neighboring characters in the memory rep-resentation), variations in typographic design of the fonts used, and formattinginformation (point size, superscript, subscript, and so on). The results on screen or papercan differ considerably from the prototypical shape of a letter or character, as shown inFigure 2-3.
TXT呈现要求内存中的字符映射到字形。渲染文本的最终显示可能取决于上下文(内存中相邻的字符表示)、所用字体的排版设计变化以及格式信息(点大小、上标、下标等)。屏幕或纸张上的结果可能与字母或字符的原型形状大不相同,如图2-3所示。
![](https://img.haomeiwen.com/i15359095/0fc14529f1343f48.png)
For the Latin script, this relationship between character code sequence and glyph is rela-tively simple and well known; for several other scripts, it is documented in this standard.However, in all cases, fine typography requires a more elaborate set of rules than givenhere. The Unicode Standard documents the default relationship between character sequences and glyphic appearance for the purpose of ensuring that the same text contentcan be stored with the same, and therefore interchangeable, sequence of character codes.
对于拉丁语脚本,字符代码序列和glyph之间的关系相对简单并且众所周知;对于其他几个脚本,它在本标准中有文档记录。但是,在所有情况下,精细的排版都需要比givenhere更详细的规则集。Unicode标准记录了字符序列和字形外观之间的默认关系,以确保相同的文本内容可以存储在相同的字符代码序列中,因此可以互换。
Semantics语义学
Caracters have well-defined semantics. These semantics are defined by explicitly assignedcharacter properties, rather than implied through the character name or the position of acharacter in the code tables (see Section 3.5, Properties). The Unicode Character Databaseprovides machine-readable character property tables for use in implementations of pars-ing, sorting, and other algorithms requiring semantic knowledge about the code points.These properties are supplemented by the description of script and character behavior inthis standard. See also Unicode Technical Report #23, “The Unicode Character PropertyModel.”
字符具有定义良好的语义。这些语义是由显式分配的字符属性定义的,而不是通过字符名或字符在代码表中的位置来隐含(见第3.5节,属性)。Unicode字符数据库提供了机器可读的字符属性表,用于Pars-ing、Sorting和其他需要有关代码点的语义知识的算法的实现。这些属性由本标准中的脚本描述和字符行为补充。天啊。另见Unicode技术报告23,“Unicode字符属性模型”。
The Unicode Standard identifies more than 100 different character properties, includingnumeric, casing, combination, and directionality properties (see Chapter 4, CharacterProperties). Additional properties may be defined as needed from time to time. Wherecharacters are used in different ways in different languages, the relevant properties are nor-mally defined outside the Unicode Standard. For example, Unicode Technical Standard#10, “Unicode Collation Algorithm,” defines a set of default collation weights that can beused with a standard algorithm. Tailorings for each language are provided in the UnicodeCommon Locale Data Repository (CLDR); see Section B.3, Other Unicode OnlineResources.
Unicode标准识别了100多种不同的字符属性,包括数字、大小写、组合和方向属性(见第4章,字符属性)。可根据需要不时定义其他属性。当字符在不同语言中以不同的方式使用时,相关属性也不会在Unicode标准之外被错误地定义。例如,Unicode技术标准“Unicode排序规则算法”定义了一组可以与标准算法一起使用的默认排序规则权重。每种语言的详细信息都在unicodecommon locale data repository(cldr)中提供;请参见第B.3节,其他unicode onlineresources。
The Unicode Standard, by supplying a universal repertoire associated with well-definedcharacter semantics, does not require the code set independent model of internationaliza-tion and text handling. That model abstracts away string handling as manipulation of bytestreams of unknown semantics to protect implementations from the details of hundreds ofdifferent character encodings and selectively late-binds locale-specific character propertiesto characters. Of course, it is always possible for code set independent implementations toretain their model and to treat Unicode characters as just another character set in that con-text. It is not at all unusual for Unix implementations to simply add UTF-8 as another char-acter set, parallel to all the other character sets they support. By contrast, the Unicodeapproach—because it is associated with a universal repertoire—assumes that charactersand their properties are inherently and inextricably associated. If an internationalizedapplication can be structured to work directly in terms of Unicode characters, all levels ofthe implementation can reliably and efficiently access character storage and be assured ofthe universal applicability of character property semantics.
Unicode标准通过提供与定义良好的字符语义相关的通用指令表,不需要独立于代码集的国际化和文本处理模型。该模型将字符串处理抽象为对未知语义字节串的操作,以保护实现不受数百个不同字符编码的细节的影响,并选择性地延迟绑定特定于区域设置的字符属性到字符。当然,代码集独立的实现总是有可能在其模型中详细描述它们,并将Unicode字符视为该con文本中的另一个字符集。对于Unix实现来说,简单地将utf-8作为另一个字符集添加,并与它们支持的所有其他字符集并行,这一点也不罕见。相比之下,由于与通用剧目相关联的单码方法假定特征及其性质是固有的和不可分割的关联。如果一个国际化的应用程序可以被构造成直接使用Unicode字符,那么所有层次的实现都可以可靠、高效地访问字符存储,并确保字符属性语义的普遍适用性。
Plain Text
Pain text is a pure sequence of character codes; plain Unicode-encoded text is therefore asequence of Unicode character codes. In contrast, styled text, also known as rich text, is anytext representation consisting of plain text plus added information such as a language iden-tifier, font size, color, hypertext links, and so on. For example, the text of this specification,a multi-font text as formatted by a book editing system, is rich text.
The simplicity of plain text gives it a natural role as a major structural element of rich text.SGML, RTF, HTML, XML, and T E X are examples of rich text fully represented as plain textstreams, interspersing plain text data with sequences of characters that represent the addi-tional data structures. They use special conventions embedded within the plain text file,such as “<p>”, to distinguish the markup or tags from the “real” content. Many popularword processing packages rely on a buffer of plain text to represent the content and imple-ment links to a parallel store of formatting data.
The relative functional roles of both plain text and rich text are well established:
• Plain text is the underlying content stream to which formatting can be applied.
• Rich text carries complex formatting information as well as text context.
• Plain text is public, standardized, and universally readable.
• Rich text representation may be implementation-specific or proprietary.
Athough some rich text formats have been standardized or made public, the majority ofrich text designs are vehicles for particular implementations and are not necessarily read-able by other implementations. Given that rich text equals plain text plus added informa-tion, the extra information in rich text can always be stripped away to reveal the “pure” textunderneath. This operation is often employed, for example, in word processing systemsthat use both their own private rich text format and plain text file format as a universal, iflimited, means of exchange. Thus, by default, plain text represents the basic, interchange-able content of text.
Pain text represents character content only, not its appearance. It can be displayed in avarity of ways and requires a rendering process to make it visible with a particular appear-ance. If the same plain text sequence is given to disparate rendering processes, there is noexpectation that rendered text in each instance should have the same appearance. Instead,the disparate rendering processes are simply required to make the text legible according tothe intended reading. This legibility criterion constrains the range of possible appearances.
The relationship between appearance and content of plain text may be summarized as fol-lows:
Plain text must contain enough information to permit the text to be rendered legibly,and nothing more.
The Unicode Standard encodes plain text. The distinction between plain text and otherforms of data in the same data stream is the function of a higher-level protocol and is notspecified by the Unicode Standard itself.
Logical Order
The order in which Unicode text is stored in the memory representation is called logicalorder. This order roughly corresponds to the order in which text is typed in via the key-board; it also roughly corresponds to phonetic order. For decimal numbers, the logicalorder consistently corresponds to the most significant digit first, which is the orderexpected by number-parsing software.
When displayed, this logical order often corresponds to a simple linear progression ofcharacters in one direction, such as from left to right, right to left, or top to bottom. Inother circumstances, text is displayed or printed in an order that differs from a single linearprogression. Some of the clearest examples are situations where a right-to-left script (suchas Arabic or Hebrew) is mixed with a left-to-right script (such as Latin or Greek). Forexample, when the text in Figure 2-4 is ordered for display the glyph that represents thefirst character of the English text appears at the left. The logical start character of the Hbrew text, however, is represented by the Hebrew glyph closest to the right margin. Thesucceeding Hebrew glyphs are laid out to the left.
![](https://img.haomeiwen.com/i15359095/67a1e6196b4c19ec.png)
In logical order, numbers are encoded with most significant digit first, but are displayed indifferent writing directions. As shown in Figure 2-5 these writing directions do not alwayscorrespond to the writing direction of the surrounding text. The first example shows N’Ko,a right-to-left script with digits that also render right to left. Examples 2 and 3 show Hbrew and Arabic, in which the numbers are rendered left to right, resulting in bidirec-tional layout. In left-to-right scripts, such as Latin and Hiragana and Katakana (for Japa-nese), numbers follow the predominant left-to-right direction of the script, as shown in Eamples 4 and 5. When Japanese is laid out vertically, numbers are either laid out verti-cally or may be rotated clockwise ninety degrees to follow the layout direction of the lines,as shown in Example 6.
![](https://img.haomeiwen.com/i15359095/2e5fa51ea7635578.png)
Te Unicode Standard precisely defines the conversion of Unicode text from logical orderto the order of readable (displayed) text so as to ensure consistent legibility. Properties of directionality inherent in characters generally determine the correct display order of text.The Unicode Bidirectional Algorithm specifies how these properties are used to resolvedirectional interactions when characters of right-to-left and left-to-right directionality aremixed. (See Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”) However,when characters of different directionality are mixed, inherent directionality alone is occa-sionally insufficient to render plain text legibly. The Unicode Standard therefore includescharacters to explicitly specify changes in direction when necessary. The BidirectionalAlgorithm uses these directional layout control characters together with the inherent direc-tional properties of characters to exert exact control over the display ordering for legibleinterchange. By requiring the use of this algorithm, the Unicode Standard ensures thatplain text used for simple items like file names or labels can always be correctly ordered fordisplay.
Besides mixing runs of differing overall text direction, there are many other cases where thelogical order does not correspond to a linear progression of characters. Combining charac-ters (such as accents) are stored following the base character to which they apply, but arepositioned relative to that base character and thus do not follow a simple linear progres-sion in the final rendered text. For example, the Latin letter “Ï” is stored as “x” followed bycombining “Δ; the accent appears below, not to the right of the base. This position withrespect to the base holds even where the overall text progression is from top to bottom—forexample, with “Ï” appearing upright within a vertical Japanese line. Characters may alsocombine into ligatures or conjuncts or otherwise change positions of their componentsradically, as shown in Figure 2-3 and Figure 2-19.
There is one particular exception to the usual practice of logical order paralleling phoneticorder. With the Thai, Lao, Tai Viet, and New Tai Lue scripts, users traditionally type invisual order rather than phonetic order, resulting in some vowel letters being stored aheadof consonants, even though they are pronounced after them.
Unification
The Unicode Standard avoids duplicate encoding of characters by unifying them withinscripts across language. Common letters are given one code each, regardless of language, asare common Chinese/Japanese/Korean (CJK ) ideographs. (See Section 18.1, Han.)
Punctuation marks, symbols, and diacritics are handled in a similar manner as letters. Ifthey can be clearly identified with a particular script, they are encoded once for that scriptand are unified across any languages that may use that script. See, for example, U+1362ethiopic full stop, U+060F arabic sign misra, and U+0592 hebrew accent segol.However, some punctuation or diacritical marks may be shared in common across a num-ber of scripts—the obvious example being Western-style punctuation characters, which areoften recently added to the writing systems of scripts other than Latin. In such cases, char-acters are encoded only once and are intended for use with multiple scripts. Common sym-bols are also encoded only once and are not associated with any script in particular.
It is quite normal for many characters to have different usages, such as comma “,” foreither thousands-separator (English) or decimal-separator (French). The Unicode Stan-dard avoids duplication of characters due to specific usage in different languages; rather, itduplicates characters only to support compatibility with base standards. Avoidance ofduplicate encoding of characters is important to avoid visual ambiguity.
There are a few notable instances in the standard where visual ambiguity between differentcharacters is tolerated, however. For example, in most fonts there is little or no distinctionvisible between Latin “o”, Cyrillic “o”, and Greek “o” (omicron). These are not unifiedbecause they are characters from three different scripts, and many legacy character encod-ings distinguish between them. As another example, there are three characters whose glyphis the same uppercase barred D shape, but they correspond to three distinct lowercaseforms. Unifying these uppercase characters would have resulted in unnecessary complica-tions for case mapping.
The Unicode Standard does not attempt to encode features such as language, font, size,positioning, glyphs, and so forth. For example, it does not preserve language as a part ofcharacter encoding: just as French i grec, German ypsilon, and English wye are all repre-sented by the same character code, U+0059 “Y”, so too are Chinese zi, Japanese ji, andKorean ja all represented as the same character code, U+5B57 %.
In determining whether to unify variant CJK ideograph forms across standards, the Uni-code Standard follows the principles described in Section 18.1, Han. Where these principlesdetermine that two forms constitute a trivial difference, the Unicode Standard assigns asingle code. Just as for the Latin and other scripts, typeface distinctions or local preferencesin glyph shapes alone are not sufficient grounds for disunification of a character. Figure 2-6illustrates the well-known example of the CJK ideograph for “bone,” which shows signifi-cant shape differences from typeface to typeface, with some forms preferred in China andsome in Japan. All of these forms are considered to be the same character, encoded atU+9AA8 in the Unicode Standard.
![](https://img.haomeiwen.com/i15359095/3f54c67bb17ad402.png)
Many characters in the Unicode Standard could have been unified with existing visuallysimilar Unicode characters or could have been omitted in favor of some other Unicodemechanism for maintaining the kinds of text distinctions for which they were intended.However, considerations of interoperability with other standards and systems oftenrequire that such compatibility characters be included in the Unicode Standard. SeeSection 2.3, Compatibility Characters. In particular, whenever font style, size, positioning orprecise glyph shape carry a specific meaning and are used in distinction to the ordinarycharacter—for example, in phonetic or mathematical notation—the characters are not unified.
Dynamic Composition
The Unicode Standard allows for the dynamic composition of accented forms and Hangulsyllables. Combining characters used to create composite forms are productive. Becausethe process of character composition is open-ended, new forms with modifying marks maybe created from a combination of base characters followed by combining characters. Forexample, the diaeresis “ ̈” may be combined with all vowels and a number of consonants inlanguages using the Latin script and several other scripts, as shown in Figure 2-7.
![](https://img.haomeiwen.com/i15359095/f6093f1c80677a25.png)
Equivalent Sequences. Some text elements can be encoded either as static precomposedforms or by dynamic composition. Common precomposed forms such as U+00DC “Ü”latin capital letter u with diaeresis are included for compatibility with current stan-dards. For static precomposed forms, the standard provides a mapping to an equivalentdynamically composed sequence of characters. (See also Section 3.7, Decomposition.) Thusdifferent sequences of Unicode characters are considered equivalent. A precomposed char-acter may be represented as an equivalent composed character sequence (see Section 2.12,Equivalent Sequences).
Stability
Certain aspects of the Unicode Standard must be absolutely stable between versions, sothat implementers and users can be guaranteed that text data, once encoded, retains thesame meaning. Most importantly, this means that once Unicode characters are assigned,their code point assignments cannot be changed, nor can characters be removed.
Characters are retained in the standard, so that previously conforming data stay confor-mant in future versions of the standard. Sometimes characters are deprecated—that is,their use in new documents is strongly discouraged. While implementations should con-tinue to recognize such characters when they are encountered, spell-checkers or editorscould warn users of their presence and suggest replacements. For more about deprecatedcharacters, see D13 in Section 3.4, Characters and Encoding.
Unicode character names are also never changed, so that they can be used as identifiersthat are valid across versions. See Section 4.8, Name.
Similar stability guarantees exist for certain important properties. For example, the decom-positions are kept stable, so that it is possible to normalize a Unicode text once and have itremain normalized in all future versions.
The most current versions of the character encoding stability policies for the UnicodeStandard are maintained online at:http://www.unicode.org/policies/stability_policy.html
Convertibility
Character identity is preserved for interchange with a number of different base standards,including national, international, and vendor standards. Where variant forms (or even thesame form) are given separate codes within one base standard, they are also kept separatewithin the Unicode Standard. This choice guarantees the existence of a mapping betweenthe Unicode Standard and base standards.
Accurate convertibility is guaranteed between the Unicode Standard and other standardsin wide usage as of May 1993. Characters have also been added to allow convertibility toseveral important East Asian character sets created after that date—for example, GB 18030.In general, a single code point in another standard will correspond to a single code point inthe Unicode Standard. Sometimes, however, a single code point in another standard corre-sponds to a sequence of code points in the Unicode Standard, or vice versa. Conversionbetween Unicode text and text in other character codes must, in general, be done byexplicit table-mapping processes. (See also Section 5.1, Data Structures for Character Con-version.)
-
2-3 Compatibility Characters
Conceptually, compatibility characters are characters that would not have been encoded inthe Unicode Standard except for compatibility and round-trip convertibility with otherstandards. Such standards include international, national, and vendor character encodingstandards. For the most part, these are widely used standards that pre-dated Unicode, butbecause continued interoperability with new standards and data sources is one of the pri-mary design goals of the Unicode Standard, additional compatibility characters are addedas the situation warrants.
Compatibility characters can be contrasted with ordinary (or non-compatibility) charactersin the standard—ones that are generally consistent with the Unicode text model and whichwould have been accepted for encoding to represent various scripts and sets of symbols,regardless of whether those characters also existed in other character encoding standards.
For example, in the Unicode model of Arabic text the logical representation of text usesbasic Arabic letters. Rather than being directly represented in the encoded characters, thecursive presentation of Arabic text for display is determined in context by a rendering sys-tem. (See Section 9.2, Arabic.) However, some earlier character encodings for Arabic wereintended for use with rendering systems that required separate characters for initial,medial, final, and isolated presentation forms of Arabic letters. To allow one-to-one map-ping to these character sets, the Unicode Standard includes Arabic presentation forms ascompatibility characters.
The purpose for the inclusion of compatibility characters like these is not to implement oremulate alternative text models, nor to encourage the use of plain text distinctions in char-acters which would otherwise be better represented by higher-level protocols or othermechanisms. Rather, the main function of compatibility characters is to simplify interoper-ability of Unicode-based systems with other data sources, and to ensure convertibility ofdata.
Interoperability does not require that all external characters can be mapped to single Uni-code characters; encoding a compatibility character is not necessary when a character inanother standard can be represented as a sequence of existing Unicode characters. Forexample the Shift-JIS encoding 0x839E for JIS X 0213 katakana letter ainu to can simply bemapped to the Unicode character sequence <U+30C8, U+309A>. However, in cases whereno appropriate mapping is available, the requirement for interoperability and convertibil-ity may be met by encoding a compatibility character for one-to-one mapping to anotherstandard.
Usage. The fact that a particular character is considered a compatibility character does notmean that that character is deprecated in the standard. The use of most compatibility char-acters in general text interchange is unproblematic. Some, however, such as the Arabicpositional forms or other compatibility characters which assume information about partic-ular layout conventions, such as presentation forms for vertical text, can lead to problemswhen used in general interchange. Caution is advised for their use. See also the discussion of compatibility characters in the W3C specification, “Unicode in XML and Other MarkupLanguages.”
Allocation. The Compatibility and Specials Area contains a large number of compatibilitycharacters, but the Unicode Standard also contains many compatibility characters that donot appear in that area. These include examples such as U+2163 “IV” roman numeralfour, U+2007 figure space, U+00B2 “2” superscript two, U+2502 box drawingslight vertical, and U+32D0 circled katakana a.
There is no formal listing of all compatibility characters in the Unicode Standard. This fol-lows from the nature of the definition of compatibility characters. It is a judgement call asto whether any particular character would have been accepted for encoding if it had notbeen required for interoperability with a particular standard. Different participants incharacter encoding often disagree about the appropriateness of encoding particular char-acters, and sometimes there are multiple justifications for encoding a given character.
Compatibility Variants
Compatibility variants are a subset of compatibility characters, and have the further charac-teristic that they represent variants of existing, ordinary, Unicode characters.For example, compatibility variants might represent various presentation or styled formsof basic letters: superscript or subscript forms, variant glyph shapes, or vertical presenta-tion forms. They also include halfwidth or fullwidth characters from East Asian characterencoding standards, Arabic contextual form glyphs from preexisting Arabic code pages,Arabic ligatures and ligatures from other scripts, and so on. Compatibility variants alsoinclude CJK compatibility ideographs, many of which are minor glyph variants of anencoded unified CJK ideograph.
In contrast to compatibility variants there are the numerous compatibility characters, suchas U+2502 box drawings light vertical, U+263A white smiling face, or U+2701upper blade scissors, which are not variants of ordinary Unicode characters. However, itis not always possible to determine unequivocally whether a compatibility character is avariant or not.
Compatibility Decomposable Characters
The term compatibility is further applied to Unicode characters in a different, strictlydefined sense. The concept of a compatibility decomposable character is formally defined asany Unicode character whose compatibility decomposition is not identical to its canonicaldecomposition. (See Definition D66 in Section 3.7, Decomposition, and the discussion inSection 2.2, Unicode Design Principles.)
The list of compatibility decomposable characters is precisely defined by property values inthe Unicode Character Database, and by the rules of Unicode Normalization. (SeeSection 3.11, Normalization Forms.) Because of their use in Unicode Normalization, com-patibility decompositions are stable and cannot be changed once a character has been encoded; the list of compatibility decomposable characters for any version of the UnicodeStandard is thus also stable.
Compatibility decomposable characters have also been referred to in earlier versions of theUnicode Standard as compatibility composite characters or compatibility composites forshort, but the full term, compatibility decomposable character is preferred.
Compatibility Character Vs. Compatibility Decomposable Character. In informal dis-cussions of the Unicode Standard, compatibility decomposable characters have also oftenbeen referred to simply as “compatibility characters.” This is understandable, in partbecause the two sets of characters largely overlap, but the concepts are actually distinct.There are compatibility characters which are not compatibility decomposable characters,and there are compatibility decomposable characters which are not compatibility charac-ters.
For example, the deprecated alternate format characters such as U+206C inhibit arabicform shaping are considered compatibility characters, but they have no decompositionmapping, and thus by definition cannot be compatibility decomposable characters. Like-wise for such other compatibility characters as U+2502 box drawings light vertical orU+263A white smiling face.
There are also instances of compatibility variants which clearly are variants of other Uni-code characters, but which have no decomposition mapping. For example, U+2EAF cjkradical silk is a compatibility variant of U+2F77 kangxi radical silk, as well as being acompatibility variant of U+7CF9 cjk unified ideograph-7cf9, but has no compatibilitydecomposition. The numerous compatibility variants like this in the CJK Radicals Supple-ment block were encoded for compatibility with encodings that distinguished and sepa-rately encoded various forms of CJK radicals as symbols.
A different case is illustrated by the CJK compatibility ideographs, such as U+FA0C cjkcompatibility ideograph-fa0c. Those compatibility characters have a decompositionmapping, but for historical reasons it is always a canonical decomposition, so they arecanonical decomposable characters, but not compatibility decomposable characters.
By way of contrast, some compatibility decomposable characters, such as modifier lettersused in phonetic orthographies, for example, U+02B0 modifier letter small h, are notconsidered to be compatibility characters. They would have been accepted for encoding inthe standard on their own merits, regardless of their need for mapping to IPA. A largenumber of compatibility decomposable characters like this are actually distinct symbolsused in specialized notations, whether phonetic or mathematical. In such cases, their com-patibility mappings express their historical derivation from styled forms of standard letters.
Other compatibility decomposable characters are widely used characters serving essentialfunctions. U+00A0 no-break space is one example. In these and similar cases, such asfixed-width space characters, the compatibility decompositions define possible fallbackrepresentations.
The Unicode Character Database supplies identification and mapping information onlyfor compatibility decomposable characters, while compatibility variants are not formally identified or documented. Because the two sets substantially overlap, many specificationsare written in terms of compatibility decomposable characters first; if necessary, such spec-ifications may be extended to handle other, non-decomposable compatibility variants asrequired. (See also the discussion in Section 5.19, Mapping Compatibility Variants.)
-
2-4 Code Points and Characters
On a computer, abstract characters are encoded internally as numbers. To create a com-plete character encoding, it is necessary to define the list of all characters to be encoded andto establish systematic rules for how the numbers represent the characters.
The range of integers used to code the abstract characters is called the codespace. A partic-ular integer in this set is called a code point. When an abstract character is mapped orassigned to a particular code point in the codespace, it is then referred to as an encodedcharacter.
In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF16, com-prising 1,114,112 code points available for assigning the repertoire of abstract characters. There are constraints on how the codespace is organized, and particular areas of thecodespace have been set aside for encoding of certain kinds of abstract characters or forother uses in the standard. For more on the allocation of the Unicode codespace, seeSection 2.8, Unicode Allocation.
Figure 2-8 illustrates the relationship between abstract characters and code points, whichtogether constitute encoded characters. Note that some abstract characters may be associ-ated with multiple, separately encoded characters (that is, be encoded “twice”). In otherinstances, an abstract character may be represented by a sequence of two (or more) otherencoded characters. The solid arrows connect encoded characters with the abstract charac-ters that they represent and encode
图
When referring to code points in the Unicode Standard, the usual practice is to refer tothem by their numeric value expressed in hexadecimal, with a “U+” prefix. (See Appendix A,Notational Conventions.) Encoded characters can also be referred to by their code points only. To prevent ambiguity, the official Unicode name of the character is often added; thisclearly identifies the abstract character that is encoded. For example:U+0061 latin small letter aU+10330 gothic letter ahsaU+201DF cjk unified ideograph-201dfSuch citations refer only to the encoded character per se, associating the code point (as anintegral value) with the abstract character that is encoded.
Types of Code Points
There are many ways to categorize code points. Ta b l e 2 - 3 illustrates some of the categoriza-tions and basic terminology used in the Unicode Standard. The seven basic types of codepoints are formally defined in Section 3.4, Characters and Encoding. (See Definition D10a,Code Point Type.)
图
Not all assigned code points represent abstract characters; only Graphic, Format, Controland Private-use do. Surrogates and Noncharacters are assigned code points but are notassigned to abstract characters. Reserved code points are assignable: any may be assigned in a future version of the standard. The General Category provides a finer breakdown ofGraphic characters and also distinguishes between the other basic types (except betweenNoncharacter and Reserved). Other properties defined in the Unicode Character Databaseprovide for different categorizations of Unicode code points.
Control Codes. Sixty-five code points (U+0000..U+001F and U+007F.. U+009F) aredefined specifically as control codes, for compatibility with the C0 and C1 control codes ofthe ISO/IEC 2022 framework. A few of these control codes are given specific interpreta-tions by the Unicode Standard. (See Section 23.1, Control Codes.)Noncharacters. Sixty-six code points are not used to encode characters. Noncharactersconsist of U+FDD0..U+FDEF and any code point ending in the value FFFE16 or FFFF16—that is, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. (SeeSection 23.7, Noncharacters.)Private Use. Three ranges of code points have been set aside for private use. Characters inthese areas will never be defined by the Unicode Standard. These code points can be freelyused for characters of any purpose, but successful interchange requires an agreementbetween sender and receiver on their interpretation. (See Section 23.5, Private-Use Charac-ters.)Surrogates. Some 2,048 code points have been allocated as surrogate code points, whichare used in the UTF-16 encoding form. (See Section 23.6, Surrogates Area.)
Restricted Interchange. Code points that are not assigned to abstract characters are subjectto restrictions in interchange.
• Surrogate code points cannot be conformantly interchanged using Unicodeencoding forms. They do not correspond to Unicode scalar values and thus donot have well-formed representations in any Unicode encoding form. (SeeSection 3.8, Surrogates.)
• Noncharacter code points are reserved for internal use, such as for sentinel val-ues. They have well-formed representations in Unicode encoding forms andsurvive conversions between encoding forms. This allows sentinel values to bepreserved internally across Unicode encoding forms, even though they are notdesigned to be used in open interchange.
• All implementations need to preserve reserved code points because they mayoriginate in implementations that use a future version of the Unicode Standard.For example, suppose that one person is using a Unicode 12.0 system and a sec-ond person is using a Unicode 11.0 system. The first person sends the secondperson a document containing some code points newly assigned in Unicode12.0; these code points were unassigned in Unicode 11.0. The second personmay edit the document, not changing the reserved codes, and send it on. Inthat case the second person is interchanging what are, as far as the second per-son knows, reserved code points.
Code Point Semantics. The semantics of most code points are established by this standard;the exceptions are Controls, Private-use, and Noncharacters. Control codes generally havesemantics determined by other standards or protocols (such as ISO/IEC 6429), but thereare a small number of control codes for which the Unicode Standard specifies particularsemantics. See Table 23-1 in Section 23.1, Control Codes, for the exact list of those controlcodes. The semantics of private-use characters are outside the scope of the Unicode Stan-dard; their use is determined by private agreement, as, for example, between vendors. Non-characters have semantics in internal use only.
网友评论