Python字符编码入门

作者: alan787 | 来源:发表于2018-05-27 08:56 被阅读0次

Python字符编码入门

字符串、字节串

python中有两个相关类, 他们都是basestring的子类：

unicode 字符串，产生来源有: 源码指定u'中国'、外部数据数据库查询结果、读取文本，或者由字节串转换得到 '\xe4\xb8\xad'.decode('utf8') 得到 u'中'
string 字节串，产生来源：源码'中国'、由字符经过编码得到的字节串，如

>>> u'中'.encode('gbk')
'\xd6\xd0'
>>> u'中'.encode('utf8')
'\xe4\xb8\xad'

unicode 和 string的转换关系图

python_stringunicode_encode_decode_3.png

注：图中绿色线段标示的即为我们常用的转换方法，红色标示的转换在python 2.x 中是合法的

多种编码解析

系统编码 Unix/Linix 系统通过 locale 查看，windows系统

系统编码影响到python在控制台REPL模式下的编解码，可以通过 sys.stdin.encoding 和 sys.stdout.encoding 查看Python console从系统继承的默认编解码是什么
默认字符串编码 sys.getdefaultencoding() 官方文档解释

Return the name of the current default string encoding used by the Unicode implementation.

可理解为在字节串转化为字符串时，若没有显示指明字节编码方式，都将使用默认编码方式，如：
- 在源代码文件中没有显示指明 #coding: xxx，解析文件的字节流时使用
- 在读取文件字节流后，需要转换为字符串时使用
- 网络请求查询结果返回字节流转换为字符串时使用
  
  可以通过 reload(sys); sys.setdefaultencoding('utf8') 来改变
源文件代码编码，显示通过文件头两行 #coding: utf-8 来告知python解析器文件的编码，此处的编码应该与实际文件的编码一致，如不指定，根据 PEP 263 将使用 ascii 编码

Python will default to ASCII as standard encoding if no other encoding hints are given.

具体python处理文件的过程见 PEP 263 Python's tokenizer/compiler combo will need to be updated to work as follows...

A. read the file

B. decode it into Unicode assuming a fixed per-file encoding

C. convert it into a UTF-8 byte string

D. tokenize the UTF-8 content

E. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
文件编码，文件本身的编码，将文件内容按照编码格式存成二进制格式
输入输出编码 sys.stdin.encoding && sys.stdout.encoding

在控制台时输入输出编码将从系统编码继承，linux下可以通过locale查看，或通过export设置locale
若字符串

常见错误总结

没有指定source coding encoding

有文件 p233.py，文件编码为utf-8

#!/usr/bin/env python
s = u'é'  # 这里即使注释了也报错
print s

执行 python p233.py 报错如下

SyntaxError: Non-ASCII character '\xc3' in file /Users/yuanzhou/code/PycharmProjects/untitled/test2.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

解释：

>>> u'é'
u'\xe9'
>>> u'é'.encode("utf-8")
'\xc3\xa9'

A: 首先python以二进制流读取文件，由于文件采用utf8编码，因此字节流中存在两个字节\xc3\xa9；B: python使用default encoding ascii 将文件字节流转换为python内部的unicode字符集，由于ascii只可以识别0x00-0x7f的字节，因此报错了

如果在p233.py 首行加上 #coding: utf8 重新执行，上面B：将使用utf-8将文件字节流转为python内部的unicode字符集

locale影响python控制台输入和输出

设置 export LC_ALL="C"，此时 locale 输出

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

此时在控制台下执行以下代码

>>> a=unichr(233)    # 对应latin字符'é'
>>> a, type(a)
(u'\xe9', <type 'unicode'>)
>>> print sys.stdout.encoding
US-ASCII
>>> print a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
>>> a.encode("utf-8")
'\xc3\xa9'
>>> print a.encode("utf-8")
é

分析报错描述，可以看到报错为UnicodeEncodeError，在读取控制台输入后，print 将结果往stdout输出，因为a变量为unicode，由于stdout.encoding为ascii编码，因此将使用 a.encode("ascii") 以得到字节串，但ascii的范围是0-127，因此抛出异常。通过命令print a.encode("utf-8") 的输出为字节流 \xc3\xa9，此字节流在SSH终端中可以显示的原因是：SSH终端设置编码为UTF-8，可以将此字节流decode为unicode字符

从字符串写到文件异常

>>>a=u'中'
>>>f=open("test.txt", "w")
>>> f.write(a)     # 写入失败，因为sys.getdefaultencoding为ascii
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e2d' in position 0: ordinal not in range(128)
>>>f.write(a.encode("utf-8"))  # 写入成功
>>>import sys; reload(sys)
>>>sys.setdefaultencoding("utf-8")
>>>f.write(a)   # 写入成功
>>>f.close()

TO BE CONTINUED

文件名编码, 从db中读取字符是Unicode，open(u, "wb")时抛异常

Reference

网友评论

我爱编程

本文标题：Python字符编码入门

本文链接：https://www.haomeiwen.com/subject/hkxbjftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python字符编码入门

Python字符编码入门

字符串、字节串

多种编码解析

常见错误总结

没有指定source coding encoding

locale影响python控制台输入和输出

从字符串写到文件异常

TO BE CONTINUED

Reference

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

我爱编程