美文网首页
How to determine the encoding of

How to determine the encoding of

作者: __XY__ | 来源:发表于2020-07-01 12:11 被阅读0次

https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text

Correctly detecting the encoding all times is impossible.

正确地检测编码是不可能的。

(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有些编码是针对特定语言优化的,而语言并不是随机的。有些字符序列会一直出现,但有些序列的出现是毫无意义的。
一个精通英语的人打开报纸,发现 "txzqJv 2!dasd0a QqdKjvz",会立刻认出那不是英语(尽管它完全由英文字母组成)。
通过研究大量的 "典型 "文本,计算机算法可以模拟出这种功能,并对文本的语言做出有根据的猜测。

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

python有一个chardet库使用这种方式来尝试检测编码。
chardet库是Mozilla中自动检测代码的一个移植库。

You can also use UnicodeDammit. It will try the following methods:

你也可以使用UnicodeDammit。他的原理如下:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

你也可以使用UnicodeDammit这个工具,它将尝试以下方法:

找编码的过程:比如在XML已声明了,或在HTML文档的http-equiv META标签中已声明,如果Beautiful Soup在文件中发现了这种编码,它就会从头开始重新解析文档,并尝试使用新的编码。有个例外,如果你明确地指定了编码,并且该编码确实有效:那么它将忽略在文档中发现的任何编码。
通过查看文件的前几个字节来嗅探编码。如果在这个阶段检测到一个编码,它将是UTF-*编码、EBCDIC或ASCII中的一个。
如果你安装了chardet库,那么它就会被检测到。

相关文章

网友评论

      本文标题:How to determine the encoding of

      本文链接:https://www.haomeiwen.com/subject/mnecqktx.html