python--opencc使用过程的问题

作者: JieJ | 来源:发表于2015-11-09 19:57 被阅读2770次

python--opencc使用过程的问题
git使用过程遇到的问题
Ubuntu使用过程出现的问题
git使用过程出现的问题
appium使用过程问题总结
sql目录
2019-12-20 VS Code 针对js后缀文件代码提示失
Spark-sql读取hive分区表限制分区过滤条件及限制分区数
ckeditor使用过程遇到的问题汇总
使用Fiddler过程中遇到的问题

使用python-opencc模块，实现汉字的繁简互转。这里使用了繁体转为简体。代码示例：

import opencc
cc = opencc.OpenCC('mix2s')     #mix2s - Mixed to Simplified Chinese
f = open(target,'w')
for line in open(fname).readlines():
        l = line.decode('utf8','ignore').rstrip(u'\n')
        f.write(cc.convert(l)+u'\n')
f.close()
print len(open(target).readlines())

转化完成的文本使用readlines()读取时，长度只剩1了。即没有换行了。
先查看了一下opencc模块的convert函数源码。

def convert(self, text):
    """Convert text """
    proc = subprocess.Popen([self.opencc_path, '-c', self.confg], 
                            cwd=self.data_path,
                            stdin=subprocess.PIPE,
                            stdout=subprocess.PIPE)
    proc.stdin.write(text.encode('utf8'))
    proc.stdin.close()
    code = proc.wait()
    if code:
        raise RuntimeError('Failed to call opencc with exit code %s' % code)
    result = proc.stdout.read() 
    return result.decode('utf8')

可以看到，输入是unicode编码，程序在完成转化之后输出的还是unicode编码，所以，先尝试写入文件时，编码为utf8，而不是直接写

l=cc.convert(l).encode('utf8','ignore')
f.write(l+'\n')

问题基本解决。
不过这种转换方式速度非常慢,不知道完全使用opencc的C++代码会不会好一点。

网友评论

本文标题：python--opencc使用过程的问题

本文链接：https://www.haomeiwen.com/subject/fcoqhttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python--opencc使用过程的问题

相关文章

python--opencc使用过程的问题

git使用过程遇到的问题

Ubuntu使用过程出现的问题

git使用过程出现的问题

appium使用过程问题总结

sql目录

2019-12-20 VS Code 针对js后缀文件代码提示失

Spark-sql读取hive分区表限制分区过滤条件及限制分区数

ckeditor使用过程遇到的问题汇总

使用Fiddler过程中遇到的问题

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读