美文网首页
北大开源全新中文分词工具包:准确率远超THULAC、jieba

北大开源全新中文分词工具包:准确率远超THULAC、jieba

作者: 不爱吃饭的小孩怎么办 | 来源:发表于2019-12-12 15:27 被阅读0次

    https://www.jianshu.com/p/3d9cd356da1a
    https://www.jianshu.com/p/528e46284cbc

    (nlp) spring@ubuntu18:~$ pip install pkuseg
    Looking in indexes: https://mirrors.aliyun.com/pypi/simple
    Collecting pkuseg
      Downloading https://mirrors.aliyun.com/pypi/packages/36/d8/2cd2d21fc960815d4bb521e1e2e2f725c0e4d1ab88cefa4c73520cd84825/pkuseg-0.0.22-cp36-cp36m-manylinux1_x86_64.whl (50.2MB)
         |████████████████████████████████| 50.2MB 1.9MB/s 
    Requirement already satisfied: numpy>=1.16.0 in ./anaconda3/envs/nlp/lib/python3.6/site-packages (from pkuseg) (1.17.4)
    Installing collected packages: pkuseg
    Successfully installed pkuseg-0.0.22
    (nlp) spring@ubuntu18:~$ python
    Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
    [GCC 7.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pkuseg
    >>> seg = pkuseg.pkuseg()
    tex>>> 
    >>> text = seg.cut('我爱杭州西湖')
    >>> print(text)
    ['我', '爱', '杭州', '西湖']
    >>> text = seg.cut('我叫马化腾,我想学区块链,你说好不好啊,天青色等烟雨,而我在 等你,月色被打捞器,晕开了结局')
    >>> text
    ['我', '叫', '马化腾', ',', '我', '想', '学区', '块链', ',', '你', '说', '好不', '好', '啊', ',', '天青色', '等', '烟雨', ',', '而', '我', '在', '等', '你', ',', '月色', '被', '打捞器', ',', '晕开', '了', '结局']
    >>> lexicon = ['区块链','好不好', '天青色']
    >>> seg = pkuseg.pkuseg(user_dict=lexicon)
    >>> text = seg.cut('我叫马化腾,我想学区块链,你说好不好啊,天青色等烟雨,而我在 等你,月色被打捞器,晕开了结局')
    >>> text
    ['我', '叫', '马化腾', ',', '我', '想', '学', '区块链', ',', '你', '说', '好不好', '啊', ',', '天青色', '等', '烟雨', ',', '而', '我', '在', '等', '你', ',', '月色', '被', '打捞器', ',', '晕开', '了', '结局']
    
    

    相关文章

      网友评论

          本文标题:北大开源全新中文分词工具包:准确率远超THULAC、jieba

          本文链接:https://www.haomeiwen.com/subject/eedtnctx.html