美文网首页
Python数据分析与机器学习41-使用Gensim库构造中文维

Python数据分析与机器学习41-使用Gensim库构造中文维

作者: 只是甲 | 来源:发表于2022-08-02 17:34 被阅读0次

    一. 维基百科中文数据

    1.1 数据下载

    下载网址:
    https://dumps.wikimedia.org/zhwiki/20220720/

    image.png

    解压:
    解压后数据集差不多10G

    image.png

    使用编辑器浏览数据格式:

    image.png
    image.png

    1.2 数据预处理

    将xml.bz2的数据,通过python程序转为text文档数据

    代码:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # 修改后的代码如下:
    import logging
    import os.path
    import sys
    from gensim.corpora import WikiCorpus
    
    if __name__ == '__main__':
    
        program = os.path.basename(sys.argv[0])
        logger = logging.getLogger(program)
        logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
        logging.root.setLevel(level=logging.INFO)
        logger.info("running %s" % ' '.join(sys.argv))
        # check and process input arguments
        if len(sys.argv) < 3:
            #print(globals()['__doc__'] % locals())
            sys.exit(1)
        inp, outp = sys.argv[1:3]
        space = ' '
        i = 0
        output = open(outp, 'w', encoding='utf-8')
        # wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
        wiki = WikiCorpus(inp, dictionary={})
        for text in wiki.get_texts():
            s = space.join(text)
            #s = s.decode('utf8') + "\n"
            s = s + "\n"
            output.write(s)
            i = i + 1
            if (i % 10000 == 0):
                logger.info("Saved " + str(i) + " articles")
        output.close()
        logger.info("Finished Saved " + str(i) + " articles")
    # python process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
    

    测试记录:

    python process.py E:/python/数据分析/机器学习/word2vec/维基百科中文数据/zhwiki-20220720-pages-articles-multistream.xml.bz2  wiki.zh.text
    
    2022-07-22 11:47:20,200: INFO: Saved 10000 articles
    2022-07-22 11:47:49,804: INFO: Saved 20000 articles
    2022-07-22 11:48:15,985: INFO: Saved 30000 articles
    2022-07-22 11:48:41,948: INFO: Saved 40000 articles
    2022-07-22 11:49:07,655: INFO: Saved 50000 articles
    2022-07-22 11:49:32,707: INFO: Saved 60000 articles
    2022-07-22 11:49:56,776: INFO: Saved 70000 articles
    2022-07-22 11:50:20,577: INFO: Saved 80000 articles
    2022-07-22 11:50:44,223: INFO: Saved 90000 articles
    2022-07-22 11:51:07,352: INFO: Saved 100000 articles
    2022-07-22 11:51:32,792: INFO: Saved 110000 articles
    2022-07-22 11:52:02,830: INFO: Saved 120000 articles
    2022-07-22 11:52:27,015: INFO: Saved 130000 articles
    2022-07-22 11:52:55,681: INFO: Saved 140000 articles
    2022-07-22 11:53:22,406: INFO: Saved 150000 articles
    2022-07-22 11:53:51,108: INFO: Saved 160000 articles
    2022-07-22 11:54:17,588: INFO: Saved 170000 articles
    2022-07-22 11:54:44,145: INFO: Saved 180000 articles
    2022-07-22 11:55:12,876: INFO: Saved 190000 articles
    2022-07-22 11:57:02,681: INFO: Saved 200000 articles
    2022-07-22 11:57:37,999: INFO: Saved 210000 articles
    2022-07-22 11:58:24,656: INFO: Saved 220000 articles
    2022-07-22 11:58:57,855: INFO: Saved 230000 articles
    2022-07-22 11:59:30,406: INFO: Saved 240000 articles
    2022-07-22 12:00:07,523: INFO: Saved 250000 articles
    2022-07-22 12:00:39,435: INFO: Saved 260000 articles
    2022-07-22 12:01:16,402: INFO: Saved 270000 articles
    2022-07-22 12:01:50,524: INFO: Saved 280000 articles
    2022-07-22 12:02:24,565: INFO: Saved 290000 articles
    2022-07-22 12:02:56,571: INFO: Saved 300000 articles
    2022-07-22 12:03:25,060: INFO: Saved 310000 articles
    2022-07-22 12:03:58,827: INFO: Saved 320000 articles
    2022-07-22 12:04:31,788: INFO: Saved 330000 articles
    2022-07-22 12:05:09,589: INFO: Saved 340000 articles
    2022-07-22 12:05:44,559: INFO: Saved 350000 articles
    2022-07-22 12:06:22,483: INFO: Saved 360000 articles
    2022-07-22 12:06:59,661: INFO: Saved 370000 articles
    2022-07-22 12:07:35,899: INFO: Saved 380000 articles
    2022-07-22 12:08:16,372: INFO: Saved 390000 articles
    2022-07-22 12:08:50,098: INFO: Saved 400000 articles
    2022-07-22 12:09:24,153: INFO: Saved 410000 articles
    2022-07-22 12:09:59,611: INFO: Saved 420000 articles
    2022-07-22 12:10:35,669: INFO: Saved 430000 articles
    2022-07-22 12:10:45,356: INFO: finished iterating over Wikipedia corpus of 432882 documents with 100469267 positions (total 4078948 articles, 118486509 positions before pruning articles shorter than 50 words)
    2022-07-22 12:10:45,554: INFO: Finished Saved 432882 articles
    

    查看解析后的数据:
    都是繁体字

    image.png

    1.3 将繁体字转为简体字

    1.3.1 opencc安装

    1.3.1.1 pip安装

    pip之间安装opencc报错


    image.png

    应该使用如下命令:

    pip install opencc-python-reimplemented
    

    参数解释:

    image.png

    1.3.1.2 windows安装

    下载地址:
    https://github.com/BYVoid/OpenCC/wiki/Download

    image.png

    配置文件:

    image.png

    1.3.2 windows下opencc使用

    拷贝文件到bin目录:

    image.png

    查看opcc的帮助命令:

    E:\python\数据分析\机器学习\word2vec\OpenCC\build\bin>opencc --help
    
    Open Chinese Convert (OpenCC) Command Line Tool
    Author: Carbo Kuo <byvoid@byvoid.com>
    Bug Report: http://github.com/BYVoid/OpenCC/issues
    
    Usage:
    
       opencc  [--noflush <bool>] [-i <file>] [-o <file>] [-c <file>] [--]
               [--version] [-h]
    
    Options:
    
       --noflush <bool>
         Disable flush for every line
    
       -i <file>,  --input <file>
         Read original text from <file>.
    
       -o <file>,  --output <file>
         Write converted text to <file>.
    
       -c <file>,  --config <file>
         Configuration file
    
       --,  --ignore_rest
         Ignores the rest of the labeled arguments following this flag.
    
       --version
         Displays version information and exits.
    
       -h,  --help
         Displays usage information and exits.
    
    
       Open Chinese Convert (OpenCC) Command Line Tool
    
    
    E:\python\数据分析\机器学习\word2vec\OpenCC\build\bin>
    

    将繁体字转为简体字:

    opencc -i E:\python\数据分析\机器学习\word2vec\wiki.zh.text -o E:\python\数据分析\机器学习\word2vec\wiki_简体.zh.text -c t2s.json
    
    image.png

    二. 使用jieba进行分词

    代码:

    import jieba
    import jieba.analyse
    import jieba.posseg as pseg
    import codecs,sys
    def cut_words(sentence):
        #print sentence
        return " ".join(jieba.cut(sentence)).encode('utf-8')
    f=codecs.open('wiki_简体.zh.text','r',encoding="utf8")
    target = codecs.open("wiki_简体.zh.分词.seg-1.3g.text", 'w',encoding="utf8")
    print ('open files')
    line_num=1
    line = f.readline()
    while line:
        print('---- processing ', line_num, ' article----------------')
        line_seg = " ".join(jieba.cut(line))
        target.writelines(line_seg)
        line_num = line_num + 1
        line = f.readline()
    f.close()
    target.close()
    exit()
    while line:
        curr = []
        for oneline in line:
            #print(oneline)
            curr.append(oneline)
        after_cut = map(cut_words, curr)
        target.writelines(after_cut)
        print ('saved',line_num,'articles')
        exit()
        line = f.readline1()
    f.close()
    target.close()
    
    # python Testjieba.py
    

    测试记录:
    分词完成后的文本如下:

    image.png

    三. 生成词向量

    python代码:

    import logging
    import os.path
    import sys
    import multiprocessing
    from gensim.corpora import WikiCorpus
    from gensim.models import Word2Vec
    from gensim.models.word2vec import LineSentence
    if __name__ == '__main__':
        
        program = os.path.basename(sys.argv[0])
        logger = logging.getLogger(program)
        logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
        logging.root.setLevel(level=logging.INFO)
        logger.info("running %s" % ' '.join(sys.argv))
        # check and process input arguments
        if len(sys.argv) < 4:
            print (globals()['__doc__'] % locals())
            sys.exit(1)
        inp, outp1, outp2 = sys.argv[1:4]
        model = Word2Vec(LineSentence(inp), window=5, min_count=5, workers=multiprocessing.cpu_count())
        model.save(outp1)
        model.wv.save_word2vec_format(outp2, binary=False)
    #python word2vec_model.py zh.jian.wiki.seg.txt wiki.zh.text.model wiki.zh.text.vector
    #opencc -i wiki_texts.txt -o test.txt -c t2s.json
    

    运行python程序:

    python word2vec_model.py wiki_简体.zh.分词.seg-1.3g.text wiki_简体.zh.分词.seg-1.3g.text.model wiki_简体.zh.分词.seg-1.3g.text.vector
    

    测试记录:
    新生成的文件:

    image.png
    model文件内容如下:
    image.png

    四. 测试

    输出与我的测试词最接近的5个词
    代码:

    from gensim.models import Word2Vec
    
    en_wiki_word2vec_model = Word2Vec.load('wiki_简体.zh.分词.seg-1.3g.text.model')
    
    testwords = ['苹果','数学','学术','白痴','篮球']
    for i in range(5):
        res = en_wiki_word2vec_model.wv.most_similar(testwords[i])
        print (testwords[i])
        print (res)
    

    测试记录:

    苹果
    [('洋葱', 0.6776813268661499), ('apple', 0.6557675004005432), ('黑莓', 0.6425948143005371), ('草莓', 0.6342688798904419), ('小米', 0.6270812153816223), ('坚果', 0.6188254952430725), ('苹果公司', 0.6164141893386841), ('果冻', 0.6137404441833496), ('咖啡', 0.604507327079773), ('籽', 0.597305178642273)]
    数学
    [('微积分', 0.8419661521911621), ('算术', 0.8334301710128784), ('数学分析', 0.7725528478622437), ('概率论', 0.7687932252883911), ('数论', 0.763685405254364), ('逻辑学', 0.7549666166305542), ('高等数学', 0.7526171803474426), ('物理', 0.7509859800338745), ('数理逻辑', 0.7463771104812622), ('拓扑学', 0.7362903952598572)]
    学术
    [('学术研究', 0.8562514781951904), ('社会科学', 0.7332450747489929), ('自然科学', 0.7266139388084412), ('学术界', 0.7095916867256165), ('法学', 0.7089691758155823), ('汉学', 0.7063266038894653), ('科学研究', 0.7001381516456604), ('跨学科', 0.6969149708747864), ('学术活动', 0.6920358538627625), ('史学', 0.6917674541473389)]
    白痴
    [('傻子', 0.764008641242981), ('书呆子', 0.709040105342865), ('疯子', 0.7005040049552917), ('笨蛋', 0.674527108669281), ('傻瓜', 0.6599918603897095), ('骗子', 0.6531063914299011), ('爱哭鬼', 0.650740921497345), ('天才', 0.6505906581878662), ('蠢', 0.6501168012619019), ('娘娘腔', 0.6492332816123962)]
    篮球
    [('棒球', 0.7790619730949402), ('美式足球', 0.7700984477996826), ('排球', 0.7593865990638733), ('橄榄球', 0.75135737657547), ('网球', 0.7470680475234985), ('冰球', 0.7441636323928833), ('足球', 0.7276999950408936), ('橄榄球队', 0.7252797484397888), ('男子篮球', 0.7197815179824829), ('曲棍球', 0.7128130793571472)]
    
    

    参考:

    1. https://study.163.com/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1

    相关文章

      网友评论

          本文标题:Python数据分析与机器学习41-使用Gensim库构造中文维

          本文链接:https://www.haomeiwen.com/subject/zdnuirtx.html