Wrod2vec算法实战_3分钟热情学NLP第5篇

作者: 十三先 | 来源:发表于2021-01-19 21:59 被阅读0次

    3分钟热情学NLP第5篇,Wrod2vec算法实战

    参考文章:https://blog.csdn.net/qq_30189255/article/details/103049569

    1、英文语料

    本文采用的语料:
    语料text8,保存在sentence中;text8有100mb大小;
    text8的下载地址:http://mattmahoney.net/dc/text8.zip
    text8语料,已经按照空格进行分词,去掉了标点符号,无需进行预处理

    2、模型训练

    采用python的gensim包实现word2vec

    pip install gensim
    

    输入:

    from gensim.models import word2vec
    
    #Gensim是一款开源的第三方Python工具包,用于从原始的非结构化的文本中,无监督地学习到文本隐层的主题向量表达。支持TF-IDF,LSA,LDA,和word2vec多种主题模型算法,
    
    #将语料text8,保存在sentence中;text8有100mb大小;text8的下载地址:http://mattmahoney.net/dc/text8.zip
    sentences = word2vec.Text8Corpus('text8')
    
    #生成词向量空间模型
    model = word2vec.Word2Vec(sentences, sg=1, size=100, window=5,min_count=5, negative=3, sample=0.001, hs=1, workers=4)
    
    
    #命名模型,并保存模型;保存后,后续如果还需要用到该模型,就可以直接训练好的模型
    model.save('text8_word2vec_model')
    

    模型训练过程会比较久,在普通的mac pro上跑模型,大约花了15mins

    3、模型加载

    #进行模型加载
    model = word2vec.Word2Vec.load('text8_word2vec_model')
    

    4、计算2个词的相似度,使用similarity()

    print('---计算2个词的相似度---')
    word1 = 'man'
    word2 = 'woman'
    result1 = model.wv.similarity(word1, word2)
    print(word1 + "和" + word2 + "的相似度为:",result1)
    

    输出:

    ---计算2个词的相似度---
    man和woman的相似度为: 0.6944872
    

    5、计算某个词的关联词表,使用most_similar()

    输入:

    print('\n---计算1个词的关联列表---')
    word = 'cat'
    result2 = model.wv.most_similar(word, topn=10)#计算得出10个最相关的词
    
    print( "和" + word +"相关的10个词为:")
    for item in result2:
        print(item[0], item[1])
    

    输出:

    ---计算1个词的关联列表---
    和cat相关的10个词为:
    prionailurus 0.7491977214813232
    cats 0.7341662049293518
    dog 0.7332097887992859
    dogs 0.7025191783905029
    kitten 0.6987137794494629
    rat 0.6867721676826477
    eared 0.6866066455841064
    felis 0.6811522245407104
    pug 0.678561806678772
    tortoiseshell 0.6764862537384033
    

    借助翻译软件,看下与cat相关的词
    prionailurus,豹猫属(学名 Prionailurus) 是猫科的一属,其体型与家猫大致相仿;kitten,小猫;eared,有耳的;

    换1个词试试,看看 beijing 的相关词

    ---计算1个词的关联列表---
    和beijing相关的10个词为:
    guangzhou 0.7843025326728821
    shanghai 0.7154852151870728
    peking 0.6975410580635071
    taipei 0.6882435083389282
    hangzhou 0.6816953420639038
    wuhan 0.6814815998077393
    kaohsiung 0.6703094244003296
    ribao 0.664854109287262
    guangdong 0.6647670269012451
    hong 0.6628706455230713
    

    从上到下依次为:广州、上海、北京、台北、杭州、武汉、高雄、日报、广东、hong
    最后1个词,hong 个人推测是hongkong

    6、寻找与词对应关系,同样使用most_similar()

    输入

    print('\n---寻找词之间的对应关系---')
    
    print('\n---寻找词之间的对应关系---')
    
    print('"boy" is to "father" as "girl" to ?')
    result3 = model.wv.most_similar(['girl', 'father'], ['boy'], topn=2)#计算得出2个对应的词
    
    for item in result3:
        print(item[0], item[1])
    

    输出

    ---寻找词之间的对应关系---
    "boy" is to "father" as "girl" to ?
    mother 0.7658053040504456
    wife 0.7323337197303772
    
    more_examples = ["she her he", "small smaller bad", "going went being"]
    for example in more_examples:
        a, b, x = example.split()
        predicted = model.wv.most_similar([x, b], [a])[0][0]
        print("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))
    
    'she' is to 'her' as 'he' is to 'his'
    'small' is to 'smaller' as 'bad' is to 'worse'
    'going' is to 'went' as 'being' is to 'was'
    

    7、在一堆词中找出异类,使用doesnt_math()

    输入

    print('\n---在一堆词中,找茬---')
    
    words = "apple cat banana peach"#从水果词汇中,找出动物词汇
    result4 = model.wv.doesnt_match(words.split())
    print("在 " + words + " 中,与众不同的词为:",result4)
      
    

    输出

    ---在一堆词中,找茬---
    在 apple cat banana peach 中,与众不同的词为: cat
    

    8、查看词向量

    输入

    word = 'boy'
    print(word, "\n", model.wv[word])
    

    输出

    boy 
     [-0.05925538  0.11277281  0.11228959  0.00941157 -0.29323277  0.3983824
      0.10022594 -0.27772436 -0.0637489   0.21361585 -0.1111148  -0.07992619
      0.19348109 -0.3863782  -0.39820215 -0.5309777   0.3023594   0.09559165
      0.26342046  0.07928758  0.181699    0.69354516  0.06837065 -0.18296044
      0.02820505 -0.2478618   0.02427425  0.05263022  0.4571287  -0.11103037
      0.00101246 -0.27764824 -0.24569483  0.44549158 -0.21713312  0.5335748
      0.14214468  0.11317527  0.19602373  0.2653484  -0.32859662 -0.38938046
      0.25495887 -0.45625678  0.14457951  0.32262853  0.15038528  0.32194614
     -0.08338999 -0.01091572  0.20316067 -0.74805576 -0.08273557 -0.59173554
     -0.12938951 -0.2492775   0.16524307  0.14128453 -0.42496806  0.2531642
      0.01175205  0.24926914 -0.20511891 -0.32925373  0.64965665 -0.2722091
      0.7198772  -0.45331827  0.02247382 -0.44499233  0.46038678  0.099677
     -0.03841541  0.22986875  0.24340023 -0.2364937  -0.22875474 -0.08419312
      0.47897708 -0.2800826   0.36107522 -0.41507873  0.13201733 -0.61776733
      0.08101977 -0.14693528  0.15443248  0.08642672  0.21798083 -0.30605313
      0.09893245 -0.15973178  0.07892659  0.31995687 -0.07135762  0.46047646
     -0.53847355 -0.00333725 -0.03253252  0.20049895]
    

    boy,用100维向量的展示。

    word2vec,模型训练后,保存了 个静态的表;后面在使用该模型时,直接使用即可。

    9、增量训练

    在使用词向量时,如果出现了未登录词,采用增量训练的方法

    10,附,使用微信公众号的中文语料

    本文使用的模型,来自于:

    苏剑林. (Apr. 03, 2017). 《【不可思议的Word2Vec】 2.训练好的模型 》[Blog post]. Retrieved from https://spaces.ac.cn/archives/4304

    感谢作者的分享!

    10.1、查看相似词

    import gensim
    model = gensim.models.Word2Vec.load('word2vec_wx')
    word = '公众号'
    result = model.wv.most_similar(word, topn=10)
    for item in result:
        print(item[0], item[1])
    

    输出

    订阅号 0.7826967239379883
    微信公众号 0.7606396675109863
    微信公众账号 0.7348952293395996
    公众平台 0.7161740064620972
    扫一扫 0.6978365182876587
    微信公众平台 0.6968472003936768
    置顶 0.666775643825531
    公共账号 0.6657419800758362
    微信平台 0.6610353589057922
    菜单栏 0.6523471474647522
    

    10.2、查看词向量

    word2 = '语言'
    print(model.wv[word2])
    

    输出256维的向量

     [ 3.33861768e-01  1.73384026e-01 -4.85719703e-02 -1.83665797e-01
      2.97123849e-01  2.50801265e-01  2.51928747e-01 -3.17050546e-01
      1.52625665e-01  1.81541026e-01  1.79962456e-01  2.98898131e-01
     -9.67884883e-02 -1.62156686e-01  6.51515201e-02 -1.08960688e-01
     -7.70623609e-02 -2.79988974e-01 -9.49766021e-03  2.07269922e-01
      3.30152921e-02 -1.23128854e-01 -5.47476672e-02  6.98662698e-02
     -3.13813448e-01 -2.39523128e-01 -2.49424770e-01  8.21948871e-02
     -2.28355274e-01 -1.54693067e-01  1.65754601e-01 -4.94239718e-01
      1.17348902e-01 -7.47983456e-02 -4.48045224e-01 -1.86154351e-01
     -1.74458280e-01  3.80349047e-02  1.14527017e-01  1.82628911e-02
      4.37613158e-03  2.90892631e-01 -4.35760915e-02 -9.80213210e-02
      7.60740787e-03 -1.29277661e-01  7.25271106e-02  1.96394786e-01
     -2.30470344e-01  1.26821458e-01 -3.31954837e-01 -1.83349043e-01
      2.70551860e-01 -1.47792444e-01  1.41586214e-01 -1.36383906e-01
      8.69724825e-02 -3.14457536e-01  2.85050809e-01 -5.34303486e-01
      2.05140444e-03 -3.56914908e-01  1.32593617e-01  1.34602815e-01
      8.34424496e-02 -2.73539245e-01 -8.30292553e-02  1.47602037e-01
     -2.76010185e-01  2.20643684e-01  2.27436453e-01 -4.26474325e-02
     -6.24517351e-02  5.66290952e-02  2.52530258e-02 -1.38873294e-01
     -2.31301725e-01 -1.09231465e-01 -1.34342521e-01  9.82840061e-02
      2.76122481e-01 -2.20348358e-01 -3.24411511e-01  3.73846352e-01
      2.25116895e-03  7.37380683e-02  2.37027138e-01 -6.71216100e-02
     -7.34784752e-02 -2.10223243e-01 -1.42640561e-01  1.53249070e-01
      1.13160595e-01 -2.25711882e-01 -7.01921284e-02  1.15349911e-01
     -3.80048126e-01 -2.95185624e-03  3.98775488e-01 -9.73170176e-02
     -4.16714624e-02 -9.56387296e-02  8.42289254e-02 -2.36658439e-01
      6.64465651e-02 -1.31715974e-02 -1.59601197e-01  8.88570100e-02
      1.33331180e-01 -7.15468898e-02  1.95292663e-02 -1.93508156e-02
      1.33383304e-01 -3.06313127e-01  7.45932907e-02 -2.96030585e-02
     -1.72281161e-01  5.93193658e-02  8.34293887e-02  1.39482886e-01
      1.18947007e-01  6.46592006e-02  2.07168125e-02  2.07605623e-02
     -8.70405138e-02  2.21211419e-01 -1.53310314e-01  1.00793295e-01
     -1.92820221e-01 -2.12027013e-01  9.87652317e-02  2.17053652e-01
      4.86631304e-01  1.86503336e-01 -5.58116026e-02 -5.36188930e-02
     -3.45847532e-02  1.55890629e-01  1.08481154e-01  5.94230704e-02
      3.30689073e-01 -8.32900181e-02  2.48705238e-01  5.14308959e-02
      6.58088923e-02 -1.93040535e-01  3.02518234e-02 -3.92529294e-02
      1.89376771e-01 -9.18796659e-02  3.10825408e-01  4.15268168e-02
     -4.59740274e-02 -1.36704236e-01 -2.26371899e-01 -1.88938022e-01
     -9.71311778e-02 -2.66413629e-01  1.49052262e-01 -3.35519582e-01
      1.54610902e-01  5.26955947e-02 -9.51223820e-02  2.34650373e-01
      1.66842282e-01 -4.10355441e-02 -6.46131486e-02  7.39907995e-02
     -7.34125972e-02  8.67785811e-02 -4.19434011e-01  1.49924412e-01
      2.64039114e-02  3.05301458e-01  2.93732643e-01  1.41284674e-01
      8.81637931e-02  1.88074231e-01 -2.17319787e-01 -4.10166413e-01
      4.96431477e-02 -3.15308601e-01 -1.60014797e-02 -1.07530311e-01
      1.28617911e-02  3.53148021e-02 -7.71220773e-02 -6.29130676e-02
     -8.69281590e-02 -5.55020664e-03 -1.84910014e-01 -2.46468764e-02
      2.31978804e-01  2.93507904e-01  8.72691721e-02 -2.26804726e-02
     -2.83576965e-01 -3.23421247e-02 -1.48572177e-01  1.82716042e-01
      1.91754699e-01  7.03820288e-02  4.24286425e-02 -6.50014803e-02
     -9.97320190e-02 -2.15809867e-01  3.14078555e-02 -1.96677875e-02
     -1.10134572e-01  1.19875282e-01  1.54025719e-01  1.01338290e-01
      5.95281012e-02  2.78205037e-01 -9.77627859e-02 -2.41364494e-01
      5.33033311e-01 -2.40000382e-01 -4.84812446e-02  6.93592578e-02
     -8.78402889e-02 -2.56885827e-01 -1.05378618e-02  2.10043699e-01
     -3.02311122e-01 -9.24663767e-02 -9.41163674e-02  8.37587938e-03
      1.84201568e-01 -2.37815425e-01 -6.43651634e-02 -1.07877836e-01
     -2.62131453e-01  1.14236757e-01  4.68298532e-02 -3.03846925e-01
     -8.44435990e-02  1.81959212e-01 -3.58633459e-01  1.99875772e-01
      1.75900802e-01 -3.03341061e-01 -1.05403319e-01  4.98839887e-04
      8.98204967e-02 -2.79990375e-01 -2.72860229e-01 -6.90619275e-02
     -7.85811692e-02 -1.28763422e-01 -1.47669330e-01  2.09599406e-01
      7.18796179e-02  2.34390453e-01  1.15268238e-01 -1.40857920e-01]
    ``

    相关文章

      网友评论

        本文标题:Wrod2vec算法实战_3分钟热情学NLP第5篇

        本文链接:https://www.haomeiwen.com/subject/pfsczktx.html