Gensim实现Word2Vec的Skip-Gram模型

作者: 致Great | 来源:发表于2017-09-21 17:55 被阅读153次

【Gensim训练Word2Vec】参数详解
Gensim实现Word2Vec的Skip-Gram模型
【2020-07-16】Word2Vec
Word2Vector
NLP—Word2vec词向量表达
gensim-word2vec
Word2Vec教程-Negative Sampling 负采样
word2vec模型(2) - 基本模型
基于TensorFlow实现Skip-Gram模型
Pytorch 词向量训练

简介

Genism是一个开源的Python库，用于便捷高效地提取文档中的语义话题。它用于处理原始的、非结构化的电子文本（“纯文本”），gensim中的一些算法，如 Latent Semantic Analysis（潜在语义分析）、 Latent Dirichlet Allocation（潜在Dirichlet分布）、Random Projections（随机预测）通过检查训练文档中的共现实体来挖掘语义结构。

快速上手

import logging
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)

#创建一个小的语料库
from gensim import corpora,models,similarities

corpus=[[(0,1.0),(1,1.0),(2,1.0)],
        [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
        [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
        [(0, 1.0), (4, 2.0), (7, 1.0)],
        [(3, 1.0), (5, 1.0), (6, 1.0)],
        [(9, 1.0)],
        [(9, 1.0), (10, 1.0)],
        [(9, 1.0), (10, 1.0), (11, 1.0)],
        [(8, 1.0), (10, 1.0), (11, 1.0)]]

#对向量进行加权
tfidf=models.TfidfModel(corpus)

vec=[(0,1),(4,1)]
print(tfidf[vec])

[(0, 0.8075244024440723), (4, 0.5898341626740045)]

index= similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=12)
sims=index[tfidf[vec]]
print(list(enumerate(sims)))

[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

How to read this output? Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.

对语料进行分词

import os
import jieba
sentences_file=open("files/data/python32-sentence.txt",encoding='utf8')
word_file=open("files/data/python32-word.txt","a",encoding="utf8")
lines=sentences_file.readlines()
for line in lines:
    line.replace('\t','').replace('\n','').replace(' ','')
    segment_words=jieba.cut(line,cut_all=False)
    word_file.write(" ".join(segment_words))
sentences_file.close()
word_file.close()

使用gensim的word2vec训练模型

参考：python初步实现word2vec

# 导入包
from gensim.models import word2vec
import logging

#初始化
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
sentences=word2vec.Text8Corpus("files/data/python32-word.txt")#加载分词语料
model=word2vec.Word2Vec(sentences,size=200)#训练skip-gram模型，默认window=5
print("输出模型",model)

#计算两个单词的相似度
try:
    y1=model.similarity("企业","公司")
except KeyError:
    y1=0
print("【企业】和【公司】的相似度为：{}\n".format(y1))

#/计算某个词的相关词列表
y2=model.most_similar("科技",topn=20)#20个最相关的
print("与【科技】最相关的词有：\n")
for word in y2:
    print(word[0],word[1])
print("*********\n")

#寻找对应关系
print("公司-产品","生产")
y3=model.most_similar(["公司","产品"],["生产"],topn=3)
for word in y3:
    print(word[0],word[1])
print("*********\n")

#寻找不合群的词
y4 =model.doesnt_match(u"企业 公司 是 合作伙伴".split())  
print("不合群的词：{}".format(y4))  
print("***********\n"  )

#保存模型
model.save("企业关系.model")

WARNING:gensim.models.word2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay


输出模型 Word2Vec(vocab=579, size=200, alpha=0.025)
【企业】和【公司】的相似度为：0.9999545757451112

与【科技】最相关的词有：

， 0.9999620318412781
有限公司 0.9999616146087646
产品 0.9999591708183289
是 0.9999580383300781
和 0.9999551773071289
： 0.9999542832374573
成为 0.9999539256095886
软件 0.9999529719352722
经销商 0.9999511241912842
的 0.9999507069587708
年 0.999950110912323
等 0.999950110912323
技术 0.9999500513076782
美国 0.9999497532844543
月 0.9999494552612305
及 0.999949038028717
企业 0.9999480843544006
核心 0.9999477863311768
公司 0.999947726726532
指定 0.9999475479125977
*********

公司-产品 生产
。 0.9998433589935303
等 0.9998431205749512
的 0.9998403787612915
*********

不合群的词：公司
***********

【Gensim训练Word2Vec】参数详解
用gensim函数库训练Word2Vec模型有很多配置参数。这里对gensim文档的Word2Vec函数的参数说...
Gensim实现Word2Vec的Skip-Gram模型
简介 Genism是一个开源的Python库，用于便捷高效地提取文档中的语义话题。它用于处理原始的、非结构化的电子...
【2020-07-16】Word2Vec
gensim的 Word2Vec参数简单实现语言模型： CBOW【根据上下文的词语，预测当前词语】、Skip-...
Word2Vector
Word2Vec模型中，主要有Skip-Gram和CBOW两种模型，从直观上理解，Skip-Gram是给定inpu...
NLP—Word2vec词向量表达
原理：word2vec原理(一) CBOW与Skip-Gram模型基础word2vec原理(二) 基于Hierar...
gensim-word2vec
通过word2vec的“skip-gram和CBOW模型”生成词向量，使用hierarchical softmax...
Word2Vec教程-Negative Sampling 负采样
这篇word2vec教程2中（教程1 Word2Vec教程-Skip-Gram模型），作者主要讲述了skip-gr...
word2vec模型(2) - 基本模型
上一篇文章介绍了word2vec模型的背景，本文将介绍模型细节。word2vec有两种结构：skip-gram和C...
基于TensorFlow实现Skip-Gram模型
理解 Word2Vec 之 Skip-Gram 模型 Word2Vec是从大量文本语料中以无监督的方式学习语义知识...
Pytorch 词向量训练
说明对于词向量的训练，常用的有如gensim库下提供的word2vec模型，后面会简单的示例gensim库下该模...

Gensim实现Word2Vec的Skip-Gram模型

简介

快速上手

对语料进行分词

使用gensim的word2vec训练模型

相关文章

【Gensim训练Word2Vec】参数详解

Gensim实现Word2Vec的Skip-Gram模型

【2020-07-16】Word2Vec

Word2Vector

NLP—Word2vec词向量表达

gensim-word2vec

Word2Vec教程-Negative Sampling 负采样

word2vec模型(2) - 基本模型

基于TensorFlow实现Skip-Gram模型

Pytorch 词向量训练

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

gensim

生活不易我用python

Python 运维

人工智能微刊

Python语言与信息数据获取和机器学习

机器学习