字典,Dictionary
# 创建字典(单词与编号之间的映射)
dictionary = create_dict(document_final)
print('创建字典完成!')
# 保存字典
dictionary.save('D:/Python/Projects/NLP/doc2bow/doc2bow.dict')
print('保存字典成功!')
# 加载字典
dictionary_load = corpora.Dictionary.load('D:/Python/Projects/NLP/doc2bow/doc2bow.dict')
print('加载字典成功!')
模型,Model
# 保存模型
# 方式一
tfidf.save('D:/Python/Projects/NLP/doc2bow/doc2bow.model')
print('保存模型成功!')
# 方式二
# model.wv.save_word2vec_format('word2vec.vector')
# mode2.wv.save_word2vec_format('word2vec.bin')
# 加载模型
model_load = models.TfidfModel.load('D:/Python/Projects/NLP/doc2bow/doc2bow.model')
print('加载模型成功!')
索引,index
index_tmpfile = get_tmpfile("D:/Python/Projects/NLP/doc2bow/")
index = Similarity(index_tmpfile, corpus_tfidf, num_features=len(dictionary))
print('创建索引完成!')
index.save('D:/Python/Projects/NLP/doc2bow/doc2bow.index')
print('保存索引成功!')
index_load = Similarity.load('D:/Python/Projects/NLP/doc2bow/doc2bow.index')
print('加载索引成功!')
Similarity Class简介
def __init__(self, output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')
Parameters
----------
output_prefix : str
Prefix for shard filename. If None, a random filename in temp will be used.
碎片文件名的前缀,即存放碎片的文件夹路径,如果为None,则会随机使用一个空的文件夹存放
corpus : iterable of list of (int, number)
Corpus in streamed Gensim bag-of-words format.
已向量化的语料库
num_features : int
Size of the dictionary (number of features).
字典的长度,即维度数
num_best : int, optional
If set, return only the `num_best` most similar documents, always leaving out documents with similarity = 0.
Otherwise, return a full vector with one float for every document in the index.
如果已设置,则只返回“num_best”最相似的文档,而不返回相似度为0的文档
否则,为索引中的每个文档返回一个带一个浮点数的全向量
chunksize : int, optional
Size of query chunks. Used internally when the query is an entire corpus.
查询块的大小。当查询是整个语料库时在内部使用。
shardsize : int, optional
Maximum shard size, in documents. Choose a value so that a `shardsize x chunksize` matrix of floats fits
comfortably into your RAM.
文档中最大碎片大小。选择一个值,这样一个碎片大小x块大小的浮动矩阵就可以轻松地放入RAM中
norm : {'l1', 'l2'}, optional
Normalization to use.
要使用的规范化
网友评论