美文网首页
gensim模型训练中的存储和读取,Dictionary,Mod

gensim模型训练中的存储和读取,Dictionary,Mod

作者: 旅行小张 | 来源:发表于2020-06-24 16:01 被阅读0次

字典,Dictionary

    # 创建字典(单词与编号之间的映射)
    dictionary = create_dict(document_final)
    print('创建字典完成!')
    # 保存字典
    dictionary.save('D:/Python/Projects/NLP/doc2bow/doc2bow.dict')
    print('保存字典成功!')
    # 加载字典
    dictionary_load = corpora.Dictionary.load('D:/Python/Projects/NLP/doc2bow/doc2bow.dict')
    print('加载字典成功!')

模型,Model

    # 保存模型
    # 方式一
    tfidf.save('D:/Python/Projects/NLP/doc2bow/doc2bow.model')
    print('保存模型成功!')
    # 方式二
    # model.wv.save_word2vec_format('word2vec.vector')
    # mode2.wv.save_word2vec_format('word2vec.bin')
    # 加载模型
    model_load = models.TfidfModel.load('D:/Python/Projects/NLP/doc2bow/doc2bow.model')
    print('加载模型成功!')

索引,index

    index_tmpfile = get_tmpfile("D:/Python/Projects/NLP/doc2bow/")
    index = Similarity(index_tmpfile, corpus_tfidf, num_features=len(dictionary))
    print('创建索引完成!')
    index.save('D:/Python/Projects/NLP/doc2bow/doc2bow.index')
    print('保存索引成功!')
    index_load = Similarity.load('D:/Python/Projects/NLP/doc2bow/doc2bow.index')
    print('加载索引成功!')
    Similarity Class简介
    def __init__(self, output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')

    Parameters
    ----------
    output_prefix : str
        Prefix for shard filename. If None, a random filename in temp will be used.
        碎片文件名的前缀,即存放碎片的文件夹路径,如果为None,则会随机使用一个空的文件夹存放
    corpus : iterable of list of (int, number)
        Corpus in streamed Gensim bag-of-words format.
        已向量化的语料库
    num_features : int
        Size of the dictionary (number of features).
        字典的长度,即维度数
    num_best : int, optional
        If set, return only the `num_best` most similar documents, always leaving out documents with similarity = 0.
        Otherwise, return a full vector with one float for every document in the index.
        如果已设置,则只返回“num_best”最相似的文档,而不返回相似度为0的文档
        否则,为索引中的每个文档返回一个带一个浮点数的全向量
    chunksize : int, optional
        Size of query chunks. Used internally when the query is an entire corpus.
        查询块的大小。当查询是整个语料库时在内部使用。
    shardsize : int, optional
        Maximum shard size, in documents. Choose a value so that a `shardsize x chunksize` matrix of floats fits
    comfortably into your RAM.
        文档中最大碎片大小。选择一个值,这样一个碎片大小x块大小的浮动矩阵就可以轻松地放入RAM中
    norm : {'l1', 'l2'}, optional
        Normalization to use.
        要使用的规范化

相关文章

网友评论

      本文标题:gensim模型训练中的存储和读取,Dictionary,Mod

      本文链接:https://www.haomeiwen.com/subject/fairfktx.html