搜索引擎Indri系列：建立索引 (Indexing)

作者: 我就爱思考 | 来源:发表于2017-02-23 23:46 被阅读42次

搜索引擎Indri系列：建立索引 (Indexing)
Indri建立索引
搜索引擎indri系列：评价 (Evaluation)
搜索引擎Indri系列：检索 (Retrieval)
搜索引擎Indri系列：安装及使用
主流搜索引擎算法【分布式索引】
【SEO】优化技巧一：导航条当前菜单，不带参数，高亮显示
搭建elk服务
安装WP后基本设置
数据结构11——倒排索引（基于全文倒排）

在为文档集建立索引时，需要执行IndriBuildIndex path-to-to-index_parameter_file。这里的index_parameter_file是xml格式的参数文件，用来配置索引模型的参数。
下面详细说明参数的使用方法。


<parameters>
    <memory>1G</memory>  #运行索引需要的内存，数字后可加K/M/G来表示大小，例如100M=100000000
    <index>/home/PROJECT/Index</index> #存放生成的索引的路径，注意：重新生成索引时要把原索引删除
    <stemmer> #词干提取，分为krovetz和porter，默认无词干提取
        <name>krovetz</name>
    </stemmer> 

    <stopper> #指定停用词，默认无停用词
        <word>stopword</word>
    </stopper> 
    
    <corpus>  #可多次指定
        <path>/home/Collections/Volume1</path> #需要建立索引的语料的存放路径
        <class>trectext</class> #文档类型，有trectext, trecweb, html, xml, pdf, txt等，详见https://sourceforge.net/p/lemur/wiki/Indexer%20File%20Formats/）
    <annotations>/path/to/file</annotations> #包含语料对应的偏移标注的文件的存放路径，见https://sourceforge.net/p/lemur/wiki/Inline%20and%20Offset%20Annotations
    <offsetannotationhint>unordered<offsetannotationhint/> #表示offset annotations的存放顺序是否和文档到存放顺序一致，取值为ordered或者unordered，若为unordered，则Indri无需将整个文档文件存入内存，只需顺序读取。
    </corpus>
    <corpus>
        <path>/home/Collections/Volume2</path> 
        <class>trectext</class> 
    </corpus>
    <corpus>
        <path>/home/Collections/Volume3</path>
        <class>trectext</class>
    </corpus>
    <field>#建立索引一般有需要索引域，这是title索引域，域索引用于域查询, 可多次指定
        <name>title</name>
    </field>  

    <field>
        <name>date</name>
        <numeric>true</numeric> #该域是否包含整数
        <parserName>DateFieldAnnotator</parserName> #指定parser将数字域转换为整数值。默认是NumericFieldAnnotator，如果数字域存在 offset annotations，则是OffsetAnnotationAnnotator
    </field> 
</parameters>

参考文献：

https://sourceforge.net/p/lemur/wiki/IndriBuildIndex%20Parameters/

网友评论

本文标题：搜索引擎Indri系列：建立索引 (Indexing)

本文链接：https://www.haomeiwen.com/subject/cndiwttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

搜索引擎Indri系列：建立索引 (Indexing)

相关文章

搜索引擎Indri系列：建立索引 (Indexing)

Indri建立索引

搜索引擎indri系列：评价 (Evaluation)

搜索引擎Indri系列：检索 (Retrieval)

搜索引擎Indri系列：安装及使用

主流搜索引擎算法【分布式索引】

【SEO】优化技巧一：导航条当前菜单，不带参数，高亮显示

搭建elk服务

安装WP后基本设置

数据结构11——倒排索引（基于全文倒排）

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

资源搜索

机器学习与数据挖掘