lucene

作者: 丹之 | 来源:发表于2018-11-29 20:30 被阅读9次

Lucene 7.4 初体验
Lucene 构造索引 & 查询 Demo
Elasticsearch | Lucene概要
[转]Lucene的索引文件格式
Lucene的索引文件格式(基于V7_3_0)
Elasticsearch
Lucene编译
solr4.10.3集成tomcat
Lucene系列文章
Elasticsearch内存

索引库(Index)

一个目录一个索引库，同一文件夹中的所有的文件构成一个Lucene索引库。类似数据库的表的概念。

段(Segment)

Lucene索引可能由多个子索引组成，这些子索引成为段。每一段都是完整独立的索引，能被搜索。

文档(Document)

一个索引可以包含多个段，段与段之间是独立的，添加新文档可以生成新的段，不同的段可以合并。段是索引数据存储的单元。类似数据库内的行或者文档数据库内的文档的概念。

域(Field)

一篇文档包含不同类型的信息，可以分开索引，比如标题，时间，正文，作者等。类似于数据库表中的字段。

词(Term)

词是索引的最小单位，是经过词法分析和语言处理后的字符串。一个Field由一个或多个Term组成。比如标题内容是“hello lucene”，经过分词之后就是“hello”，“lucene”，这两个单词就是Term的内容信息，当关键字搜索“hello”或者“lucene”的时候这个标题就会被搜索出来。

分词器（Analyzer）

一段有意义的文字需要通过Analyzer来分割成一个个词语后才能按关键词搜索。StandartdAnalyzer是Lucene中常用的分析器，中文分词有CJKAnalyzer、SmartChinieseAnalyzer等。

lucene 索引存储结构概念图

上图大概可以这样理解，索引内部由多个段组成，当新文档添加进来时候会生成新的段，不同的段之间可以合并（Segment-0、Segment-1、Segment-2合并成Segment-4），段内含有文档号与文档的索引信息。而每个文档内有多个域可以进行索引，每个域可以指定不同类型（StringField，TextField）。
所以，从图中可以看出，lucene的层次结构依次如下：索引(Index) –> 段(segment) –> 文档(Document) –> 域(Field) –> 词(Term)。

构建索引与查询索引过程

检索文件之前先要建立索引，所以上图得从“待检索文件”节点开始看。

构建索引过程：

1、为每一个待检索的文件构建Document类对象,将文件中各部分内容作为Field类对象。

2、使用Analyzer类实现对文档中的自然语言文本进行分词处理,并使用IndexWriter类构建索引。

3、使用FSDirectory类设定索引存储的方式和位置,实现索引的存储。

检索索引过程：

4、使用IndexReader类读取索引。

5、使用Term类表示用户所查找的关键字以及关键字所在的字段,使用QueryParser类表示用户的查询条件。

6、使用IndexSearcher类检索索引,返回符合查询条件的Document类对象。

其中虚线指向的是这个类所在的包名（packege）。如Analyzer在org.apache.lucene.analysis包下。

构建索引代码：

//创建索引
public class CreateTest {

    public static void main(String[] args) throws Exception {
        Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");

//        FSDirectory有三个主要的子类,open方法会根据系统环境自动挑选最合适的子类创建
//        MMapDirectory：Linux, MacOSX, Solaris
//        NIOFSDirectory：other non-Windows JREs
//        SimpleFSDirectory：other JREs on Windows
        Directory dir = FSDirectory.open(indexPath);

        // 分词器
        Analyzer analyzer = new StandardAnalyzer();
        boolean create = true;
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        if (create) {
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        } else {
            // lucene是不支持更新的，这里仅仅是删除旧索引，然后创建新索引
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        }
        IndexWriter indexWriter = new IndexWriter(dir, indexWriterConfig);

        Document doc = new Document();
        // 域值会被索引，但是不会被分词，即被当作一个完整的token处理，一般用在“国家”或者“ID
        // Field.Store表示是否在索引中存储原始的域值
        // 如果想在查询结果里显示域值，则需要对其进行存储
        // 如果内容太大并且不需要显示域值（整篇文章内容），则不适合存储到索引中
        doc.add(new StringField("Title", "sean", Field.Store.YES));
        long time = new Date().getTime();
        // LongPoint并不存储域值
        doc.add(new LongPoint("LastModified", time));
//        doc.add(new NumericDocValuesField("LastModified", time));
        // 会自动被索引和分词的字段，一般被用在文章的正文部分
        doc.add(new TextField("Content", "this is a test of sean", Field.Store.NO));

        List<Document> docs = new LinkedList<>();
        docs.add(doc);

        indexWriter.addDocuments(docs);
        // 默认会在关闭前提交
        indexWriter.close();
    }
}

对应时序图：

查询索引代码：

//查询索引
public class QueryTest {

    public static void main(String[] args) throws Exception {
        Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");
        Directory dir = FSDirectory.open(indexPath);
        // 分词器
        Analyzer analyzer = new StandardAnalyzer();

        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);

        // 同时查询多个域
//        String[] queryFields = {"Title", "Content", "LastModified"};
//        QueryParser parser = new MultiFieldQueryParser(queryFields, analyzer);
//        Query query = parser.parse("sean");

        // 一个域按词查doc
//        Term term = new Term("Title", "test");
//        Query query = new TermQuery(term);

        // 模糊查询
//        Term term = new Term("Title", "se*");
//        WildcardQuery query = new WildcardQuery(term);

        // 范围查询
        Query query1 = LongPoint.newRangeQuery("LastModified", 1L, 1637069693000L);

        // 多关键字查询，必须指定slop（key的存储方式）
        PhraseQuery.Builder phraseQueryBuilder = new PhraseQuery.Builder();
        phraseQueryBuilder.add(new Term("Content", "test"));
        phraseQueryBuilder.add(new Term("Content", "sean"));
        phraseQueryBuilder.setSlop(10);
        PhraseQuery query2 = phraseQueryBuilder.build();

        // 复合查询
        BooleanQuery.Builder booleanQueryBuildr = new BooleanQuery.Builder();
        booleanQueryBuildr.add(query1, BooleanClause.Occur.MUST);
        booleanQueryBuildr.add(query2, BooleanClause.Occur.MUST);
        BooleanQuery query = booleanQueryBuildr.build();

        // 返回doc排序
        // 排序域必须存在，否则会报错
        Sort sort = new Sort();
        SortField sortField = new SortField("Title", SortField.Type.SCORE);
        sort.setSort(sortField);

        TopDocs topDocs = searcher.search(query, 10, sort);
        if(topDocs.totalHits > 0)
            for(ScoreDoc scoreDoc : topDocs.scoreDocs){
                int docNum = scoreDoc.doc;
                Document doc = searcher.doc(docNum);
                System.out.println(doc.toString());
            }
    }
}

对应时序图：

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>7.4.0</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.4.0</version>
</dependency>

https://mp.weixin.qq.com/s/VRqp9V1ppyxkqf8l7sn_xg