lucene源码分析-搜索过程解析(一)

作者: 尹亮_36cd | 来源:发表于2019-04-22 21:24 被阅读0次

1 搜索示例

首先在lucene索引中预先写入了一些文档，主要包含两个field (id和name)信息，每个field都是stored和indexed

{
    "id": 0,
    "name": "Stephen"
},{
    "id": 1,
    "name": "Draymond"
},{
    "id": 2,
    "name", "LeBron"
},{
    "id": 3,
    "name": "Kevin"
}

我们使用如下代码，可以从lucene索引中搜索name为LeBron的文档信息，接下来，我们将通过一系列文章来分析这些代码是如何工作的

public class IndexSearcherTest {
    public static void main(String[] args) throws IOException, ParseException {
        Directory directory = FSDirectory.open("/lucene/index/path");
        IndexReader indexReader = DirectoryReader.open(directory);

        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        
        QueryParser queryParser = new QueryParser("name", new StandardAnalyzer());
        Query query = queryParser.parse("LeBron");

        TopDocs topDocs = indexSearcher.search(query, 10);

        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for(ScoreDoc scoreDoc : scoreDocs){
            System.out.println("doc: " + scoreDoc.doc + ", score: " + scoreDoc.score);

            Document document = indexReader.document(scoreDoc.doc);
            if(null == document){
                continue;
            }
            System.out.println("id :" + document.get("id") + ", name: " + document.get("name"));
        }
    }
}

本片文章主要介绍打开索引路径和索引目录的部分

1 打开索引文件

由于索引存在本地磁盘中，可以使用FSDirectory打开本地的索引文件，获取索引路径的Directory对象

Directory directory = FSDirectory.open("/lucene/index/path");

① 如果当前jre是64位且支持Unmap（能加载sun.misc.Cleaner类和java.nio.DirectByteBuffer.cleaner()方法），则创建的是MMapDirectory对象
② 如果当前系统是Windows（判断操作系统名称是否以Windows开头），则创建SimpleFSDirectory对象
③ 如果不满足上述两种情况，则创建NIOFSDirectory对象

public abstract class FSDirectory extends BaseDirectory {
  public static FSDirectory open(File path) throws IOException {
    return open(path, null);
  }

  public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
    if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
      return new MMapDirectory(path, lockFactory);
    } else if (Constants.WINDOWS) {
      return new SimpleFSDirectory(path, lockFactory);
    } else {
      return new NIOFSDirectory(path, lockFactory);
    }
  }
}

上述三种Directory的构造方法都是通过调用父类FSDirectory实现的

FSDirectory类图

在FSDirectory的构造方法中，主要是初始化了lockFactory和directory对象

public abstract class FSDirectory extends BaseDirectory {
  protected FSDirectory(File path, LockFactory lockFactory) throws IOException {
    // new ctors use always NativeFSLockFactory as default:
    if (lockFactory == null) {
      lockFactory = new NativeFSLockFactory();
    }
    directory = path.getCanonicalFile();

    if (directory.exists() && !directory.isDirectory())
      throw new NoSuchDirectoryException("file '" + directory + "' exists but is not a directory");

    setLockFactory(lockFactory);

  }
}

2 打开索引目录

2.1 查找segment文件

在打开索引路径获得索引目录后，会使用DirectoryReader打开这个索引目录

IndexReader indexReader = DirectoryReader.open(directory);

这个过程非常复杂，主要是查找并打开索引的segment文件（segments_N和segments.gen），从segment中获取index文件信息并打开（.tip， .tim，.doc，.pos，.tvd，.tvx，.si，.nvd，.nvm）

public abstract class DirectoryReader extends BaseCompositeReader<AtomicReader> {
  public static DirectoryReader open(final Directory directory) throws IOException {
    return StandardDirectoryReader.open(directory, null, DEFAULT_TERMS_INDEX_DIVISOR);
  }
}

在StandardDirectoryReader的open()方法中，主要是创建SegmentInfos.FindSegmentsFile对象并重写doBody()方法，然后执行对象的run()方法

final class StandardDirectoryReader extends DirectoryReader {
  static DirectoryReader open(final Directory directory, final IndexCommit commit,
                          final int termInfosIndexDivisor) throws IOException {
    return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
      @Override
      protected Object doBody(String segmentFileName) throws IOException {
        SegmentInfos sis = new SegmentInfos();
        sis.read(directory, segmentFileName);
        final SegmentReader[] readers = new SegmentReader[sis.size()];
        boolean success = false;
        try {
          for (int i = sis.size()-1; i >= 0; i--) {
            readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
          }

          DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
          success = true;

          return reader;
        } finally {
          if (success == false) {
            IOUtils.closeWhileHandlingException(readers);
          }
        }
      }
    }.run(commit);
  }
}

SegmentInfos.FindSegmentsFile的run()方法主要是查找segment文件
① 从以segment开头但不为segments.gen的文件中查找后缀最大的字符的作为genA
② 读取segment.gen文件，从lucene索引文件格式可知其格式如下：

GenHeader	Generation	Generation	Footer

generation为一个Long类型的数字，并且被写入了两次，如果两个值相同，则作为genB
③ 比较genA和genB的值，最大的作为最终的gen值，用segment_[gen]作为segment文件名

public final class SegmentInfos implements Cloneable, Iterable<SegmentCommitInfo> {
  public abstract static class FindSegmentsFile {
    public Object run(IndexCommit commit) throws IOException {
      if (commit != null) {
        if (directory != commit.getDirectory())
          throw new IOException("the specified commit does not match the specified Directory");
        return doBody(commit.getSegmentsFileName());
      }

      String segmentFileName = null;
      long gen = 0;
      int retryCount = 0;

      boolean useFirstMethod = true;

      while(true) {
        if (useFirstMethod) {
          files = directory.listAll();
          
          if (files != null) {
            genA = getLastCommitGeneration(files);
          }
          
          long genB = -1;
          try {
            genInput = directory.openChecksumInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);
          } catch (IOException e) {
          
          }
  
          if (genInput != null) {
            try {
              int version = genInput.readInt();
              if (version == FORMAT_SEGMENTS_GEN_47 || version == FORMAT_SEGMENTS_GEN_CHECKSUM) {
                long gen0 = genInput.readLong();
                long gen1 = genInput.readLong();
                if (gen0 == gen1) {
                  // The file is consistent.
                  genB = gen0;
                }
              } else {
                throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_START, FORMAT_SEGMENTS_GEN_CURRENT);
              }
            } catch (IOException err2) {
             
          }

          gen = Math.max(genA, genB);

        if (useFirstMethod && lastGen == gen && retryCount >= 2) {
          useFirstMethod = false;
        }

        segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                "",
                                                                gen);
        try {
          Object v = doBody(segmentFileName);
          if (infoStream != null) {
            message("success on " + segmentFileName);
          }
          return v;
        } catch (IOException err) {
          // ...
        }
      }
    }
    protected abstract Object doBody(String segmentFileName) throws IOException;
  }
}

2.2 打开index文件

在SegmentInfos.FindSegmentsFile的doBody()方法中，主要是读取各个index文件

final class StandardDirectoryReader extends DirectoryReader {
  static DirectoryReader open(final Directory directory, final IndexCommit commit,
                          final int termInfosIndexDivisor) throws IOException {
    return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
      @Override
      protected Object doBody(String segmentFileName) throws IOException {
        SegmentInfos sis = new SegmentInfos();
        sis.read(directory, segmentFileName);
        final SegmentReader[] readers = new SegmentReader[sis.size()];
        boolean success = false;
        try {
          for (int i = sis.size()-1; i >= 0; i--) {
            readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
          }

          // This may throw IllegalArgumentException if there are too many docs, so
          // it must be inside try clause so we close readers in that case:
          DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
          success = true;

          return reader;
        } finally {
          if (success == false) {
            IOUtils.closeWhileHandlingException(readers);
          }
        }
      }
    }.run(commit);
  }
}

首先创建一个SegmentInfos()对象，然后调用sis.read(directory, segmentFileName);读取segment文件
segment文件格式如下：

Header	Version	NameCounter	SegCount	SegCount	CommitUserData	Footer

然后遍历segment中的每一个段信息，调用SegmentReader的构造方法读取索引的segment 索引文件

public final class SegmentReader extends AtomicReader implements Accountable {
  public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException {
    this.si = si;
    // 读取cfs
    fieldInfos = readFieldInfos(si);
    // 读取tip tim nvd nvm fdt fdx tvf tvd tvx
    core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor);
    segDocValues = new SegmentDocValues();
    
    boolean success = false;
    final Codec codec = si.info.getCodec();
    try {
      if (si.hasDeletions()) {
        读取 del
        liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
      } else {
        assert si.getDelCount() == 0;
        liveDocs = null;
      }
      numDocs = si.info.getDocCount() - si.getDelCount();

      if (fieldInfos.hasDocValues()) {
        initDocValuesProducers(codec);
      }

      success = true;
    } finally {
      if (!success) {
        doClose();
      }
    }
  }
}

① 用readFieldInfos()方法读取cfs文件，一个“虚拟”的文件，用于访问复合流

public final class SegmentReader extends AtomicReader implements Accountable {
  static FieldInfos readFieldInfos(SegmentCommitInfo info) throws IOException {
    final Directory dir;
    final boolean closeDir;
    if (info.getFieldInfosGen() == -1 && info.info.getUseCompoundFile()) {
      // no fieldInfos gen and segment uses a compound file
      dir = new CompoundFileDirectory(info.info.dir,
          IndexFileNames.segmentFileName(info.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION),
          IOContext.READONCE,
          false);
      closeDir = true;
    } else {
      // gen'd FIS are read outside CFS, or the segment doesn't use a compound file
      dir = info.info.dir;
      closeDir = false;
    }
    
    try {
      final String segmentSuffix = info.getFieldInfosGen() == -1 ? "" : Long.toString(info.getFieldInfosGen(), Character.MAX_RADIX);
      Codec codec = info.info.getCodec();
      FieldInfosFormat fisFormat = codec.fieldInfosFormat();
      return fisFormat.getFieldInfosReader().read(dir, info.info.name, segmentSuffix, IOContext.READONCE);
    } finally {
      if (closeDir) {
        dir.close();
      }
    }
  }
}

② 构建SegmentCoreReaders对象时主要读取tip tim nvd nvm fdt fdx tvf tvd tvx 格式文件

final class SegmentCoreReaders implements Accountable {
  SegmentCoreReaders(SegmentReader owner, Directory dir, SegmentCommitInfo si, IOContext context, int termsIndexDivisor) throws IOException {

    if (termsIndexDivisor == 0) {
      throw new IllegalArgumentException("indexDivisor must be < 0 (don't load terms index) or greater than 0 (got 0)");
    }
    
    final Codec codec = si.info.getCodec();
    final Directory cfsDir; // confusing name: if (cfs) its the cfsdir, otherwise its the segment's directory.

    boolean success = false;
    
    try {
      if (si.info.getUseCompoundFile()) {
        // 读取cfs 文件
        cfsDir = cfsReader = new CompoundFileDirectory(dir, IndexFileNames.segmentFileName(si.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION), context, false);
      } else {
        cfsReader = null;
        cfsDir = dir;
      }

      final FieldInfos fieldInfos = owner.fieldInfos;
      
      this.termsIndexDivisor = termsIndexDivisor;
      final PostingsFormat format = codec.postingsFormat();
      final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si.info, fieldInfos, context, termsIndexDivisor);
      // 读取tip 和 tim 文件
      fields = format.fieldsProducer(segmentReadState);
      assert fields != null;

      if (fieldInfos.hasNorms()) {
        // 读取 nvd和 nvm 文件
        normsProducer = codec.normsFormat().normsProducer(segmentReadState);
        assert normsProducer != null;
      } else {
        normsProducer = null;
      }
      // 读取fdx 和fdt 文件
      fieldsReaderOrig = si.info.getCodec().storedFieldsFormat().fieldsReader(cfsDir, si.info, fieldInfos, context);
      if (fieldInfos.hasVectors()) { 
        // 读取 tvf  tvd  和  tvx 文件
        termVectorsReaderOrig = si.info.getCodec().termVectorsFormat().vectorsReader(cfsDir, si.info, fieldInfos, context);
      } else {
        termVectorsReaderOrig = null;
      }

      success = true;
    } finally {
      if (!success) {
        decRef();
      }
    }
  }
}

tim文件格式如下：

Header	FSTIndexNumFields	<IndexStartFP>NumFields	DirOffset	Footer

tip文件格式如下：

Header	PostingsHeader	NumBlocks	FieldSummary	DirOffset	Footer

public class BlockTreeTermsReader extends FieldsProducer {
  public BlockTreeTermsReader(Directory dir, FieldInfos fieldInfos, SegmentInfo info,
                              PostingsReaderBase postingsReader, IOContext ioContext,
                              String segmentSuffix, int indexDivisor)
    throws IOException {
    
    this.postingsReader = postingsReader;

    this.segment = info.name;
    // 读取cfs
    in = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_EXTENSION),
                       ioContext);

    boolean success = false;
    IndexInput indexIn = null;

    try {
      version = readHeader(in);
      if (indexDivisor != -1) {
        indexIn = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_INDEX_EXTENSION),
                                ioContext);
        int indexVersion = readIndexHeader(indexIn);
        if (indexVersion != version) {
          throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion);
        }
      }
      
      // verify
      if (indexIn != null && version >= BlockTreeTermsWriter.VERSION_CHECKSUM) {
        CodecUtil.checksumEntireFile(indexIn);
      }

      // Have PostingsReader init itself
      postingsReader.init(in);

      // ...
  }
}

fdx 文件格式

<Header>	<FieldValuesPosition> SegSize

fdt文件格式

<Header>	<DocFieldData> SegSize

public final class Lucene40StoredFieldsReader extends StoredFieldsReader implements Cloneable, Closeable {
  public Lucene40StoredFieldsReader(Directory d, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
    final String segment = si.name;
    boolean success = false;
    fieldInfos = fn;
    try {
      fieldsStream = d.openInput(IndexFileNames.segmentFileName(segment, "", FIELDS_EXTENSION), context);
      final String indexStreamFN = IndexFileNames.segmentFileName(segment, "", FIELDS_INDEX_EXTENSION);
      indexStream = d.openInput(indexStreamFN, context);
      
      CodecUtil.checkHeader(indexStream, CODEC_NAME_IDX, VERSION_START, VERSION_CURRENT);
      CodecUtil.checkHeader(fieldsStream, CODEC_NAME_DAT, VERSION_START, VERSION_CURRENT);
      assert HEADER_LENGTH_DAT == fieldsStream.getFilePointer();
      assert HEADER_LENGTH_IDX == indexStream.getFilePointer();
      final long indexSize = indexStream.length() - HEADER_LENGTH_IDX;
      this.size = (int) (indexSize >> 3);
      // Verify two sources of "maxDoc" agree:
      if (this.size != si.getDocCount()) {
        throw new CorruptIndexException("doc counts differ for segment " + segment + ": fieldsReader shows " + this.size + " but segmentInfo shows " + si.getDocCount());
      }
      numTotalDocs = (int) (indexSize >> 3);
      success = true;
    } finally {
      if (!success) {
        try {
          close();
        } catch (Throwable t) {} // ensure we throw our original exception
      }
    }
  }
}

tvx 文件格式

Header	<DocumentPosition,FieldPosition> NumDocs

tvd 文件格式

Header	<NumFields, FieldNums, FieldPositions> NumDocs

tvf 文件格式

Header	<NumTerms, Flags, TermFreqs> NumFields

③ liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);主要读取del文件
del文件格式

Format	Header	ByteCount	BitCount	Bits

public class Lucene40LiveDocsFormat extends LiveDocsFormat {
  public Bits readLiveDocs(Directory dir, SegmentCommitInfo info, IOContext context) throws IOException {
    String filename = IndexFileNames.fileNameFromGeneration(info.info.name, DELETES_EXTENSION, info.getDelGen());
    final BitVector liveDocs = new BitVector(dir, filename, context);
    if (liveDocs.length() != info.info.getDocCount()) {
      throw new CorruptIndexException("liveDocs.length()=" + liveDocs.length() + "info.docCount=" + info.info.getDocCount() + " (filename=" + filename + ")");
    }
    if (liveDocs.count() != info.info.getDocCount() - info.getDelCount()) {
      throw new CorruptIndexException("liveDocs.count()=" + liveDocs.count() + " info.docCount=" + info.info.getDocCount() + " info.getDelCount()=" + info.getDelCount() + " (filename=" + filename + ")");
    }
    return liveDocs;
  }
}

3 创建IndexSearcher

在打开索引目录后，接着创建IndexSearcher对象

IndexSearcher indexSearcher = new IndexSearcher(indexReader);

创建IndexSearcher的过程，主要是初始化searcher的context和reader

public class IndexSearcher {
  public IndexSearcher(IndexReader r) {
    this(r, null);
  }
  public IndexSearcher(IndexReader r, ExecutorService executor) {
    this(r.getContext(), executor);
  }
  public IndexSearcher(IndexReaderContext context, ExecutorService executor) {
    assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader();
    reader = context.reader();
    this.executor = executor;
    this.readerContext = context;
    leafContexts = context.leaves();
    this.leafSlices = executor == null ? null : slices(leafContexts);
  }
}

lucene源码分析-搜索过程解析(一)

1 搜索示例

1 打开索引文件

2 打开索引目录

2.1 查找segment文件

2.2 打开index文件

3 创建IndexSearcher

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读