美文网首页
lucene源码分析-搜索过程解析(一)

lucene源码分析-搜索过程解析(一)

作者: 尹亮_36cd | 来源:发表于2019-04-22 21:24 被阅读0次

    1 搜索示例

    首先在lucene索引中预先写入了一些文档,主要包含两个field (id和name)信息,每个field都是stored和indexed

    {
        "id": 0,
        "name": "Stephen"
    },{
        "id": 1,
        "name": "Draymond"
    },{
        "id": 2,
        "name", "LeBron"
    },{
        "id": 3,
        "name": "Kevin"
    }
    

    我们使用如下代码,可以从lucene索引中搜索name为LeBron的文档信息,接下来,我们将通过一系列文章来分析这些代码是如何工作的

    public class IndexSearcherTest {
        public static void main(String[] args) throws IOException, ParseException {
            Directory directory = FSDirectory.open("/lucene/index/path");
            IndexReader indexReader = DirectoryReader.open(directory);
    
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);
            
            QueryParser queryParser = new QueryParser("name", new StandardAnalyzer());
            Query query = queryParser.parse("LeBron");
    
            TopDocs topDocs = indexSearcher.search(query, 10);
    
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for(ScoreDoc scoreDoc : scoreDocs){
                System.out.println("doc: " + scoreDoc.doc + ", score: " + scoreDoc.score);
    
                Document document = indexReader.document(scoreDoc.doc);
                if(null == document){
                    continue;
                }
                System.out.println("id :" + document.get("id") + ", name: " + document.get("name"));
            }
        }
    }
    

    本片文章主要介绍打开索引路径和索引目录的部分

    1 打开索引文件

    由于索引存在本地磁盘中,可以使用FSDirectory打开本地的索引文件,获取索引路径的Directory对象

    Directory directory = FSDirectory.open("/lucene/index/path");
    

    ① 如果当前jre是64位且支持Unmap(能加载sun.misc.Cleaner类和java.nio.DirectByteBuffer.cleaner()方法),则创建的是MMapDirectory对象
    ② 如果当前系统是Windows(判断操作系统名称是否以Windows开头),则创建SimpleFSDirectory对象
    ③ 如果不满足上述两种情况,则创建NIOFSDirectory对象

    public abstract class FSDirectory extends BaseDirectory {
      public static FSDirectory open(File path) throws IOException {
        return open(path, null);
      }
    
      public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
        if (Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
          return new MMapDirectory(path, lockFactory);
        } else if (Constants.WINDOWS) {
          return new SimpleFSDirectory(path, lockFactory);
        } else {
          return new NIOFSDirectory(path, lockFactory);
        }
      }
    }
    

    上述三种Directory的构造方法都是通过调用父类FSDirectory实现的

    FSDirectory类图

    在FSDirectory的构造方法中,主要是初始化了lockFactory和directory对象

    public abstract class FSDirectory extends BaseDirectory {
      protected FSDirectory(File path, LockFactory lockFactory) throws IOException {
        // new ctors use always NativeFSLockFactory as default:
        if (lockFactory == null) {
          lockFactory = new NativeFSLockFactory();
        }
        directory = path.getCanonicalFile();
    
        if (directory.exists() && !directory.isDirectory())
          throw new NoSuchDirectoryException("file '" + directory + "' exists but is not a directory");
    
        setLockFactory(lockFactory);
    
      }
    }
    

    2 打开索引目录

    2.1 查找segment文件

    在打开索引路径获得索引目录后,会使用DirectoryReader打开这个索引目录

    IndexReader indexReader = DirectoryReader.open(directory);
    

    这个过程非常复杂,主要是查找并打开索引的segment文件(segments_N和segments.gen),从segment中获取index文件信息并打开(.tip, .tim,.doc,.pos,.tvd,.tvx,.si,.nvd,.nvm)

    public abstract class DirectoryReader extends BaseCompositeReader<AtomicReader> {
      public static DirectoryReader open(final Directory directory) throws IOException {
        return StandardDirectoryReader.open(directory, null, DEFAULT_TERMS_INDEX_DIVISOR);
      }
    }
    

    在StandardDirectoryReader的open()方法中,主要是创建SegmentInfos.FindSegmentsFile对象并重写doBody()方法,然后执行对象的run()方法

    final class StandardDirectoryReader extends DirectoryReader {
      static DirectoryReader open(final Directory directory, final IndexCommit commit,
                              final int termInfosIndexDivisor) throws IOException {
        return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
          @Override
          protected Object doBody(String segmentFileName) throws IOException {
            SegmentInfos sis = new SegmentInfos();
            sis.read(directory, segmentFileName);
            final SegmentReader[] readers = new SegmentReader[sis.size()];
            boolean success = false;
            try {
              for (int i = sis.size()-1; i >= 0; i--) {
                readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
              }
    
              DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
              success = true;
    
              return reader;
            } finally {
              if (success == false) {
                IOUtils.closeWhileHandlingException(readers);
              }
            }
          }
        }.run(commit);
      }
    }
    

    SegmentInfos.FindSegmentsFile的run()方法主要是查找segment文件
    ① 从以segment开头但不为segments.gen的文件中查找后缀最大的字符的作为genA
    ② 读取segment.gen文件,从lucene索引文件格式可知其格式如下:

    GenHeader Generation Generation Footer

    generation为一个Long类型的数字,并且被写入了两次,如果两个值相同,则作为genB
    ③ 比较genA和genB的值,最大的作为最终的gen值,用segment_[gen]作为segment文件名

    public final class SegmentInfos implements Cloneable, Iterable<SegmentCommitInfo> {
      public abstract static class FindSegmentsFile {
        public Object run(IndexCommit commit) throws IOException {
          if (commit != null) {
            if (directory != commit.getDirectory())
              throw new IOException("the specified commit does not match the specified Directory");
            return doBody(commit.getSegmentsFileName());
          }
    
          String segmentFileName = null;
          long gen = 0;
          int retryCount = 0;
    
          boolean useFirstMethod = true;
    
          while(true) {
            if (useFirstMethod) {
              files = directory.listAll();
              
              if (files != null) {
                genA = getLastCommitGeneration(files);
              }
              
              long genB = -1;
              try {
                genInput = directory.openChecksumInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);
              } catch (IOException e) {
              
              }
      
              if (genInput != null) {
                try {
                  int version = genInput.readInt();
                  if (version == FORMAT_SEGMENTS_GEN_47 || version == FORMAT_SEGMENTS_GEN_CHECKSUM) {
                    long gen0 = genInput.readLong();
                    long gen1 = genInput.readLong();
                    if (gen0 == gen1) {
                      // The file is consistent.
                      genB = gen0;
                    }
                  } else {
                    throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_START, FORMAT_SEGMENTS_GEN_CURRENT);
                  }
                } catch (IOException err2) {
                 
              }
    
              gen = Math.max(genA, genB);
    
            if (useFirstMethod && lastGen == gen && retryCount >= 2) {
              useFirstMethod = false;
            }
    
            segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                    "",
                                                                    gen);
            try {
              Object v = doBody(segmentFileName);
              if (infoStream != null) {
                message("success on " + segmentFileName);
              }
              return v;
            } catch (IOException err) {
              // ...
            }
          }
        }
        protected abstract Object doBody(String segmentFileName) throws IOException;
      }
    }
    

    2.2 打开index文件

    在SegmentInfos.FindSegmentsFile的doBody()方法中,主要是读取各个index文件

    final class StandardDirectoryReader extends DirectoryReader {
      static DirectoryReader open(final Directory directory, final IndexCommit commit,
                              final int termInfosIndexDivisor) throws IOException {
        return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {
          @Override
          protected Object doBody(String segmentFileName) throws IOException {
            SegmentInfos sis = new SegmentInfos();
            sis.read(directory, segmentFileName);
            final SegmentReader[] readers = new SegmentReader[sis.size()];
            boolean success = false;
            try {
              for (int i = sis.size()-1; i >= 0; i--) {
                readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);
              }
    
              // This may throw IllegalArgumentException if there are too many docs, so
              // it must be inside try clause so we close readers in that case:
              DirectoryReader reader = new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
              success = true;
    
              return reader;
            } finally {
              if (success == false) {
                IOUtils.closeWhileHandlingException(readers);
              }
            }
          }
        }.run(commit);
      }
    }
    

    首先创建一个SegmentInfos()对象,然后调用sis.read(directory, segmentFileName);读取segment文件
    segment文件格式如下:

    Header Version NameCounter SegCount SegCount CommitUserData Footer

    然后遍历segment中的每一个段信息,调用SegmentReader的构造方法读取索引的segment 索引文件

    public final class SegmentReader extends AtomicReader implements Accountable {
      public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException {
        this.si = si;
        // 读取cfs
        fieldInfos = readFieldInfos(si);
        // 读取tip tim nvd nvm fdt fdx tvf tvd tvx
        core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor);
        segDocValues = new SegmentDocValues();
        
        boolean success = false;
        final Codec codec = si.info.getCodec();
        try {
          if (si.hasDeletions()) {
            读取 del
            liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
          } else {
            assert si.getDelCount() == 0;
            liveDocs = null;
          }
          numDocs = si.info.getDocCount() - si.getDelCount();
    
          if (fieldInfos.hasDocValues()) {
            initDocValuesProducers(codec);
          }
    
          success = true;
        } finally {
          if (!success) {
            doClose();
          }
        }
      }
    }
    

    ① 用readFieldInfos()方法读取cfs文件,一个“虚拟”的文件,用于访问复合流

    public final class SegmentReader extends AtomicReader implements Accountable {
      static FieldInfos readFieldInfos(SegmentCommitInfo info) throws IOException {
        final Directory dir;
        final boolean closeDir;
        if (info.getFieldInfosGen() == -1 && info.info.getUseCompoundFile()) {
          // no fieldInfos gen and segment uses a compound file
          dir = new CompoundFileDirectory(info.info.dir,
              IndexFileNames.segmentFileName(info.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION),
              IOContext.READONCE,
              false);
          closeDir = true;
        } else {
          // gen'd FIS are read outside CFS, or the segment doesn't use a compound file
          dir = info.info.dir;
          closeDir = false;
        }
        
        try {
          final String segmentSuffix = info.getFieldInfosGen() == -1 ? "" : Long.toString(info.getFieldInfosGen(), Character.MAX_RADIX);
          Codec codec = info.info.getCodec();
          FieldInfosFormat fisFormat = codec.fieldInfosFormat();
          return fisFormat.getFieldInfosReader().read(dir, info.info.name, segmentSuffix, IOContext.READONCE);
        } finally {
          if (closeDir) {
            dir.close();
          }
        }
      }
    }
    

    ② 构建SegmentCoreReaders对象时主要读取tip tim nvd nvm fdt fdx tvf tvd tvx 格式文件

    final class SegmentCoreReaders implements Accountable {
      SegmentCoreReaders(SegmentReader owner, Directory dir, SegmentCommitInfo si, IOContext context, int termsIndexDivisor) throws IOException {
    
        if (termsIndexDivisor == 0) {
          throw new IllegalArgumentException("indexDivisor must be < 0 (don't load terms index) or greater than 0 (got 0)");
        }
        
        final Codec codec = si.info.getCodec();
        final Directory cfsDir; // confusing name: if (cfs) its the cfsdir, otherwise its the segment's directory.
    
        boolean success = false;
        
        try {
          if (si.info.getUseCompoundFile()) {
            // 读取cfs 文件
            cfsDir = cfsReader = new CompoundFileDirectory(dir, IndexFileNames.segmentFileName(si.info.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION), context, false);
          } else {
            cfsReader = null;
            cfsDir = dir;
          }
    
          final FieldInfos fieldInfos = owner.fieldInfos;
          
          this.termsIndexDivisor = termsIndexDivisor;
          final PostingsFormat format = codec.postingsFormat();
          final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si.info, fieldInfos, context, termsIndexDivisor);
          // 读取tip 和 tim 文件
          fields = format.fieldsProducer(segmentReadState);
          assert fields != null;
    
          if (fieldInfos.hasNorms()) {
            // 读取 nvd和 nvm 文件
            normsProducer = codec.normsFormat().normsProducer(segmentReadState);
            assert normsProducer != null;
          } else {
            normsProducer = null;
          }
          // 读取fdx 和fdt 文件
          fieldsReaderOrig = si.info.getCodec().storedFieldsFormat().fieldsReader(cfsDir, si.info, fieldInfos, context);
          if (fieldInfos.hasVectors()) { 
            // 读取 tvf  tvd  和  tvx 文件
            termVectorsReaderOrig = si.info.getCodec().termVectorsFormat().vectorsReader(cfsDir, si.info, fieldInfos, context);
          } else {
            termVectorsReaderOrig = null;
          }
    
          success = true;
        } finally {
          if (!success) {
            decRef();
          }
        }
      }
    }
    

    tim文件格式如下:

    Header FSTIndexNumFields <IndexStartFP>NumFields DirOffset Footer

    tip文件格式如下:

    Header PostingsHeader NumBlocks FieldSummary DirOffset Footer
    public class BlockTreeTermsReader extends FieldsProducer {
      public BlockTreeTermsReader(Directory dir, FieldInfos fieldInfos, SegmentInfo info,
                                  PostingsReaderBase postingsReader, IOContext ioContext,
                                  String segmentSuffix, int indexDivisor)
        throws IOException {
        
        this.postingsReader = postingsReader;
    
        this.segment = info.name;
        // 读取cfs
        in = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_EXTENSION),
                           ioContext);
    
        boolean success = false;
        IndexInput indexIn = null;
    
        try {
          version = readHeader(in);
          if (indexDivisor != -1) {
            indexIn = dir.openInput(IndexFileNames.segmentFileName(segment, segmentSuffix, BlockTreeTermsWriter.TERMS_INDEX_EXTENSION),
                                    ioContext);
            int indexVersion = readIndexHeader(indexIn);
            if (indexVersion != version) {
              throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion);
            }
          }
          
          // verify
          if (indexIn != null && version >= BlockTreeTermsWriter.VERSION_CHECKSUM) {
            CodecUtil.checksumEntireFile(indexIn);
          }
    
          // Have PostingsReader init itself
          postingsReader.init(in);
    
          // ...
      }
    }
    

    fdx 文件格式

    <Header> <FieldValuesPosition> SegSize

    fdt文件格式

    <Header> <DocFieldData> SegSize
    public final class Lucene40StoredFieldsReader extends StoredFieldsReader implements Cloneable, Closeable {
      public Lucene40StoredFieldsReader(Directory d, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
        final String segment = si.name;
        boolean success = false;
        fieldInfos = fn;
        try {
          fieldsStream = d.openInput(IndexFileNames.segmentFileName(segment, "", FIELDS_EXTENSION), context);
          final String indexStreamFN = IndexFileNames.segmentFileName(segment, "", FIELDS_INDEX_EXTENSION);
          indexStream = d.openInput(indexStreamFN, context);
          
          CodecUtil.checkHeader(indexStream, CODEC_NAME_IDX, VERSION_START, VERSION_CURRENT);
          CodecUtil.checkHeader(fieldsStream, CODEC_NAME_DAT, VERSION_START, VERSION_CURRENT);
          assert HEADER_LENGTH_DAT == fieldsStream.getFilePointer();
          assert HEADER_LENGTH_IDX == indexStream.getFilePointer();
          final long indexSize = indexStream.length() - HEADER_LENGTH_IDX;
          this.size = (int) (indexSize >> 3);
          // Verify two sources of "maxDoc" agree:
          if (this.size != si.getDocCount()) {
            throw new CorruptIndexException("doc counts differ for segment " + segment + ": fieldsReader shows " + this.size + " but segmentInfo shows " + si.getDocCount());
          }
          numTotalDocs = (int) (indexSize >> 3);
          success = true;
        } finally {
          if (!success) {
            try {
              close();
            } catch (Throwable t) {} // ensure we throw our original exception
          }
        }
      }
    }
    

    tvx 文件格式

    Header <DocumentPosition,FieldPosition> NumDocs

    tvd 文件格式

    Header <NumFields, FieldNums, FieldPositions> NumDocs

    tvf 文件格式

    Header <NumTerms, Flags, TermFreqs> NumFields

    liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);主要读取del文件
    del文件格式

    Format Header ByteCount BitCount Bits
    public class Lucene40LiveDocsFormat extends LiveDocsFormat {
      public Bits readLiveDocs(Directory dir, SegmentCommitInfo info, IOContext context) throws IOException {
        String filename = IndexFileNames.fileNameFromGeneration(info.info.name, DELETES_EXTENSION, info.getDelGen());
        final BitVector liveDocs = new BitVector(dir, filename, context);
        if (liveDocs.length() != info.info.getDocCount()) {
          throw new CorruptIndexException("liveDocs.length()=" + liveDocs.length() + "info.docCount=" + info.info.getDocCount() + " (filename=" + filename + ")");
        }
        if (liveDocs.count() != info.info.getDocCount() - info.getDelCount()) {
          throw new CorruptIndexException("liveDocs.count()=" + liveDocs.count() + " info.docCount=" + info.info.getDocCount() + " info.getDelCount()=" + info.getDelCount() + " (filename=" + filename + ")");
        }
        return liveDocs;
      }
    }
    

    3 创建IndexSearcher

    在打开索引目录后,接着创建IndexSearcher对象

    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    

    创建IndexSearcher的过程,主要是初始化searcher的context和reader

    public class IndexSearcher {
      public IndexSearcher(IndexReader r) {
        this(r, null);
      }
      public IndexSearcher(IndexReader r, ExecutorService executor) {
        this(r.getContext(), executor);
      }
      public IndexSearcher(IndexReaderContext context, ExecutorService executor) {
        assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel for reader" + context.reader();
        reader = context.reader();
        this.executor = executor;
        this.readerContext = context;
        leafContexts = context.leaves();
        this.leafSlices = executor == null ? null : slices(leafContexts);
      }
    }
    

    相关文章

      网友评论

          本文标题:lucene源码分析-搜索过程解析(一)

          本文链接:https://www.haomeiwen.com/subject/ceyygqtx.html