美文网首页
IR Chapter1-3

IR Chapter1-3

作者: 马小马_f182 | 来源:发表于2019-05-07 16:50 被阅读0次

    chapter 1 boolean search

    布尔检索是数据库检索最基本的方法,是用逻辑“或”(+、OR)、逻辑"与"(×、AND)、逻辑"非"(-、NOT)等算符在数据库中对相关文献的定性选择的方法。

    1. 单词-文档矩阵,
    2. 索引结构是倒排索引 unit-posting,基于关键词的查找
    • 查找
    • 合并merge:按照Doc Frequency排序,从最小的数组开始合并

    chapter 2 vocabulary list and posting

    更多是关于web search,不是精确检索。在search前需要mapping。

    1.文档处理document delineation and char seguence decoding

    • index granularity索引粒度:确定document unit
      (a message / a file /a message and its contained attachment / a book or a chapter or a sentence)
      PS: 系统需要提供可选的索引granularity

    2. vocabulary

    • tokenization( segment,eliminate punctuation etc.):hyphen、space

    • match:返回内容充分满足expectations,windows->Windows、WINDOWS 各种变换的形式

    • common words(stop words,pressure for time and space):
      big list → small list → statistics
      Web search engines generally do not use stop word lists.More precise with stop words

    • Normalization (helps in queries, not in IR performance)
      standard way:create equivalence classes
      1). mapping【badcase U.S.A match USA,but C.A.T. not match cat】
      2). maintain relations
      a. index unnormalized tokens + a query expansion list
      b. index construction, index 含有token同义词的document

      3). the forms of normalization
      accents /diacritics
      capitalization/case-folding:reduce letters to lower case
      idiosyncratic issue in English:color vs colour

    • Stemming and lemmatization:reduce inflectional forms
      stemming remove derivational affixes
      lemma 还原词根

    3. Faster postings list intersection

    • skip list pointer,而不是原来的逐一遍历
      【Where do we place skips?】 There is a tradeoff.
      heuristic : ` √P evenly-spaced skip pointers.

    4. Positional postings and phrase queries

    • ▼ Biword indexes→phrase index
      single-word term index also needed
      【cons】
      large vocabulary→partial phrase index
    • ▲ Positional postings 位置信息索引
      【cons】
      expands posting size,depending on token
    • ▲+▼ combined scheme【biword index + position index】
      index what biword?
      • query behavior:index frequently queried phrase
      • individual word high frequency, phrase comparatively low

    chapter 3 Dictionaries and tolerant retrieval(容错检索)

    • Search structures :Dictionary

    • Solution :

    1. hashing
    2. search tree【demand prescribed ordering】
      Binary tree
      B-tree
      binary tree
      B-Tree
    • Wildcard queries 通配符检索

    找索引单元/term/单词,字符串查找

    query:mon*
    solution:regular B-Tree
    process: walk down the tree following the symbol m, o, n.

    query:*mon
    solution:reversed B-Tree
    process: walk down the tree following the symbol m, o, n.
    e.g. lemon root->n->o->m->e->l

    query:se*mon
    solution2:permuterm indexes
    solution1:regular B-Tree + reversed B-Tree
    process:
    search regular B-Tree, prefix se-, set A
    search reversed B-Tree, suffix -mon, set B
    intersect A ∩ B

    General wildcard queries

    1. permuterm index
      也就是索引单元包含了首、尾的信息
      cons:the dictionary becomes quite large, lead to blowup


      image.png

      e.g. m*n → n$m → man、moron
      e.g. fi*mo*er → 首 er 尾fi → candidate terms →filter by exhaustive enumeration,检查中间是否包含"mo"
      找完单词以后,进行document retrive

    2. k-gram indexes for wildcard queries
      index k characters, with ¥ denoting beginning and end.
      相比于permuterm index,数量相对少一些
      e.g. k=3,castle →¥ca, cas, ast, stl, tle, le$
      search
      e.g. re*ve → ¥re and ve¥
      filtering demanded string-matching operation
      e.g. red* → ed¥→ retired【invalid,因为query左侧没有其他符号】

    Spelling correction

    1. implementation of spelling correction

    2. 2 steps to solving the problem

    • edit distance
    • k-gram overlap
    1. implementation
    • 选择正确答案的原则
      选择最近的,proximity;
      选择最常见的(出现最多;使用最多)
    • forms
      isolated-term correction
      context-sensitive correction.
    1. 编辑距离的计算【动态规划】
      可以用来计算相似度
      编辑距离其实是编辑操作的最小数目
      edit operation:insert, replace, delete => 带权重的edit operation

    3.k-gram索引 for spelling correction
    【isolated spelling correction】
    【Jaccard coefficient】
    set A query里面的k-gram集合
    set B term里面的k-gram集合
    Jaccard coefficient = |A ∩ B| / |A ∪ B|
    流程:
    for t in term-posting list
       Jaccard( t ) > 阈值
       选择t作为候选

    t 的 k-gram如何获得?
    Jaccard = num(A&B) / num(A+B)-num(A+B)
    num(A+B)是 posting hits
    方法:
    step1、经验方法,直接选出candidate
    step2、计算编辑距离,选出具有最小edit distance的t

    【context sensitive spelling correction】
    query = w1 + w2 +.....wn
    列举出每一个词的candidate
    trim,bi-words,previous queries by users

    相关文章

      网友评论

          本文标题:IR Chapter1-3

          本文链接:https://www.haomeiwen.com/subject/fmmeoqtx.html