美文网首页
IR-chapter1:Boolean retrieval

IR-chapter1:Boolean retrieval

作者: woodsouthmmm | 来源:发表于2017-04-17 20:24 被阅读0次

    Information retrieval

    meaning

    Information retrieval (IR) is finding material (usually documents) of an
    unstructured nature (usually text) that satisfies an information need from
    within large collections (usually stored on computers).

    keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)

    IR also covers supporting users in browsing or filtering document
    collections or further processing a set of retrieved documents

    scale

    • web search
      billions of documents stored on millions of computers
      gather documents to indexed
      build efficient system
      exploit hypertext
      protect from being boosted
    • personal information retrieval
      spotlight, instant search
      email program, search and classification
    • enterprise, institutional, and domain-specific search

    An example information retrieval problem

    Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.

    grep

    (How about requiring lager data, more flexible query, ranked retrieval more quickly)

    incidence matrix

    incidence matrix for Shakespeare' collections query processing

    extremely sparse

    terminology

    • boolean retrieval model
      a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms.
    • term
      the smallest unit we treat as the element of the set
    • document
      units we have decided to build a retrieval system over
    • collection/corpus
      the group of documents
    • information need
      the topic about which the user desires to know more.
    • query
      what the user convey to the computer.
    • relevant
      a document is relevant if it is the one that the user perceives as containing information of value with respect to their personal informational need.
    • effectiveness
      the quality of its search results
    ll type of true and false
    • pricision
      TP/(TP+FP)
    • recall
      TP/(TP+FN)

    inverted index/inverted file/index

    part of inverted index for Shakespeare's collections
    • vocabulary/lexicon
      the set of terms
    • dictionary
      the data structure of the items
    • posting
      each item in the list
    • posting list
    • postings
      all posting lists

    a first take at building an inverted index

    1. collect documents to be indexed
    2. tokenize the text, turning each document into a list of tokens
    3. do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
    4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
    4th step
    • storage
      memory - disk(a linked list of fixed length arrays for each term)

    processing boolean queries

    • simple conjunctive query
    merge algorithm
    • query optimization
      process in increasing order of term frequency
    Algorithm for conjunctive queries
    • asymmetric
    • difference is large

    The extended Boolean model versus ranked retrieval

    • ranked retrieval model
      such as the vector space model, in which users freely use free text queries
    • the extended Boolean model
      proximity operator: specify that two terms in a query must occur close to each other in a document

    相关文章

      网友评论

          本文标题:IR-chapter1:Boolean retrieval

          本文链接:https://www.haomeiwen.com/subject/smztzttx.html