美文网首页
IR-chapter1:Boolean retrieval

IR-chapter1:Boolean retrieval

作者: woodsouthmmm | 来源:发表于2017-04-17 20:24 被阅读0次

Information retrieval

meaning

Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)

IR also covers supporting users in browsing or filtering document
collections or further processing a set of retrieved documents

scale

  • web search
    billions of documents stored on millions of computers
    gather documents to indexed
    build efficient system
    exploit hypertext
    protect from being boosted
  • personal information retrieval
    spotlight, instant search
    email program, search and classification
  • enterprise, institutional, and domain-specific search

An example information retrieval problem

Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.

grep

(How about requiring lager data, more flexible query, ranked retrieval more quickly)

incidence matrix

incidence matrix for Shakespeare' collections query processing

extremely sparse

terminology

  • boolean retrieval model
    a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms.
  • term
    the smallest unit we treat as the element of the set
  • document
    units we have decided to build a retrieval system over
  • collection/corpus
    the group of documents
  • information need
    the topic about which the user desires to know more.
  • query
    what the user convey to the computer.
  • relevant
    a document is relevant if it is the one that the user perceives as containing information of value with respect to their personal informational need.
  • effectiveness
    the quality of its search results
ll type of true and false
  • pricision
    TP/(TP+FP)
  • recall
    TP/(TP+FN)

inverted index/inverted file/index

part of inverted index for Shakespeare's collections
  • vocabulary/lexicon
    the set of terms
  • dictionary
    the data structure of the items
  • posting
    each item in the list
  • posting list
  • postings
    all posting lists

a first take at building an inverted index

  1. collect documents to be indexed
  2. tokenize the text, turning each document into a list of tokens
  3. do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
  4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
4th step
  • storage
    memory - disk(a linked list of fixed length arrays for each term)

processing boolean queries

  • simple conjunctive query
merge algorithm
  • query optimization
    process in increasing order of term frequency
Algorithm for conjunctive queries
  • asymmetric
  • difference is large

The extended Boolean model versus ranked retrieval

  • ranked retrieval model
    such as the vector space model, in which users freely use free text queries
  • the extended Boolean model
    proximity operator: specify that two terms in a query must occur close to each other in a document

相关文章

网友评论

      本文标题:IR-chapter1:Boolean retrieval

      本文链接:https://www.haomeiwen.com/subject/smztzttx.html