美文网首页
IR-chapter5: Index compression

IR-chapter5: Index compression

作者: woodsouthmmm | 来源:发表于2017-04-24 19:32 被阅读0次
    • the advantage
      decrease the disk space
      increase the use of cache, thus decrease response time
      decrease the transfer time from disk to memory
    • In this chapter, we define a posting as a docID in a postings list.

    statistical properties of terms in information retrieval

    The effects of preprocessing for Reuters-RCV1
    • lossy compression, when the "lost" information is unlikely to be used by the search system
    • The compression techniques we discussed in the remainder of this chapter is are lossless, that is, all information is preserved

    estimating the number of terms:

    • OED: more than 600,000 words, but ignoring most names of person, locations, products etc.
    • heap's law
      M = k * power(T,b)
      for T>100,000, b=0.49, k=44, the fit is excellent.
      It suggests...

    modeling the distribution of terms

    • Zipf's law
      the collection frequency of the ith commonest item = c * power(i,k)
      log cf = logc + k * log i
      it suggests...

    dictionary compression

    • why is it necessary?
      To decrease the number of disk seeks to shorten the response time.

    dictionary as a string

    • the simplest approach: the dictionary as a fixed-width array.
      Too wasteful!!!
      Can't store term containing more than 20 characters
    • storing the dictionary terms as one long string.
    dictionary as a string

    blocked storage

    • grouping terms in the string into blocks of size k, and keeping one term pointer for the first term in the string, adding the length byte for each term in the string.
    blocked storage
    • trade-off between the compression and the speed of term lookup.
    term lookup time
    • front coding:
    front coding
    • hash function:
      unifiable for dynamic environment, since every new term will create collision.
    dictionary compression with different data structure

    posting file compression

    • variable ecoding
      docID (rare terms) vs gap(frequent terms)
    • two methods
      bytewise, bitwise (encoding the gaps )


      variable encoding

    Variable byte encoding

    VB encoding pseudodcode
    • larger : less effective compression, less bit manipulation.
    • smaller: more effective compression, more bit manipulation.
    • trade-off between compression ratio and depression time.

    γ Codes

    γ Codes
    • universal
    E(L) - the expected length of a code L, H(P)-entropy
    • prefix free, parameter free
    • how much compression does it achieve?

    相关文章

      网友评论

          本文标题:IR-chapter5: Index compression

          本文链接:https://www.haomeiwen.com/subject/nyrszttx.html