美文网首页
IR-chapter5: Index compression

IR-chapter5: Index compression

作者: woodsouthmmm | 来源:发表于2017-04-24 19:32 被阅读0次
  • the advantage
    decrease the disk space
    increase the use of cache, thus decrease response time
    decrease the transfer time from disk to memory
  • In this chapter, we define a posting as a docID in a postings list.

statistical properties of terms in information retrieval

The effects of preprocessing for Reuters-RCV1
  • lossy compression, when the "lost" information is unlikely to be used by the search system
  • The compression techniques we discussed in the remainder of this chapter is are lossless, that is, all information is preserved

estimating the number of terms:

  • OED: more than 600,000 words, but ignoring most names of person, locations, products etc.
  • heap's law
    M = k * power(T,b)
    for T>100,000, b=0.49, k=44, the fit is excellent.
    It suggests...

modeling the distribution of terms

  • Zipf's law
    the collection frequency of the ith commonest item = c * power(i,k)
    log cf = logc + k * log i
    it suggests...

dictionary compression

  • why is it necessary?
    To decrease the number of disk seeks to shorten the response time.

dictionary as a string

  • the simplest approach: the dictionary as a fixed-width array.
    Too wasteful!!!
    Can't store term containing more than 20 characters
  • storing the dictionary terms as one long string.
dictionary as a string

blocked storage

  • grouping terms in the string into blocks of size k, and keeping one term pointer for the first term in the string, adding the length byte for each term in the string.
blocked storage
  • trade-off between the compression and the speed of term lookup.
term lookup time
  • front coding:
front coding
  • hash function:
    unifiable for dynamic environment, since every new term will create collision.
dictionary compression with different data structure

posting file compression

  • variable ecoding
    docID (rare terms) vs gap(frequent terms)
  • two methods
    bytewise, bitwise (encoding the gaps )


    variable encoding

Variable byte encoding

VB encoding pseudodcode
  • larger : less effective compression, less bit manipulation.
  • smaller: more effective compression, more bit manipulation.
  • trade-off between compression ratio and depression time.

γ Codes

γ Codes
  • universal
E(L) - the expected length of a code L, H(P)-entropy
  • prefix free, parameter free
  • how much compression does it achieve?

相关文章

网友评论

      本文标题:IR-chapter5: Index compression

      本文链接:https://www.haomeiwen.com/subject/nyrszttx.html