IR-chapter5: Index compression

IR-chapter5: Index compression

作者: woodsouthmmm | 来源:发表于2017-04-24 19:32 被阅读0次

IR-chapter5: Index compression
vue gzip压缩
漫谈Deep Compression(一)简介与背景
Win10优化
git获取更新出错RPC failed;
第17周论文阅读（2019年）
Deep Compression
Deep Compression
Kafka Compression
Deduplication and Compression

the advantage
decrease the disk space
increase the use of cache, thus decrease response time
decrease the transfer time from disk to memory
In this chapter, we define a posting as a docID in a postings list.

statistical properties of terms in information retrieval

The effects of preprocessing for Reuters-RCV1

lossy compression, when the "lost" information is unlikely to be used by the search system
The compression techniques we discussed in the remainder of this chapter is are lossless, that is, all information is preserved

estimating the number of terms:

OED: more than 600,000 words, but ignoring most names of person, locations, products etc.
heap's law
M = k * power(T,b)
for T>100,000, b=0.49, k=44, the fit is excellent.
It suggests...

modeling the distribution of terms

Zipf's law
the collection frequency of the ith commonest item = c * power(i,k)
log cf = logc + k * log i
it suggests...

dictionary compression

why is it necessary?
To decrease the number of disk seeks to shorten the response time.

dictionary as a string

the simplest approach: the dictionary as a fixed-width array.
Too wasteful!!!
Can't store term containing more than 20 characters
storing the dictionary terms as one long string.

dictionary as a string

blocked storage

grouping terms in the string into blocks of size k, and keeping one term pointer for the first term in the string, adding the length byte for each term in the string.

blocked storage

trade-off between the compression and the speed of term lookup.

term lookup time

front coding:

front coding

hash function:
unifiable for dynamic environment, since every new term will create collision.

dictionary compression with different data structure

posting file compression

variable ecoding
docID (rare terms) vs gap(frequent terms)
two methods
bytewise, bitwise (encoding the gaps )

variable encoding

Variable byte encoding

VB encoding

pseudodcode

larger : less effective compression, less bit manipulation.
smaller: more effective compression, more bit manipulation.
trade-off between compression ratio and depression time.

γ Codes

γ Codes

universal

E(L) - the expected length of a code L, H(P)-entropy

prefix free, parameter free
how much compression does it achieve?

相关文章

IR-chapter5: Index compression
the advantagedecrease the disk spaceincrease the use of c...
vue gzip压缩
只需要在config/index.js productionGzip:true, 然后安装compression-...
漫谈Deep Compression(一)简介与背景
Index 系列文章结构如下: 漫谈Deep Compression(一)简介与背景漫谈Deep Compres...
Win10优化
关闭Memory Compression 方法一 Memory Compression是微软检测并缓解物理内存(R...
git获取更新出错RPC failed;
解决方式：1、文件太大,解决方式为git添加 compression 配置项 compression 是压缩的意思...
第17周论文阅读（2019年）
Compression and Localization in Reinforcement Learning fo...
Deep Compression
Approach We introduce “deep compression”, a three stage p...
Deep Compression
Approach We introduce “deep compression”, a three stage p...
Kafka Compression
使用kafka-producer-perf-test.sh脚本依次为4个topic发送60,000,000条消息，...
Deduplication and Compression
文章中如有错误，望指出，谢谢！

网友评论

本文标题：IR-chapter5: Index compression

本文链接：https://www.haomeiwen.com/subject/nyrszttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

栏目导航

热点阅读

关于我们|服务条款|联系我们|IR-chapter5: Index compression|投稿指南|网站地图|RSS订阅|排版工具|手机版

提供经典美文摘抄,优美散文欣赏,现代诗歌精选,短篇小说,心情随笔,表白情书范文,故事会在线阅读欣赏

Copyright © 2014-2023 Haomeiwen.com All Rights Reserved. 好美文阅读网版权所有

备案信息：桂公网安备 45052102000051号 · 桂ICP备13007215号-3

本站所收录作品、热点评论等信息部分来源互联网，目的只是为了系统归纳学习和传递资讯

所有作品版权归原创作者所有，与本站立场无关，如不慎侵犯了你的权益，请联系我们告知，我们将做删除处理！