RETRO

作者: MatrixOnEarth | 来源:发表于2022-07-14 11:18 被阅读0次

retro
retro
如何组织一场Retro
如何设置本地免密码登录linux服务器
Returning to the retro
Reed
回顾会议的各种形式
Pixel Soldier-Castle Parkour
《传世争霸》
国外商场贱卖Champion，“潮牌”快要卖不动了？

@(NLP)[IR]

姚伟峰(Matrix Yao)

Info Card

full name
Retrieval Enhanced TRansfOrmer

paper
Improving language models by retrieving from trillions of tokens

year
2022

from
DeepMind

GitHub
un-official

Basic Idea

RETRO is a neural language model.
Comparing w/ existing language models like GPT, it separates memorization and generalization, so memorize the world knowledge w/ Retrieval, while learn the language structure w/ Model.

General auto-regressive language model
$L(X|\theta) \triangleq \sum_{i=1}^{n}l_{\theta}(x_i|(x_j)_{j<i})$
RETRO's chunked retrieval enhanced model
$L(X|\theta, \mathcal D) \triangleq \sum_{u=1}^{l} \sum_{i=1}^{m}l_{\theta}(x_{(u-1)m+i}|(x_j)_{j<(u-1)m+i}, (RET_{\mathcal D}(C_{u'}))_{u'<u})$

LM Before and After

Any benefits?
Democratization fast/cheap and good
- Fewer parameters: 25x fewer parameters lead to much lower requirement computation requirement for training and serving;
- SOTA accuracy: show better perplexity on LM and SOTA accuracy on downstream tasks e.g., question answering;

Below diagram from [1] is not the whole picture of RETRO. It's just the retrieval part.

How Does it Work

Step-1: Retrieve Nearest Neighbors and Encode them

Points
- retrieve the top-k nearest neighbors in chunk granularity, neither passage granularity as sentence BERT nor token granularity like ColBERT
- each of top-k token sequence = concat(neighbor chunk, continuation chunk)
- each token sequence is encoded w/ bi-directional transformer encoder, optionally w/ self-attended query as k/v

Step-2: Decode Causally

CCA(Chunked Cross Attention)

Points
- both attention and CCA are causal, to make it auto-regressive

Results

Language Model

Pretty good bits-per-byte even 23+x smaller size.

Downstream Task: QA

Not really so good, considering the 7.5B model size. And inferior accuracy than FiD, they blame the encoder weight not enough in current model.

Application on ODQA domain

Pipeline Comparison

dense retriever + neural ranker
E.g.,
- Single Retrieval Encoder: SentenceEmbedding Retriever + ColBERT Ranker
- Dual Retrieval Encoder: DPR Retriever + ColBERT Ranker
RETRO

We can see that RETRO can easily fit as a dense retriever + neural ranker ODQA pipeline. It can be viewed as single-encoder dense retriever + neural ranker , and the ranker is compute-heavier than ColBERT, both because of model size and the ranker doc encoder cannot be pre-computed.

To put RETRO into the map of ODQA paradigms

References

网友评论

本文标题：RETRO

本文链接：https://www.haomeiwen.com/subject/hniqirtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

RETRO

Info Card

Basic Idea

How Does it Work

Step-1: Retrieve Nearest Neighbors and Encode them

Step-2: Decode Causally

Results

Language Model

Downstream Task: QA

Application on ODQA domain

Pipeline Comparison

To put RETRO into the map of ODQA paradigms

References

相关文章

retro

retro

如何组织一场Retro

如何设置本地免密码登录linux服务器

Returning to the retro

Reed

回顾会议的各种形式

Pixel Soldier-Castle Parkour

《传世争霸》

国外商场贱卖Champion，“潮牌”快要卖不动了？

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读