美文网首页
language model 得到句子的得分

language model 得到句子的得分

作者: VanJordan | 来源:发表于2019-05-13 16:47 被阅读0次

bert as language model

  • 首先bert是一个masked language model,因此只能在句子中有mask的时候根据双向的词来预测这个位置的单词,不符合语言模型的链式法则,但是也是可以一个一个的mask掉单词,然后得到去掉这个单词之后 句子的得分,然后将所有的得分相加得到句子的困惑度
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-large-cased')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sentence_loss=0.
    for i,word in enumerate(tokenize_input):

        tokenize_input[i]='[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
        word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
        sentence_loss +=word_loss
        #print("Word: %s : %f"%(word, np.exp(-word_loss)))

    return np.exp(sentence_loss/len(tokenize_input))

score("There is a book on the table")
88.899999
  • bert as language model 看到别人的一个实现,原理是一样的,但是最大的缺点就是慢,判断一个句子的ppl竟然需要1秒钟。不能忍。

GPT as language model

方法

  • 因为训练的时候没有使用单词表示句子的开始,因此不能给一个单词计算困惑度。
import math
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return math.exp(loss)


a=['there is a book on the desk',
                'there is a plane on the desk',
                        'there is a book in the desk']
print([score(i) for i in a])
21.31652459381952, 61.45907380241148, 26.24923942649312

transformer-xl

    def encode_text_batch(self, sentences, ordered=False, verbose=False, add_eos=True,
            add_double_eos=False):
        encoded = []
        for idx, line in enumerate(sentences):
            if verbose and idx > 0 and idx % 500000 == 0:
                print('    line {}'.format(idx))
            symbols = self.tokenize(line, add_eos=add_eos,
                add_double_eos=add_double_eos)
            encoded.append(self.convert_to_tensor(symbols))

        if ordered:
            encoded = torch.cat(encoded)

        return encoded

batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
tmp_iter = LMShuffledIterator(encoded_text_batch,1, 5,device=device)
evaluate(tmp_iter)
>> 1, ppl, loss : 16906.905848100676 48.67738723754883 
2, ppl, loss : 16927.99263942421 48.68361949920654 
3, ppl, loss : 16954.343652297874 48.691396713256836 
  • 如上面所示作者更改了corpus.vocab.encode_file
Note: I modified the corpus.vocab.encode_file to encode the input sentence instead of reading from file
Any particular reason why this is observed.
  • 如果想要计算一个batch里面每个句子的ppl,要做两件事,一件是设置mem_len=0,二是使用一个不同的数据迭代器。
batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
for sent in encoded_text_batch:
  inp = sent[:-1]
  tgt = sent[1:]
  loss, = model(inp, tgt)
  ppl = math.exp(loss.mean().item())

基于统计的n-gram的方法kenLM

引用

  • kneser-Ney平滑后的n-gram模型, 不好的地方就是还要自己训练一个,假设我们使用COCA数据集合来训练(170W),好处是得分比较快。
  • 训练,将所有的数据首先进行clear
cat text/*.txt | python coca/clean.py > text/coca_fulltext.clean.txt
  • 然后统计词频(相当于进行训练)。
mosesdecoder/bin/lmplz -o 3 < text/coca_fulltext.clean.txt > text/coca_fulltext.clean.lm.arpa
  • 然后做测试
echo "I am a boy ." | mosesdecoder/bin/query text/coca_fulltext.clean.lm.arpa
  • 得到结果。
I=486 2 -1.7037368
am=4760 3 -1.4910358
a=27 3 -1.1888235
boy=10140 2 -3.2120245
.=29 3 -0.6548149
</s>=2 2 -1.335156
Total: -9.585592
OOV: 0
  • -9.585592 是句子概率的log,这么做是因为有些句子的概率太小了,log以后不会溢出。
  • 在一个例子
wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
lmplz -o 5 < something.txt > something.arpa
  • 计算困惑度
import kenlm

model=kenlm.Model("something.arpa") 
per=model.perplexity("your text sentance")

print(per)

kenlm

GLTR tool by harvard nlp

  • GLTR tool by harvard nlp可以计算,条件概率,"He was going home" 给定"he was going"计算"home"的概率。

使用gpt做language model

  • 先在modeling_openai.py里面将两个文件下载好,然后改成缓存的地址。
  • 再在tokenization_openai.py里面干同样的事情。
import math
import torch
import time
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# Load pre-trained model (weights)
model_load_start_time = time.time()
print('start loading model...')
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt',cache_dir='gpt')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt',cache_dir='gpt')
print('model load successfully! {0}'.format(time.time()-model_load_start_time))
def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return math.exp(loss)

print('get score...')
score_start_time = time.time()
a=['there is a book on the desk',
                'there is a plane on the desk',
                        'there is a book in the desk']
print([score(i) for i in a])
print('time {0}'.format(time.time()-score_start_time))
  • 发现用时还是可以接受的,3个句子只用时0.1秒。

相关文章

网友评论

      本文标题:language model 得到句子的得分

      本文链接:https://www.haomeiwen.com/subject/crkraqtx.html