bert as language model
- 首先
bert
是一个masked language model
,因此只能在句子中有mask的时候根据双向的词来预测这个位置的单词,不符合语言模型的链式法则,但是也是可以一个一个的mask
掉单词,然后得到去掉这个单词之后 句子的得分,然后将所有的得分相加得到句子的困惑度
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
model = BertForMaskedLM.from_pretrained('bert-large-cased')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
def score(sentence):
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
sentence_loss=0.
for i,word in enumerate(tokenize_input):
tokenize_input[i]='[MASK]'
mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
word_loss=model(mask_input, masked_lm_labels=tensor_input).data.numpy()
sentence_loss +=word_loss
#print("Word: %s : %f"%(word, np.exp(-word_loss)))
return np.exp(sentence_loss/len(tokenize_input))
score("There is a book on the table")
88.899999
GPT as language model
方法
- 因为训练的时候没有使用单词表示句子的开始,因此不能给一个单词计算困惑度。
import math
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
def score(sentence):
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
loss=model(tensor_input, lm_labels=tensor_input)
return math.exp(loss)
a=['there is a book on the desk',
'there is a plane on the desk',
'there is a book in the desk']
print([score(i) for i in a])
21.31652459381952, 61.45907380241148, 26.24923942649312
transformer-xl
def encode_text_batch(self, sentences, ordered=False, verbose=False, add_eos=True,
add_double_eos=False):
encoded = []
for idx, line in enumerate(sentences):
if verbose and idx > 0 and idx % 500000 == 0:
print(' line {}'.format(idx))
symbols = self.tokenize(line, add_eos=add_eos,
add_double_eos=add_double_eos)
encoded.append(self.convert_to_tensor(symbols))
if ordered:
encoded = torch.cat(encoded)
return encoded
batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
tmp_iter = LMShuffledIterator(encoded_text_batch,1, 5,device=device)
evaluate(tmp_iter)
>> 1, ppl, loss : 16906.905848100676 48.67738723754883
2, ppl, loss : 16927.99263942421 48.68361949920654
3, ppl, loss : 16954.343652297874 48.691396713256836
- 如上面所示作者更改了
corpus.vocab.encode_file
Note: I modified the corpus.vocab.encode_file to encode the input sentence instead of reading from file
Any particular reason why this is observed.
- 如果想要计算一个batch里面每个句子的ppl,要做两件事,一件是设置
mem_len
=0,二是使用一个不同的数据迭代器。
batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
for sent in encoded_text_batch:
inp = sent[:-1]
tgt = sent[1:]
loss, = model(inp, tgt)
ppl = math.exp(loss.mean().item())
基于统计的n-gram的方法kenLM
引用
- 是
kneser-Ney
平滑后的n-gram
模型, 不好的地方就是还要自己训练一个,假设我们使用COCA
数据集合来训练(170W
),好处是得分比较快。
- 训练,将所有的数据首先进行
clear
cat text/*.txt | python coca/clean.py > text/coca_fulltext.clean.txt
mosesdecoder/bin/lmplz -o 3 < text/coca_fulltext.clean.txt > text/coca_fulltext.clean.lm.arpa
echo "I am a boy ." | mosesdecoder/bin/query text/coca_fulltext.clean.lm.arpa
I=486 2 -1.7037368
am=4760 3 -1.4910358
a=27 3 -1.1888235
boy=10140 2 -3.2120245
.=29 3 -0.6548149
</s>=2 2 -1.335156
Total: -9.585592
OOV: 0
-
-9.585592
是句子概率的log
,这么做是因为有些句子的概率太小了,log
以后不会溢出。
- 在一个例子
wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
lmplz -o 5 < something.txt > something.arpa
import kenlm
model=kenlm.Model("something.arpa")
per=model.perplexity("your text sentance")
print(per)
kenlm
GLTR tool by harvard nlp
-
GLTR tool by harvard nlp
可以计算,条件概率,"He was going home" 给定"he was going"计算"home"的概率。
使用gpt做language model
- 先在
modeling_openai.py
里面将两个文件下载好,然后改成缓存的地址。
- 再在
tokenization_openai.py
里面干同样的事情。
import math
import torch
import time
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# Load pre-trained model (weights)
model_load_start_time = time.time()
print('start loading model...')
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt',cache_dir='gpt')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt',cache_dir='gpt')
print('model load successfully! {0}'.format(time.time()-model_load_start_time))
def score(sentence):
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
loss=model(tensor_input, lm_labels=tensor_input)
return math.exp(loss)
print('get score...')
score_start_time = time.time()
a=['there is a book on the desk',
'there is a plane on the desk',
'there is a book in the desk']
print([score(i) for i in a])
print('time {0}'.format(time.time()-score_start_time))
网友评论