PyTrch深度学习简明实战22 - 文本表示方法与词嵌入

作者: 薛东弗斯 | 来源:发表于2023-04-08 13:35 被阅读0次

torchtext 用法 - 简书 (jianshu.com)

image.png

独热编码的缺点：维度比较大。
散列编码：文本间的关系无法表示；固定长度的散列值表示，容易散列冲突

image.png

分词方式

import string

string.punctuation   # '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'   特殊符号

s = "I love You"
for c in string.punctuation:
    s = s.replace(c,' ').lower()
    
# 分词方式1， 将每个字符单独分词
# list(s)    # ['i', ' ', 'l', 'o', 'v', 'e', ' ', 'y', 'o', 'u']

# 分词方式2，按照每个单词单独分词,按照空格分词
# s.split()    # ['i', 'love', 'you']

# 分词方式3，n-gram

向量化

import string
import torch
import torch.nn as nn

# 向量化： one-hot embeding
s = "I love You,you love me"
for c in string.punctuation:   # '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'   特殊符号
    s = s.replace(c,' ').lower()
s.split()   # ['i', 'love', 'you', 'you', 'love', 'me']
vocab = dict((word,index) for index,word  in enumerate(np.unique(s.split())))   # 得到单词到序号的词表  
# # 如果只有np.unique(s.split()),则没有序号。因此需要将np.unique(s.split())包在enumerate里面
# vocab   # {'i': 0, 'love': 1, 'me': 2, 'you': 3}

s = [vocab.get(w) for w in s.split()]    
# s   # [0, 1, 3, 3, 1, 2]

b = np.zeros((len(s),len(vocab)))
for index,i in enumerate(s):
    b[index,i] = 1      # 每一行的第i行为1
# b
# array([[1., 0., 0., 0.],
#        [0., 1., 0., 0.],
#        [0., 0., 0., 1.],
#        [0., 0., 0., 1.],
#        [0., 1., 0., 0.],
#        [0., 0., 1., 0.]])

# 词嵌入表示
# 不仅可以做文本分类，还可以做分类表示。将每一类映射到一个张量上面
em = torch.nn.Embedding(len(vocab),20)   # 将每个单词映射成长度为20的张量。
print(len(s))   # 6     将6个单词，映射为长度为20的张量
s_em = em(torch.LongTensor(s))
# print(s_em.shape)   # torch.Size([6, 20])

TorchText文本分类数据集

文本分类预处理的步骤

1、分词

2、创建词表

3、创建词嵌入表示

image.png

每个单词在深度学习里面相当于一个分类，每个分类相当于给它建立一个索引从0开始。对于单词也是这样的，对每个单词创建索引编码。
创建好词表，就可以做词嵌入，将词映射到一个张量，反映不同单词间的映射关系。用torch.nn.Embeding层做。
所有的词嵌入表示做聚合，一般用平均方法。缺点：忽略了文本的前后关系。速度快，但精度不够

image.png

image.png
使用TrochText库对文本预处理。从TorchTest加载数据集，如IMDB；使用TorchText分词工具做分词；使用TorchText提供的创建词表工具创建词表（word look-up table），得到一个个的索引（index1 .。。。 indexn）; 将字符索引送入EmbeddingBag层，会对每个索引所词嵌入，再将词嵌入的结果进行聚合，聚合后的结果交给Linear Layer；Linear Layer分类输出。
深度学习中都是小批量数据同时训练。batch
对于电影评分案例，每条的评论并非等长，此时创建批次就会有问题。批次中每条评论均该等长。文本中该怎么做？文本中，我们并不需要对每条评论进行填充到等长，因为使用了EmbeddingBag层，可以将一个批次中全部文本创建成一个序列，EmbeddingBag层怎么知道哪个层是哪个评论？将序列告诉embeding层同时还需要告诉偏移量，如1-3为第1条评论，4-8 第二条评论等等，通过偏移量告诉Embedding层每条评论在什么位置，这样可以非常高效的序列。
定义预处理函数。

import torch
import torchtext   # pip install portalocker/pip install torchtext
from torchtext.data.utils import get_tokenizer            # 分词工具
from torchtext.vocab import build_vocab_from_iterator     # 创建词表工具
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils import data

train_iter,test_iter = torchtext.datasets.IMDB()
train_data,test_data = list(train_iter),list(test_iter)
# all_classes = set([label for (label,text) in train_data])

tokenizer = get_tokenizer('basic_english')
tokenizer('this is a aaaa PYtorch lesson')   # ['this', 'is', 'a', 'aaaa', 'pytorch', 'lesson']

# 处理文本的思路：
#    1. 分词
#    2. 生成词表   he--30， her--31
#    3. 词嵌入     30--》（0.2， 0.4， 0.2， 0.9， 2.1）  独热编码  tf-idf  hash

def yield_token(data):
    for(_,text) in data:
        yield tokenizer(text)
        
vocab = build_vocab_from_iterator(yield_token(train_data),
                                 specials=['<pad>','<unk>'],# pad会被映射为0，unk映射为1
                                 min_freq =3)   # 至少出现3次的单词才会创建此表

vocab.set_default_index(vocab['<unk>'])  # 不认识的单词，设置索引为1

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x:int(x=='pos')   # 将标签定义为0 /1

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        precess_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(precess_text)
        offsets.append(precess_text.size(0))
    label_list = torch.tensor(label_list)
    text_list = torch.cat(text_list)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

train_dataloader = DataLoader(train_dataset, batch_size=64, 
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=64, 
                             shuffle=True, collate_fn=collate_batch)

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)
    


num_class = 2

vocab_size = len(vocab)
emsize = 100
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

loss_fn = nn.CrossEntropyLoss()
from torch.optim import lr_scheduler
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

def train(dataloader):
    total_acc, total_count, total_loss, = 0, 0, 0
    model.train()
    for label, text, offsets in dataloader:
        predited_label = model(text, offsets)
        loss = loss_fn(predited_label, label)
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        with torch.no_grad():
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
            total_loss += loss.item()*label.size(0)
    return total_loss/total_count, total_acc/total_count
    
def test(dataloader):
    model.eval()
    total_acc, total_count, total_loss, = 0, 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predited_label = model(text, offsets)
            loss = loss_fn(predited_label, label)
            total_acc += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
            total_loss += loss.item()*label.size(0)
    return total_loss/total_count, total_acc/total_count
    
def fit(epochs, train_dl, test_dl):
    train_loss = []
    train_acc = []
    test_loss = []
    test_acc = []

    for epoch in range(epochs):
        epoch_loss, epoch_acc = train(train_dl)
        epoch_test_loss, epoch_test_acc = test(test_dl)
        train_loss.append(epoch_loss)
        train_acc.append(epoch_acc)
        test_loss.append(epoch_test_loss)
        test_acc.append(epoch_test_acc)
        exp_lr_scheduler.step()
        template = ("epoch:{:2d}, train_loss: {:.5f}, train_acc: {:.1f}% ," 
                    "test_loss: {:.5f}, test_acc: {:.1f}%")
        print(template.format(
              epoch, epoch_loss, epoch_acc*100, epoch_test_loss, epoch_test_acc*100))
    print("Done!")
    
    return train_loss, test_loss, train_acc, test_acc
    
EPOCHS = 30

train_loss, test_loss, train_acc, test_acc = fit(EPOCHS, 
                                                 train_dataloader, 
                                                 test_dataloader)

epoch: 0, train_loss: 0.05308, train_acc: 99.9% ,test_loss: 0.01097, test_acc: 100.0%
epoch: 1, train_loss: 0.00760, train_acc: 100.0% ,test_loss: 0.00540, test_acc: 100.0%
epoch: 2, train_loss: 0.00440, train_acc: 100.0% ,test_loss: 0.00358, test_acc: 100.0%
epoch: 3, train_loss: 0.00311, train_acc: 100.0% ,test_loss: 0.00268, test_acc: 100.0%
epoch: 4, train_loss: 0.00241, train_acc: 100.0% ,test_loss: 0.00214, test_acc: 100.0%
epoch: 5, train_loss: 0.00197, train_acc: 100.0% ,test_loss: 0.00179, test_acc: 100.0%
epoch: 6, train_loss: 0.00167, train_acc: 100.0% ,test_loss: 0.00153, test_acc: 100.0%
epoch: 7, train_loss: 0.00145, train_acc: 100.0% ,test_loss: 0.00134, test_acc: 100.0%
epoch: 8, train_loss: 0.00128, train_acc: 100.0% ,test_loss: 0.00120, test_acc: 100.0%
epoch: 9, train_loss: 0.00115, train_acc: 100.0% ,test_loss: 0.00108, test_acc: 100.0%
epoch:10, train_loss: 0.00108, train_acc: 100.0% ,test_loss: 0.00107, test_acc: 100.0%
epoch:11, train_loss: 0.00107, train_acc: 100.0% ,test_loss: 0.00106, test_acc: 100.0%
epoch:12, train_loss: 0.00106, train_acc: 100.0% ,test_loss: 0.00105, test_acc: 100.0%
epoch:13, train_loss: 0.00105, train_acc: 100.0% ,test_loss: 0.00104, test_acc: 100.0%
epoch:14, train_loss: 0.00104, train_acc: 100.0% ,test_loss: 0.00103, test_acc: 100.0%
epoch:15, train_loss: 0.00103, train_acc: 100.0% ,test_loss: 0.00102, test_acc: 100.0%
epoch:16, train_loss: 0.00102, train_acc: 100.0% ,test_loss: 0.00101, test_acc: 100.0%
epoch:17, train_loss: 0.00102, train_acc: 100.0% ,test_loss: 0.00100, test_acc: 100.0%
epoch:18, train_loss: 0.00101, train_acc: 100.0% ,test_loss: 0.00099, test_acc: 100.0%
epoch:19, train_loss: 0.00100, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:20, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:21, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:22, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:23, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:24, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:25, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:26, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:27, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:28, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
epoch:29, train_loss: 0.00099, train_acc: 100.0% ,test_loss: 0.00098, test_acc: 100.0%
Done!

# 词嵌入
# 词嵌入表示
em = torch.nn.Embedding(len(vocab),20)  #  映射到长度为20的张量     
s_em = em(torch.LongTensor(s))
s_em
# tensor([[ 0.6799,  0.0031,  1.8743,  0.5423,  0.4416, -0.5952, -0.2199, -0.5157,
#          -0.3273, -1.3469,  0.0934, -0.7734, -1.1675, -1.7603, -0.6279,  0.0658,
#          -0.0439, -0.7101, -2.8237,  0.5240],
#         [ 1.2440, -0.6924, -1.5589,  1.0726,  0.5467,  0.4919, -0.5138,  0.6218,
#          -0.5417,  1.8314,  0.7325,  0.3218,  0.1628,  0.0626,  0.3276,  0.2167,
#          -0.3692, -0.7034,  0.0705,  0.3935],
#         [-1.5574,  1.2716, -0.2661,  1.1207, -2.0768, -1.8648, -0.0386, -0.4261,
#          -0.5502,  1.6994, -0.0997,  0.6697, -0.2014,  0.7134,  0.4169, -0.5963,
#           1.0591,  1.2019,  0.5507,  0.1586],
#         [-0.2499,  2.1329,  0.0696, -0.7143,  1.9350,  0.3822, -0.6562,  0.5425,
#           0.1180,  2.6066,  2.2274,  1.8885, -0.2532, -1.0749,  2.3597,  0.1946,
#           0.5644, -2.3629, -0.5948,  1.7755],
#         [-1.0459,  1.1564,  0.0283, -1.6528,  0.5663,  0.3941,  0.4620, -1.8766,
#           0.0099,  0.9699,  0.1013, -0.4488,  0.5943, -0.7578, -0.5864,  0.5749,
#          -0.5252, -0.1428, -0.5080, -0.1126],
#         [ 0.7462, -0.7669, -0.6719, -1.4782,  0.6049,  0.0729,  1.1420, -1.2980,
#           0.7976,  0.5524,  0.0607,  1.5890,  0.9569, -1.7878,  0.1119, -0.0818,
#          -0.2928, -1.7373, -0.0768,  1.2419],
#         [-0.1877, -1.6885,  0.2267,  0.0714,  0.0725, -0.0426, -0.4804,  1.3219,
#          -0.9896,  0.6099, -0.1541, -0.1877, -1.5077, -0.6768,  0.5033,  0.3638,
#          -0.0308, -2.2888,  1.4341,  1.1249],
#         [ 1.9103,  0.4591,  1.5937, -0.6241, -0.8967, -1.3835,  0.6170,  1.4025,
#           0.4851,  1.4262,  0.0310, -0.7734, -1.2927, -0.2866,  0.8493,  0.1133,
#           0.2878, -1.1133,  1.7173,  0.0744],
#         [ 0.2675, -0.5814,  0.0361, -1.1336,  0.0780, -1.1488,  0.1985,  1.6851,
#          -0.1978, -1.8281, -1.3562, -0.8949, -1.1598, -1.5098, -0.6373, -1.2807,
#          -1.7206,  1.2317, -0.3598,  0.0665]], grad_fn=<EmbeddingBackward0>)

PyTrch深度学习简明实战22 - 文本表示方法与词嵌入

分词方式

向量化

TorchText文本分类数据集

文本分类预处理的步骤

1、分词

2、创建词表

3、创建词嵌入表示

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

PyTrch深度学习简明实战22 - 文本表示方法与词嵌入

分词方式

向量化

TorchText文本分类数据集

文本分类预处理的步骤

1、分词

2、 创建词表

3、创建词嵌入表示

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

2、创建词表