Pytorch学习记录-更深的TorchText学习01

作者: 我的昵称违规了 | 来源:发表于2019-04-18 17:18 被阅读4次

Pytorch学习记录-更深的TorchText学习01
简单实现torchtext之后，我希望能够进一步学习torchtext。找到两个教程

1. practical-torchtext简介

有效使用torchtext的教程，包括两个部分

文本分类
词级别的语言模型

1.1 目标

torchtext的文档仍然相对不完整，目前有效地使用torchtext需要阅读相当多的代码。这组教程旨在提供使用torchtext的工作示例，以使更多用户能够充分利用这个风扇库。

1.2 使用

torchtext的当前pip版本存在一些错误，这些错误会导致某些代码运行不正确。这些错误目前只在torchtext的github存储库的主分支上修复。因此，教程建议使用以下命令从github存储库安装torchtext：

pip install --upgrade git+https://github.com/pytorch/text

2. 基于torchtext处理的文本分析

我看了一下，第一课是基于这两天反复操作的那个教程，但是作者进行了丰富和解释，就放在这里再跑一次了

2.0 简介

前面都是一致的，加载数据，预处理之后生成dataset，输入模型。
使用的数据集还是之前的Kaggle垃圾信息数据。

import pandas as pd
import numpy as np
import torch
from torch.nn import init
from torchtext.data import Field

2.1 声明Fields

Field类用于确定数据是如何预处理并转化为数字格式。
很简单。标签的预处理更加容易，因为它们已经转换为二进制编码。我们需要做的就是告诉Field类标签已经处理完毕。我们通过将use_vocab = False关键字传递给构造函数来完成此操作

tokenize=lambda x: x.split()
TEXT=Field(sequential=True, tokenize=tokenize, lower=True)
LABEL=Field(sequential=False, use_vocab=False)

2.2 创建Dataset

我们将使用TabularDataset类来读取我们的数据，因为它是csv格式（截至目前，TabularDataset处理csv，tsv和json文件）

对于列车和验证数据，我们需要处理标签。我们传入的字段必须与列的顺序相同。对于我们不使用的字段，我们传入一个元组，其中第二个元素是None

%%time
from torchtext.data import TabularDataset

tv_datafields=[
    ('id',None),
    ('comment_text',TEXT),
    ("toxic", LABEL),
    ("severe_toxic", LABEL),
    ("threat", LABEL),
    ("obscene", LABEL), 
    ("insult", LABEL),
    ("identity_hate", LABEL)
]
trn,vld=TabularDataset.splits(
    path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata',
    train='train.csv',
    validation='valid.csv',
    format='csv',
    skip_header=True,
    fields=tv_datafields
)

Wall time: 4.99 ms

%%time
tst_datafields=[
    ('id',None),
    ('comment_text',TEXT)
]
tst=TabularDataset(
    path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata\test.csv',
    format='csv',
    skip_header=True,
    fields=tst_datafields
)

Wall time: 3.01 ms

2.3 构建字典

对于TEXT字段将单词转换为整数，需要告诉整个词汇是什么。为此，我们运行TEXT.build_vocab，传入数据集以构建词汇表。

%%time
TEXT.build_vocab(trn)
TEXT.vocab.freqs.most_common(10)
print(TEXT.vocab.freqs.most_common(10))

[('the', 78), ('to', 41), ('you', 33), ('of', 30), ('and', 26), ('a', 26), ('is', 24), ('that', 22), ('i', 20), ('if', 19)]
Wall time: 3.99 ms

在这里，dataset中每一个元素都是一个Example对象，包含有若干独立数据

# 查看trn这个field的标签
print(trn[0].__dict__.keys())
# 查看某一行中的文本，在结果中可以看到，已有的文本是已经被分好词的
print(trn[10].comment_text)

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
['"', 'fair', 'use', 'rationale', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'notice', 'the', 'image', 'page', 'specifies', 'that', 'the', 'image', 'is', 'being', 'used', 'under', 'fair', 'use', 'but', 'there', 'is', 'no', 'explanation', 'or', 'rationale', 'as', 'to', 'why', 'its', 'use', 'in', 'wikipedia', 'articles', 'constitutes', 'fair', 'use.', 'in', 'addition', 'to', 'the', 'boilerplate', 'fair', 'use', 'template,', 'you', 'must', 'also', 'write', 'out', 'on', 'the', 'image', 'description', 'page', 'a', 'specific', 'explanation', 'or', 'rationale', 'for', 'why', 'using', 'this', 'image', 'in', 'each', 'article', 'is', 'consistent', 'with', 'fair', 'use.', 'please', 'go', 'to', 'the', 'image', 'description', 'page', 'and', 'edit', 'it', 'to', 'include', 'a', 'fair', 'use', 'rationale.', 'if', 'you', 'have', 'uploaded', 'other', 'fair', 'use', 'media,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'the', 'fair', 'use', 'rationale', 'on', 'those', 'pages', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', "'image'", 'pages', 'you', 'have', 'edited', 'by', 'clicking', 'on', 'the', '""my', 'contributions""', 'link', '(it', 'is', 'located', 'at', 'the', 'very', 'top', 'of', 'any', 'wikipedia', 'page', 'when', 'you', 'are', 'logged', 'in),', 'and', 'then', 'selecting', '""image""', 'from', 'the', 'dropdown', 'box.', 'note', 'that', 'any', 'fair', 'use', 'images', 'uploaded', 'after', '4', 'may,', '2006,', 'and', 'lacking', 'such', 'an', 'explanation', 'will', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'uploaded,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '•', 'contribs', '•', ')', 'unspecified', 'source', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'noticed', 'that', 'the', "file's", 'description', 'page', 'currently', "doesn't", 'specify', 'who', 'created', 'the', 'content,', 'so', 'the', 'copyright', 'status', 'is', 'unclear.', 'if', 'you', 'did', 'not', 'create', 'this', 'file', 'yourself,', 'then', 'you', 'will', 'need', 'to', 'specify', 'the', 'owner', 'of', 'the', 'copyright.', 'if', 'you', 'obtained', 'it', 'from', 'a', 'website,', 'then', 'a', 'link', 'to', 'the', 'website', 'from', 'which', 'it', 'was', 'taken,', 'together', 'with', 'a', 'restatement', 'of', 'that', "website's", 'terms', 'of', 'use', 'of', 'its', 'content,', 'is', 'usually', 'sufficient', 'information.', 'however,', 'if', 'the', 'copyright', 'holder', 'is', 'different', 'from', 'the', "website's", 'publisher,', 'then', 'their', 'copyright', 'should', 'also', 'be', 'acknowledged.', 'as', 'well', 'as', 'adding', 'the', 'source,', 'please', 'add', 'a', 'proper', 'copyright', 'licensing', 'tag', 'if', 'the', 'file', "doesn't", 'have', 'one', 'already.', 'if', 'you', 'created/took', 'the', 'picture,', 'audio,', 'or', 'video', 'then', 'the', 'tag', 'can', 'be', 'used', 'to', 'release', 'it', 'under', 'the', 'gfdl.', 'if', 'you', 'believe', 'the', 'media', 'meets', 'the', 'criteria', 'at', 'wikipedia:fair', 'use,', 'use', 'a', 'tag', 'such', 'as', 'or', 'one', 'of', 'the', 'other', 'tags', 'listed', 'at', 'wikipedia:image', 'copyright', 'tags#fair', 'use.', 'see', 'wikipedia:image', 'copyright', 'tags', 'for', 'the', 'full', 'list', 'of', 'copyright', 'tags', 'that', 'you', 'can', 'use.', 'if', 'you', 'have', 'uploaded', 'other', 'files,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'their', 'source', 'and', 'tagged', 'them,', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', 'files', 'you', 'have', 'uploaded', 'by', 'following', '[', 'this', 'link].', 'unsourced', 'and', 'untagged', 'images', 'may', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'tagged,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'the', 'image', 'is', 'copyrighted', 'under', 'a', 'non-free', 'license', '(per', 'wikipedia:fair', 'use)', 'then', 'the', 'image', 'will', 'be', 'deleted', '48', 'hours', 'after', '.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '•', 'contribs', '•', ')', '"']

2.4 构建迭代器

在训练期间，将使用一种称为BucketIterator的特殊迭代器。当数据传递到神经网络时，我们希望将数据填充为相同的长度，以便我们可以批量处理它们。
如果序列的长度差异很大，则填充将消耗大量浪费的内存和时间。BucketIterator将每个批次的相似长度的序列组合在一起，以最小化填充。

from torchtext.data import Iterator, BucketIterator
# sort_key就是告诉BucketIterator使用哪个key值去进行组合，很明显，在这里是comment_text
# repeat设定为False是因为之后要打包这个迭代层
train_iter, val_iter=BucketIterator.splits(
    (trn,vld),
    batch_sizes=(64,64),
    device=-1,
    sort_key=lambda x:len(x.comment_text),
    sort_within_batch=False,
    repeat=False
)

# 现在就可以看一下输出的BucketIterator是怎样的。
batch=next(train_iter.__iter__());batch

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.






[torchtext.data.batch.Batch of size 25]
    [.comment_text]:[torch.LongTensor of size 494x25]
    [.toxic]:[torch.LongTensor of size 25]
    [.severe_toxic]:[torch.LongTensor of size 25]
    [.threat]:[torch.LongTensor of size 25]
    [.obscene]:[torch.LongTensor of size 25]
    [.insult]:[torch.LongTensor of size 25]
    [.identity_hate]:[torch.LongTensor of size 25]

batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.

2.5 打包迭代器

目前，迭代器返回一个名为torchtext.data.Batch的自定义数据类型。这使得代码重用变得困难（因为每次列名更改时，我们都需要修改代码），并且使得torchtext很难与其他库一起用于某些用例（例如torchsample和fastai）。
这里教程将写了一个简单的包装器，使批量易于使用。具体地说，我们将批处理转换为元素形式（x，y），其中x是自变量（模型的输入），y是因变量（监督数据）。

class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)

train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)

验证一下，这里有一个理解，iter方法是用来迭代出tensor的？似乎这样是可以的。

next(train_dl.__iter__())

(tensor([[ 63,  66, 354,  ..., 334, 453, 778],
         [  4,  82,  63,  ...,  55, 523, 650],
         [664,   2,   4,  ..., 520,  30,  22],
         ...,
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1],
         [  1,   1,   1,  ...,   1,   1,   1]]),
 tensor([[0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [1., 1., 0., 1., 1., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0.]]))

2.6 训练一个文本分类器

依旧是LSTM

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

class LSTM(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300, num_linear=1):
        super().__init__()
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1)
        self.linear_layers = []

        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
            self.linear_layer = nn.ModuleList(self.linear_layers)

        self.predictor = nn.Linear(hidden_dim, 6)

    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)

        return preds
em_sz = 100
nh = 500
nl = 3
model = LSTM(nh, emb_dim=em_sz)

%%time

import tqdm
opt=optim.Adam(model.parameters(),lr=1e-2)
loss_func=nn.BCEWithLogitsLoss()
epochs=2

Wall time: 0 ns

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train()
    for x, y in tqdm.tqdm(train_dl):
        opt.zero_grad()
        preds = model(x)
        loss = loss_func(y, preds)
        loss.backward()
        opt.step()

        running_loss += loss.item()* x.size(0)
    epoch_loss = running_loss / len(trn)

    val_loss = 0.0
    model.eval()  # 评估模式
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(y, preds)
        val_loss += loss.item()* x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))
test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    preds = preds.data.numpy()
    # 模型的实际输出是logit，所以再经过一个sigmoid函数
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
    test_preds = np.hstack(test_preds)

print(test_preds)

100%|██████████| 1/1 [00:06<00:00,  6.28s/it]


Epoch: 1, Training Loss: 14.2130, Validation Loss: 4.4170


100%|██████████| 1/1 [00:04<00:00,  4.20s/it]


Epoch: 2, Training Loss: 10.5315, Validation Loss: 3.3947


100%|██████████| 1/1 [00:00<00:00,  2.87it/s]


[[0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99978834 0.99761593 0.5279695  0.9961003  0.9957486  0.3662841 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
 [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]]

2.7 测试数据并查看

test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.hstack(test_preds)

100%|██████████| 1/1 [00:00<00:00,  2.77it/s]

df = pd.read_csv("./data/torchtextdata/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    df[col] = test_preds[:, i]

df.head(3)

image.png

Pytorch学习记录-更深的TorchText学习01

1. practical-torchtext简介

1.1 目标

1.2 使用

2. 基于torchtext处理的文本分析

2.0 简介

2.1 声明Fields

2.2 创建Dataset

2.3 构建字典

2.4 构建迭代器

2.5 打包迭代器

2.6 训练一个文本分类器

2.7 测试数据并查看

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

机器学习100天

Machine Learning & Recommendation & NLP & DL