Pytorch学习记录-更深的TorchText学习01

作者: 我的昵称违规了 | 来源:发表于2019-04-18 17:18 被阅读4次

    Pytorch学习记录-更深的TorchText学习01
    简单实现torchtext之后,我希望能够进一步学习torchtext。找到两个教程

    1. practical-torchtext简介

    有效使用torchtext的教程,包括两个部分

    • 文本分类
    • 词级别的语言模型

    1.1 目标

    torchtext的文档仍然相对不完整,目前有效地使用torchtext需要阅读相当多的代码。这组教程旨在提供使用torchtext的工作示例,以使更多用户能够充分利用这个风扇库。

    1.2 使用

    torchtext的当前pip版本存在一些错误,这些错误会导致某些代码运行不正确。这些错误目前只在torchtext的github存储库的主分支上修复。因此,教程建议使用以下命令从github存储库安装torchtext:

    pip install --upgrade git+https://github.com/pytorch/text
    

    2. 基于torchtext处理的文本分析

    我看了一下,第一课是基于这两天反复操作的那个教程,但是作者进行了丰富和解释,就放在这里再跑一次了

    2.0 简介

    前面都是一致的,加载数据,预处理之后生成dataset,输入模型。
    使用的数据集还是之前的Kaggle垃圾信息数据。

    import pandas as pd
    import numpy as np
    import torch
    from torch.nn import init
    from torchtext.data import Field
    

    2.1 声明Fields

    Field类用于确定数据是如何预处理并转化为数字格式。
    很简单。标签的预处理更加容易,因为它们已经转换为二进制编码。我们需要做的就是告诉Field类标签已经处理完毕。我们通过将use_vocab = False关键字传递给构造函数来完成此操作

    tokenize=lambda x: x.split()
    TEXT=Field(sequential=True, tokenize=tokenize, lower=True)
    LABEL=Field(sequential=False, use_vocab=False)
    

    2.2 创建Dataset

    我们将使用TabularDataset类来读取我们的数据,因为它是csv格式(截至目前,TabularDataset处理csv,tsv和json文件)

    对于列车和验证数据,我们需要处理标签。我们传入的字段必须与列的顺序相同。对于我们不使用的字段,我们传入一个元组,其中第二个元素是None

    %%time
    from torchtext.data import TabularDataset
    
    tv_datafields=[
        ('id',None),
        ('comment_text',TEXT),
        ("toxic", LABEL),
        ("severe_toxic", LABEL),
        ("threat", LABEL),
        ("obscene", LABEL), 
        ("insult", LABEL),
        ("identity_hate", LABEL)
    ]
    trn,vld=TabularDataset.splits(
        path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata',
        train='train.csv',
        validation='valid.csv',
        format='csv',
        skip_header=True,
        fields=tv_datafields
    )
    
    Wall time: 4.99 ms
    
    %%time
    tst_datafields=[
        ('id',None),
        ('comment_text',TEXT)
    ]
    tst=TabularDataset(
        path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code\data\torchtextdata\test.csv',
        format='csv',
        skip_header=True,
        fields=tst_datafields
    )
    
    Wall time: 3.01 ms
    

    2.3 构建字典

    对于TEXT字段将单词转换为整数,需要告诉整个词汇是什么。为此,我们运行TEXT.build_vocab,传入数据集以构建词汇表。

    %%time
    TEXT.build_vocab(trn)
    TEXT.vocab.freqs.most_common(10)
    print(TEXT.vocab.freqs.most_common(10))
    
    [('the', 78), ('to', 41), ('you', 33), ('of', 30), ('and', 26), ('a', 26), ('is', 24), ('that', 22), ('i', 20), ('if', 19)]
    Wall time: 3.99 ms
    

    在这里,dataset中每一个元素都是一个Example对象,包含有若干独立数据

    # 查看trn这个field的标签
    print(trn[0].__dict__.keys())
    # 查看某一行中的文本,在结果中可以看到,已有的文本是已经被分好词的
    print(trn[10].comment_text)
    
    dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
    ['"', 'fair', 'use', 'rationale', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'notice', 'the', 'image', 'page', 'specifies', 'that', 'the', 'image', 'is', 'being', 'used', 'under', 'fair', 'use', 'but', 'there', 'is', 'no', 'explanation', 'or', 'rationale', 'as', 'to', 'why', 'its', 'use', 'in', 'wikipedia', 'articles', 'constitutes', 'fair', 'use.', 'in', 'addition', 'to', 'the', 'boilerplate', 'fair', 'use', 'template,', 'you', 'must', 'also', 'write', 'out', 'on', 'the', 'image', 'description', 'page', 'a', 'specific', 'explanation', 'or', 'rationale', 'for', 'why', 'using', 'this', 'image', 'in', 'each', 'article', 'is', 'consistent', 'with', 'fair', 'use.', 'please', 'go', 'to', 'the', 'image', 'description', 'page', 'and', 'edit', 'it', 'to', 'include', 'a', 'fair', 'use', 'rationale.', 'if', 'you', 'have', 'uploaded', 'other', 'fair', 'use', 'media,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'the', 'fair', 'use', 'rationale', 'on', 'those', 'pages', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', "'image'", 'pages', 'you', 'have', 'edited', 'by', 'clicking', 'on', 'the', '""my', 'contributions""', 'link', '(it', 'is', 'located', 'at', 'the', 'very', 'top', 'of', 'any', 'wikipedia', 'page', 'when', 'you', 'are', 'logged', 'in),', 'and', 'then', 'selecting', '""image""', 'from', 'the', 'dropdown', 'box.', 'note', 'that', 'any', 'fair', 'use', 'images', 'uploaded', 'after', '4', 'may,', '2006,', 'and', 'lacking', 'such', 'an', 'explanation', 'will', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'uploaded,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '•', 'contribs', '•', ')', 'unspecified', 'source', 'for', 'image:wonju.jpg', 'thanks', 'for', 'uploading', 'image:wonju.jpg.', 'i', 'noticed', 'that', 'the', "file's", 'description', 'page', 'currently', "doesn't", 'specify', 'who', 'created', 'the', 'content,', 'so', 'the', 'copyright', 'status', 'is', 'unclear.', 'if', 'you', 'did', 'not', 'create', 'this', 'file', 'yourself,', 'then', 'you', 'will', 'need', 'to', 'specify', 'the', 'owner', 'of', 'the', 'copyright.', 'if', 'you', 'obtained', 'it', 'from', 'a', 'website,', 'then', 'a', 'link', 'to', 'the', 'website', 'from', 'which', 'it', 'was', 'taken,', 'together', 'with', 'a', 'restatement', 'of', 'that', "website's", 'terms', 'of', 'use', 'of', 'its', 'content,', 'is', 'usually', 'sufficient', 'information.', 'however,', 'if', 'the', 'copyright', 'holder', 'is', 'different', 'from', 'the', "website's", 'publisher,', 'then', 'their', 'copyright', 'should', 'also', 'be', 'acknowledged.', 'as', 'well', 'as', 'adding', 'the', 'source,', 'please', 'add', 'a', 'proper', 'copyright', 'licensing', 'tag', 'if', 'the', 'file', "doesn't", 'have', 'one', 'already.', 'if', 'you', 'created/took', 'the', 'picture,', 'audio,', 'or', 'video', 'then', 'the', 'tag', 'can', 'be', 'used', 'to', 'release', 'it', 'under', 'the', 'gfdl.', 'if', 'you', 'believe', 'the', 'media', 'meets', 'the', 'criteria', 'at', 'wikipedia:fair', 'use,', 'use', 'a', 'tag', 'such', 'as', 'or', 'one', 'of', 'the', 'other', 'tags', 'listed', 'at', 'wikipedia:image', 'copyright', 'tags#fair', 'use.', 'see', 'wikipedia:image', 'copyright', 'tags', 'for', 'the', 'full', 'list', 'of', 'copyright', 'tags', 'that', 'you', 'can', 'use.', 'if', 'you', 'have', 'uploaded', 'other', 'files,', 'consider', 'checking', 'that', 'you', 'have', 'specified', 'their', 'source', 'and', 'tagged', 'them,', 'too.', 'you', 'can', 'find', 'a', 'list', 'of', 'files', 'you', 'have', 'uploaded', 'by', 'following', '[', 'this', 'link].', 'unsourced', 'and', 'untagged', 'images', 'may', 'be', 'deleted', 'one', 'week', 'after', 'they', 'have', 'been', 'tagged,', 'as', 'described', 'on', 'criteria', 'for', 'speedy', 'deletion.', 'if', 'the', 'image', 'is', 'copyrighted', 'under', 'a', 'non-free', 'license', '(per', 'wikipedia:fair', 'use)', 'then', 'the', 'image', 'will', 'be', 'deleted', '48', 'hours', 'after', '.', 'if', 'you', 'have', 'any', 'questions', 'please', 'ask', 'them', 'at', 'the', 'media', 'copyright', 'questions', 'page.', 'thank', 'you.', '(talk', '•', 'contribs', '•', ')', '"']
    

    2.4 构建迭代器

    在训练期间,将使用一种称为BucketIterator的特殊迭代器。当数据传递到神经网络时,我们希望将数据填充为相同的长度,以便我们可以批量处理它们。
    如果序列的长度差异很大,则填充将消耗大量浪费的内存和时间。BucketIterator将每个批次的相似长度的序列组合在一起,以最小化填充。

    from torchtext.data import Iterator, BucketIterator
    # sort_key就是告诉BucketIterator使用哪个key值去进行组合,很明显,在这里是comment_text
    # repeat设定为False是因为之后要打包这个迭代层
    train_iter, val_iter=BucketIterator.splits(
        (trn,vld),
        batch_sizes=(64,64),
        device=-1,
        sort_key=lambda x:len(x.comment_text),
        sort_within_batch=False,
        repeat=False
    )
    
    # 现在就可以看一下输出的BucketIterator是怎样的。
    batch=next(train_iter.__iter__());batch
    
    The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
    The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
    
    
    
    
    
    
    [torchtext.data.batch.Batch of size 25]
        [.comment_text]:[torch.LongTensor of size 494x25]
        [.toxic]:[torch.LongTensor of size 25]
        [.severe_toxic]:[torch.LongTensor of size 25]
        [.threat]:[torch.LongTensor of size 25]
        [.obscene]:[torch.LongTensor of size 25]
        [.insult]:[torch.LongTensor of size 25]
        [.identity_hate]:[torch.LongTensor of size 25]
    
    batch.__dict__.keys()
    
    dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
    
    test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)
    
    The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
    
    
    

    2.5 打包迭代器

    目前,迭代器返回一个名为torchtext.data.Batch的自定义数据类型。这使得代码重用变得困难(因为每次列名更改时,我们都需要修改代码),并且使得torchtext很难与其他库一起用于某些用例(例如torchsample和fastai)。
    这里教程将写了一个简单的包装器,使批量易于使用。具体地说,我们将批处理转换为元素形式(x,y),其中x是自变量(模型的输入),y是因变量(监督数据)。

    class BatchWrapper:
        def __init__(self, dl, x_var, y_vars):
            self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
        
        def __iter__(self):
            for batch in self.dl:
                x = getattr(batch, self.x_var) # we assume only one input in this wrapper
                
                if self.y_vars is not None: # we will concatenate y into a single tensor
                    y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
                else:
                    y = torch.zeros((1))
    
                yield (x, y)
        
        def __len__(self):
            return len(self.dl)
    
    train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
    valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
    test_dl = BatchWrapper(test_iter, "comment_text", None)
    

    验证一下,这里有一个理解,iter方法是用来迭代出tensor的?似乎这样是可以的。

    next(train_dl.__iter__())
    
    (tensor([[ 63,  66, 354,  ..., 334, 453, 778],
             [  4,  82,  63,  ...,  55, 523, 650],
             [664,   2,   4,  ..., 520,  30,  22],
             ...,
             [  1,   1,   1,  ...,   1,   1,   1],
             [  1,   1,   1,  ...,   1,   1,   1],
             [  1,   1,   1,  ...,   1,   1,   1]]),
     tensor([[0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [1., 1., 0., 1., 1., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [1., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [1., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0.]]))
    

    2.6 训练一个文本分类器

    依旧是LSTM

    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    from torch.autograd import Variable
    
    class LSTM(nn.Module):
        def __init__(self, hidden_dim, emb_dim=300, num_linear=1):
            super().__init__()
            self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
            self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1)
            self.linear_layers = []
    
            for _ in range(num_linear - 1):
                self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
                self.linear_layer = nn.ModuleList(self.linear_layers)
    
            self.predictor = nn.Linear(hidden_dim, 6)
    
        def forward(self, seq):
            hdn, _ = self.encoder(self.embedding(seq))
            feature = hdn[-1, :, :]
            for layer in self.linear_layers:
                feature = layer(feature)
            preds = self.predictor(feature)
    
            return preds
    em_sz = 100
    nh = 500
    nl = 3
    model = LSTM(nh, emb_dim=em_sz)
    
    %%time
    
    import tqdm
    opt=optim.Adam(model.parameters(),lr=1e-2)
    loss_func=nn.BCEWithLogitsLoss()
    epochs=2
    
    Wall time: 0 ns
    
    for epoch in range(1, epochs + 1):
        running_loss = 0.0
        running_corrects = 0
        model.train()
        for x, y in tqdm.tqdm(train_dl):
            opt.zero_grad()
            preds = model(x)
            loss = loss_func(y, preds)
            loss.backward()
            opt.step()
    
            running_loss += loss.item()* x.size(0)
        epoch_loss = running_loss / len(trn)
    
        val_loss = 0.0
        model.eval()  # 评估模式
        for x, y in valid_dl:
            preds = model(x)
            loss = loss_func(y, preds)
            val_loss += loss.item()* x.size(0)
    
        val_loss /= len(vld)
        print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))
    test_preds = []
    for x, y in tqdm.tqdm(test_dl):
        preds = model(x)
        preds = preds.data.numpy()
        # 模型的实际输出是logit,所以再经过一个sigmoid函数
        preds = 1 / (1 + np.exp(-preds))
        test_preds.append(preds)
        test_preds = np.hstack(test_preds)
    
    print(test_preds)
    
    100%|██████████| 1/1 [00:06<00:00,  6.28s/it]
    
    
    Epoch: 1, Training Loss: 14.2130, Validation Loss: 4.4170
    
    
    100%|██████████| 1/1 [00:04<00:00,  4.20s/it]
    
    
    Epoch: 2, Training Loss: 10.5315, Validation Loss: 3.3947
    
    
    100%|██████████| 1/1 [00:00<00:00,  2.87it/s]
    
    
    [[0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99978834 0.99761593 0.5279695  0.9961003  0.9957486  0.3662841 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]
     [0.99982786 0.99812    0.53367174 0.99682033 0.9966144  0.3649216 ]]
    

    2.7 测试数据并查看

    test_preds = []
    for x, y in tqdm.tqdm(test_dl):
        preds = model(x)
        # if you're data is on the GPU, you need to move the data back to the cpu
        # preds = preds.data.cpu().numpy()
        preds = preds.data.numpy()
        # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
        preds = 1 / (1 + np.exp(-preds))
        test_preds.append(preds)
    test_preds = np.hstack(test_preds)
    
    100%|██████████| 1/1 [00:00<00:00,  2.77it/s]
    
    df = pd.read_csv("./data/torchtextdata/test.csv")
    for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
        df[col] = test_preds[:, i]
    
    df.head(3)
    
    image.png

    相关文章

      网友评论

        本文标题:Pytorch学习记录-更深的TorchText学习01

        本文链接:https://www.haomeiwen.com/subject/igitgqtx.html