TextCNN

作者: Jarkata | 来源:发表于2021-05-12 12:21 被阅读0次

本文为转载,原链接 https://wmathor.com/index.php/archives/1445/

本文主要介绍一篇将 CNN 应用到 NLP 领域的一篇论文 Convolutional Neural Networks for Sentence Classification,然后给出 PyTorch 实现

论文比较短,总体流程也不复杂,最主要的是下面这张图,只要理解了这张图,就知道如何写代码了。如果你不了解 CNN,请先看我的这篇文章 CS231n 笔记:通俗理解 CNN

下图的 feature map 是将一句话中的各个词通过 WordEmbedding 得到的,feature map 的宽为 embedding 的维度,长为一句话的单词数量。例如下图中,很明显就是用一个 6 维的向量去编码每个词,并且一句话中有 9 个词

之所以有两张 feature map,你可以理解为 batchsize 为 2

其中,红色的框代表的就是卷积核。而且很明显可以看出,这是一个长宽不等的卷积核。有意思的是,卷积核的宽可以认为是 n-gram,比方说下图卷积核宽为 2,所以同时考虑了 "wait" 和 "for" 两个单词的词向量,因此可以认为该卷积是一个类似于 bigram 的模型

后面的部分就是传统 CNN 的步骤,激活、池化、Flatten,没什么好说的


代码实现(PyTorch)

导包,指定device和数据类型

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F

dtype = torch.FloatTensor
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

定义一些数据,设置一些常见参数

# 3 words sentences (=sequence_length is 3)
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
labels = [1, 1, 1, 0, 0, 0]  # 1 is good, 0 is not good.

# TextCNN Parameter
embedding_size = 2
sequence_length = len(sentences[0]) # every sentences contains sequence_length(=3) words, need padding
num_classes = len(set(labels))
batch_size = 3

word_list = " ".join(sentences).split()
vocab = list(set(word_list))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
vocab_size = len(vocab)

数据预处理

def make_data(sentences, labels):
    inputs = []
    for sen in sentences:
        inputs.append([word2idx[n] for n in sen.split()])

    targets = []
    for out in labels:
        targets.append(out)  # To using Torch Softmax Loss Function
    return inputs, targets

input_batch, target_batch = make_data(sentences, labels)
input_batch, target_batch = torch.LongTensor(input_batch), torch.LongTensor(target_batch)

dataset = Data.TensorDataset(input_batch, target_batch)
loader = Data.DataLoader(dataset, batch_size, True)

构建模型

class TextCNN(nn.Module):
    def __init__(self):
        super(TextCNN, self).__init__()
        self.W = nn.Embedding(vocab_size, embedding_size)
        output_channel = 3
        self.conv = nn.Sequential(
            # conv:[input_channel(=1), output_channel(=3), (filter_height,filter_width), stride = 1]
            nn.Conv2d(1, output_channel, (2, embedding_size)),
            nn.ReLU(),
            # pool: ((filter_height, filter_width))
            nn.MaxPool2d((2, 1))
        )
        self.fc = nn.Linear(output_channel, num_classes)

    def forward(self, X):
        '''
        :param X: [batch_size, sequence_length]
        :return:
        '''
        batch_size = X.shape[0]
        embedding_X = self.W(X)  # [batch_size, sequence_length, embedding_size]

        embedding_X = embedding_X.unsqueeze(
            1)  # add channel(=1) [batch_size, channel(=1), sequence_length, embedding_size]
        conved = self.conv(embedding_X)
        flatten = conved.view(batch_size, -1)
        output = self.fc(flatten)
        return output

下面详细介绍一下数据在网络中流动的过程中维度的变化。输入数据是个矩阵,矩阵维度为[batch_size, sequence_length],输入矩阵的数字代表的是某个词在整个词库中的索引(下标)

首先通过Embedding层,也就是查表,将每个索引转为一个向量,比方说12可能会变成[0.3,0.6,0.12,...],因此整个数据无形中增加了一个维度,变成了[batch_size, sequence_length, embedding_size]。

之后通过unsqueeze(1)函数使数据增加一个channel(=1)维度,变成[batch_size, 1, sequence_length, embedding_size]。 现在的数据才能做卷积,因为在传统 CNN 中,输入数据就应该是 [batch_size, in_channel, height, width] 这种维度。

[batch_size, 1, 3, 2] 的输入数据通过 nn.Conv2d(1, 3, (2, 2)) 的卷积之后,得到的就是 [batch_size, 3, 2, 1] 的数据,由于经过 ReLU 激活函数是不改变维度的,所以就没画出来。最后经过一个 nn.MaxPool2d((2, 1)) 池化,得到的数据维度就是 [batch_size, 3, 1, 1]

训练

model = TextCNN().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training
for epoch in range(5000):
    for batch_x, batch_y in loader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        pred = model(batch_x)
        loss = criterion(pred, batch_y)
        if (epoch + 1) % 1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'loss = ', '{:.6f}'.format(loss))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

测试

# Test
test_text = 'i hate me'
tests = [[word2idx[n] for n in test_text.split()]]
test_batch = torch.LongTensor(tests).to(device)

# Predict
model = model.eval()
predict = model(test_batch).data.max(1, keepdim=True)[1]
if predict[0][0] == 0:
    print(test_text, "is Bad Mean...")
else:
    print(test_text, "is Good Mean!")

完整代码

'''
  code by Tae Hwan Jung(Jeff Jung) @graykode, modify by wmathor
'''
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F

dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3 words sentences (=sequence_length is 3)
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
labels = [1, 1, 1, 0, 0, 0]  # 1 is good, 0 is not good.

# TextCNN Parameter
embedding_size = 2
sequence_length = len(sentences[0]) # every sentences contains sequence_length(=3) words
num_classes = 2  # 0 or 1
batch_size = 3

word_list = " ".join(sentences).split()
vocab = list(set(word_list))
word2idx = {w: i for i, w in enumerate(vocab)}
vocab_size = len(vocab)

def make_data(sentences, labels):
  inputs = []
  for sen in sentences:
      inputs.append([word2idx[n] for n in sen.split()])

  targets = []
  for out in labels:
      targets.append(out) # To using Torch Softmax Loss function
  return inputs, targets

input_batch, target_batch = make_data(sentences, labels)
input_batch, target_batch = torch.LongTensor(input_batch), torch.LongTensor(target_batch)

dataset = Data.TensorDataset(input_batch, target_batch)
loader = Data.DataLoader(dataset, batch_size, True)

class TextCNN(nn.Module):
    def __init__(self):
        super(TextCNN, self).__init__()
        self.W = nn.Embedding(vocab_size, embedding_size)
        output_channel = 3
        self.conv = nn.Sequential(
            # conv : [input_channel(=1), output_channel, (filter_height, filter_width), stride=1]
            nn.Conv2d(1, output_channel, (2, embedding_size)),
            nn.ReLU(),
            # pool : ((filter_height, filter_width))
            nn.MaxPool2d((2, 1)),
        )
        # fc
        self.fc = nn.Linear(output_channel, num_classes)

    def forward(self, X):
      '''
      X: [batch_size, sequence_length]
      '''
      batch_size = X.shape[0]
      embedding_X = self.W(X) # [batch_size, sequence_length, embedding_size]
      embedding_X = embedding_X.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
      conved = self.conv(embedding_X) # [batch_size, output_channel, 1, 1]
      flatten = conved.view(batch_size, -1) # [batch_size, output_channel*1*1]
      output = self.fc(flatten)
      return output

model = TextCNN().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training
for epoch in range(5000):
  for batch_x, batch_y in loader:
    batch_x, batch_y = batch_x.to(device), batch_y.to(device)
    pred = model(batch_x)
    loss = criterion(pred, batch_y)
    if (epoch + 1) % 1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
# Test
test_text = 'i hate me'
tests = [[word2idx[n] for n in test_text.split()]]
test_batch = torch.LongTensor(tests).to(device)
# Predict
model = model.eval()
predict = model(test_batch).data.max(1, keepdim=True)[1]
if predict[0][0] == 0:
    print(test_text,"is Bad Mean...")
else:
    print(test_text,"is Good Mean!!")

相关文章

  • TextCNN

    在文本分类时,可以使用卷积层进行文本特征抽取,模型结构如图: 首先利用卷积层和池化层,捕获序列特征,然后根据特征用...

  • TextCNN

    本文为转载,原链接 https://wmathor.com/index.php/archives/1445/[ht...

  • task8

    文本分类 使用双向循环神经网络 使用卷积神经网络->TextCNN TextCNN 模型主要使用了一维卷积层和时序...

  • NLP - TextCNN

    1. word2vec(word2Vec 来源和原理) 2, Text -CNN 文本分类 (转自:Text-CN...

  • 【NLP】TextCNN

    模型 四种模式 CNN-rand: 单词向量是随机初始化,向量随着模型学习而改变 CNN-static: 使用预训...

  • 文本分类总结

    一、TextCNN→TextRNN→TextBiRNN→TextRCNN→Text-ATT-BI-RNN→HAN ...

  • 四、深度学习文本分类TextCNN

    原理: TextCNN出处:论文https://aclanthology.org/D14-1181/[https:...

  • TextCNN算法分析

    TextCNN 什么是卷积 如上图所示,这其实是一个动图,黄色的3*3的方格在不断滑动,其就是一个图像过滤器的图示...

  • 文本分类-TextCNN

    简介 TextCNN模型是由 Yoon Kim提出的Convolutional Naural Networks f...

  • 待学习

    FastText TextCNN 和 TextRNNskip-gram IJCAI-18 阿里妈妈搜索广告转化预测...

网友评论

      本文标题:TextCNN

      本文链接:https://www.haomeiwen.com/subject/qgpdqltx.html