torchtext处理IMDB数据

作者: 我的昵称违规了 | 来源:发表于2019-07-02 00:26 被阅读8次

    感谢这个博客,之前一直在想,torchtext能不能对这个数据进行操作,尝试了一下不行,昨天搜索之后发现了这个教程,真的很有用。
    我们先看一下之前做的时候预处理的流程。

    image.png

    在前面已经训练好了word2vec,这里不再处理。

    import pandas as pd
    import numpy as np
    import spacy
    
    # Read data from files 
    train_data = pd.read_csv( "./drive/My Drive/NLPdata/train.tsv", header=0, delimiter="\t", quoting=3,encoding='latin-1' )
    test_data = pd.read_csv( "./drive/My Drive/NLPdata/test.tsv", header=0, delimiter="\t", quoting=3,encoding='latin-1')
    # unlabeled_train = pd.read_csv( "./train01.tsv", header=0, delimiter="\t", quoting=3,encoding='latin-1' )
    
    # Verify the number of reviews that were read (100,000 in total)
    print("Read %d labeled train reviews, %d labeled test reviews, "% (train_data["Phrase"].size, test_data["Phrase"].size ))
    

    导入之前生成的word2vec

    import logging
    import gensim
    from gensim.models import word2vec
    model=gensim.models.KeyedVectors.load_word2vec_format("./drive/My Drive/NLPdata/word2Vec03.bin",binary=True)
    
    index2word=model.index2word
    print(len(index2word))
    index2word_set=set(model.index2word)
    print(len(index2word_set))
    print(model)
    

    对语料库数据进行处理

    包括分句、分词、单词小写等

    # text是输入的已经分好词的语料库文本
    # model是之前生成的word2vec模型
    # num_features是word2vec模型中每个词维度大小,这里是200
    def word2vec(text,model,num_features):
        featureVec = np.zeros((200,),dtype="float32")
        nwords=0
        for word in text:
            if word in index2word_set:
                nwords+=1
                featureVec=np.add(featureVec,model[word])
        featureVec = np.divide(featureVec,nwords)
        return featureVec
    # print(word2vec(token))
    def getAvgFeatureVecs(phrases,model,num_features):
        counter=0
        phraseFeatureVecs = np.zeros((len(phrases),num_features),dtype="float32")
        for phrase in phrases:
            if counter % 2000==0:
                print("Phrase %d of %d" % (counter, len(phrases)))
            phraseFeatureVecs[counter]=word2vec(phrase, model, num_features)
            counter = counter+1
        return phraseFeatureVecs
    
    from nltk.corpus import stopwords
    import re
    def phrase_to_wordlist(phrase, remove_stopwords=False):
        phrase_text = re.sub("[^a-zA-Z]"," ", phrase)
        words = phrase_text.lower().split()
    #     if remove_stopwords:
    #         stops = set(stopwords.words("english"))
    #         words = [w for w in words if not w in stops]
        return(words)
    
    
    

    处理训练集和测试集数据

    clean_train_phrases = []
    for phrase in train_data["Phrase"]:
        clean_train_phrases.append( phrase_to_wordlist( phrase, remove_stopwords=True ))
        
    num_features=200
    trainDataVecs = getAvgFeatureVecs( clean_train_phrases, model, num_features )
    
    clean_test_phrases = []
    for phrase in test_data["Phrase"]:
        clean_test_phrases.append( phrase_to_wordlist( phrase, remove_stopwords=True ))
        
    num_features=200
    testDataVecs = getAvgFeatureVecs( clean_test_phrases, model, num_features )
    
    # np.isnan(trainDataVecs).any()
    nullFeatureVec = np.zeros((200,),dtype="float32")
    # print(trainDataVecs[4])
    trainDataVecs[np.isnan(trainDataVecs)] = 0
    print(trainDataVecs[3])
    

    对向量化的数据中空值进行赋值

    # np.isnan(trainDataVecs).any()
    nullFeatureVec = np.zeros((200,),dtype="float32")
    # print(trainDataVecs[4])
    trainDataVecs[np.isnan(trainDataVecs)] = 0
    print(trainDataVecs[3])
    

    接下来看一下使用torchtext怎么处理数据,对比之后,我感觉,确实优雅了很多

    读取数据

    import pandas as pd
    data=pd.read_csv(r'C:\Users\jwc19\Desktop\sentiment-analysis-on-movie-reviews\train.tsv',sep='\t')
    test=pd.read_csv(r'C:\Users\jwc19\Desktop\sentiment-analysis-on-movie-reviews\test.tsv',sep='\t')
    data.head()
    

    使用sklearn对数据集进行分割

    将训练集数据按照8:2的比例分割为训练集和验证集

    from sklearn.model_selection import train_test_split
    train,val=train_test_split(data,test_size=0.2)
    train.to_csv("train.csv",index=False)
    val.to_csv('val.csv',index=False)
    

    构建分词器,定义Field

    Torchtext采用了一种声明式的方法来加载数据:你来告诉Torchtext你希望的数据是什么样子的,剩下的由torchtext来处理。
    实现这种声明的是Field,Field确定了一种你想要怎么去处理数据。

    field在默认的情况下都期望一个输入是一组单词的序列,并且将单词映射成整数。
    这个映射被称为vocab。如果一个field已经被数字化了并且不需要被序列化,
    可以将参数设置为use_vocab=False以及sequential=False。

    import spacy
    import torch
    from torchtext import data, datasets
    from torchtext.vocab import Vectors
    from torch.nn import init
    
    device=torch.device("cuda")
    spacy_en=spacy.load("en")
    def tokenize_en(text):
        return [tok.text for tok in spacy_en.tokenizer(text)]
    
    label=data.Field(sequential=False, use_vocab=False)
    text=data.Field(sequential=True, tokenize=tokenize_en,lower=True)
    

    定义Dataset

    The fields知道当给定原始数据的时候要做什么。现在,我们需要告诉fields它需要处理什么样的数据。这个功能利用Datasets来实现。

    Torchtext有大量内置的Datasets去处理各种数据格式。

    TabularDataset官网介绍: Defines a Dataset of columns stored in CSV, TSV, or JSON format.

    对于csv/tsv类型的文件,TabularDataset很容易进行处理,故我们选它来生成Dataset

    train, val=data.TabularDataset.splits(
        path=r'C:\Users\jwc19\Desktop\2001_2018jszyfz\code',
        train='train.csv',
        validation='val.csv',
        format='csv',
        skip_header=True,
        fields=[
            ('PhraseId',None),
            ('SentenceId',None),
            ('Phrase',text),
            ('Sentiment',label)
        ]
    )
    
    test=data.TabularDataset.splits(
        path=r'C:\Users\jwc19\Desktop\sentiment-analysis-on-movie-reviews',
        test='test.tsv',
        format='tsv',
        skip_header=True,
        fields=[
            ('PhraseId',None),
            ('SentenceId',None),
            ('Phrase',text),
        ]
    )
    

    建立vocab

    Torchtext可以将词转化为数字,但是它需要被告知需要被处理的全部范围的词,在这里使用的是glove,库会帮你下载好

    text.build_vocab(train,vectors='glove.6B.100d')
    text.vocab.vectors.unk_init = init.xavier_uniform
    
    print(text.vocab.itos[1510])
    print(text.vocab.stoi['bore'])
    # 词向量矩阵: TEXT.vocab.vectors
    print(text.vocab.vectors.shape)
    word_vec = text.vocab.vectors[text.vocab.stoi['bore']]
    print(word_vec.shape)
    print(word_vec)
    

    相关文章

      网友评论

        本文标题:torchtext处理IMDB数据

        本文链接:https://www.haomeiwen.com/subject/ltnocctx.html