2019-02 文本的预处理

作者: Hugo_Ng_7777 | 来源:发表于2019-02-26 21:16 被阅读0次

2019-02 文本的预处理
2019-05-29 文本预处理
动手学深度学习(八) NLP 文本预处理
pytorch之文本预处理,语言模型,循环神经网络基础
第一次打卡 Task02
第二天-文本预处理,语言模型,循环神经网络
第二次打卡
二. 文本预处理，语言模型，RNN
使用word2vec训练中文词向量
文本预处理

文本的预处理操作大致分为：去除停用词、映射成索引、补全或截断、随机打乱、加载预训练词向量

1. Stop Words

## 对于英文来说，用nltk有整理一些
from nltk.corpus import stopwords
stop = set(stopwords.words('english')) #
print(stop)

2. To Word Index

# Tokenizer
# 保留的词频最高的num_words个数作为vocab_size-1,因为还有<UNK>
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=20000， oov_token='<UNK>') 
tokenizer.fit_on_texts(train_text)
train_idxs = tokenizer.text_to_sequences(train_text)
test_idxs = tokenizer.text_to_sequences(test_text)

train_padded = tf.keras.preprocessing.sequence.pad_sequences(train_idxs, 
                                                             maxlen=MAX_LENGTH, 
                                                             padding='post', 
                                                             truncating='post') ## padding的方向和截取的方向

下面的这几条也非常实用

word2id = tokenizer.word_index ## word2idx 的一个字典
id2word = {idx: word for word, idx in zip(word2idx.keys(), word2idx.values())} ## 构建id2word

3. Shuffle

这里先介绍小数据规模下全部加载进内存的shuffle操作

import numpy as np
np.random.shuffle(train_set)
np.random.shuffle(test_set)
或者
import pandas
train = pandas.Series.sample(frac=0.9) ## 既起到shuffle作用，又起到sampling的作用
test = pandas.Series.sample(frac=1.0)

4. Load Pre-trained Word Embedding

## 先加载pre-trained vector文件
def loadGloVe(filename):
    vocab = []
    embd = []
    file = open(filename, 'r')
    for line in file.readlines(): # 读取 txt 的每一行
        row = line.strip().split(' ')
        vocab.append(row[0])
        embd.append(row[1:])
    print('Loaded GloVe!')
    file.close()
    return vocab, embd

vocab, embd = loadGloVe(filename)
vocab_size = len(vocab) # 词表的大小
embedding_dim = len(embd[0]) # embedding 的维度
print("Vocab size : ", vocab_size)
print("Embedding dimensions : ", embedding_dim)

## 根据 vocab 将文本转化为对应的 Id
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_LENGTH)
pretrain = vocab_processor.fit(vocab) # 根据我们的 vocab 进行 fit
x_transform_train = vocab_processor.transform(x_train) # train set
x_transform_test = vocab_processor.transform(x_test) # test set

vocab = vocab_processor.vocabulary_
vocab_size_after_process = len(vocab) # 注意：这个size和前面加载的vocab的不一样了，忽略所有非单词的符号，并且添加了<UNK>符号
print("Vocab size after process:", vocab_size_after_process)

## 进行Tensorflow Embedding 操作
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size_after_process, embedding_dim]) # 通过 Placeholder 喂给 graph

网友评论

本文标题：2019-02 文本的预处理

本文链接：https://www.haomeiwen.com/subject/wsdmyqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

2019-02 文本的预处理

1. Stop Words

2. To Word Index

3. Shuffle

4. Load Pre-trained Word Embedding

相关文章

2019-02 文本的预处理

2019-05-29 文本预处理

动手学深度学习(八) NLP 文本预处理

pytorch之文本预处理,语言模型,循环神经网络基础

第一次打卡 Task02

第二天-文本预处理,语言模型,循环神经网络

第二次打卡

二. 文本预处理，语言模型，RNN

使用word2vec训练中文词向量

文本预处理

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读