文本的预处理操作大致分为:去除停用词、映射成索引、补全或截断、随机打乱、加载预训练词向量
1. Stop Words
## 对于英文来说,用nltk有整理一些
from nltk.corpus import stopwords
stop = set(stopwords.words('english')) #
print(stop)
2. To Word Index
# Tokenizer
# 保留的词频最高的num_words个数作为vocab_size-1,因为还有<UNK>
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=20000, oov_token='<UNK>')
tokenizer.fit_on_texts(train_text)
train_idxs = tokenizer.text_to_sequences(train_text)
test_idxs = tokenizer.text_to_sequences(test_text)
train_padded = tf.keras.preprocessing.sequence.pad_sequences(train_idxs,
maxlen=MAX_LENGTH,
padding='post',
truncating='post') ## padding的方向和截取的方向
下面的这几条也非常实用
word2id = tokenizer.word_index ## word2idx 的一个字典
id2word = {idx: word for word, idx in zip(word2idx.keys(), word2idx.values())} ## 构建id2word
3. Shuffle
这里先介绍小数据规模下全部加载进内存的shuffle操作
import numpy as np
np.random.shuffle(train_set)
np.random.shuffle(test_set)
或者
import pandas
train = pandas.Series.sample(frac=0.9) ## 既起到shuffle作用,又起到sampling的作用
test = pandas.Series.sample(frac=1.0)
4. Load Pre-trained Word Embedding
## 先加载pre-trained vector文件
def loadGloVe(filename):
vocab = []
embd = []
file = open(filename, 'r')
for line in file.readlines(): # 读取 txt 的每一行
row = line.strip().split(' ')
vocab.append(row[0])
embd.append(row[1:])
print('Loaded GloVe!')
file.close()
return vocab, embd
vocab, embd = loadGloVe(filename)
vocab_size = len(vocab) # 词表的大小
embedding_dim = len(embd[0]) # embedding 的维度
print("Vocab size : ", vocab_size)
print("Embedding dimensions : ", embedding_dim)
## 根据 vocab 将文本转化为对应的 Id
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_LENGTH)
pretrain = vocab_processor.fit(vocab) # 根据我们的 vocab 进行 fit
x_transform_train = vocab_processor.transform(x_train) # train set
x_transform_test = vocab_processor.transform(x_test) # test set
vocab = vocab_processor.vocabulary_
vocab_size_after_process = len(vocab) # 注意:这个size和前面加载的vocab的不一样了,忽略所有非单词的符号,并且添加了<UNK>符号
print("Vocab size after process:", vocab_size_after_process)
## 进行Tensorflow Embedding 操作
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size_after_process, embedding_dim]) # 通过 Placeholder 喂给 graph
网友评论