今天分享一篇来自Analytics Vidhya的热文:Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers 。
这篇文章针对文本数据,介绍了不同的特征抽取方式,包括基本的方法到一些比较先进的NLP技术,同时还介绍了文本数据的预处理,以帮助我们抽取到更好的特征。
文章以twitter sentiment dataset为例,使用Python进行特征抽取,主要内容如下所示,具体内容请戳文章链接,我就不一一搬运了。
目录
1. Basic feature extraction using text data
Number of words
Number of characters
Average word length
Number of stopwords
Number of special characters
Number of numerics
Number of uppercase words
2. Basic Text Pre-processing of text data
Lower casing
Punctuation removal
Stopwords removal
Frequent words removal
Rare words removal
Spelling correction
Tokenization
Stemming
Lemmatization
3. Advance Text Processing
N-grams
Term Frequency
Inverse Document Frequency
Term Frequency-Inverse Document Frequency (TF-IDF)
Bag of Words
Sentiment Analysis
Word Embedding
网友评论