89-预测分析－Ｒ语言实现-朴素贝叶斯

作者: wonphen | 来源:发表于2020-10-22 13:33 被阅读0次

89-预测分析－Ｒ语言实现-朴素贝叶斯
朴素贝叶斯法
朴素贝叶斯—豆瓣Top250影评的情感分析与预测
第五周 - 20180507
机器学习实战之python实现朴素贝叶斯算法
算法笔记（7）-朴素贝叶斯算法及Python代码实现
【机器学习与R语言】3-概率学习朴素贝叶斯（NB）
Flink实现高斯朴素贝叶斯
数据算法 Hadoop/Spark大数据处理---第十四章
实战：垃圾短信分类器

> library(pacman)
> p_load(dplyr, caret, chinese.misc, stringr, tidytext)

朴素贝叶斯分类器是一种有向图模型。

任务：预测电影评论里的情绪。

1、数据准备

该大电影评论数据集包含25000条影评数据。

> path_pos <- "data_set/aclImdb/train/pos/"
> path_neg <-"data_set/aclImdb/train/neg/"
> 
> # 获取文件夹下所有文件名称
> pos_files <- dir_or_file(path_pos)
> head(pos_files)

## [1] "/home/ez/R/data_set/aclImdb/train/pos/0_9.txt"     "/home/ez/R/data_set/aclImdb/train/pos/1_7.txt"    
## [3] "/home/ez/R/data_set/aclImdb/train/pos/10_9.txt"    "/home/ez/R/data_set/aclImdb/train/pos/100_7.txt"  
## [5] "/home/ez/R/data_set/aclImdb/train/pos/1000_8.txt"  "/home/ez/R/data_set/aclImdb/train/pos/10000_8.txt"

文档名称最后的数字代表评论分数（下划线后的数字），评分在0～4和7～10之间。

> # 先提取下划线后面的数字
> # 然后清除下划线，转换为数值型
> get_score <- function(x) {
+   x = stringr::str_extract(x, "_\\d+")
+   x = stringr::str_remove(x, "_")
+   x = as.numeric(x)
+   return(x)
+ }

构建读取文件内容函数，并创建数据框，第一列为评论内容，第二列为评分。

> read_fun <- function(x) {
+   txt <- readLines(x)
+   score <- get_score(x)
+   df <- tibble(content = txt,
+                score = score)
+   return(df)
+ }

读取pos文件夹下所有评论及对应的评分：

> nb_pos <- lapply(pos_files, read_fun) %>%
+   do.call(bind_rows, .)
> 
> head(nb_pos)

## # A tibble: 6 x 2
##   content                                                                                                 score
##   <chr>                                                                                                   <dbl>
## 1 "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, …     9
## 2 "If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the sm…     7
## 3 "I'm a male, not given to women's movies, but this is really a well done special story. I have no pers…     9
## 4 "Scott Bartlett's 'OffOn' is nine minutes of pure craziness. It is a full-frontal assault of psychedel…     7
## 5 "I liked the film. Some of the action scenes were very interesting, tense and well done. I especially …     8
## 6 "Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan …     8

使用同样的方法读取neg文件夹下所有文件。

> neg_files <- dir_or_file(path_neg)
> head(neg_files)

## [1] "/home/ez/R/data_set/aclImdb/train/neg/0_3.txt"     "/home/ez/R/data_set/aclImdb/train/neg/1_1.txt"    
## [3] "/home/ez/R/data_set/aclImdb/train/neg/10_2.txt"    "/home/ez/R/data_set/aclImdb/train/neg/100_3.txt"  
## [5] "/home/ez/R/data_set/aclImdb/train/neg/1000_4.txt"  "/home/ez/R/data_set/aclImdb/train/neg/10000_4.txt"

> nb_neg <- lapply(neg_files, read_fun) %>%
+   do.call(bind_rows, .)
> 
> head(nb_neg)

## # A tibble: 6 x 2
##   content                                                                                                 score
##   <chr>                                                                                                   <dbl>
## 1 "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrif…     3
## 2 "Robert DeNiro plays the most unbelievably intelligent illiterate of all time. This movie is so wastef…     1
## 3 "This film had a lot of promise, and the plot was relatively interesting, however the actors, director…     2
## 4 "OK its not the best film I've ever seen but at the same time I've been able to sit and watch it TWICE…     3
## 5 "The plot for Descent, if it actually can be called a plot, has two noteworthy events. One near the be…     4
## 6 "Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belong…     4

将两个文件合并为一个，并根据分数进行分类，将>=5的分为正向评论，<5的为负向评论。(也可以通过文件夹来区分，pos文件夹下的所有评论都为正向评论)

> nb_all <- bind_rows(nb_pos, nb_neg) %>% 
+   mutate(sentiment = ifelse(score < 5, "negative", "positive"))
> str(nb_all)

## tibble [25,000 × 3] (S3: tbl_df/tbl/data.frame)
##  $ content  : chr [1:25000] "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \""| __truncated__ "If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the small advent"| __truncated__ "I'm a male, not given to women's movies, but this is really a well done special story. I have no personal love "| __truncated__ "Scott Bartlett's 'OffOn' is nine minutes of pure craziness. It is a full-frontal assault of psychedelic, pulsat"| __truncated__ ...
##  $ score    : num [1:25000] 9 7 9 7 8 8 10 7 8 8 ...
##  $ sentiment: chr [1:25000] "positive" "positive" "positive" "positive" ...

2、去除数字、标点等停用词

去除数字、标点符号，全部转换为小写字母，然后将多个空格替换为单个空格。

> pre_fun <- function(string) {
+   string <- str_remove_all(string, "[[:digit:]]")
+   string <- str_remove_all(string,  "[[:punct:]]")
+   string <- str_to_lower(string)
+   string <- str_replace_all(string, "\\s+", " ")
+   return(string)
+ }
> 
> nb_all$content <- nb_all$content %>% 
+   lapply(pre_fun) %>% 
+   unlist()
> 
> # 随机显示一条处理后的评论
> sample_n(nb_all, 1)[, 1]

## # A tibble: 1 x 1
##   content                                                                                                      
##   <chr>                                                                                                        
## 1 this movie was excellent i was not expecting it to live up to all the hype but it did like all the bourne mo…

> # 新增id列，并设为第一列
> nb_all <- mutate(nb_all, id = 1:nrow(nb_all)) %>% 
+   select(id, everything())

3、创建词向量

按照文档编号（id）统计每个词在文档中出现的次数。

> nb_words <- nb_all %>% 
+   unnest_tokens(words, content) %>% 
+   count(words, id, sort = T)
> 
> str(nb_words)

## tibble [3,480,049 × 3] (S3: tbl_df/tbl/data.frame)
##  $ words: chr [1:3480049] "the" "the" "the" "the" ...
##  $ id   : int [1:3480049] 54 1948 4350 5921 3083 16037 16035 17204 1948 1014 ...
##  $ n    : int [1:3480049] 198 198 138 128 108 103 99 99 98 97 ...

4、去除英语停用词

> stop_words <- readr::read_csv("dict/english_stopword.txt", col_names = F) %>% 
+   setNames("words")
> str(stop_words)

## tibble [1,435 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ words: chr [1:1435] "à" "able" "about" "above" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_character()
##   .. )

> nb_words <- nb_words %>% 
+   anti_join(stop_words, by = "words")
> str(nb_words)

## tibble [1,818,205 × 3] (S3: tbl_df/tbl/data.frame)
##  $ words: chr [1:1818205] "br" "br" "br" "br" ...
##  $ id   : int [1:1818205] 18564 21701 24451 18360 12526 1095 14875 18438 5025 20903 ...
##  $ n    : int [1:1818205] 92 86 84 72 70 68 60 60 58 54 ...

词矩阵行数有3,480,049缩减为1,818,205。"br"应该是html标记，手动删除掉。

> usr_words <- tibble(words = c("br", "aa"))
> nb_words <- nb_words %>% 
+   anti_join(usr_words, by = "words")
> str(nb_words)

## tibble [1,803,533 × 3] (S3: tbl_df/tbl/data.frame)
##  $ words: chr [1:1803533] "rob" "titanic" "victor" "film" ...
##  $ id   : int [1:1803533] 6262 54 13037 20332 22218 1948 23221 1948 1948 5323 ...
##  $ n    : int [1:1803533] 34 34 31 27 27 27 27 26 25 25 ...

5、创建文档-词条矩阵（DTM）

> nb_dtm <- cast_dtm(data = nb_words,
+                    document = id,
+                    term = words,
+                    value = n)
> dim(nb_dtm)

## [1]  25000 107753

该文档-词条矩阵共25000行（文档数量），107753列（词条数量）。

> nb_dtm

## <<DocumentTermMatrix (documents: 25000, terms: 107753)>>
## Non-/sparse entries: 1803533/2692021467
## Sparsity           : 100%
## Maximal term length: 64
## Weighting          : term frequency (tf)

Non-/sparse entries: 1803533/2692021467表示非稀疏与稀疏元素之比，在2692021467个元素中，只有1803533个不为0。
为了减少列数（特征数），我们可以通过丢弃稀疏列（很多词条在语料库里只出现一两次，对用户情绪没有多少信息量）和词干提取（movie和movies表示的是同样的意思）的方式。当然，还有一种做法是进行PCA。
使用tm包丢弃所有文档中出现次数不足1%的词条：

> nb_dtm <- tm::removeSparseTerms(nb_dtm, sparse = 0.99)
> dim(nb_dtm)

## [1] 25000  1227

现在列数显著地减少到了1227个，相对于行数，这明显是个更合理的特征数量。
查看其中某些行和列的信息：

> tm::inspect(nb_dtm[10:15, 20:28])

## <<DocumentTermMatrix (documents: 6, terms: 9)>>
## Non-/sparse entries: 13/41
## Sparsity           : 76%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    ben book character dr flick guy james la story
##   17466   0    0         0  0     0   1     0  0     6
##   21540   0    1         0  0     0   0     0  0     2
##   22439   0    1         2  0     0   0     0  0     2
##   22476   0    0         2  0     0   3     0  0     0
##   23202   0    0         0  0     0   0     0  0     2
##   9686    0    2         1  0     0   0     0  0     1

6、转换为tibble

训练模型需要输入为数据框，将DTM转换为数据框。

> nb_data <- as_tibble(as.matrix(nb_dtm))
> nb_data[10:15, 20:28]

## # A tibble: 6 x 9
##     guy story   ben  book character    dr flick james    la
##   <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     0     2     0     1         2     0     0     0     0
## 2     3     0     0     0         2     0     0     0     0
## 3     0     2     0     0         0     0     0     0     0
## 4     0     2     0     1         0     0     0     0     0
## 5     0     1     0     2         1     0     0     0     0
## 6     1     6     0     0         0     0     0     0     0

将sentiment列合并回去，转换为因子型作为因变量，即得到我们需要的数据集。

> nb_data <- nb_data %>% 
+   bind_cols(sentiment = as.factor(nb_all$sentiment)) 
> 
> table(nb_data$sentiment)

## 
## negative positive 
##    12500    12500

7、拆分训练集和测试集

> set.seed(123)
> ind <- createDataPartition(nb_data$sentiment, p = 0.8, list = F)
> dtrain <- nb_data[ind, ]
> dtest <- nb_data[-ind, ]

8、训练朴素贝叶斯模型

> fit.nb <- train(sentiment ~ ., data = dtrain, method = "naive_bayes")
> fit.nb

## Naive Bayes 
## 
## 20000 samples
##  1227 predictor
##     2 classes: 'negative', 'positive' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 20000, 20000, 20000, 20000, 20000, 20000, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.7312269  0.4623403
##    TRUE      0.5003438  0.0000000
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning parameter 'adjust' was held constant at
##  a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE and adjust = 1.

准确率：

> # 训练集准确率
> mean(predict(fit.nb, newdata = dtrain) == dtrain$sentiment)

## [1] 0.73725

> # 测试集准确率
> mean(predict(fit.nb, newdata = dtest) == dtest$sentiment)

## [1] 0.735

准确率大概为73.5%。
如果使用score作为因变量，使用回归模型，也可以用来预测评论的分数。

89-预测分析－Ｒ语言实现-朴素贝叶斯
朴素贝叶斯分类器是一种有向图模型。任务：预测电影评论里的情绪。 1、数据准备该大电影评论数据集包含25000条...
朴素贝叶斯法
朴素贝叶斯法朴素贝叶斯法的学习与分类朴素贝叶斯法的参数估计朴素贝叶斯实现高斯朴素贝叶斯实现使用 skle...
朴素贝叶斯—豆瓣Top250影评的情感分析与预测
前言本文使用朴素贝叶斯算法实现豆瓣Top250电影评价的情感分析与预测。最近在学习自然语言正负面情感...
第五周 - 20180507
朴素贝叶斯的思路及实现一、朴素贝叶斯简介朴素贝叶斯法（Naive Bayes）是基于贝叶斯定理与特征条件独立假...
机器学习实战之python实现朴素贝叶斯算法
python代码实现朴素贝叶斯分类算法示例：使用朴素贝叶斯过滤垃圾邮件
算法笔记（7）-朴素贝叶斯算法及Python代码实现
朴素贝叶斯算法有三种类型，分别是贝努利朴素贝叶斯、高斯贝叶斯、多项式朴素贝叶斯。贝叶斯公式贝努利朴素贝叶斯适...
【机器学习与R语言】3-概率学习朴素贝叶斯（NB）
1.理解朴素贝叶斯 1）基本概念依据概率原则进行分类。如天气预测概率。朴素贝叶斯（Naive Bayes, N...
Flink实现高斯朴素贝叶斯
Flink实现高斯朴素贝叶斯在之前的文章中提到了多项式的朴素贝叶斯，在spark的ML里也实现了多项式的朴素贝叶...
数据算法 Hadoop/Spark大数据处理---第十四章
本章为朴素贝叶斯算法朴素贝叶斯算法的思想本章实现方式 1.基于spark来实现 2.基于传统Scala来实现 ...
实战：垃圾短信分类器
上次我们讲到朴素贝叶斯分类，忘记的同学参考一文搞懂朴素贝叶斯分类，今天就通过朴素贝叶斯分来来实现一个简单的垃圾短信...

89-预测分析－Ｒ语言实现-朴素贝叶斯

1、数据准备

2、去除数字、标点等停用词

3、创建词向量

4、去除英语停用词

5、创建文档-词条矩阵（DTM）

6、转换为tibble

7、拆分训练集和测试集

8、训练朴素贝叶斯模型

相关文章

89-预测分析－Ｒ语言实现-朴素贝叶斯

朴素贝叶斯法

朴素贝叶斯—豆瓣Top250影评的情感分析与预测

第五周 - 20180507

机器学习实战之python实现朴素贝叶斯算法

算法笔记（7）-朴素贝叶斯算法及Python代码实现

【机器学习与R语言】3-概率学习朴素贝叶斯（NB）

Flink实现高斯朴素贝叶斯

数据算法 Hadoop/Spark大数据处理---第十四章

实战：垃圾短信分类器

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

R炒面