主题建模的主要技术是隐含狄式分布(LDA),它假定在文档里能找到的主题和单词分布来源于事先按照狄式分布抽样的隐藏多项分布。主题建模可以视为聚类的一种形式。
> library(pacman)
> p_load(dplyr, tm, Matrix, topicmodels)
1、数据准备
bbc数据集包含了2225篇被分组到5个主题中的文章,这些主题是商业、娱乐、政治、体育和技术。
bbsport数据集包含了737篇专门关于体育的文章,这些文字可以根据它描述的体育项目分组到5个类别里,包括田径、板球、足球、英式橄榄球和网球。
每个文件夹包含了三个文件,.mtx扩展名文件是包含了一个稀疏矩阵的词条-文档矩阵文件;.terms扩展名文件包含了实际的词条,每行一个词条,而词条就是词条-文档矩阵的行名;类似的,.docs扩展文件就是词条-文档矩阵的列名。
> bbc_folder <- "data_set/bbc"
> bbcsport_folder <- "data_set/bbc/bbcsport"
Matrix包的readMM()函数可以加载数据并将其保存在一个稀疏矩阵对象里,tm包的as.TermDocumentMatrix()函数可以将这个对象变换为一个词条-文档矩阵。
> bbc_mat <- readMM(paste0(bbc_folder, "/bbc.mtx"))
> # weighting参数指定权重参数,本数据为词条出现频率
> bbc_tdm <- as.TermDocumentMatrix(bbc_mat, weighting = weightTf)
> bbc_tdm
## <<TermDocumentMatrix (terms: 9635, documents: 2225)>>
## Non-/sparse entries: 286774/21151101
## Sparsity : 99%
## Maximal term length: NA
## Weighting : term frequency (tf)
> bbcsport_mat <- readMM(paste0(bbcsport_folder, "/bbcsport.mtx"))
> bbcsport_tdm <- as.TermDocumentMatrix(bbcsport_mat, weighting = weightTf)
> bbcsport_tdm
## <<TermDocumentMatrix (terms: 4613, documents: 737)>>
## Non-/sparse entries: 85576/3314205
## Sparsity : 97%
## Maximal term length: NA
## Weighting : term frequency (tf)
给bbc添加行名和列名:
> bbc_rownames <- readr::read_table(paste0(bbc_folder, "/bbc.terms"), col_names = F)
> head(bbc_rownames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 ad
## 2 sale
## 3 boost
## 4 time
## 5 warner
## 6 profit
> bbc_colnames <- readr::read_table(paste0(bbc_folder, "/bbc.docs"), col_names = F)
> head(bbc_colnames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 business.001
## 2 business.002
## 3 business.003
## 4 business.004
## 5 business.005
## 6 business.006
> # 去掉文档名后面的数字和点
> rm_fun <- function(string) {
+ string <- stringr::str_remove_all(string, "\\d+|\\.")
+ return(string)
+ }
>
> bbc_colnames$X1 <- bbc_colnames$X1 %>%
+ lapply(rm_fun) %>%
+ unlist()
>
> bbc_tdm$dimnames$Terms <- bbc_rownames$X1
> bbc_tdm$dimnames$Docs <- bbc_colnames$X1
>
> # 将tdm转换为dtm矩阵
> bbc_dtm <- t(bbc_tdm)
> bbc_dtm
## <<DocumentTermMatrix (documents: 2225, terms: 9635)>>
## Non-/sparse entries: 286774/21151101
## Sparsity : 99%
## Maximal term length: 24
## Weighting : term frequency (tf)
> bbcsport_rownames <- readr::read_table(paste0(bbcsport_folder, "/bbcsport.terms"), col_names = F)
> head(bbcsport_rownames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 claxton
## 2 hunt
## 3 first
## 4 major
## 5 medal
## 6 british
> bbcsport_colnames <- readr::read_table(paste0(bbcsport_folder, "/bbcsport.docs"), col_names = F)
> head(bbcsport_colnames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 athletics.001
## 2 athletics.002
## 3 athletics.003
## 4 athletics.004
## 5 athletics.005
## 6 athletics.006
> bbcsport_colnames$X1 <- bbcsport_colnames$X1 %>%
+ lapply(rm_fun) %>%
+ unlist()
>
> bbcsport_tdm$dimnames$Terms <- bbcsport_rownames$X1
> bbcsport_tdm$dimnames$Docs <- bbcsport_colnames$X1
>
> # 将tdm转换为dtm矩阵
> bbcsport_dtm <- t(bbcsport_tdm)
> bbcsport_dtm
## <<DocumentTermMatrix (documents: 737, terms: 4613)>>
## Non-/sparse entries: 85576/3314205
## Sparsity : 97%
## Maximal term length: 17
## Weighting : term frequency (tf)
> table(bbc_colnames$X1)
##
## business entertainment politics sport tech
## 510 386 417 511 401
> table(bbcsport_colnames$X1)
##
## athletics cricket football rugby tennis
## 101 124 265 147 100
数据集在各主题上的数量分布得还算均匀。bbcsport数据集关于足球的文章数量大约是其他运动的两倍。
2、建模
构建4种不同的主题模型:
LDA_VEM:用变分期望最大化(VEM)方法训练的LDA模型,这种模型会自动估算狄式参数向量。
LDA_VEM_:用VEM训练的LDA模型,差别是它没有估算狄式参数向量。
LDA_GIB:用吉布斯抽样方法训练的LDA模型。
CTM_VEM:用VEM训练的相关主题模型(Correlated Topic Model, CTM)的一种实现。
> # K为主题数量,topic_seed为训练过程中的内在随机数种子
> compute_model <- function(k, topic_seed, dtm) {
+ LDA_VEM <- LDA(dtm, k = k, control = list(seed = topic_seed))
+ LDA_VEM_a <- LDA(dtm, k = k, control = list(seed = topic_seed,
+ estimate.alpha = F))
+ # burnin为起始阶段省略的吉布斯迭代次数
+ # thin为中间阶段省略的迭代数量
+ # iter为迭代总次数
+ LDA_GIB <- LDA(dtm, k = k, method = "Gibbs", control = list(
+ seed = topic_seed, burnin = 1000, thin = 100, iter = 1000
+ ))
+ CTM_VEM <- CTM(dtm, k = k, control = list(
+ seed = topic_seed, var = list(tol = 10 ^ -4),
+ em = list(tol = 10 ^ -3)))
+ return(list(LDA_VEM = LDA_VEM, LDA_VEM_a = LDA_VEM_a,
+ LDA_GIB = LDA_GIB, CTM_VEM = CTM_VEM))
+ }
> bbc_models <- compute_model(k = 5, topic_seed = 123, bbc_dtm)
> bbcsport_models <- compute_model(k = 5, topic_seed = 123, bbcsport_dtm)
查看LDA_VEM模型对BBC数据集产生的结果:
> # topics函数得到每一个文档最有可能主题的向量
> table(topics(bbc_models$LDA_VEM), bbc_colnames$X1)
##
## business entertainment politics sport tech
## 1 225 13 32 3 116
## 2 260 15 2 0 259
## 3 1 352 2 159 10
## 4 21 6 381 3 5
## 5 3 0 0 346 11
主题3、4、5几乎对应了entertainment,politics,sport主题,但是主题1和2却是business和tech主题的混合,所以这个模型并没有成功区分我们需要的类别。
查看LDA_GIB模型:
> table(topics(bbc_models$LDA_GIB), bbc_colnames$X1)
##
## business entertainment politics sport tech
## 1 5 364 3 2 13
## 2 9 6 1 0 371
## 3 472 2 8 1 5
## 4 0 0 4 505 3
## 5 24 14 401 3 9
这个模型比上个模型要好很多。
主题模型的总精确度 = 行的最大值 / 文档总数,每一行的最大值对应的就是给这一行代表的模型主题分配的金主题。
构建计算总精确度的函数:
> compute_acc <- function(model, gold_type) {
+ model_topics <- topics(model)
+ model_table <- table(model_topics, gold_type)
+ # 按行取最大值
+ model_match <- apply(model_table, 1, max)
+ # 计算精确度
+ model_acc <- sum(model_match) / sum(model_table)
+ return(model_acc)
+ }
计算所有模型的精确度:
> sapply(bbc_models, FUN = function(x) compute_acc(x, bbc_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.7029 0.6854 0.9497 0.7425
> sapply(bbcsport_models, FUN = function(x) compute_acc(x, bbcsport_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.7802 0.7720 0.8820 0.7951
在两个数据集上LDA_GIB模型(吉布斯方法)都比其他模型都好。
另一种评估模型拟合质量的方法是计算数据的对数似然,值越大越好:
> sapply(bbc_models, logLik)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## -3205967 -3276286 -3017840 -3235383
> sapply(bbcsport_models, logLik)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## -864136 -886482 -814866 -870134
得到同样的结果,在两个数据集上LDA_GIB模型都比其他模型要好。
3、模型稳定性
上面的模型使用了相同的随机数种子,但是如果更换随机数种子,就有可能得到不同的结果。为了验证哪个模型的表现最稳定,可以在训练过程中设定nstart参数,它指定了在优化程序中使用的随机重启的次数。
利用随机数重启意味着会增加训练时间。
> compute_model_r <- function(k, topic_seed, dtm, nstart) {
+ seed_range <- topic_seed:(topic_seed + nstart - 1)
+ LDA_VEM <- LDA(dtm, k = k, control = list(seed = seed_range, nstart = nstart))
+ LDA_VEM_a <- LDA(dtm, k = k, control = list(seed = seed_range,
+ estimate.alpha = F,
+ nstart = nstart))
+ LDA_GIB <- LDA(dtm, k = k, method = "Gibbs", control = list(
+ seed = seed_range, burnin = 1000, thin = 100,
+ iter = 1000, nstart = nstart
+ ))
+ CTM_VEM <- CTM(dtm, k = k, control = list(
+ seed = topic_seed, var = list(tol = 10 ^ -4),
+ em = list(tol = 10 ^ -3)))
+ return(list(LDA_VEM = LDA_VEM, LDA_VEM_a = LDA_VEM_a,
+ LDA_GIB = LDA_GIB, CTM_VEM = CTM_VEM))
+ }
>
> bbc_models <- compute_model_r(k = 5, topic_seed = 123, bbc_dtm, nstart = 5)
> bbcsport_models <- compute_model_r(k = 5, topic_seed = 123, bbcsport_dtm, nstart = 5)
>
> sapply(bbc_models, FUN = function(x) compute_acc(x, bbc_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.9483 0.7762 0.7627 0.7425
> sapply(bbcsport_models, FUN = function(x) compute_acc(x, bbcsport_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.8996 0.8887 0.6499 0.7951
使用了5次随机重启后模型的精确度已经有所改善,值得注意的是,CTM_VEM模型的精确度能克服随机重启的波动(前后表现基本一致)。
4、主题分布
可以调用posterior()函数查看每个模型的主题分布。
> options(digits = 4)
> head(posterior(bbc_models[[1]])$topics)
## 1 2 3 4 5
## business 0.0002189 0.0356775 0.7043 0.0002189 0.2595903
## business 0.1073955 0.0002508 0.8797 0.0002508 0.0123840
## business 0.0003209 0.0003209 0.9987 0.0003209 0.0003209
## business 0.0002121 0.0418897 0.9575 0.0002121 0.0002121
## business 0.0004043 0.0192081 0.9796 0.0004043 0.0004043
## business 0.0004669 0.1748144 0.8238 0.0004669 0.0004669
> p_load(ggplot2)
>
> posterior(bbc_models[[1]])$topics %>%
+ # 找出每一行中概率最大值
+ apply(1, max) %>%
+ # 转换为向量
+ tibble(topics = as.numeric(.)) %>%
+ ggplot(aes(topics)) +
+ geom_histogram(binwidth = 0.02)+
+ labs(x = "最可能主题的概率", y = "频率",
+ title = "LDA_VEM模型最大概率直方图") +
+ theme_bw()
LDA_VEM模型最大概率直方图
估算主题分布平滑度的另一种方法是计算模型熵,即不同文档中的所有主题分布的平均熵值,平滑的分布会比尖的分布具有更高的熵值。
> compute_entropy <- function(x) {
+ return(- sum(x * log(x)))
+ }
>
> # 计算平均熵
> compute_mean_entropy <- function(model) {
+ topics <- posterior(model)$topics
+ return(mean(apply(topics, 1, compute_entropy)))
+ }
>
> # 计算两个数据集上的不同模型的平均熵
> sapply(bbc_models, compute_mean_entropy)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.3061 1.2777 1.3189 0.8350
> sapply(bbcsport_models, compute_mean_entropy)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.2838 1.2953 1.3699 0.7221
CTM_VEM模型具有比其他模型更平滑的分布。
5、单词分布
查看LDA_GIB模型中每个主题中最重要的10个单词。
> terms(bbc_models[[3]], 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "govern" "on" "plai" "peopl" "year"
## [2,] "labour" "work" "film" "game" "compani"
## [3,] "parti" "want" "best" "music" "sale"
## [4,] "elect" "includ" "win" "servic" "market"
## [5,] "minist" "world" "game" "technolog" "firm"
## [6,] "plan" "time" "first" "mobil" "2004"
## [7,] "peopl" "london" "award" "phone" "share"
## [8,] "public" "show" "year" "get" "expect"
## [9,] "blair" "week" "england" "on" "month"
## [10,] "sai" "year" "star" "tv" "bank"
从dtm矩阵中计算每个词条的频率:
> # 转换为矩阵
> bbc_dtm_mat <- as.matrix(bbc_tdm)
> bbc_dtm_mat[1:5, 10:16]
## Docs
## Terms business business business business business business business
## ad 0 1 0 0 1 2 0
## sale 0 1 0 5 1 0 1
## boost 0 0 0 2 1 0 0
## time 0 0 1 0 2 0 0
## warner 0 0 0 0 0 0 0
> # 按行计算词频
> # 转换为数值型向量是为了去掉属性名,否则画图会报错
> numTerms <- as.numeric(rowSums(bbc_dtm_mat))
> head(numTerms)
## ad sale boost time warner profit
## 851 736 222 1487 26 315
> # 组成新的词频数据框
> bbc_tf <- tibble(Terms = rownames(bbc_dtm_mat),
+ numTerms = numTerms)
> head(bbc_tf)
## # A tibble: 6 x 2
## Terms numTerms
## <chr> <dbl>
## 1 ad 851
## 2 sale 736
## 3 boost 222
## 4 time 1487
## 5 warner 26
## 6 profit 315
关键词云:
> p_load(wordcloud2)
>
> # 画第三个模型,第3个主题前25个词条的关键词云
> terms(bbc_models[[3]], 25) %>%
> as_tibble() %>%
> .[, 3] %>%
> setNames("Terms") %>%
> left_join(bbc_tf, by = "Terms") %>%
> wordcloud2()
关键词云
网友评论