LDA:隐狄利克雷分配,常用于文本主题模型(主题分类、聚类)。注意LDA也是线性判别分析的缩写
参考一篇文章:https://zhuanlan.zhihu.com/p/31470216
from pyspark.ml.clustering import LDA
# $example off$
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("LDAExample") \
.getOrCreate()
# Loads data.
dataset = spark.read.format("libsvm").load("sample_lda_libsvm_data.txt")
# optimizer:'online','em'
# k:topic数量
# learningOffset: 降低早期迭代的权重,值越大,早期迭代数量越少
# learningDecay: 学习率衰减,设置为0.5-1之间以保证渐进收敛
# subsamplingRate:mini-batch采样比例
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
ll = model.logLikelihood(dataset)# 对数似然
lp = model.logPerplexity(dataset)# 对数困惑度
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
model.vocabSize()# 词汇量
topics = model.describeTopics(maxTermsPerTopic=3)# 每个主题前3权重大的词汇
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
# 转换后的新列应该表示在10个主题上的权重,和为1
transformed = model.transform(dataset)
transformed.show(truncate=False)
网友评论