原文链接:http://tecdat.cn/?p=4815
因为近期在分析数据时用到了EM最大期望估计法这个算法,在参数估计中也用到的比较多。然而,发现国内在R软件上实现高斯混合分布的EM的实例并不多,大多数是关于1到2个高斯混合分布的实现,不易于推广,因此这里分享一下自己编写的k个高斯混合分布的EM算法实现请大神们多多指教。并结合EMCluster包对结果进行验算。
本文使用的密度函数为下面格式:
need-to-insert-img
![](https://img.haomeiwen.com/i2808647/4e6067a148f90463.png)
对应的函数原型为 em.norm(x,means,covariances,mix.prop)
x为原数据,means为初始均值,covariances为数据的协方差矩阵,mix.prop为混合参数初始值。
使用的数据为MASS包里面的synth.te数据的前两列
<- synth.te[,-3]","classes":{"has":1}}" data-cke-widget-upcasted="1" data-cke-widget-keep-attr="0" data-widget="codeSnippet">x <- synth.te[,-3]
need-to-insert-img
首先安装需要的包,并读取原数据。
install.packages("MASS") library(MASS) install.packages("EMCluster") library(EMCluster) install.packages("ggplot2") library(ggplot2) Y=synth.te[,c(1:2)] qplot(x=xs, y=ys, data=Y)
need-to-insert-img
然后绘制相应的变量相关图:
need-to-insert-img
从图上我们可以大概估计出初始的平均点为(-0.7,0.4) (-0.3,0.8)(0.5,0.6)
当然 为了试验的严谨性,我可以从两个初始均值点的情况开始估计
首先输入初始参数:
mustart = rbind(c(-0.5,0.3),c(0.4,0.5))covstart = list(cov(Y), cov(Y)) probs = c(.01, .99)
need-to-insert-img
然后编写em.norm函数,注意其中的clusters值需要根据不同的初始参数进行修改,
< maxits) {\n\nprobsOld = probs\n\n\n\nllOld = ll\n\nriOld = ri\n\n\n\n\n\n# Compute responsibilities\n\nfor (k in 1:clusters){\n\nri[,k] = probs[k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=F)\n\n}\n\nri = ri/rowSums(ri)\n\n\n\n\n\nrk = colSums(ri)\n\nprobs = rk/N\n\nfor (k in 1:clusters){\n\nvarmat = matrix(0, ncol=ncol(X), nrow=ncol(X))\n\nfor (i in 1:N){\n\nvarmat = varmat + ri[i,k] * X[i,]%*%t(X[i,])\n\n}\n\nmu[k,] = (t(X) %*% ri[,k]) / rk[k]\n\nvar[[k]] =varmat/rk[k] - mu[k,]%*%t(mu[k,])\n\nll[k] = -.5 * sum( ri[,k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=T) )\n\n}\n\nll = sum(ll)\n\n\n\n\n\nparmlistold =c(llOld, probsOld)\n\nparmlistcurrent = c(ll, probs)\n\nit = it + 1\n\nif (showits & it == 1 | it%%5 == 0)\n\ncat(paste(format(it), \"...\", \"\\n\", sep = \"\"))\n\nconverged = min(abs(parmlistold - parmlistcurrent)) <= tol\n\n}\n\n\n\nclust = which(round(ri)==1, arr.ind=T)\n\nclust = clust[order(clust[,1]), 2]\n\nout = list(probs=probs, mu=mu, var=var, resp=ri, cluster=clust, ll=ll)\n\n}\n\n","classes":{"has":1}}" data-cke-widget-upcasted="1" data-cke-widget-keep-attr="0" data-widget="codeSnippet">em.norm = function(X,mustart,covstart,probs){params = list(mu=mustart, var=covstart, probs=probs)clusters = 2tol=.00001maxits=100showits=Trequire(mvtnorm)N = nrow(X)mu = params$muvar = params$varprobs = params$probsri = matrix(0, ncol=clusters, nrow=N)ll = 0it = 0converged = FALSEif (showits)cat(paste("Iterations of EM:", "\n"))while (!converged & it < maxits) {probsOld = probsllOld = llriOld = ri# Compute responsibilitiesfor (k in 1:clusters){ri[,k] = probs[k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=F)}ri = ri/rowSums(ri)rk = colSums(ri)probs = rk/Nfor (k in 1:clusters){varmat = matrix(0, ncol=ncol(X), nrow=ncol(X))for (i in 1:N){varmat = varmat + ri[i,k] * X[i,]%*%t(X[i,])}mu[k,] = (t(X) %*% ri[,k]) / rk[k]var[[k]] =varmat/rk[k] - mu[k,]%*%t(mu[k,])ll[k] = -.5 * sum( ri[,k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=T) )}ll = sum(ll)parmlistold =c(llOld, probsOld)parmlistcurrent = c(ll, probs)it = it + 1if (showits & it == 1 | it%%5 == 0)cat(paste(format(it), "...", "\n", sep = ""))converged = min(abs(parmlistold - parmlistcurrent)) <= tol}clust = which(round(ri)==1, arr.ind=T)clust = clust[order(clust[,1]), 2]out = list(probs=probs, mu=mu, var=var, resp=ri, cluster=clust, ll=ll) }
need-to-insert-img
结果,可以用图像化来表示:
分享:
网友评论