【理论篇】：主成分分析法和变量聚类

作者: dataheart | 来源:发表于2017-05-21 10:31 被阅读1668次

压缩变量的思路方法

建模前—主成分、因子分析或变量聚类
建模时—逐步法或者全子集法

主成分分析：根据变量之间的相关性，把相关性较高的地方拿出来提取一个主成分,即从原始变量中导出几个少数主成分，使他们尽可能多的保留原始变量的信息，且彼此不相关

选取原则：单个主成分解释的变量不应该小于1，选取主成分累积达到80% ~90%

Paste_Image.png

代码块
1.数据源：该数据记录电信客户业务使用行为信息；
属性说明：
ID:客户编码
cnt_call:打电话次数
cnt_msg:发短信次数
cnt_wei:发微信次数
cnt_web:浏览网站次数

Paste_Image.png

> library(sqldf)
> setwd('E:\\R数据挖掘实战\\第四周\\data数据')
> orgData <- read.csv("profile_telecom.csv") #读取数据框
> names(orgData) #获取数据框的列名
[1] "ID"       "cnt_call" "cnt_msg"  "cnt_wei"  "cnt_web" 
> orgData <- orgData[,2:5] #数据框的切片，2:5行的数据选出并且赋值,只保留参与运算
> head(orgData)
  cnt_call cnt_msg cnt_wei cnt_web
1       46      90      36      31
2       53       2       0       2
3       28      24       5       8
4        9       2       0       4
5      145       2       0       1
6      186       4       3       1
> #主成分分析方法一
> pr1 <- princomp(orgData,cor = TRUE) #最传统主成分分析的方法——princomp 函数
> #cor = TRUE --> 样本的相关矩阵R做主成分分析， = FALSE --> 样本的协方差S做主成分分析
> pr1
Call:
princomp(x = orgData, cor = TRUE)

Standard deviations:
    Comp.1     Comp.2     Comp.3     Comp.4 
1.58127090 0.99237512 0.71374657 0.07307394 

 4  variables and  600 observations.

解析：做了中心标准化，Comp.1~1.58127090 说明了解释1.58个变量，第四个解释了0.07个变量，则说明了只要保留三个

累积解释情况

> #loading是逻辑变量 
> #当loading=TRUE时表示显示loading 的内容
> #loadings的输出结果为载荷 是主成分对应于原始变量的系数即Q矩阵
> summary(pr1,loadings=TRUE)
Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5812709 0.9923751 0.7137466 0.07307394
Proportion of Variance 0.6251044 0.2462021 0.1273585 0.00133495
Cumulative Proportion  0.6251044 0.8713065 0.9986650 1.00000000

Loadings:
         Comp.1 Comp.2 Comp.3 Comp.4
cnt_call -0.111  0.990              
cnt_msg  -0.510 -0.127  0.810 -0.262
cnt_wei  -0.579        -0.559 -0.593
cnt_web  -0.627        -0.157  0.762

解析：
分析结果含义
Standard deviation 标准差其平方为方差=特征值
Proportion of Variance 方差贡献率
Cumulative Proportion 方差累计贡献率--->累积解释力度在80%~90%之间，从而筛选出到底是几个变量

权重矩阵，可以初步看出主成分包含的代表的信息，谁的权重高
comp.1四个都有关系代表着数据业务
Comp.2 代表电话业务
引申出无法解释主成分代表的信息

Paste_Image.png

#画主成分的碎石图
screeplot(pr1,type="lines") #第一个变缓慢的趋势就是保留的变量

Paste_Image.png

还好主成分可以旋转

不旋转情况下和上面一样

> library(psych)

载入程辑包：‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

> pr2<-principal(orgData,nfactors=3,rotate="none",covar=F,score=TRUE) 
> #nfactors=3 ，指定值保留几个组成分，rotate="none" ，表示不做因子旋转
> pr2
Principal Components Analysis
Call: principal(r = orgData, nfactors = 3, rotate = "none", covar = F, 
    scores = TRUE)
Standardized loadings (pattern matrix) based upon correlation matrix
          PC1   PC2   PC3 h2      u2 com
cnt_call 0.18  0.98  0.06  1 6.5e-08 1.1
cnt_msg  0.81 -0.13  0.58  1 3.7e-04 1.9
cnt_wei  0.92 -0.02 -0.40  1 1.9e-03 1.4
cnt_web  0.99 -0.05 -0.11  1 3.1e-03 1.0

                       PC1  PC2  PC3
SS loadings           2.50 0.98 0.51
Proportion Var        0.63 0.25 0.13
Cumulative Var        0.63 0.87 1.00
Proportion Explained  0.63 0.25 0.13
Cumulative Proportion 0.63 0.87 1.00

Mean item complexity =  1.3
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0 
 with the empirical chi square  0.01  with prob <  NA 

Fit based upon off diagonal values = 1

带旋转

> library(psych)
> fc1<-principal(orgData,nfactors=2,rotate="varimax",covar=F,score=TRUE) #varimax 方差最大化
> fc1
Principal Components Analysis
Call: principal(r = orgData, nfactors = 2, rotate = "varimax", covar = F, 
    scores = TRUE)
Standardized loadings (pattern matrix) based upon correlation matrix
          RC1   RC2   h2     u2 com
cnt_call 0.06  1.00 1.00 0.0037   1
cnt_msg  0.82 -0.03 0.67 0.3343   1
cnt_wei  0.91  0.09 0.84 0.1611   1
cnt_web  0.99  0.06 0.98 0.0156   1

                       RC1  RC2
SS loadings           2.48 1.01
Proportion Var        0.62 0.25
Cumulative Var        0.62 0.87
Proportion Explained  0.71 0.29
Cumulative Proportion 0.71 1.00

Mean item complexity =  1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.1 
 with the empirical chi square  72.93  with prob <  NA 

Fit based upon off diagonal values = 0.97

解析：

Paste_Image.png

第一个主成分代表总体：微信和微博
第二个主成分代表总体偏差指标的捕捉：电话业务

2 . 变量的聚类：先把几个变量归纳为几个维度，从几个维度里面找到最合理的变量

Paste_Image.png

分为三组变量

Paste_Image.png

给出了所有的情况，选取最大的

Paste_Image.png

由上图可知分为三组

Paste_Image.png

orgData<-read.csv("profile_telecom.csv")
head(orgData)
orgData<-orgData[,2:5]

library(ClustOfVar)#专门对变量的聚类的包
tree <- hclustvar(orgData) 
plot(tree) #先做变量的聚类，做聚类的树形图，可以清晰的看出聚类可以压缩的变量

Paste_Image.png

到底是两类好还是三类好，下面做检验

stability(tree,B=40) #对树形图做一个检验，到底是两类好还是三类好呢？，那个高选择那个
#如果一样的话

> part <- cutreevar(tree,3,matsim = T)#matsim = T 表示变量之间的相关性的观察
> summary(part)

Call:
cutreevar(obj = tree, k = 3, matsim = T)



Data: 
   number of observations:  600
   number of variables:  4
   number of clusters:  3

Cluster  1 : 
         squared loading
cnt_call               1


Cluster  2 : 
        squared loading
cnt_msg               1


Cluster  3 : 
        squared loading
cnt_wei            0.98
cnt_web            0.98


Gain in cohesion (in %):  96.7
> part$var
$cluster1
         squared loading
cnt_call               1

$cluster2
        squared loading
cnt_msg               1

$cluster3
        squared loading
cnt_wei       0.9752462
cnt_web       0.9752462

> part$sim
$cluster1
         cnt_call
cnt_call        1

$cluster2
        cnt_msg
cnt_msg       1

$cluster3
          cnt_wei   cnt_web
cnt_wei 1.0000000 0.9034359
cnt_web 0.9034359 1.0000000

参考资料：CDA《信用风险建模》微专业

网友评论

本文标题：【理论篇】：主成分分析法和变量聚类

本文链接：https://www.haomeiwen.com/subject/pcftgttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【理论篇】：主成分分析法和变量聚类

压缩变量的思路方法

累积解释情况

还好主成分可以旋转

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读