美文网首页
R_DATACAMP10 Cluster Analysis in

R_DATACAMP10 Cluster Analysis in

作者: 一条很闲的咸鱼 | 来源:发表于2018-08-23 14:19 被阅读0次

    Calculating distance between observations计算两点间距离

    lims(x = c(-30,30), y = c(-20, 20)) 应用于ggplot中,可以设置图标坐标轴的范围

    dist(two_players) dist(data.frame)会计算出数据结构中各个点相互之间的举例

    scale(data.frame)后 再dist,可以消除因为同组数之间相差太大引起的影响,比如一个是千米,一个是毫升这种毫不相关的量。即矩阵的中心化。

    如果data.frame中的data是 YES/NO LOW/MIDDLE/HIGH这样的组合 如何用dist进行计算呢?
    首先,library(dummies)
    dummy_survey <- dummy.data.frame(job_survey)用dummy的数据结构格式转化
    dist_survey <- dist(dummy_survey, method = 'binary')再dist,方式选择二进制
    以下是method的取值
    euclidean 欧几里德距离,就是平方再开方。
    maximum 切比雪夫距离
    manhattan 绝对值距离
    canberra Lance 距离
    minkowski 明科夫斯基距离,使用时要指定p值
    binary 定性变量距离.

    矩阵中给出各个参数互相之间的关联值,其中其他数据对一组数据的关联值,分三个方面:

      • Complete: the resulting distance is based on the maximum,max()
      • Single: the resulting distance is based on the minimum,min()
      • Average: the resulting distance is based on the average,mean()

    hc_players <- hclust(dist_players, method = "complete")
    clusters_k2 <- cutree(hc_players, k = 2)
    hclust()是聚类函数
    cutree(k = )从中提取聚类后的???

    library(dendextend)
    color_branches()
    dend_20 <- color_branches(dend_players, h = 20)

    library(dendextend)
    dist_players <- dist(lineup, method = 'euclidean')
    hc_players <- hclust(dist_players, method = "complete")
    dend_players <- as.dendrogram(hc_players)as.dendrogram这里是转化成什么格式?
    plot(dend_players)做出来是树状图
    dend_20 <- color_branches(dend_players, h = 20) color_branches是给树状图上色,h是指上色的高度

    dist_customers <- dist(customers_spend)计算两点距离
    hc_customers <- hclust(dist_customers, method = "complete")用hclust聚类之
    plot(hc_customers)画出聚类后的树状图
    clust_customers <- cutree(hc_customers, h = 15000)设置一个高度限制,cutree,这里的h具体是指代什么?
    segment_customers <- mutate(customers_spend, cluster = clust_customers)将cutree下来的各组数的组别加入到原始datafram中成为新的一列cluster

    ggplot中的ifelse

    K-means clustering K值平均分类

    kmeans(lineup, centers = 2)创建一个k均值模型,此处k=几就是分为按颜色分为几类。
    clust_km2 <- model_km2$cluster模型中的cluster列选出来
    lineup_km2 <- mutate(lineup, cluster = clust_km2)将模型中分配好组的cluster列加入原来的数据结构中
    ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
    geom_point()绘制出来,利用散点图看出分组情况。此处有关ggplot中的颜色要不要factor()之,是因为如果不转化为因子,那么原来的格式是int,是连续的,按颜色分类时就会是一个连续的按颜色渐变分类,如果变成factor后就会变成离散型的分类,也就是说从1~2变成了1,2这样的分类。

    library(purrr)
    tot_withinss <- map_dbl(1:10, function(k){
    model <- kmeans(x = lineup, centers = k)
    model$tot.withinss
    })

    elbow_df <- data.frame(
    k = 1:10,
    tot_withinss = tot_withinss
    )
    取很多个K值(从1到10)

    library(cluster)
    pam()与kmeans的功能类似,都是创建模型model。pam_k2 <- pam(lineup, k = 2)
    kmeans是围绕均值进行划分,对异常值敏感。而pam更稳健,是对于中心值划分。
    silhouette()
    plot(silhouette(pam_k2))绘制出相关的条形图

    sil_width <- map_dbl(2:10, function(k){
    model <- pam(x = customers_spend, k = k)
    modelsilinfoavg.width
    })
    sil_df <- data.frame(
    k = 2:10,
    sil_width = sil_width
    )
    ggplot(sil_df, aes(x = k, y = sil_width)) +
    geom_line() +
    scale_x_continuous(breaks = 2:10)
    批量设置K值然后绘制出关于K值的折线图来确定K值

    segment_customers %>%
    group_by(cluster) %>%
    summarise_all(funs(mean(.)))
    分类汇总查看之前的结果

    Case Study: National Occupational mean wage

    library(tibble)
    rownames_to_column(as.data.frame(oes), var = 'occupation')此函数可以将数据结构中的每一列的名字转化为一列存储起来,其新的这一列的名称就是var = '...'

    相关文章

      网友评论

          本文标题:R_DATACAMP10 Cluster Analysis in

          本文链接:https://www.haomeiwen.com/subject/zyiliftx.html