美文网首页
第10章 关联分析和序列挖掘

第10章 关联分析和序列挖掘

作者: zd200572 | 来源:发表于2021-12-10 21:55 被阅读0次

    关联分析是发现交易数据内有趣联系的一种方法,比如著名的“啤酒-尿布”。频繁序列模式挖掘,可以预测购买行为,生物序列等等。

    10.2 数据转换成事务

    链表、矩阵和数据框架转换成事务

    # 数据转换成事务
    install.packages("arules")
    library(arules)
    tr_list <- list(c("Apple", "Bread", "Cake"),
                    c("Apple", "Bread", "Milk"),
                    c("Apple", "Cake", "Milk"))
    names(tr_list) <- paste("Tr",c(1:3), sep = "")
    trans <- as(tr_list, "transactions");trans
    tr_matrix <- matrix(
      c(1,1,1,0,
        1,1,0,1,
        0,1,1,1),ncol = 4
    )
    dimnames(tr_matrix) <- list(
      paste("Tr",c(1:3), sep = ""),
      c("Apple", "Bread", "Cake", "Milk")
    )
    trans2 <- as(tr_matrix, "transactions");trans2
    Tr_df <- data.frame(
      TrID = as.factor(c(1,2,1,1,2,3,2,3,2,3)),
      Item = as.factor(c("Apple", "Milk", "Cake", "Bread",
                         "Cake", "Milk", "Apple", "Cake",
                         "Bread", "Bread"))
    )
    trans3 <- as(split(Tr_df[,"Item"], Tr_df[,"TrID"]),
                 "transactions");trans3
    

    调用as函数给每次事务都加上一个id就完成了数据向事务的类型转换。“transactions"类型来代表规则或频繁项集的事务型数据,是itemMatrix类型的延伸。

    10.3 展示事务及关联

    R的arule包使用自带的transactions类型来存储事务数据类型。

    LIST(trans) # 列表形式展示数据
    $Tr1
    [1] "Apple" "Bread" "Cake" 
    
    $Tr2
    [1] "Apple" "Bread" "Milk" 
    
    $Tr3
    [1] "Apple" "Cake"  "Milk"
    summary(trans)
    transactions as itemMatrix in sparse format with
     3 rows (elements/itemsets/transactions) and
     4 columns (items) and a density of 0.75 
    
    most frequent items:
      Apple   Bread    Cake    Milk (Other) 
          3       2       2       2       0 
    
    element (itemset/transaction) length distribution:
    sizes
    3 
    3 
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          3       3       3       3       3       3 
    
    includes extended item information - examples:
      labels
    1  Apple
    2  Bread
    3   Cake
    
    includes extended transaction information - examples:
      transactionID
    1           Tr1
    2           Tr2
    3           Tr3
    inspect(trans)
        items                transactionID
    [1] {Apple, Bread, Cake} Tr1          
    [2] {Apple, Bread, Milk} Tr2          
    [3] {Apple, Cake, Milk}  Tr3 
    filter_trans <- trans[size(trans) >=3]
    inspect(filter_trans)
        items                transactionID
    [1] {Apple, Bread, Cake} Tr1          
    [2] {Apple, Bread, Milk} Tr2          
    [3] {Apple, Cake, Milk}  Tr3  
    image(trans)
    itemFrequencyPlot(trans)
    itemFrequency(trans) # 支持度的分布
        Apple     Bread      Cake      Milk 
    1.0000000 0.6666667 0.6666667 0.6666667 
    
    事务可视化
    频繁度条形图

    10.4 Apriori规则完成关联挖掘

    首先找到频繁个体项集,然后再通过广度优先搜索策略生成更大的频繁项集。

    data("Groceries")
    summary(Groceries)
    itemFrequencyPlot(Groceries,support=0.1, cex.names=0.8, topN=10)
    rules <- apriori(Groceries, parameter = list(supp=0.001, conf=0.5, 
                                                 target="rules"))
    summary(rules)
    inspect(head(rules))
        lhs                    rhs          support     confidence coverage    lift     count
    [1] {honey}             => {whole milk} 0.001118454 0.7333333  0.001525165 2.870009 11   
    [2] {tidbits}           => {rolls/buns} 0.001220132 0.5217391  0.002338587 2.836542 12   
    [3] {cocoa drinks}      => {whole milk} 0.001321810 0.5909091  0.002236909 2.312611 13   
    [4] {pudding powder}    => {whole milk} 0.001321810 0.5652174  0.002338587 2.212062 13   
    [5] {cooking chocolate} => {whole milk} 0.001321810 0.5200000  0.002541942 2.035097 13   
    [6] {cereals}           => {whole milk} 0.003660397 0.6428571  0.005693950 2.515917 36 
    rules <- sort(rules, by="confidence", decreasing = TRUE)
    inspect(head(rules))
     lhs                      rhs                    support confidence    coverage     lift count
    [1] {rice,                                                                                       
         sugar}               => {whole milk}       0.001220132          1 0.001220132 3.913649    12
    [2] {canned fish,                                                                                
         hygiene articles}    => {whole milk}       0.001118454          1 0.001118454 3.913649    11
    [3] {root vegetables,                                                                            
         butter,                                                                                     
         rice}                => {whole milk}       0.001016777          1 0.001016777 3.913649    10
    [4] {root vegetables,                                                                            
         whipped/sour cream,                                                                         
         flour}               => {whole milk}       0.001728521          1 0.001728521 3.913649    17
    [5] {butter,                                                                                     
         soft cheese,                                                                                
         domestic eggs}       => {whole milk}       0.001016777          1 0.001016777 3.913649    10
    [6] {citrus fruit,                                                                               
         root vegetables,                                                                            
         soft cheese}         => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
    > 
    
    排名前10的项集

    可以通过支持度和关联度两个值来评估规则的强弱,前者表示规则的频率代表两个项集同时出现在一个事务中的概率。这两个指标仅对规则强弱判断有效,一些规则也可能是冗余的,提升度可以评估规则的质量。支持度代表了特定项集地事务数据库中的所占比例,置信度是规则的正确率,提升度是响应目标关联规则与平均响应的比值。
    Apriori是最广为人知的关联规则挖掘算法,依靠逐层地广度优先策略来生成候选项集。
    还可以调用intersectMeasure函数来获得其他有趣的指标。

    head(interestMeasure(rules,c("support", "chiSquare", "confidence",
    +                              "conviction", "cosine", "coverage",
    +                              "leverage", "lift", "oddsRatio"),
    +                      Groceries))
          support chiSquared confidence conviction     cosine    coverage     leverage     lift
    1 0.001220132   35.00650          1        Inf 0.06910260 0.001220132 0.0009083689 3.913649
    2 0.001118454   32.08603          1        Inf 0.06616070 0.001118454 0.0008326715 3.913649
    3 0.001016777   29.16615          1        Inf 0.06308175 0.001016777 0.0007569741 3.913649
    4 0.001728521   49.61780          1        Inf 0.08224854 0.001728521 0.0012868559 3.913649
    5 0.001016777   29.16615          1        Inf 0.06308175 0.001016777 0.0007569741 3.913649
    6 0.001016777   41.72398          1        Inf 0.07249042 0.001016777 0.0008200380 5.168156
      oddsRatio
    1       Inf
    2       Inf
    3       Inf
    4       Inf
    5       Inf
    6       Inf
    

    10.5 去掉冗余规则

    关联规则挖掘的两个主要限制是在支持度和置信度之间的选择,去冗余,发现这些规则中真正有意义的信息。初接触这个领域,不是太懂,先放这。

    # derep
    rules.sorted <- sort(rules, by="lift")
    subset.matrix <- is.subset(rules.sorted, rules.sorted)
    subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
    redundant <- colSums(subset.matrix, na.rm=TRUE) >= 8
    rules.pruned <- rules.sorted[!redundant]
    inspect(head(rules.pruned))
        lhs                   rhs              support confidence    coverage     lift count
    [1] {citrus fruit,                                                                      
         pip fruit,                                                                         
         bottled water}    => {whole milk} 0.001118454        0.5 0.002236909 1.956825    11
    [2] {citrus fruit,                                                                      
         pip fruit,                                                                         
         yogurt}           => {whole milk} 0.001626843        0.5 0.003253686 1.956825    16
    [3] {root vegetables,                                                                   
         rolls/buns,                                                                        
         soda}             => {whole milk} 0.002440264        0.5 0.004880529 1.956825    24
    [4] {tropical fruit,                                                                    
         other vegetables,                                                                  
         rolls/buns,                                                                        
         soda}             => {whole milk} 0.001118454        0.5 0.002236909 1.956825    11
    

    发现按原书>=1就没有规则幸存了,就看了一下,设置成4以上会有,就设置了个8。发现小内存的电脑在这步可能崩溃,因为矩阵很大,比如小云主机,在16G内存的电脑上是能成功运行的。

    10.6 关联规则的可视化

    重新把上面设置成了1800,为了可视化好看

    install.packages("arulesViz")
    library(arulesViz)
    install.packages('Rcpp')
    library(Rcpp) #发现要安装这个包,否则报:Error in stress_major(xinit, W, D, iter, tol) : 
      function 'Rcpp_precious_remove' not provided by package 'Rcpp'
    plot(rules.pruned)
    #避免重叠,抖动下
    plot(rules.pruned,shading='order', control =list(jitter=6))
    # Apriori算法生成左边为soda的新规则
    soda_rule <- apriori(data = Groceries, parameter = list(supp=0.001,conf=0.1,minlen=2),
                         appearance = list(default='rhs',lhs='soda'))
    plot(sort(soda_rule,by="lift"), method = 'graph',
         control = list(type='items'))
    plot(rules.pruned[c(1:66)],method="group") # 这里报:
    Error in stats::hclust(stats::dist(m_clust)) : 
      must have n >= 2 objects to cluster
    于是用上一步的代替啦
    
    精简规则的散点图,还挺好看,颜色深浅代表规则提升度大小,越深,提升度越高
    防重叠拉动,自动配色不错
    关联规则的图示,仅选择soda在左边的
    气球图,气球面积表示支持度,颜色深浅表示提升度
    交互式图形plot(rules.pruned[c(1:66)],interactive = TRUE)

    10.7 使用Eclat挖掘频繁项集

    Apriori算法采用广度优先策略来遍历数据库,整体耗时较长;如果数据库可以整个装入内存中,可以使用深度优先的Eclat算法,效率比前者高。

    # Eclat
    frequentest <- eclat(Groceries, parameter = list(support=0.05,maxlen=10))
    summary(frequentest)
    inspect(sort(frequentest, by="support")[1:10])
         items              support    count
    [1]  {whole milk}       0.25551601 2513 
    [2]  {other vegetables} 0.19349263 1903 
    [3]  {rolls/buns}       0.18393493 1809 
    [4]  {soda}             0.17437722 1715 
    [5]  {yogurt}           0.13950178 1372 
    [6]  {bottled water}    0.11052364 1087 
    [7]  {root vegetables}  0.10899847 1072 
    [8]  {tropical fruit}   0.10493137 1032 
    [9]  {shopping bags}    0.09852567  969 
    [10] {sausage}          0.09395018  924 
    

    Apriori算法直接也易于理解,缺点是需要多遍扫描数据库因而会产生大量候选集,支持度的计算很耗时。Eclat算法采用了等价类、深度优先遍历、求次等策略,支持度计算效率有很大改善。前者采用水平数据结构来存放事务,后者采用垂直数据结构来存放每个事务的交易ID,也从频繁项集中生成关联规则。
    FP-Growth也是应用非常广的一种关联规则挖掘算法,与Eclat算法相似,也是采用深度优先搜索策略来计算项集支持度,暂时没包支持?2021了,或许有了吧。

    10.8 生成时态事务数据

    # 时态
    install.packages("arulesSequences")
    library(arulesSequences)
    tmp_data <- list(
      c("A"),
      c("A","B","C"),
      c("D"),
      c("c","F"),
      c("A","D"),
      c("C"),
      c("B", "C"),
      c("A","E"),
      c("E","F"),
      c("A","B"),
      c("D","F"),
      c("C"),
      c("B"),
      c("E"),
      c("G"),
      c("A","F"),
      c("C"),
      c("B"),
      c("C")
    )
    # 转换为事务
    names(tmp_data) <- paste0("Tr", c(1:19), seq="")
    trans <- as(tmp_data, "transactions")
    transactionInfo(trans)$SequenceID <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
    transactionInfo(trans)$eventID<- c(10,20,30,40,50,10,20,30,40,10,20,30,40,50,
                                       10,20,30,40,50)
    trans
    transactions in sparse format with
     19 transactions (rows) and
     8 items (columns)
    inspect(head(trans))
        items   transactionID eventID
    [1] {A}     Tr1           10     
    [2] {A,B,C} Tr2           20     
    [3] {D}     Tr3           30     
    [4] {c,F}   Tr4           40     
    [5] {A,D}   Tr5           50     
    [6] {C}     Tr6           10  
    summary(trans)
    zaki <- read_baskets(con = system.file("misc", "zaki.txt",
                                           package = "arulesSequences"),
                                           info = c("sequenceID", 'eventID', "SIZE"))
    as(zaki,"data.frame")
          items sequenceID eventID SIZE
    1      {C,D}          1      10    2
    2    {A,B,C}          1      15    3
    3    {A,B,F}          1      20    3
    4  {A,C,D,F}          1      25    4
    5    {A,B,F}          2      15    3
    6        {E}          2      20    1
    7    {A,B,F}          3      10    3
    8    {D,G,H}          4      10    3
    9      {B,F}          4      20    2
    10   {A,G,H}          4      25    3
    

    arulesSequences提供了两个新的数据结构,sequences和timedsequences,用来表示纯序列和时态序列数据。

    10.9 cSPADE挖掘频繁时序模式

    等价类序列模式挖掘,是广为人知的一种频繁序列模式挖掘算法,利用垂直数据库的特性,通过ID表的交集及有效的搜索策略完成频繁序列模式的挖掘,支持对挖掘到的序列添加约束。

    s_result <- cspade(trans, parameter = list(support=0.75), control = list(verbose=TRUE))
    # Error in cspade(trans, parameter = list(support = 0.75), control = list(verbose = TRUE)) : 
    #   transactionInfo: missing 'sequenceID' and/or 'eventID'
    

    错报复有点奇怪,不得法呀!这章就到这啦!

    相关文章

      网友评论

          本文标题:第10章 关联分析和序列挖掘

          本文链接:https://www.haomeiwen.com/subject/huatfrtx.html