《精通机器学习:基于R 第二版》学习笔记


 项集:数据集中一个或多个项目的集合。
 支持度:包含某个项集的事务在整个数据中的比例。
 置信度:如果某人购买了x(或做了x),那么他就会购买y(或做y)的条件概率;x被称为先导或左侧项,y被称为后继或右侧项。
 提升度:它是一个比例,x发生的同时发生y的支持度是分子,分母是x和y在相互独立的情况下同时发生的概率。它等于置信度/(x的概率 * y的概率)。举例来说,假设x和y同时发生的概率是10%,x发生的概率是20%,y发生的概率是30%,那么提升度就是10%/(20% * 30%),等于1.667%。


> library(pacman)
> p_load(arules, arulesViz)
> data(Groceries)
> str(Groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  3 variables:
##   .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
##   .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
##   .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables


> itemFrequencyPlot(Groceries, topN = 10, type = "absolute")

购买最多的项目是“whole milk”,在9835条事务记录中占了大约2500条。

> itemFrequencyPlot(Groceries, topN = 15)

我们看到,beer在杂货店“最常被购买的商品”中只排在第13位(bottle beer)和第15位(canned beer)。只有不到10%的购买记录中包括瓶装啤酒和罐装啤酒。


> # supp最小支持度,conf最小置信度,maxlen最大项数
> rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.9, maxlen = 4))
## Apriori
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##       4  rules FALSE
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## Absolute minimum support count: 9 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [67 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
> rules
## set of 67 rules


> # 设置全局数字小数位为2位
> options(digits = 2)
> rules <- sort(rules, by = "lift", decreasing = T)
> inspect(rules[1:5])
##     lhs                     rhs                support confidence lift count
## [1] {liquor,                                                                
##      red/blush wine}     => {bottled beer}      0.0019       0.90 11.2    19
## [2] {root vegetables,                                                       
##      butter,                                                                
##      cream cheese }      => {yogurt}            0.0010       0.91  6.5    10
## [3] {citrus fruit,                                                          
##      root vegetables,                                                       
##      soft cheese}        => {other vegetables}  0.0010       1.00  5.2    10
## [4] {pip fruit,                                                             
##      whipped/sour cream,                                                    
##      brown bread}        => {other vegetables}  0.0011       1.00  5.2    11
## [5] {butter,                                                                
##      whipped/sour cream,                                                    
##      soda}               => {other vegetables}  0.0013       0.93  4.8    13

具有最高提升度的关联规则是,购买了liquor和red/blush wine的顾客也很可能购买bottled beer,这条规则的支持度只有0.0019,说明这种购买行为并不常见。

> rules <- sort(rules, by = "confidence", decreasing = T)
> inspect(rules[1:5])
##     lhs                     rhs                support confidence lift count
## [1] {citrus fruit,                                                          
##      root vegetables,                                                       
##      soft cheese}        => {other vegetables}  0.0010          1  5.2    10
## [2] {pip fruit,                                                             
##      whipped/sour cream,                                                    
##      brown bread}        => {other vegetables}  0.0011          1  5.2    11
## [3] {rice,                                                                  
##      sugar}              => {whole milk}        0.0012          1  3.9    12
## [4] {canned fish,                                                           
##      hygiene articles}   => {whole milk}        0.0011          1  3.9    11
## [5] {root vegetables,                                                       
##      butter,                                                                
##      rice}               => {whole milk}        0.0010          1  3.9    10


> # 根据数据集建立交叉表
> tab <- crossTable(Groceries)
> # 检查商品之间的共同购买关系
> tab[1:3, 1:3]
##             frankfurter sausage liver loaf
## frankfurter         580      99          7
## sausage              99     924         10
## liver loaf            7      10         50

可以看到,在9835笔交易记录中,顾客们只购买了50次liver loaf。此外,顾客购买了924次sausage时,有10次同时购买了liver loaf。

> tab["bottled beer", "bottled beer"]
## [1] 792

查看人们购买bottled beer的同时,购买了多少次canned beer:

> tab["bottled beer", "canned beer"]
## [1] 26

生成关于bottled beer的关联规则:

> beer.rules <- apriori(data = Groceries,
+                       # 可根据需要调整参数
+                       parameter = list(support=0.0015,confidence=0.3),
+                       appearance = list(default="lhs",rhs="bottled beer"))
## Apriori
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5  0.0015      1
##  maxlen target   ext
##      10  rules FALSE
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## Absolute minimum support count: 14 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [153 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [4 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
> beer.rules
## set of 4 rules


> beer.rules <- sort(beer.rules, decreasing = T, by = "lift")
> inspect(beer.rules)
##     lhs                                  rhs            support confidence
## [1] {liquor,red/blush wine}           => {bottled beer} 0.0019  0.90      
## [2] {liquor}                          => {bottled beer} 0.0047  0.42      
## [3] {soda,red/blush wine}             => {bottled beer} 0.0016  0.36      
## [4] {other vegetables,red/blush wine} => {bottled beer} 0.0015  0.31      
##     lift count
## [1] 11.2 19   
## [2]  5.2 46   
## [3]  4.4 16   
## [4]  3.8 15


> plot(beer.rules, method = "graph", measure = "lift", shading = "confidence")



推荐系统的设计有两种主要方式: 协同过滤和基于内容的推荐。
 基于用户的协同过滤
 基于项目的协同过滤
 奇异值分解
 主成分分析

4.1 基于用户的协同过滤


4.2 基于项目的协同过滤


4.3 奇异值分解和主成分分析




> p_load(recommenderlab)
> data("Jester5k")
> str(Jester5k)
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
##   ..@ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:362106] 0 1 2 3 4 5 6 7 8 9 ...
##   .. .. ..@ p       : int [1:101] 0 3314 6962 10300 13442 18440 22513 27512 32512 35685 ...
##   .. .. ..@ Dim     : int [1:2] 5000 100
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : chr [1:5000] "u2841" "u15547" "u15221" "u15573" ...
##   .. .. .. ..$ : chr [1:100] "j1" "j2" "j3" "j4" ...
##   .. .. ..@ x       : num [1:362106] 7.91 -3.2 -1.7 -7.38 0.1 0.83 2.91 -2.77 -3.35 -1.99 ...
##   .. .. ..@ factors : list()
##   ..@ normalize: NULL


> # 查看用户10的所有评价
> as(Jester5k[10, ], "list")
## $u12843
##    j1    j2    j3    j4    j5    j6    j7    j8    j9   j10   j11   j12   j13 
## -1.99 -6.89  2.09 -4.42 -4.90  2.43 -3.06  3.98 -1.46  0.68  3.06  2.28 -2.91 
##   j14   j15   j16   j17   j18   j19   j20   j21   j22   j23   j24   j25   j26 
## -5.44 -3.88 -0.63 -7.96 -1.70 -0.73  1.17 -6.94 -7.33  2.09  4.76 -7.09 -7.14 
##   j27   j28   j29   j30   j31   j32   j33   j34   j35   j36   j37   j38   j39 
##  0.15  1.55 -6.02  2.09 -5.92  1.55 -1.60 -3.59 -5.39  2.14 -3.54 -0.97 -8.11 
##   j40   j41   j42   j43   j44   j45   j46   j47   j48   j49   j50   j51   j52 
##  2.18 -6.26 -7.14 -3.11 -3.40  3.83 -5.78  1.12 -5.68 -5.78 -7.52 -6.50 -7.38 
##   j53   j54   j55   j56   j57   j58   j59   j60   j61   j62   j63   j64   j65 
## -7.52 -2.62 -2.14 -5.92 -3.25  0.15 -1.55 -3.59  1.46  1.70  0.24 -3.88 -7.23 
##   j66   j67   j68   j69   j70   j71   j72   j73   j74   j75   j76   j77   j78 
##  5.05 -3.69  1.60 -1.31 -1.02  1.84  1.80  1.75 -1.17  1.75  1.75  1.65 -3.79 
##   j79   j80   j81   j82   j83   j84   j85   j86   j87   j88   j89   j90   j91 
##  1.99  1.99 -5.58  2.82 -3.30  0.97 -5.24  2.38  4.13  2.43 -0.92  1.80  1.94 
##   j92   j93   j94   j95   j96   j97   j98   j99  j100 
##  2.77  1.75 -1.41  2.67  2.04 -0.29 -6.31  0.24 -6.50
> # 查看用户10的平均评价
> rowMeans(Jester5k[10, ])
## u12843 
##   -1.6
> # 查看笑话1的平均评价
> colMeans(Jester5k[, 1])
##   j1 
## 0.92


> getRatings(Jester5k) %>% tibble::enframe() %>% ggplot(aes(value)) + 
+     geom_histogram(binwidth = 0.3, col = "white") + theme_bw() + 
+     labs(title = "Histogram of getRatings(Jester5k)", x = "", y = "")


> Jester5k %>% normalize() %>% getRatings() %>% 
+     tibble::enframe() %>% ggplot(aes(value)) + 
+     geom_histogram(binwidth = 0.3, col = "white") + 
+     theme_bw() + 
+     labs(title = "Normalized Jester5k", x = "", y = "")



> set.seed(123)
> e <- evaluationScheme(Jester5k,method="split",train=0.8,
+                       # 使用15个评价进行预测,其余用于计算误差
+                       given=15,
+                       # 大于等于5为“好评价”的阈值
+                       goodRating=5)
> e
## Evaluation scheme with 15 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.


> recommenderRegistry$get_entries(dataType = "realRatingMatrix")
## $ALS_realRatingMatrix
## Recommender method: ALS for realRatingMatrix
## Description: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.
## Reference: Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan (2008). Large-Scale Parallel Collaborative Filtering for the Netflix Prize, 4th Int'l Conf. Algorithmic Aspects in Information and Management, LNCS 5034.
## Parameters:
##   normalize lambda n_factors n_iterations min_item_nr seed
## 1      NULL    0.1        10           10           1 NULL
## $ALS_implicit_realRatingMatrix
## Recommender method: ALS_implicit for realRatingMatrix
## Description: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.
## Reference: Yifan Hu, Yehuda Koren, Chris Volinsky (2008). Collaborative Filtering for Implicit Feedback Datasets, ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 263-272.
## Parameters:
##   lambda alpha n_factors n_iterations min_item_nr seed
## 1    0.1    10        10           10           1 NULL
## $IBCF_realRatingMatrix
## Recommender method: IBCF for realRatingMatrix
## Description: Recommender based on item-based collaborative filtering.
## Reference: NA
## Parameters:
##    k   method normalize normalize_sim_matrix alpha na_as_zero
## 1 30 "Cosine"  "center"                FALSE   0.5      FALSE
## $LIBMF_realRatingMatrix
## Recommender method: LIBMF for realRatingMatrix
## Description: Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html).
## Reference: NA
## Parameters:
##   dim costp_l2 costq_l2 nthread
## 1  10     0.01     0.01       1
## $POPULAR_realRatingMatrix
## Recommender method: POPULAR for realRatingMatrix
## Description: Recommender based on item popularity.
## Reference: NA
## Parameters:
##   normalize
## 1  "center"
##                                                      aggregationRatings
## 1 new("standardGeneric", .Data = function (x, na.rm = FALSE, dims = 1, 
##                                                   aggregationPopularity
## 1 new("standardGeneric", .Data = function (x, na.rm = FALSE, dims = 1, 
## $RANDOM_realRatingMatrix
## Recommender method: RANDOM for realRatingMatrix
## Description: Produce random recommendations (real ratings).
## Reference: NA
## Parameters: None
## $RERECOMMEND_realRatingMatrix
## Recommender method: RERECOMMEND for realRatingMatrix
## Description: Re-recommends highly rated items (real ratings).
## Reference: NA
## Parameters:
##   randomize minRating
## 1         1        NA
## $SVD_realRatingMatrix
## Recommender method: SVD for realRatingMatrix
## Description: Recommender based on SVD approximation with column-mean imputation.
## Reference: NA
## Parameters:
##    k maxiter normalize
## 1 10     100  "center"
## $SVDF_realRatingMatrix
## Recommender method: SVDF for realRatingMatrix
## Description: Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html).
## Reference: NA
## Parameters:
##    k gamma lambda min_epochs max_epochs min_improvement normalize verbose
## 1 10 0.015  0.001         50        200           1e-06  "center"   FALSE
## $UBCF_realRatingMatrix
## Recommender method: UBCF for realRatingMatrix
## Description: Recommender based on user-based collaborative filtering.
## Reference: NA
## Parameters:
##     method nn sample normalize
## 1 "cosine" 25  FALSE  "center"

5.1 推荐系统的建模与评价


> ubcf <- Recommender(getData(e, "train"), "UBCF")
> ibcf <- Recommender(getData(e, "train"), "IBCF")
> svd <- Recommender(getData(e, "train"), "SVD")
> popular <- Recommender(getData(e, "train"), "POPULAR")
> random <- Recommender(getData(e, "train"), "RANDOM")


> user_pred <- predict(ubcf, getData(e, "known"), type = "ratings")
> item_pred <- predict(ibcf, getData(e, "known"), type = "ratings")
> svd_pred <- predict(svd, getData(e, "known"), type = "ratings")
> pop_pred <- predict(popular, getData(e, "known"), type = "ratings")
> rand_pred <- predict(random, getData(e, "known"), type = "ratings")


> error <- data.frame(
+        UBCF = calcPredictionAccuracy(user_pred, getData(e, "unknown")), 
+        IBCF = calcPredictionAccuracy(item_pred, getData(e, "unknown")), 
+        SVD = calcPredictionAccuracy(svd_pred, getData(e, "unknown")), 
+        Popular = calcPredictionAccuracy(pop_pred, getData(e, "unknown")), 
+        Random = calcPredictionAccuracy(rand_pred, getData(e, "unknown")))
> print(error)
##      UBCF IBCF  SVD Popular Random
## RMSE  4.7  5.3  4.7     4.6    6.4
## MSE  21.6 28.1 22.0    20.8   41.0
## MAE   3.7  4.2  3.7     3.6    5.0


> # 建立要比较的算法列表
> algorithms <- list(POPULAR = list(name = "POPULAR"), UBCF = list(name = "UBCF"), 
+     IBCF = list(name = "IBCF"))
> algorithms
## $POPULAR$name
## [1] "POPULAR"
## $UBCF
## $UBCF$name
## [1] "UBCF"
## $IBCF
## $IBCF$name
## [1] "IBCF"
> # 比较前5、10、15个笑话推荐,可以看到每种算法的运行时间
> evlist <- evaluate(e, algorithms, n = c(5, 10, 15))
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.06sec/2.2sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/4.8sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.31sec/0.25sec]
> # 检查各种技术的表现
> set.seed(123)
> avg(evlist)
##     TP  FP FN TN precision recall  TPR   FPR
## 5  2.0 3.0 12 68      0.39   0.19 0.19 0.042
## 10 3.7 6.3 10 65      0.37   0.34 0.34 0.086
## 15 5.2 9.8  9 61      0.34   0.45 0.45 0.135
## $UBCF
##     TP   FP   FN TN precision recall  TPR   FPR
## 5  1.9  3.1 12.3 68      0.38   0.18 0.18 0.043
## 10 3.5  6.5 10.7 64      0.35   0.32 0.32 0.088
## 15 5.0 10.0  9.2 61      0.33   0.43 0.43 0.137
## $IBCF
##      TP   FP FN TN precision recall   TPR  FPR
## 5  0.75  4.2 13 67      0.15  0.055 0.055 0.06
## 10 1.51  8.5 13 62      0.15  0.109 0.109 0.12
## 15 2.29 12.7 12 58      0.15  0.166 0.166 0.18


> par(mfrow = c(1, 2))
> # annotate是否在旁边显示数字
> plot(evlist, legend = "topleft", annotate = T)
> plot(evlist, "prec", legend = "bottomright", annotate = T)


> # 在整个数据集上建立流行度推荐引擎
> R1 <- Recommender(Jester5k, method = "POPULAR")
> R1
## Recommender of type 'POPULAR' for 'realRatingMatrix' 
## learned using 5000 users.
> # 查看为前2个用户做出的前5个推荐
> recommend <- predict(R1, Jester5k[1:2], n = 5) %>% as("list")
> recommend
## $u2841
## [1] "j89" "j72" "j76" "j88" "j83"
## $u15547
## [1] "j89" "j93" "j76" "j88" "j91"
> # 查看10个用户对3个笑话的评价
> rating <- predict(R1, Jester5k[300:309], type = "ratings") %>% as("matrix")
> rating[, 71:73]
##          j71 j72    j73
## u7628  -2.04 1.5 -0.291
## u8714     NA  NA     NA
## u24213 -2.94  NA -1.184
## u13301  2.39 5.9  4.142
## u10959    NA  NA     NA
## u23430 -0.43 3.1     NA
## u11167 -1.72 1.8  0.033
## u4705  -1.20 2.3  0.552
## u24469 -1.58 2.0  0.169
## u13534 -1.55 2.0     NA


> # 将评价分数转换成二值形式,大于等于5的评价记为1,小于5的评价记为0
> jester.bin <- binarize(Jester5k, minRating = 5)
> # 找出具有一定数量的评价为1的记录
> jester.bin <- jester.bin[rowCounts(jester.bin) > 10]
> jester.bin
## 3054 x 100 rating matrix of class 'binaryRatingMatrix' with 84722 ratings.
> # 使用5折交叉验证
> set.seed(123)
> e.bin <- evaluationScheme(jester.bin, method = "cross-validation", k = 5, given = 10)
> # 使用三种技术进行比较
> algorithms.bin <- list(random = list(name = "RANDOM", param = NULL), popular = list(name = "POPULAR", 
+     param = NULL), UBCF = list(name = "UBCF"))
> # 建立模型
> result.bin <- evaluate(e.bin, algorithms.bin, n = c(5, 10, 15))
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.1sec] 
##   2  [0sec/0.09sec] 
##   3  [0sec/0.08sec] 
##   4  [0sec/0.09sec] 
##   5  [0sec/0.09sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.66sec] 
##   2  [0sec/0.65sec] 
##   3  [0sec/0.67sec] 
##   4  [0sec/0.72sec] 
##   5  [0sec/0.71sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/2.3sec] 
##   2  [0sec/2.5sec] 
##   3  [0sec/2.3sec] 
##   4  [0sec/2.4sec] 
##   5  [0sec/2.9sec]


> par(mfrow = c(1, 2))
> plot(result.bin, legend = "topleft")
> plot(result.bin, "prec", legend = "bottomright")



> p_load(TraMineR)
> df <- read.csv("./data_set/data-master/sequential.csv")
> str(df)
## 'data.frame':    5000 obs. of  9 variables:
##  $ Cust_Segment: Factor w/ 4 levels "Segment1","Segment2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Purchase1   : Factor w/ 7 levels "Product_A","Product_B",..: 1 2 7 3 1 4 1 4 4 4 ...
##  $ Purchase2   : Factor w/ 8 levels "","Product_A",..: 2 1 3 1 1 1 2 1 4 7 ...
##  $ Purchase3   : Factor w/ 8 levels "","Product_A",..: 1 1 3 1 1 1 1 1 4 4 ...
##  $ Purchase4   : Factor w/ 8 levels "","Product_A",..: 1 1 4 1 1 1 1 1 4 4 ...
##  $ Purchase5   : Factor w/ 8 levels "","Product_A",..: 1 1 3 1 1 1 1 1 5 4 ...
##  $ Purchase6   : Factor w/ 8 levels "","Product_A",..: 1 1 3 1 1 1 1 1 5 7 ...
##  $ Purchase7   : Factor w/ 8 levels "","Product_A",..: 1 1 3 1 1 1 1 1 6 8 ...
##  $ Purchase8   : Factor w/ 8 levels "","Product_A",..: 1 1 8 1 1 1 1 1 6 7 ...

 Cust_segment :一个因子变量,表示顾客分类
 8个离散型购买事件,分别为 Purchase1 ~ Purchase8 :顾客可以同时购买所有8种商品,但要有一定的顺序。每个购买变量中保存着购买商品的名称,共有7种商品,ProductA~ProductG。

> # 查看各个顾客分类中的顾客数量
> table(df$Cust_Segment)
## Segment1 Segment2 Segment3 Segment4 
##     2900      572      554      974
> # 查看第一次商品购买的分类数量
> table(df$Purchase1)
## Product_A Product_B Product_C Product_D Product_E Product_F Product_G 
##      1451       765       659      1060       364       372       329
> # 查看所有商品的购买次数
> table(unlist(df[, -1]))
## Product_A Product_B Product_C Product_D Product_E Product_F Product_G 
##      3855      3193      3564      3122      1688      1273       915 
##     22390
> # 检查第一次购买行为和第二次购买行为之间的序列频率
> df %>% count(Purchase1, Purchase2) %>% arrange(desc(n)) %>% head()
## # A tibble: 6 x 3
##   Purchase1 Purchase2       n
##   <fct>     <fct>       <int>
## 1 Product_A "Product_A"   548
## 2 Product_D ""            548
## 3 Product_B ""            346
## 4 Product_C "Product_C"   345
## 5 Product_B "Product_B"   291
## 6 Product_D "Product_D"   281


> # 转换数据到一个序列类的对象中,xtstep指定绘图函数中刻度线的距离
> seq <- seqdef(df[, -1], xtstep = 1)
> head(seq)
##   Sequence                                                                       
## 1 Product_A-Product_A------                                                      
## 2 Product_B-------                                                               
## 3 Product_G-Product_B-Product_B-Product_C-Product_B-Product_B-Product_B-Product_G
## 4 Product_C-------                                                               
## 5 Product_A-------                                                               
## 6 Product_D-------


> seqiplot(seq)


> seqdplot(seq)


> seqdplot(seq, group = df$Cust_Segment)


> seqmsplot(seq, group = df$Cust_Segment)


> seqmtplot(seq, group = df$Cust_Segment)


> seqE <- seqecreate(seq)
> sub.seq <- seqefsub(seqE, pMinSupport = 0.05)
> plot(sub.seq[1:10], col = "dodgerblue")

这张图表示出序列在8种转换状态下的频率百分比。如果想进行简化,比如只使用前两种转换,可以在 seqecreate() 函数中设定索引。

> # time.varying=T则建立第二种矩阵
> seq.mat <- seqtrate(seq)
> options(digits = 2)
> seq.mat[2:4, 1:3]
##                [-> ] [-> Product_A] [-> Product_B]
## [Product_A ->]  0.19          0.417          0.166
## [Product_B ->]  0.26          0.113          0.475
## [Product_C ->]  0.19          0.058          0.041


> seq.mat[, 1]
##          [ ->] [Product_A ->] [Product_B ->] [Product_C ->] [Product_D ->] 
##           1.00           0.19           0.26           0.19           0.33 
## [Product_E ->] [Product_F ->] [Product_G ->] 
##           0.18           0.25           0.41




