Apriori算法

作者: Jessiedududu | 来源:发表于2018-11-16 15:59 被阅读0次

相关问题——啤酒与尿布

TID	Items
1	Bread, Milk
2	Bread, Diaper, Beer, Eggs
3	Milk, Diaper, Beer, Coke
4	Bread, Milk, Diaper, Beer
5	Bread, Milk, Diaper, Coke

关联规则是形如 $X{\rightarrow}Y$ 的蕴含表达式，其中X和Y是不想交的项集。关联规则的强度可以用它的支持度(support)和置信度(confidence)来度量。支持度确定规则可以数据集的频繁程度，置信度确定Y在包含X的交易中出现的频繁程度。
支持度形式定义： $s(X{\rightarrow}Y)=\frac{{\sigma}(X{\cup}Y)}{N}$
置信度形式定义： $c(X{\rightarrow}Y)=\frac{{\sigma}(X{\cup}Y)}{{\sigma}(X)}$
关联规则挖掘的目标：已知所有交易的集合T和购物篮数据所有项集合I，找到所有满足以下条件的规则

support ${\ge}$ minsupthreshold
confidence ${\ge}$ minconf threshold

关联规则挖掘任务分解为如下两个主要任务：

频繁项集的产生，目标是发现满足最小支持度阈值的所有项集，称作频繁项集。
规则的产生，目标是从上一步发现的频繁项集中提取所有高置信度的规则，称作强规则。

最简单的方法是通过暴力搜索，枚举所有可能项集。 $I={\{a, b, c, d, e\}}$

Apriori.png

但是其计算量过大，并不现实，为了降低产生频繁项集的计算复杂度，利用支持度对候选项集剪枝。

Apriori定律1：如果一个集合是频繁项集，则它的所有子集都是频繁项集。
Apriori定律2：如果一个集合不是频繁项集，则它的所有超集都不是频繁项集。

当{A, B}是非频繁集时，所有包含它的超集也是非频繁的。

Apriori2.png

Apriori算法

Let $k=1$
Generate frequent itemsets of length $k$
Repeat until no new frequent itemsets are identified
- Generate length $(k+1)$ candidate itemsets from length $k$ frequent itemsets 排好序
- Prune candidate itemsets containing subsets of length $k+1$ that are infrequent 删除掉包含非 $L_k$ 子集的候选集
- Count the support of each candidate by scanning the database
- Eliminate candidates that are infrequent, leaving only those that are frequent