美文网首页
机器学习学习笔记--关联算法

机器学习学习笔记--关联算法

作者: 松爱家的小秦 | 来源:发表于2017-12-09 16:05 被阅读0次

    关联是机器学习中研究的一种问题,研究事物的关系,其中啤酒与尿布的故事是这个问题的典型。这里要引入几个概念:

    1.支持度(Support)

    支持度表示项集{X,Y}在总项集里出现的概率。公式为:

    Support(X→Y) = P(X,Y) / P(I) = P(X∪Y) / P(I) = num(XUY) / num(I)

    其中,I表示总事务集。num()表示求事务集里特定项集出现的次数。

    比如,num(I)表示总事务集的个数

    num(X∪Y)表示含有{X,Y}的事务集的个数(个数也叫次数)。

    2.置信度 (Confidence)

    置信度表示在先决条件X发生的情况下,由关联规则”X→Y“推出Y的概率。即在含有X的项集中,含有Y的可能性,公式为:

    Confidence(X→Y) = P(Y|X)  = P(X,Y) / P(X) = P(XUY) / P(X)

    3.提升度(Lift)

    提升度表示含有X的条件下,同时含有Y的概率,与Y总体发生的概率之比。

    Lift(X→Y) = P(Y|X) / P(Y)

    一、Apriori算法

    Apriori 算法是一种最有影响力的挖掘布尔关联规则的频繁项集的 算法,它是由Rakesh Agrawal 和RamakrishnanSkrikant 提出的。

    def createC1(dataSet):

    C1 = []

    for transaction in dataSet:

    for item in transaction:

    if [item] not in C1:

    C1.append([item])

    C1.sort()

    return map(frozenset, C1)

    def scanD(D, Ck, minSupport):

    ssCnt = {}

    for tid in D:

    for can in Ck:

    if can.issubset(tid):

    ssCnt[can] = ssCnt.get(can, 0) + 1

    numItems = float(len(D))

    retList = []

    supportData = {}

    for key in ssCnt:

    support = ssCnt[key] / numItems

    if support >= minSupport:

    retList.insert(0, key)

    supportData[key] = support

    return retList, supportData

    def aprioriGen(Lk, k):

    retList = []

    lenLk = len(Lk)

    for i in range(lenLk):

    for j in range(i + 1, lenLk):

    L1 = list(Lk[i])[: k - 2];

    L2 = list(Lk[j])[: k - 2];

    L1.sort();

    L2.sort()

    if L1 == L2:

    retList.append(Lk[i] | Lk[j])

    return retList

    def apriori(dataSet, minSupport=0.5):

    C1 = createC1(dataSet)

    D = map(set, dataSet)

    L1, suppData = scanD(D, C1, minSupport)

    L = [L1]

    k = 2

    while (len(L[k - 2]) > 0):

    Ck = aprioriGen(L[k - 2], k)

    Lk, supK = scanD(D, Ck, minSupport)

    suppData.update(supK)

    L.append(Lk)

    k += 1

    return L, suppData

    def calcConf(freqSet, H, supportData, brl, minConf=0.7):

    prunedH = []

    for conseq in H:

    conf = supportData[freqSet] / supportData[freqSet - conseq]

    if conf >= minConf:

    print freqSet - conseq, '-->', conseq, 'conf:', conf

    brl.append((freqSet - conseq, conseq, conf))

    prunedH.append(conseq)

    return prunedH

    def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):

    m = len(H[0])

    if len(freqSet) > m + 1:

    Hmp1 = aprioriGen(H, m + 1)

    Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)

    if len(Hmp1) > 1:

    rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

    def generateRules(L, supportData, minConf=0.7):

    bigRuleList = []

    for i in range(1, len(L)):

    for freqSet in L[i]:

    H1 = [frozenset([item]) for item in freqSet]

    if i > 1:

    rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)

    else:

    calcConf(freqSet, H1, supportData, bigRuleList, minConf)

    return bigRuleList

    if __name__ == '__main__':

    myDat = [ [ 1, 3, 4 ], [ 2, 3, 5 ], [ 1, 2, 3, 5 ], [ 2, 5 ] ]

    L, suppData = apriori(myDat, 0.5)

    rules = generateRules(L, suppData, minConf=0.7)

    print 'rules:\n', rules

    2.FP-growth算法

    FP-growth算法是在Apriori改造而来的加快了速度

    要使用这个算法需要下载库

    sudo pip install pyfpgrowth

    #-*- coding:utf-8 -*-

    import pyfpgrowth

    transactions = [[1, 2, 5],

    [2, 4],

    [2, 3],

    [1, 2, 4],

    [1, 3],

    [2, 3],

    [1, 3],

    [1, 2, 3, 5],

    [1, 2, 3]]

    patterns = pyfpgrowth.find_frequent_patterns(transactions, 2)#2 这个位置的参数代表支持度

    rules = pyfpgrowth.generate_association_rules(patterns, 0.7)#0.7 这个位置的参数代表置信度

    print rules

    相关文章

      网友评论

          本文标题:机器学习学习笔记--关联算法

          本文链接:https://www.haomeiwen.com/subject/lyzuixtx.html