美文网首页
Pandas简单排序规则的支持度和置信度计算

Pandas简单排序规则的支持度和置信度计算

作者: 一行数师 | 来源:发表于2017-11-04 17:53 被阅读259次

    《Python数据挖掘与实战》第一章1.3.4关于“顾客买了产品X,有可能购买产品Y”这样的规则”,用Python计算支持度和置信度时,作者用的是numpy库,但代码又臭又长,看的心累。

    臭又长代码如下所示(不想看的可以直接跳过):
    from collections import defaultdict
    valid_rules = defaultdict(int)
    invalid_rules = defaultdict(int)
    num_occurences = defaultdict(int)
    
    for sample in X:
        for premise in range(n_features):
            if sample[premise] == 0: continue
            # Record that the premise was bought in another transaction
            num_occurences[premise] += 1
            for conclusion in range(n_features):
                if premise == conclusion:  # It makes little sense to measure if X -> X.
                    continue
                if sample[conclusion] == 1:
                    # This person also bought the conclusion item
                    valid_rules[(premise, conclusion)] += 1
                else:
                    # This person bought the premise, but not the conclusion
                    invalid_rules[(premise, conclusion)] += 1
    support = valid_rules
    confidence = defaultdict(float)
    for premise, conclusion in valid_rules.keys():
        confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
    
    for premise, conclusion in confidence:
        premise_name = features[premise]
        conclusion_name = features[conclusion]
        print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
        print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
        print(" - Support: {0}".format(support[(premise, conclusion)]))
        print("")
    

    现使用Pandas库,对支持度和置信度进行计算:

    首先导入相关库和数据文件

    import numpy as np
    import pandas as pd
    dataset_filename = 'affinity_dataset.txt'
    x = np.loadtxt(dataset_filename)
    df = pd.DataFrame(x,columns=['面包','牛奶','奶酪','苹果','香蕉'])
    df.head()
    

    数据文件的前五条数据如下所示:

    image.png

    其中数值为0即为没有购买,数值为1即为购买。

    接着准备两个数据框,来接收支持度和置信度的数值

    # 创建两个表 分别作为支持度和置信度的准备表
    df2 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
    df3 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
    df2
    
    image.png

    第三步计算支持度(只有7行代码)

    这里是首先遍历每个样本每个值,然后将未购买(数值为0)的样本跳过,在遍历每个人关于某个上面的购买情况时,若购买(值为1),再次从下一个开始再次遍历(不遍历自己),若除当前商品以外,有购买其他商品的,在支持度数据框对应的[j,k]和[k,j]值分别加1

    for i in x:
        for j in range(5):
            # 如果为0 就跳过
            if not i[j] : continue
            # 如果不0,继续遍历,如果有购买,便+1
            for k in range(j+1,5):
                if not i[k] : continue
                df2.iloc[j,k] += 1
                df2.iloc[k,j] += 1
    # 返回支持度的结果
    df2
    

    得到的支持度表如下:

    image.png

    最后一步计算置信度(3行代码)

    # 用支持度除以购买过此类别的数量获得自信度
    for j in range(5):
        df3.iloc[j] = df2.iloc[j] / df.sum()[j]
    
    df3.round(3) # 以3位小数返回置信度表
    
    image.png

    大功告成,哈哈哈哈,完美~

    相关文章

      网友评论

          本文标题:Pandas简单排序规则的支持度和置信度计算

          本文链接:https://www.haomeiwen.com/subject/apfqmxtx.html