《Python数据挖掘与实战》第一章1.3.4关于“顾客买了产品X,有可能购买产品Y”这样的规则”,用Python计算支持度和置信度时,作者用的是numpy库,但代码又臭又长,看的心累。
臭又长代码如下所示(不想看的可以直接跳过):
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
for conclusion in range(n_features):
if premise == conclusion: # It makes little sense to measure if X -> X.
continue
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1
else:
# This person bought the premise, but not the conclusion
invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
现使用Pandas库,对支持度和置信度进行计算:
首先导入相关库和数据文件
import numpy as np
import pandas as pd
dataset_filename = 'affinity_dataset.txt'
x = np.loadtxt(dataset_filename)
df = pd.DataFrame(x,columns=['面包','牛奶','奶酪','苹果','香蕉'])
df.head()
数据文件的前五条数据如下所示:
image.png其中数值为0即为没有购买,数值为1即为购买。
接着准备两个数据框,来接收支持度和置信度的数值
# 创建两个表 分别作为支持度和置信度的准备表
df2 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
df3 = pd.DataFrame(np.zeros([5,5]),index=['面包','牛奶','奶酪','苹果','香蕉'],columns=['面包','牛奶','奶酪','苹果','香蕉'])
df2
image.png
第三步计算支持度(只有7行代码)
这里是首先遍历每个样本每个值,然后将未购买(数值为0)的样本跳过,在遍历每个人关于某个上面的购买情况时,若购买(值为1),再次从下一个开始再次遍历(不遍历自己),若除当前商品以外,有购买其他商品的,在支持度数据框对应的[j,k]和[k,j]值分别加1
for i in x:
for j in range(5):
# 如果为0 就跳过
if not i[j] : continue
# 如果不0,继续遍历,如果有购买,便+1
for k in range(j+1,5):
if not i[k] : continue
df2.iloc[j,k] += 1
df2.iloc[k,j] += 1
# 返回支持度的结果
df2
得到的支持度表如下:
image.png最后一步计算置信度(3行代码)
# 用支持度除以购买过此类别的数量获得自信度
for j in range(5):
df3.iloc[j] = df2.iloc[j] / df.sum()[j]
df3.round(3) # 以3位小数返回置信度表
image.png
网友评论