关联规则（二）

作者: 还闹不闹 | 来源:发表于2020-06-28 16:16 被阅读0次

关联规则（二）
fp_growth频繁项集和关联规则Spark ML调用实现
关联规则
关联规则
关联规则
关联规则
关联规则
关联规则
关联规则
频繁项集和关联规则

参考：
数据挖掘入门系列教程（五）之Apriori算法Python实现
 Python 极简关联分析（购物篮分析）

数据集：
链接：https://pan.baidu.com/s/1V_vxEriCf9ticDj8pWPDcQ
提取码：14yq

#!/usr/bin/python
# coding=utf-8
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# 画图支持中文显示
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
# 负号
plt.rcParams['axes.unicode_minus'] = False

# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
# 设置value的显示长度为10000，默认为50
pd.set_option('display.width',10000)
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
#
np.set_printoptions(linewidth=1000)

df = pd.read_excel('G:\\rnn\\relationship analysis\Online Retail.xlsx')
print(df.head())
'''
发票编号：发票编号。标称，唯一分配给每个交易的6位整数。如果此代码以字母“ c”开头，则表示已取消。
StockCode：产品（项目）代码。标称，唯一地分配给每个不同产品的5位整数。
描述：产品（项目）名称。名义上
数量：每笔交易中每个产品（项目）的数量。数字。
InvoiceDate：通知日期和时间。数字，每笔交易生成的日期和时间。
单价：单价。数值，单位为英镑的产品价格。
客户编号：客户编号。标称，唯一分配给每个客户的5位整数。
国家：国家名称。名义上，每个客户居住的国家的名称。
'''

df['Description'] = df['Description'].str.strip() # Description字段去除首尾空格
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True) # 删除发票ID"InvoiceNo"为空的数据记录
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')] # 删除发票ID"InvoiceNo"不包含“C”的记录

# 数据预处理——处理为购物篮数据集
# 方法一：使用pivot_table函数
basket = df[df['Country'] =="France"].pivot_table(columns = "Description",index="InvoiceNo", values="Quantity",aggfunc=np.sum).fillna(0)
print(basket.head(3))
# 方法二：groupby后unstack
basket2 = (df[df['Country'] =="Germany"]
           .groupby(['InvoiceNo', 'Description'])['Quantity']
           .sum().unstack().reset_index().fillna(0)
           .set_index('InvoiceNo'))
# basket选择法国地区数据，basket2为德国地区数据，记得fillna(0)，将空值转为0，算法包需要。

# 将购物数量转为0/1变量
# 0：此订单未购买包含列名
# 1：此订单购买了列名商品
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1) # 删除购物篮中的邮费项（POSTAGE）

# 进行关联规则运算
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print('frequent_itemsets 为频繁项集：（Support列为支持度，即 项集发生频率/总订单量）\n', frequent_itemsets)
print('rules为最终关联规则结果表：（antecedants前项集，consequents后项集，support支持度，confidence置信度，lift提升度。）\n', rules)

print('==================================================================================================')

# 结果检视
# 选取（置信度confidence≥0.8）&（提升度lift≥6）的规则，按lift降序排序
head_rules = rules[ (rules['lift'] >= 6) & (rules['confidence'] >= 0.8) ].sort_values("lift",ascending = False)
print(head_rules)