亲和性分析

作者: overad | 来源:发表于2018-06-04 10:19 被阅读0次

python数据挖掘——亲和性分析
采用亲和性分析方法推荐电影
k8s Affinity 亲和性专题源码分析 (二)
Kubernetes 亲和性调度
亲和性分析
k8s Affinity 亲和性专题源码分析 (三)
K8S之节点亲和性调度
k8s 亲和性和反亲和性
2018-09-16 Python数据挖掘学习笔记第一天----
numpy亲和性分析

数据挖掘有个常见的应用场景，即顾客在购买一件商品时，商家可以趁机了解他们还想买什么，以便把多数顾客意愿同时购买的商品放到一起以提高销售量。当商家收集到足够多的数据时，就可以对其进行亲和性分析，以确定哪些商品合适放在一起销售

什么是亲和性：

亲和性分析根据样本个体（物体）之间的相似度，确定他们关系的亲疏。亲和性分析的应用场景如下：

向网站用户提供多样化的服务或投放定向广告；
为了向用户推荐电影或者商品，儿卖给他们一些与之相关的小玩意；
根据基因寻找亲缘关系的人

商品推荐：

我们一起看下简单的商品推荐服务，他背后的思路其实很好理解：人们之前经常同时购买两件商品，以后也很可能同时购买，该想法很简单吧，可这就是很多商品推荐服务的基础；
为了简化代码，我们只考虑一次购买两件商品的请客。例如，人们去了超市既买了面包又买了牛奶。作为数据挖掘的例子，我们希望看到下面的规则：

如果一个人买了商品X，那么他很可能购买商品Y

多件商品的规则会更为复杂，比如购买了香肠和汉堡的顾客比起其他顾客更有可能购买番茄酱。本次不探讨这样的规则。

加载数据：

In [2]: import numpy as np

In [3]: path = 'D:\\books\\affinity_dataset.txt'

In [4]: data = np.loadtxt(path)

In [5]: n_samples,n_features = data.shape

In [6]: n_samples
Out[6]: 100

In [7]: n_features
Out[7]: 5

In [8]: print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
This dataset has 100 samples and 5 features

#查看数据
In [11]: print(data[:5])
[[0. 0. 1. 1. 1.]
 [1. 1. 0. 1. 0.]
 [1. 0. 1. 1. 0.]
 [0. 0. 1. 1. 1.]
 [0. 1. 0. 0. 1.]]

输出的结果中，从横向和竖向我们可以，横着看，每次只看一行，第一行（0,0,1,1,1）表示第一条交易数据所包含的商品，竖着看，每一列代表一种商品。在我们的例子中，这五种商品分别包含面包、牛奶、奶酪、苹果和香蕉；从第一条交易数据，我们可以看到顾客买了奶酪，香蕉和苹果，但是没买面包和牛奶；
每个特征只有两种可能，1或0，表示是否购买了某种商品，而不是购买商品的数量；1表示至少购买了一个单位的该商品，0表示顾客没有购买该商品；

实现简单的排序规则：

正如前面所说，我们要找出“如果顾客买了商品X，那么他们可能愿意购买商品Y”这样的规则，简单粗暴的做法是，找出数据集中所有同事购买的两件商品。找出规则后，还需要判断其优劣势；我们挑好用的规则用：

规则的优劣势有多重衡量方法，常用的是支持度(support)和置信度(confidence)

支持度指数集中规则应验的次数：支持度衡量的是给定规则的应验比例；
置信度衡量的是规则准确率如何，即符合给定条件（即规则的“如果”语句所表示的前提条件）的所有规则里，跟当前结论一致的比例有多大；计算方法为首先统计当前规则出现的次数，再用他来除以（“如果”语句）相同规则的数量

接下来我们通过一个例子来说明支持度和置信度的计算方法；我们来看一下“如果顾客购买了苹果，他们也会购买香蕉”这条的支持度和置信度；

In [12]: fearures = ['beard','milk','cheese','apple','bananas']

In [13]: num_apple_purchases = 0

#First ,how many rows contain our premise:that a person is buying apples
In [14]: for sample in data:
    ...:     if sample[3] == 1: #this person bought apples
    ...:         num_apple_purchases += 1
    ...:

In [15]: print("{0} people bought Apples".format(num_apple_purchases))
36 people bought Apples

同理，检测sample[4]的值是否为1，就能确定顾客有没有买香蕉，

我们需要统计数据集中所有规则的相关数据，首先分别为规则应验和规则无效这两种情况创建字典。字典的键是由条件和结论组成的元组，元组元素为特征在特征列表中的索引值，不要用实际特征名；

In [16]: rule_valid = 0

In [17]: rule_invalid = 0

In [19]: for sample in data:
    ...:     if sample[3] == 1:   #this person bought apples
    ...:         if sample[4] == 1:  #this person bought both apples and bananas
    ...:             rule_valid += 1
    ...:         else:
    ...:             rule_invalid += 1
    ...:

In [20]: print("{0} cases of the rule being valid were discovered".format(rule_valid))
21 cases of the rule being valid were discovered

In [21]: print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
15 cases of the rule being invalid were discovered

我们可以计算支持度和置信度了；

# Now we have all the information needed to compute Support and Confidence
In [22]: support = rule_valid  # The Support is the number of times the rule is discovered.

In [23]: confidence = rule_valid / num_apple_purchases

In [24]: print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
The support is 21 and the confidence is 0.583.
# Confidence can be thought of as a percentage using the following:
In [25]: print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
As a percentage, that is 58.3%.

为了计算所有规则的置信度和支持度，首先要创建几个字典，用来存放计算结果。这里使用defaultdict。

from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue
        # Record that the premise was bought in another transaction
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # It makes little sense to measure if X -> X.
                continue
            if sample[conclusion] == 1:
                # This person also bought the conclusion item
                valid_rules[(premise, conclusion)] += 1
            else:
                # This person bought the premise, but not the conclusion
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]

for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

Rule: If a person buys bread they will also buy milk
 - Confidence: 0.519
 - Support: 14

Rule: If a person buys milk they will also buy cheese
 - Confidence: 0.152
 - Support: 7

Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25

Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9

Rule: If a person buys bread they will also buy apples
 - Confidence: 0.185
 - Support: 5

Rule: If a person buys apples they will also buy bread
 - Confidence: 0.139
 - Support: 5

Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21

Rule: If a person buys apples they will also buy milk
 - Confidence: 0.250
 - Support: 9

Rule: If a person buys milk they will also buy bananas
 - Confidence: 0.413
 - Support: 19

Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27

Rule: If a person buys cheese they will also buy bread
 - Confidence: 0.098
 - Support: 4

Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25

Rule: If a person buys cheese they will also buy milk
 - Confidence: 0.171
 - Support: 7

Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.356
 - Support: 21

Rule: If a person buys bread they will also buy bananas
 - Confidence: 0.630
 - Support: 17

Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27

Rule: If a person buys milk they will also buy bread
 - Confidence: 0.304
 - Support: 14

Rule: If a person buys bananas they will also buy milk
 - Confidence: 0.322
 - Support: 19

Rule: If a person buys bread they will also buy cheese
 - Confidence: 0.148
 - Support: 4

Rule: If a person buys bananas they will also buy bread
 - Confidence: 0.288
 - Support: 17


def print_rule(premise, conclusion, support, confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)
Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9

# Sort by support
from pprint import pprint
pprint(list(support.items()))
[((0, 1), 14),
 ((1, 2), 7),
 ((3, 2), 25),
 ((1, 3), 9),
 ((0, 2), 4),
 ((3, 0), 5),
 ((4, 1), 19),
 ((3, 1), 9),
 ((1, 4), 19),
 ((2, 4), 27),
 ((2, 0), 4),
 ((2, 3), 25),
 ((2, 1), 7),
 ((4, 3), 21),
 ((0, 4), 17),
 ((4, 2), 27),
 ((1, 0), 14),
 ((3, 4), 21),
 ((0, 3), 5),
 ((4, 0), 17)]

排序：

from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_support[index][0]
    print_rule(premise, conclusion, support, confidence, features)

Rule #1
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27

Rule #2
Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27

Rule #3
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25

Rule #4
Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25

Rule #5
Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.356
 - Support: 21

python数据挖掘——亲和性分析
1. 亲和性分析定义亲和性分析指的是根据样本个体之间的相似度来确定它们之间关系的亲疏 2. 亲和性分析的应用向...
采用亲和性分析方法推荐电影
1.亲和性分析应用场景：欺诈检测、软件优化、产品推荐等 2.在本文中采用亲和性分析的代表算法Apriori算法实现...
k8s Affinity 亲和性专题源码分析 (二)
前面文章一部分已有介绍Pod亲和性相关的调度策略算法分析，接下来我们继续Pod相关调度算法分析：三 POD亲和性...
Kubernetes 亲和性调度
原文链接： Kubernetes 亲和性调度亲和性有分成节点亲和性(nodeAffinity)和 Pod 亲和性...
亲和性分析
数据挖掘有个常见的应用场景，即顾客在购买一件商品时，商家可以趁机了解他们还想买什么，以便把多数顾客意愿同时购买的商...
k8s Affinity 亲和性专题源码分析 (三)
本文是续前面亲和性专题的最后一篇《服务亲和性》的算法分析篇，在default调度器代码内并未注册此预选策略，仅有代...
K8S之节点亲和性调度
节点亲和性规则： required(硬亲和性，不能商量，必须执行) 、preferred(软亲和性，可以商量，选择...
k8s 亲和性和反亲和性
亲和性和反亲和性 node 亲和性 node亲和性策略表示pod部署到符合某些条件的node上. 上面的这个例子表...
2018-09-16 Python数据挖掘学习笔记第一天----
亲和性分析示例:根据购买商品习惯推荐商品本片文章主要是学习笔记分享先做...
numpy亲和性分析
1.基本原理假设商场考虑摆放面包牛奶奶酪苹果香蕉的摆放位置。肯定要遵循一些规则，比如顾客买完...