美文网首页
亲和性分析

亲和性分析

作者: overad | 来源:发表于2018-06-04 10:19 被阅读0次

    数据挖掘有个常见的应用场景,即顾客在购买一件商品时,商家可以趁机了解他们还想买什么,以便把多数顾客意愿同时购买的商品放到一起以提高销售量。当商家收集到足够多的数据时,就可以对其进行亲和性分析,以确定哪些商品合适放在一起销售

    什么是亲和性:

    亲和性分析根据样本个体(物体)之间的相似度,确定他们关系的亲疏。亲和性分析的应用场景如下:

    1. 向网站用户提供多样化的服务或投放定向广告;
    2. 为了向用户推荐电影或者商品,儿卖给他们一些与之相关的小玩意;
    3. 根据基因寻找亲缘关系的人
    商品推荐:

    我们一起看下简单的商品推荐服务,他背后的思路其实很好理解:人们之前经常同时购买两件商品,以后也很可能同时购买,该想法很简单吧,可这就是很多商品推荐服务的基础;
    为了简化代码,我们只考虑一次购买两件商品的请客。例如,人们去了超市既买了面包又买了牛奶。作为数据挖掘的例子,我们希望看到下面的规则:

    如果一个人买了商品X,那么他很可能购买商品Y

    多件商品的规则会更为复杂,比如购买了香肠和汉堡的顾客比起其他顾客更有可能购买番茄酱。本次不探讨这样的规则。

    加载数据:
    In [2]: import numpy as np
    
    In [3]: path = 'D:\\books\\affinity_dataset.txt'
    
    In [4]: data = np.loadtxt(path)
    
    In [5]: n_samples,n_features = data.shape
    
    In [6]: n_samples
    Out[6]: 100
    
    In [7]: n_features
    Out[7]: 5
    
    In [8]: print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
    This dataset has 100 samples and 5 features
    
    #查看数据
    In [11]: print(data[:5])
    [[0. 0. 1. 1. 1.]
     [1. 1. 0. 1. 0.]
     [1. 0. 1. 1. 0.]
     [0. 0. 1. 1. 1.]
     [0. 1. 0. 0. 1.]]
    

    输出的结果中,从横向和竖向我们可以,横着看,每次只看一行,第一行(0,0,1,1,1)表示第一条交易数据所包含的商品,竖着看,每一列代表一种商品。在我们的例子中,这五种商品分别包含面包、牛奶、奶酪、苹果和香蕉;从第一条交易数据,我们可以看到顾客买了奶酪,香蕉和苹果,但是没买面包和牛奶;
    每个特征只有两种可能,1或0,表示是否购买了某种商品,而不是购买商品的数量;1表示至少购买了一个单位的该商品,0表示顾客没有购买该商品;

    实现简单的排序规则:

    正如前面所说,我们要找出“如果顾客买了商品X,那么他们可能愿意购买商品Y”这样的规则,简单粗暴的做法是,找出数据集中所有同事购买的两件商品。找出规则后,还需要判断其优劣势;我们挑好用的规则用:

    规则的优劣势有多重衡量方法,常用的是支持度(support)置信度(confidence)

    • 支持度指数集中规则应验的次数:支持度衡量的是给定规则的应验比例;
    • 置信度衡量的是规则准确率如何,即符合给定条件(即规则的“如果”语句所表示的前提条件)的所有规则里,跟当前结论一致的比例有多大;计算方法为首先统计当前规则出现的次数,再用他来除以(“如果”语句)相同规则的数量

    接下来我们通过一个例子来说明支持度和置信度的计算方法;我们来看一下“如果顾客购买了苹果,他们也会购买香蕉”这条的支持度和置信度;

    In [12]: fearures = ['beard','milk','cheese','apple','bananas']
    
    In [13]: num_apple_purchases = 0
    
    #First ,how many rows contain our premise:that a person is buying apples
    In [14]: for sample in data:
        ...:     if sample[3] == 1: #this person bought apples
        ...:         num_apple_purchases += 1
        ...:
    
    In [15]: print("{0} people bought Apples".format(num_apple_purchases))
    36 people bought Apples
    
    

    同理,检测sample[4]的值是否为1,就能确定顾客有没有买香蕉,

    我们需要统计数据集中所有规则的相关数据,首先分别为规则应验和规则无效这两种情况创建字典。字典的键是由条件和结论组成的元组,元组元素为特征在特征列表中的索引值,不要用实际特征名;

    In [16]: rule_valid = 0
    
    In [17]: rule_invalid = 0
    
    In [19]: for sample in data:
        ...:     if sample[3] == 1:   #this person bought apples
        ...:         if sample[4] == 1:  #this person bought both apples and bananas
        ...:             rule_valid += 1
        ...:         else:
        ...:             rule_invalid += 1
        ...:
    
    In [20]: print("{0} cases of the rule being valid were discovered".format(rule_valid))
    21 cases of the rule being valid were discovered
    
    In [21]: print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
    15 cases of the rule being invalid were discovered
    

    我们可以计算支持度和置信度了;

    # Now we have all the information needed to compute Support and Confidence
    In [22]: support = rule_valid  # The Support is the number of times the rule is discovered.
    
    In [23]: confidence = rule_valid / num_apple_purchases
    
    In [24]: print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
    The support is 21 and the confidence is 0.583.
    # Confidence can be thought of as a percentage using the following:
    In [25]: print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
    As a percentage, that is 58.3%.
    

    为了计算所有规则的置信度和支持度,首先要创建几个字典,用来存放计算结果。这里使用defaultdict。

    from collections import defaultdict
    # Now compute for all possible rules
    valid_rules = defaultdict(int)
    invalid_rules = defaultdict(int)
    num_occurences = defaultdict(int)
    
    for sample in X:
        for premise in range(n_features):
            if sample[premise] == 0: continue
            # Record that the premise was bought in another transaction
            num_occurences[premise] += 1
            for conclusion in range(n_features):
                if premise == conclusion:  # It makes little sense to measure if X -> X.
                    continue
                if sample[conclusion] == 1:
                    # This person also bought the conclusion item
                    valid_rules[(premise, conclusion)] += 1
                else:
                    # This person bought the premise, but not the conclusion
                    invalid_rules[(premise, conclusion)] += 1
    support = valid_rules
    confidence = defaultdict(float)
    for premise, conclusion in valid_rules.keys():
        confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
    
    for premise, conclusion in confidence:
        premise_name = features[premise]
        conclusion_name = features[conclusion]
        print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
        print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
        print(" - Support: {0}".format(support[(premise, conclusion)]))
        print("")
    
    Rule: If a person buys bread they will also buy milk
     - Confidence: 0.519
     - Support: 14
    
    Rule: If a person buys milk they will also buy cheese
     - Confidence: 0.152
     - Support: 7
    
    Rule: If a person buys apples they will also buy cheese
     - Confidence: 0.694
     - Support: 25
    
    Rule: If a person buys milk they will also buy apples
     - Confidence: 0.196
     - Support: 9
    
    Rule: If a person buys bread they will also buy apples
     - Confidence: 0.185
     - Support: 5
    
    Rule: If a person buys apples they will also buy bread
     - Confidence: 0.139
     - Support: 5
    
    Rule: If a person buys apples they will also buy bananas
     - Confidence: 0.583
     - Support: 21
    
    Rule: If a person buys apples they will also buy milk
     - Confidence: 0.250
     - Support: 9
    
    Rule: If a person buys milk they will also buy bananas
     - Confidence: 0.413
     - Support: 19
    
    Rule: If a person buys cheese they will also buy bananas
     - Confidence: 0.659
     - Support: 27
    
    Rule: If a person buys cheese they will also buy bread
     - Confidence: 0.098
     - Support: 4
    
    Rule: If a person buys cheese they will also buy apples
     - Confidence: 0.610
     - Support: 25
    
    Rule: If a person buys cheese they will also buy milk
     - Confidence: 0.171
     - Support: 7
    
    Rule: If a person buys bananas they will also buy apples
     - Confidence: 0.356
     - Support: 21
    
    Rule: If a person buys bread they will also buy bananas
     - Confidence: 0.630
     - Support: 17
    
    Rule: If a person buys bananas they will also buy cheese
     - Confidence: 0.458
     - Support: 27
    
    Rule: If a person buys milk they will also buy bread
     - Confidence: 0.304
     - Support: 14
    
    Rule: If a person buys bananas they will also buy milk
     - Confidence: 0.322
     - Support: 19
    
    Rule: If a person buys bread they will also buy cheese
     - Confidence: 0.148
     - Support: 4
    
    Rule: If a person buys bananas they will also buy bread
     - Confidence: 0.288
     - Support: 17
    
    
    def print_rule(premise, conclusion, support, confidence, features):
        premise_name = features[premise]
        conclusion_name = features[conclusion]
        print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
        print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
        print(" - Support: {0}".format(support[(premise, conclusion)]))
        print("")
    
    
    premise = 1
    conclusion = 3
    print_rule(premise, conclusion, support, confidence, features)
    Rule: If a person buys milk they will also buy apples
     - Confidence: 0.196
     - Support: 9
    
    # Sort by support
    from pprint import pprint
    pprint(list(support.items()))
    [((0, 1), 14),
     ((1, 2), 7),
     ((3, 2), 25),
     ((1, 3), 9),
     ((0, 2), 4),
     ((3, 0), 5),
     ((4, 1), 19),
     ((3, 1), 9),
     ((1, 4), 19),
     ((2, 4), 27),
     ((2, 0), 4),
     ((2, 3), 25),
     ((2, 1), 7),
     ((4, 3), 21),
     ((0, 4), 17),
     ((4, 2), 27),
     ((1, 0), 14),
     ((3, 4), 21),
     ((0, 3), 5),
     ((4, 0), 17)]
    

    排序:

    from operator import itemgetter
    sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
    for index in range(5):
        print("Rule #{0}".format(index + 1))
        (premise, conclusion) = sorted_support[index][0]
        print_rule(premise, conclusion, support, confidence, features)
    
    Rule #1
    Rule: If a person buys cheese they will also buy bananas
     - Confidence: 0.659
     - Support: 27
    
    Rule #2
    Rule: If a person buys bananas they will also buy cheese
     - Confidence: 0.458
     - Support: 27
    
    Rule #3
    Rule: If a person buys apples they will also buy cheese
     - Confidence: 0.694
     - Support: 25
    
    Rule #4
    Rule: If a person buys cheese they will also buy apples
     - Confidence: 0.610
     - Support: 25
    
    Rule #5
    Rule: If a person buys bananas they will also buy apples
     - Confidence: 0.356
     - Support: 21
    

    相关文章

      网友评论

          本文标题:亲和性分析

          本文链接:https://www.haomeiwen.com/subject/zeolsftx.html