数据挖掘实践指南读书笔记2

作者: hainingwyx | 来源:发表于2016-11-16 13:38 被阅读73次

    写在之前

    本书涉及的源程序和数据都可以在以下网站中找到:http://guidetodatamining.com/
    这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。
    欢迎转载,转载请注明出处,如有问题欢迎指正。。
    合集地址:https://www.zybuluo.com/hainingwyx/note/559139

    基于物品的协同过滤

    显示评级:显示给出评级结果,如Youtube的点赞、点差按钮
    隐式评级:网站点击轨迹。
    基于邻居(用户)的推荐系统计算的次数十分巨大,所以有延迟性。还有稀疏性的问题。也称为基于内存的协同过滤,因为需要保存所有的评级结果来进行推荐。
    基于物品的过滤:事先找到最相似的物品,并结合物品的评级结果生成推荐。也称为基于模型的协同过滤,因为不需要保存所有的评级结果,取而代之的随时构建一个模型表示物品之间的相似度。
    为了抵消分数夸大,调整余弦相似度

    U表示所有同事对i和j进行过评级的用户组合,



    表示用户u对物品i的评分,



    表示用户u对所有物品评分的平均值。可以获得相似度矩阵。
    users3 = {"David": {"Imagine Dragons": 3, "Daft Punk": 5,
                        "Lorde": 4, "Fall Out Boy": 1},
              "Matt":  {"Imagine Dragons": 3, "Daft Punk": 4,
                        "Lorde": 4, "Fall Out Boy": 1},
              "Ben":   {"Kacey Musgraves": 4, "Imagine Dragons": 3,
                        "Lorde": 3, "Fall Out Boy": 1},
              "Chris": {"Kacey Musgraves": 4, "Imagine Dragons": 4,
                        "Daft Punk": 4, "Lorde": 3, "Fall Out Boy": 1},
              "Tori":  {"Kacey Musgraves": 5, "Imagine Dragons": 4,
                        "Daft Punk": 5, "Fall Out Boy": 3}}
    
    def computeSimilarity(band1, band2, userRatings):
       averages = {}
       for (key, ratings) in userRatings.items():
          averages[key] = (float(sum(ratings.values()))
                          / len(ratings.values()))
    
       num = 0  # numerator
       dem1 = 0 # first half of denominator
       dem2 = 0
       for (user, ratings) in userRatings.items():
          if band1 in ratings and band2 in ratings:
             avg = averages[user]
             num += (ratings[band1] - avg) * (ratings[band2] - avg)
             dem1 += (ratings[band1] - avg)**2
             dem2 += (ratings[band2] - avg)**2
       return num / (sqrt(dem1) * sqrt(dem2))
    

    相似矩阵预测:

    p(u,i)表示用户u对物品i的预测值

    N表示用户u的所有评级物品中每个和i得分相似的物品。


    是i和N之间的相识度


    是u给N的评级结果,应该在[-1, 1]之间取值,可能需要做线性变换

    得到新的评级结果为


    Slope One算法

    • 计算偏差

      物品i到物品j的平均偏差为

    card(S)是S集合中的元素的个数。X是整个评分集合。



    是所有对i和j进行评分的用户集合。

    def computeDeviations(self):
        # for each person in the data:
        #    get their ratings
        for ratings in self.data.values():        # data:users2, ratings:{song:value, , }
            # for each item & rating in that set of ratings:
            for (item, rating) in ratings.items():
                self.frequencies.setdefault(item, {})   #key is song
                self.deviations.setdefault(item, {})                    
                # for each item2 & rating2 in that set of ratings:
                for (item2, rating2) in ratings.items():
                    if item != item2:
                        # add the difference between the ratings to our
                        # computation
                        self.frequencies[item].setdefault(item2, 0)
                        self.deviations[item].setdefault(item2, 0.0)
                        # frequemcies is card
                        self.frequencies[item][item2] += 1    
                        # diviations is the sum of dev of diff users
                        #value of complex dic is dev
                        self.deviations[item][item2] += rating - rating2     
    
                        for (item, ratings) in self.deviations.items():
                            for item2 in ratings:
                                ratings[item2] /= self.frequencies[item][item2]
    # test code for ComputeDeviations(self)
    #r = recommender(users2)
    #r.computeDeviations()
    #r.deviations
    

    • 加权Slope预测

    表示加权Slope算法给出的用户u对物品j的预测

    def slopeOneRecommendations(self, userRatings):
        recommendations = {}
        frequencies = {}
        # for every item and rating in the user's recommendations
        for (userItem, userRating) in userRatings.items():        # userItem :i
            # for every item in our dataset that the user didn't rate
            for (diffItem, diffRatings) in self.deviations.items():    #diffItem : j
                if diffItem not in userRatings and \
                userItem in self.deviations[diffItem]:
                    freq = self.frequencies[diffItem][userItem] #freq:c_ji
                    # 如果键不存在于字典中,将会添加键并将值设为默认值。
                    recommendations.setdefault(diffItem, 0.0)
                    frequencies.setdefault(diffItem, 0)
                    # add to the running sum representing the numerator
                    # of the formula
                    recommendations[diffItem] += (diffRatings[userItem] +
                                                  userRating) * freq
                    # keep a running sum of the frequency of diffitem
                    frequencies[diffItem] += freq
                    #p(u)j list
                    recommendations =  [(self.convertProductID2name(k),          
                                         v / frequencies[k])
                                        for (k, v) in recommendations.items()]
                    # finally sort and return
                    recommendations.sort(key=lambda artistTuple: artistTuple[1],
                                         reverse = True)
                    # I am only going to return the first 50 recommendations
                    return recommendations[:50]
               
    # test code for SlopeOneRecommendations
    #r = recommender(users2)
    #r.computeDeviations()
    #g = users2['Ben']
    #r.slopeOneRecommendations(g)
    
    def loadMovieLens(self, path=''):
          self.data = {}
          #
          # first load movie ratings
          #
          i = 0
          #
          # First load book ratings into self.data
          #
          #f = codecs.open(path + "u.data", 'r', 'utf8')
          f = codecs.open(path + "u.data", 'r', 'ascii')
          #  f = open(path + "u.data")
          for line in f:
             i += 1
             #separate line into fields
             fields = line.split('\t')
             user = fields[0]
             movie = fields[1]
             rating = int(fields[2].strip().strip('"'))
             if user in self.data:
                currentRatings = self.data[user]
             else:
                currentRatings = {}
             currentRatings[movie] = rating
             self.data[user] = currentRatings
          f.close()
          #
          # Now load movie into self.productid2name
          # the file u.item contains movie id, title, release date among
          # other fields
          #
          #f = codecs.open(path + "u.item", 'r', 'utf8')
          f = codecs.open(path + "u.item", 'r', 'iso8859-1', 'ignore')
          #f = open(path + "u.item")
          for line in f:
             i += 1
             #separate line into fields
             fields = line.split('|')
             mid = fields[0].strip()
             title = fields[1].strip()
             self.productid2name[mid] = title
          f.close()
          #
          #  Now load user info into both self.userid2name
          #  and self.username2id
          #
          #f = codecs.open(path + "u.user", 'r', 'utf8')
          f = open(path + "u.user")
          for line in f:
             i += 1
             fields = line.split('|')
             userid = fields[0].strip('"')
             self.userid2name[userid] = line
             self.username2id[line] = userid
          f.close()
          print(i)
    # test code
    #r = recommender(0)
    #r.loadMovieLens('ml-100k/')
    #r.computeDeviations()
    #r.slopeOneRecommendations(r.data['1'])
    #r.slopeOneRecommendations(r.data['25'])
    

    相关文章

      网友评论

        本文标题:数据挖掘实践指南读书笔记2

        本文链接:https://www.haomeiwen.com/subject/bukrpttx.html