数据挖掘实践指南读书笔记3

作者: hainingwyx | 来源:发表于2016-11-16 14:23 被阅读52次

    写在之前

    本书涉及的源程序和数据都可以在以下网站中找到:http://guidetodatamining.com/
    这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。
    欢迎转载,转载请注明出处,如有问题欢迎指正。
    合集地址:https://www.zybuluo.com/hainingwyx/note/559139

    基于物品属性的过滤

    新物品加入,会因为没有被评分过,永远不会被推荐。Pandora是基于基于一种称为音乐基因的项目。
    当所用数据挖掘方法基于特征的值来计算 两个对象的距离,且不同特征的尺度不同,就需要使用归一化。一般使用均值和标准差来进行归一化,但这种方法可能会受到离群点的影响,所以引入改进后的归一化:均值用中位数(u​)代替,标准差用绝对标准差(asd)代替。


    # 计算中位数和绝对标准差
    def getMedian(self, alist):
         """return median of alist"""
        if alist == []:
            return []
        blist = sorted(alist)
        length = len(alist)
        if length % 2 == 1:
            # length of list is odd so return middle element
            return blist[int(((length + 1) / 2) -  1)]
        else:
            # length of list is even so compute midpoint
            v1 = blist[int(length / 2)]
            v2 =blist[(int(length / 2) - 1)]
            return (v1 + v2) / 2.0
    
    
    def getAbsoluteStandardDeviation(self, alist, median):
        """given alist and median return absolute standard deviation"""
        sum = 0
        for item in alist:
            sum += abs(item - median)
        return sum / len(alist)
    
    def unitTest():
        list1 = [54, 72, 78, 49, 65, 63, 75, 67, 54]
        list2 = [54, 72, 78, 49, 65, 63, 75, 67, 54, 68]
        list3 = [69]
        list4 = [69, 72]
        classifier = Classifier('data/athletesTrainingSet.txt')
        m1 = classifier.getMedian(list1)
        m2 = classifier.getMedian(list2)
        m3 = classifier.getMedian(list3)
        m4 = classifier.getMedian(list4)
        asd1 = classifier.getAbsoluteStandardDeviation(list1, m1)
        asd2 = classifier.getAbsoluteStandardDeviation(list2, m2)
        asd3 = classifier.getAbsoluteStandardDeviation(list3, m3)
        asd4 = classifier.getAbsoluteStandardDeviation(list4, m4)
        assert(round(m1, 3) == 65)
        assert(round(m2, 3) == 66)
        assert(round(m3, 3) == 69)
        assert(round(m4, 3) == 70.5)
        assert(round(asd1, 3) == 8)
        assert(round(asd2, 3) == 7.5)
        assert(round(asd3, 3) == 0)
        assert(round(asd4, 3) == 1.5)
        
        print("getMedian and getAbsoluteStandardDeviation work correctly")
    

    assert语句用于软件组件测试的做法是一种常用的技术。产品每一部分分成一段实现代码加上对实现代码的测试代码,这一点十分重要。

    # 归一化
    def normalizeColumn(self, columnNumber):
           """given a column number, normalize that column in self.data"""
           # first extract values to list, v is vector, clounm is 0/1,col is a list
           col = [v[1][columnNumber] for v in self.data]
           median = self.getMedian(col)
           asd = self.getAbsoluteStandardDeviation(col, median)
           #print("Median: %f   ASD = %f" % (median, asd))
           self.medianAndDeviation.append((median, asd))
           for v in self.data:
               v[1][columnNumber] = (v[1][columnNumber] - median) / asd
    

    相关文章

      网友评论

        本文标题:数据挖掘实践指南读书笔记3

        本文链接:https://www.haomeiwen.com/subject/kqrrpttx.html