美文网首页
Python机器学习入门

Python机器学习入门

作者: 沧海一粟谦 | 来源:发表于2018-05-04 10:39 被阅读95次
    Anne Hathaway

    在Windows上安装Python

    Python官网:https://www.python.org/
    我的电脑是64位的,安装3.x版本选择Windows x86-64 executable installer,由于2.x和3.x版本不兼容,考虑到2.x版本的代码要修改后才能运行,所以我选择的是2.x版本:Windows x86-64 MSI installer


    注意选上pipAdd python.exe to Path,然后一路点“Next”即可完成安装。

    默认会安装到C:\Python27目录下,然后打开命令提示符窗口,敲入python后,看到上面的画面,就说明Python安装成功!


    如果出现:‘python’不是内部或外部命令,也不是可运行的程序或批处理文件

    这是因为Windows会根据一个Path的环境变量设定的路径去查找python.exe,如果没找到,就会报错。如果在安装时漏掉了勾选Add python.exe to Path,那就要手动把python.exe所在的路径C:\Python27添加到Path中


    Python把环境变量配置在path所有变量的最前面 导致在加载windows系统的变量的前面所以不起作用,需要重启 ,但是你只需要把变量移到最后面就不需要重启。

    Python 3 安装jupyter notebook

    python3 -m pip install --upgrade pip
    python3 -m pip install jupyter
    

    Python 2 安装jupyter notebook

    python -m pip install --upgrade pip
    python -m pip install jupyter
    

    启动 Jupyter Notebook

    jupyter notebook
    

    安装numpy

    因为要有很多的矩阵计算,所以要安装numpy包
    下载地址:点击打开链接

    • 根据自己安装的python版本选择安装包,intel平台的就选择win32:numpy-1.14.3+mkl-cp27-cp27m-win32.whl
    • 将下载的安装包拷贝在Python安装目录下C:\Python27\Scripts
    • 将Scripts这个文件夹的地址拷贝下来,然后“右击计算机-属性-高级系统设置-环境变量-系统变量-path-编辑它”将刚才的路径粘贴进去。
    • 进入DOS,输入pip版本号 install +numpy的路径+文件名
      例如我的是pip2.7 install C:\Python27\Scripts\numpy-1.14.3+mkl-cp27-cp27m-win32.whl
    • 安装成功就会提示successfully installed

    安装的过程中出现了意想不到的错误:第二个按照提示升级pip即可,但是第一个错误是怎么回事呢?
    原来我所安装的python所支持的whl 文件类型是win32,并不是你操作系统是64位的就选amd64的,所以重新下载一个win32的numpy包就好了。



    安装Matplotlib

    跟安装numpy一样,找到Matplotlib包,下载到Python安装目录下C:\Python27\Scripts,通过cmd安装:pip2.7 install C:\Python27\Scripts\matplotlib-2.2.2-cp27-cp27m-win32.whl

    安装 pandas

    pip2.7 install C:\Python27\Scripts\pandas-0.23.0-cp27-cp27m-win32.whl

    安装 seaborn

    pip install seaborn

    安装 scipy

    pip2.7 install C:\Python27\Scripts\scipy-1.1.0-cp27-cp27m-win32.whl

    安装 sklearn

    pip2.7 install C:\Python27\Scripts\scikit_learn-0.19.1-cp27-cp27m-win32.whl

    欧式距离应用

    川菜馆排行榜
    
    ------------------------------------------------------
             |   红烧肉 |  水煮牛肉 |  夫妻肺片 |   麻婆豆腐|
    ------------------------------------------------------
      灶神  |          |           |           |           |
    ------------------------------------------------------
      食神  |          |           |           |           |
    ------------------------------------------------------
      赌神  |          |           |           |           |
    ------------------------------------------------------
      吃货  |          |           |           |           |
    ------------------------------------------------------
    

    引入数据

    import numpy as np
    
    Restr_1 = [[3.5, 3.0, 3.0, 4.0],
               [2.0, 2.5, 2.5, 3.5],
               [3.0, 3.5, 3.0, 4.5],
               [4.0, 3.0, 3.5, 4.0]]
    
    Restr_2 = [[4.5, 4.0, 4.0, 4.5],
               [3.0, 3.5, 3.5, 4.5],
               [4.0, 3.5, 4.0, 4.0],
               [4.5, 4.0, 4.5, 4.5]]
    
    Restr_3 = [[1.5, 2.0, 2.0, 2.5],
               [1.0, 1.5, 1.5, 1.5],
               [2.0, 2.5, 2.0, 2.0],
               [1.5, 2.0, 2.5, 2.5]]
    

    欧氏距离公式

    def euclidean_score(param1, param2):
        
        subtracted_diff = np.subtract(param1, param2) 
    
        squared_diff = np.square( subtracted_diff)
        
        eu_dist = np.sqrt(np.sum(squared_diff))
            
        return eu_dist  , 1 / (1 + eu_dist) 
    
    R12, r12= euclidean_score(Restr_1,Restr_2)
    R13, r13= euclidean_score(Restr_1,Restr_3)
    R23, r23= euclidean_score(Restr_2,Restr_3)
    

    R12=3.4641016151377544
    R13=5.916079783099616
    R23=8.717797887081348

    KNN

    from numpy import *
    import operator
    import time
    import matplotlib.pyplot as plt
    
    def kNN(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize,1)) - dataSet
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)
        distances = sqDistances**0.5
        sortedDistIndicies = distances.argsort() 
        """
        print(distances)
        print(diffMat)
        print(sqDiffMat)
        print(sqDistances)
        print('index')
        print(sortedDistIndicies)
        """
        classCount={}          
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
    
    # kNN Example
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    

    将数据可视化

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(group[:2,0],group[:2,1], s=70, color='b')
    ax.scatter(group[2:4,0],group[2:4,1], s=70, color='r')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.show()
    
    kNN([0.3,0.2],group,labels,3)
    #out:'B' 说明[0.3,0.2]这个点属于B类
    

    请根据前例,对下表中的电影数据采用kNN算法进行分类:


    group = array([[3.0,104.0],[2.0,100.0],[1,81],[101,10.0],[99,5],[98,2.0]])
    labels = ['Romance','Romance','Romance','Action','Action','Action']
    
    kNN([18,90],group,labels,3)
    
    #out:'Romance'
    

    对文件中的数据进行分析,归类


    from numpy import *
    import matplotlib.pyplot as plt
    
    def file2matrix(filename):
        fr = open(filename)
        numberOfLines = len(fr.readlines())         #get the number of lines in the file
        returnMat = zeros((numberOfLines,3))        #prepare matrix to return
        classLabelVector = []                       #prepare labels return   
        fr = open(filename)
        index = 0
        for line in fr.readlines():
            line = line.strip()
            listFromLine = line.split('\t')
            returnMat[index,:] = listFromLine[0:3]
            classLabelVector.append(int(listFromLine[-1]))
            index += 1
        return returnMat,classLabelVector
    
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    
    plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
    plt.scatter(datingDataMat[:,1], datingDataMat[:,2], 15.0*array(datingLabels), 15.0*array(datingLabels))
    plt.xlabel('Percentage of Time Spent Playing Video Games')
    plt.ylabel('Liters of Ice Cream Consumed Per Week')
    plt.show()
    
    plt.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*array(datingLabels), 15.0*array(datingLabels))
    plt.xlabel('Frequent Flyer Miles Earned Per Year')
    plt.ylabel('Liters of Ice Cream Consumed Per Week')
    plt.show()
    
    import numpy as np
    import matplotlib.pyplot as plt
    
    from matplotlib.ticker import NullFormatter  # useful for `logit` scale
    
    # Fixing random state for reproducibility
    np.random.seed(19680801)
    
    # make up some data in the interval ]0, 1[
    y = np.random.normal(loc=0.5, scale=0.4, size=1000)
    y = y[(y > 0) & (y < 1)]
    y.sort()
    x = np.arange(len(y))
    
    # plot with various axes scales
    plt.figure(1)
    
    # linear
    plt.subplot(221)
    plt.plot(x, y)
    plt.yscale('linear')
    plt.title('linear')
    plt.grid(True)
    
    
    # log
    plt.subplot(222)
    plt.plot(x, y)
    plt.yscale('log')
    plt.title('log')
    plt.grid(True)
    
    
    # symmetric log
    plt.subplot(223)
    plt.plot(x, y - y.mean())
    plt.yscale('symlog', linthreshy=0.01)
    plt.title('symlog')
    plt.grid(True)
    
    # logit
    plt.subplot(224)
    plt.plot(x, y)
    plt.yscale('logit')
    plt.title('logit')
    plt.grid(True)
    # Format the minor tick labels of the y-axis into empty strings with
    # `NullFormatter`, to avoid cumbering the axis with too many labels.
    plt.gca().yaxis.set_minor_formatter(NullFormatter())
    # Adjust the subplot layout, because the logit one may take more space
    # than usual, due to y-tick labels like "1 - 10^{-3}"
    plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25,
                        wspace=0.35)
    
    plt.show()
    

    Apriori算法应用

    根据Apriori算法编写apriori.py

    from numpy import *
    
    def loadDataSet():
        return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
    
    def createC1(dataSet):
        C1 = []
        for transaction in dataSet:
            #print(transaction)
            for item in transaction:
                #print(item)
                if not [item] in C1:
                    #print("C1 before:")
                    #print(C1)
                    C1.append([item])
                    #print("C1 now:")
                    #print(C1)
                    
        C1.sort()
        return map(frozenset, C1)#use frozen set so we
                                #can use it as a key in a dict    
    
    def scanD(D, Ck, minSupport):
        ssCnt = {}
        for tid in D:
            for can in Ck:
                if can.issubset(tid):
                    #print("ssCnt before:")
                    #print(ssCnt)
                    if not can in ssCnt: ssCnt[can]=1
                    else: ssCnt[can] += 1
                    #print("ssCnt now:")
                    #print(ssCnt)
        numItems = float(len(list(D)))
        print("numItems:")
        print(numItems)
        retList = []
        supportData = {}
        for key in ssCnt:
            print(key)
            support = ssCnt[key]/numItems
            if support >= minSupport:
                retList.insert(0,key)
            supportData[key] = support
            print(support)
        return retList, supportData
    
    def aprioriGen(Lk, k): #creates Ck
        retList = []
        lenLk = len(Lk)
        for i in range(lenLk):
            for j in range(i+1, lenLk): 
                L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
                L1.sort(); L2.sort()
                if L1==L2: #if first k-2 elements are equal
                    retList.append(Lk[i] | Lk[j]) #set union
        return retList
    
    def apriori(dataSet, minSupport = 0.5):
        C1 = createC1(dataSet)
        D = list(map(set, dataSet))
        L1, supportData = scanD(D, C1, minSupport)
        L = [L1]
        k = 2
        while (len(L[k-2]) > 0):
            Ck = aprioriGen(L[k-2], k)
            Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
            supportData.update(supK)
            L.append(Lk)
            k += 1
        return L, supportData
    
    def generateRules(L, supportData, minConf=0.7):  #supportData is a dict coming from scanD
        bigRuleList = []
        for i in range(1, len(L)):#only get the sets with two or more items
            for freqSet in L[i]:
                H1 = [frozenset([item]) for item in freqSet]
                if (i > 1):
                    rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
                else:
                    calcConf(freqSet, H1, supportData, bigRuleList, minConf)
        return bigRuleList         
    
    def calcConf(freqSet, H, supportData, brl, minConf=0.7):
        prunedH = [] #create new list to return
        for conseq in H:
            conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
            if conf >= minConf: 
                print(freqSet-conseq,'-->',conseq,'conf:',conf)
                brl.append((freqSet-conseq, conseq, conf))
                prunedH.append(conseq)
        return prunedH
    
    def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
        m = len(H[0])
        if (len(freqSet) > (m + 1)): #try further merging
            Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
            Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
            if (len(Hmp1) > 1):    #need at least two sets to merge
                rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)
                
    def pntRules(ruleList, itemMeaning):
        for ruleTup in ruleList:
            for item in ruleTup[0]:
                print(itemMeaning[item])
            print("           -------->")
            for item in ruleTup[1]:
                print(itemMeaning[item])
            print("confidence: %f" % ruleTup[2])
            print(" ")      #print a blank line
    

    引入数据

    import apriori
    
    dataSet = [["cakes", "beer", "bread"],
               ["cakes", "beer", "bread", "donuts"],
               ["beer", "bread", "pizza"], 
               ["cakes", "bread", "donuts", "pizza"],
               ["donuts", "pizza"]]
    
    C1 = apriori.createC1(dataSet)
    list(C1)
    
    C2 = [frozenset({'cakes', 'beer'}),
     frozenset({'cakes', 'beer', 'bread'}),
     frozenset({'cakes', 'beer', 'bread', 'donuts'})]
    
    C3 =[frozenset({'beer', 'bread'}),
     frozenset({'cakes', 'beer', 'bread'}),
     frozenset({'cakes', 'beer', 'bread', 'donuts'}),
     frozenset({'beer', 'bread', 'pizza'})] 
    
    D = list(map(set, dataSet))
    D
    
    

    计算支持度计数

    L2, suppData = apriori.scanD(D, C2, 0)
    L2
    

    numItems:
    5.0
    frozenset({'beer', 'cakes'})
    0.4
    frozenset({'beer', 'bread', 'cakes'})
    0.4
    frozenset({'donuts', 'beer', 'bread', 'cakes'})
    0.2

    决策树应用

    根据决策树算法编写trees.py

    from math import log
    import operator
    
    def calcShannonEnt(dataSet):
        numEntries = len(dataSet)
        labelCounts = {}
        for featVec in dataSet: #the the number of unique elements and their occurance
            currentLabel = featVec[-1]
            if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
            labelCounts[currentLabel] += 1
        shannonEnt = 0.0
        for key in labelCounts:
            prob = float(labelCounts[key])/numEntries
            shannonEnt -= prob * log(prob,2) #log base 2
        return shannonEnt
        
    def splitDataSet(dataSet, axis, value):
        retDataSet = []
        for featVec in dataSet:
            if featVec[axis] == value:
                reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
                reducedFeatVec.extend(featVec[axis+1:])
                retDataSet.append(reducedFeatVec)
        return retDataSet
        
    def chooseBestFeatureToSplit(dataSet):
        numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
        baseEntropy = calcShannonEnt(dataSet)
        bestInfoGain = 0.0; bestFeature = -1
        for i in range(numFeatures):        #iterate over all the features
            featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
            uniqueVals = set(featList)       #get a set of unique values
            newEntropy = 0.0
            for value in uniqueVals:
                subDataSet = splitDataSet(dataSet, i, value)
                prob = len(subDataSet)/float(len(dataSet))
                newEntropy += prob * calcShannonEnt(subDataSet)     
            infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
            print("#", i)
            print("infoGain: ", infoGain)
            print(" ")
            if (infoGain > bestInfoGain):       #compare this to the best gain so far
                bestInfoGain = infoGain         #if better than current best, set to best
                bestFeature = i
        return bestFeature                      #returns an integer
    
    def majorityCnt(classList):
        classCount={}
        for vote in classList:
            if vote not in classCount.keys(): classCount[vote] = 0
            classCount[vote] += 1
        sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
    
    def createTree(dataSet,labels):
        classList = [example[-1] for example in dataSet]
        if classList.count(classList[0]) == len(classList): 
            return classList[0]#stop splitting when all of the classes are equal
        if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
            return majorityCnt(classList)
        bestFeat = chooseBestFeatureToSplit(dataSet)
        bestFeatLabel = labels[bestFeat]
        myTree = {bestFeatLabel:{}}
        del(labels[bestFeat])
        featValues = [example[bestFeat] for example in dataSet]
        uniqueVals = set(featValues)
        for value in uniqueVals:
            subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
            myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
        return myTree                            
        
    def classify(inputTree,featLabels,testVec):
        firstStr = list(inputTree.keys())[0]
        secondDict = inputTree[firstStr]
        featIndex = featLabels.index(firstStr)
        key = testVec[featIndex]
        valueOfFeat = secondDict[key]
        if isinstance(valueOfFeat, dict): 
            classLabel = classify(valueOfFeat, featLabels, testVec)
        else: classLabel = valueOfFeat
        return classLabel
    
    def storeTree(inputTree,filename):
        import pickle
        fw = open(filename,'w')
        pickle.dump(inputTree,fw)
        fw.close()
        
    def grabTree(filename):
        import pickle
        fr = open(filename)
        return pickle.load(fr)
        
    

    读取文件数据,通过决策树算法进行决策树构建

    import trees
    
    fr = open('lenses.txt')
    lenses = [inst.strip().split('\t') for inst in fr.readlines()]
    
    # 选择分类
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    
    # 构建决策树
    lensesTree = trees.createTree(lenses, lensesLabels)
    

    可视化决策树

    import matplotlib.pyplot as plt
    
    decisionNode = dict(boxstyle="sawtooth", fc="0.8")
    leafNode = dict(boxstyle="round4", fc="0.8")
    arrow_args = dict(arrowstyle="<-")
    
    def getNumLeafs(myTree):
        numLeafs = 0
        firstStr = list(myTree.keys())[0] ###
        secondDict = myTree[firstStr]
        for key in secondDict.keys():
            if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
                numLeafs += getNumLeafs(secondDict[key])
            else:   numLeafs +=1
        return numLeafs
    
    def getTreeDepth(myTree):
        maxDepth = 0
        firstStr = list(myTree.keys())[0] ###
        secondDict = myTree[firstStr]
        for key in secondDict.keys():
            if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
                thisDepth = 1 + getTreeDepth(secondDict[key])
            else:   thisDepth = 1
            if thisDepth > maxDepth: maxDepth = thisDepth
        return maxDepth
    
    def plotNode(nodeTxt, centerPt, parentPt, nodeType):
        createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
                 xytext=centerPt, textcoords='axes fraction',
                 va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
        
    def plotMidText(cntrPt, parentPt, txtString):
        xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
        yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
        createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
    
    def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
        numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
        depth = getTreeDepth(myTree)
        firstStr = list(myTree.keys())[0]     #the text label for this node should be this
        cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
        plotMidText(cntrPt, parentPt, nodeTxt)
        plotNode(firstStr, cntrPt, parentPt, decisionNode)
        secondDict = myTree[firstStr]
        plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
        for key in secondDict.keys():
            if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
                plotTree(secondDict[key],cntrPt,str(key))        #recursion
            else:   #it's a leaf node print the leaf node
                plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
                plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
                plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
        plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
    #if you do get a dictonary you know it's a tree, and the first element will be another dict
    
    def createPlot(inTree):
        fig = plt.figure(1, facecolor='white')
        fig.clf()
        axprops = dict(xticks=[], yticks=[])
        createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
        #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
        plotTree.totalW = float(getNumLeafs(inTree))
        plotTree.totalD = float(getTreeDepth(inTree))
        plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
        plotTree(inTree, (0.5,1.0), '')
        plt.show()
    
    #def createPlot():
    #    fig = plt.figure(1, facecolor='white')
    #    fig.clf()
    #    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    #    plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
    #    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
    #    plt.show()
    
    def retrieveTree(i):
        listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                      {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                      ]
        return listOfTrees[i]
    
    #createPlot(thisTree)
    
    import treePlotter
    treePlotter.createPlot(lensesTree)
    
    tree.png

    K-Means与KNN应用

    1.利用任意编程语言实现K-Means算法和KNN算法;

    1. 使用K-Means算法对以上实验数据中前6部电影进行分簇;

    2. 输入表2中最后的“待分类电影”数据,根据前一步的分簇结果对其分簇


      某电影分类镜头统计数据
    3. 根据K-Means算法编写K-Means.py

    from numpy import *
    
    def loadDataSet(fileName):      #general function to parse tab -delimited floats
        dataMat = []                #assume last column is target value
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split('\t')
            fltLine = list(map(float,curLine)) #map all elements to float()
            dataMat.append(fltLine)
        return dataMat
    
    def distEclud(vecA, vecB):
        return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
    
    def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
        m = shape(dataSet)[0]
        clusterAssment = mat(zeros((m,2)))#create mat to assign data points 
                                          #to a centroid, also holds SE of each point
        centroids = createCent(dataSet, k)
        clusterChanged = True
        while clusterChanged:
            clusterChanged = False
            for i in range(m):#for each data point assign it to the closest centroid
                minDist = inf; minIndex = -1
                for j in range(k):
                    distJI = distMeas(centroids[j,:],dataSet[i,:])
                    if distJI < minDist:
                        minDist = distJI; minIndex = j
                if clusterAssment[i,0] != minIndex: clusterChanged = True
                clusterAssment[i,:] = minIndex,minDist**2
            print(centroids)
            for cent in range(k):#recalculate centroids
                ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
                centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean 
        return centroids, clusterAssment
    

    2.装载数据

    import kMeans
    import numpy as np
    
    dataMat= np.mat([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2],[18,90]])
    
    1. 用K-Means算法对以上实验数据进行分簇
    kMeans.distEclud(dataMat[0],dataMat[1])
    
    myCentroids, clustAssing = kMeans.kMeans(dataMat,2)
    

    4.显示分簇

    A = np.asarray(dataMat[:,0])
    B = np.asarray(dataMat[:,1])
    CX = np.asarray(myCentroids[:,0])
    CY = np.asarray(myCentroids[:,1])
    
    import matplotlib.pyplot as plt
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(A, B, s=50, color='b')
    ax.scatter(CX, CY, s=1000, marker = '+', color='r')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.show()
    

    5.编写KNN算法对最后的“待分类电影”进行分类

    from numpy import *
    import operator
    import time
    import matplotlib.pyplot as plt
    
    def kNN(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize,1)) - dataSet
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)
        distances = sqDistances**0.5
        sortedDistIndicies = distances.argsort() 
        classCount={}          
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
    
    labels = ['Romance','Romance','Romance','Action','Action','Action']
    
    kNN([18,90],group,labels,3)
    
    

    分类结果:'Romance'

    源码地址

    相关文章

      网友评论

          本文标题:Python机器学习入门

          本文链接:https://www.haomeiwen.com/subject/wotdrftx.html