美文网首页
机器学习——对样本分类

机器学习——对样本分类

作者: julycan | 来源:发表于2018-06-11 18:12 被阅读0次

    本次学习从机器学习的典例数据:鸢尾花数据集出发。
    Iris dataset可以从sklearn直接引入,也可以从http://archive.ics.uci.edu/ml/datasets/Iris获取。

    实验环境

    • python3环境
    • 使用到的库
    from sklearn import datasets
    from matplotlib import pyplot as plt
    import numpy as np
    

    数据的可视化

    1. 从sklearn引入数据集
    iris = datasets.load_iris()
    features = iris['data']
    feature_names = iris['feature_names']
    target = iris['target']
    
    1. 画出每个类别的数据集,代码中以sepal length 和sepal width为例
    for t, marker, c in zip(range(3), ">ox", "rgb"):
        #我们划出每个类别,各自采用不通颜色标识
        plt.scatter(features[target == t, 0],
                    features[target == t, 1],
                    marker=marker,
                    c=c)
    plt.xlabel('sepal length')
    plt.ylabel('sepal width')
    plt.show()
    
    Figure_1.png
    1. 画出每个类别的数据集
    def scatter_plot(dim1,dim2):
        for t, marker, c in zip(range(3), ">ox", "rgb"):
            #我们划出每个类别,各自采用不通颜色标识
            plt.scatter(features[target == t, dim1],
                        features[target == t, dim2],
                        marker=marker,
                        c=c)
            dim_desc={0:'sepal length',1:'sepal width',2:'petal length',3:'petal width'}
        plt.xlabel(dim_desc.get(dim1))
        plt.ylabel(dim_desc.get(dim2))
        #plt.show()
    
    #用subplot构建6个子图
    count = 0
    for j in range(3):
        for k in range(1, 4):
            if(k>j):
                plt.subplot(231+count)
                scatter_plot(j, k)
                count=count+1
    plt.show()#注意plt.show()的位置
    
    Figure_1.png

    根据可视化的图形,显而易见,山鸢尾花可以由花瓣长度很明显的区分出来。

    分类

    1. 山鸢尾花区分开来
    plength = features[:, 2]
    #用numpy操作,来获取setosa的特征,花瓣长度,是一维矩阵
    
    is_setosa = (target == 0) #布尔型一维矩阵
    #print(is_setosa)
    #print((is_setosa.shape))
    #print(plength.shape)
    
    #布尔型索引
    setosa_plength = plength[is_setosa]
    other_plength = plength[~is_setosa]
    
    
    max_setosa = setosa_plength.max()
    min_non_setosa = other_plength.min()
    
    print('Maximum of setosa:{0}. '.format(max_setosa))
    print('Minmum of others:{0}.'.format(min_non_setosa))
    
    1. 筛出另外两个花种
    # #筛出非setosa的花种
    features = features[~is_setosa]
    labels = target[~is_setosa]
    # #rint(labels)
    #
    #
    virginica = (labels == 2)
    #print(virginica)
    #
    # #print(virginica)
    
    print(features.shape)
    best_acc = -1.0
    for fi in range(features.shape[1]):
        thresh = features[:, fi].copy()
        #thresh.sort()
        #print(thresh)
        for t in thresh:
            #print('t is',t)
            pred = (features[:, fi] > t)
            #print(pred)
            #print(pred == virginica)
            acc = (pred == virginica).mean()
            #print('acc=', acc)
            if acc > best_acc:
                best_acc = acc
                best_fi = fi
                best_t = t
    
    print('Best Accuracy:', best_acc)
    print('Best Feature Index', fi)
    print('Best Threshold', t)
    #这里我们首先对每一维度进行排序,然后从该维度中取出任一值作为阈值的一个假设,再计算这个假设的Boolean序列和实际的标签Boolean序列的一致情况,求平均,即得到了准确率。经过所有的循环,最终得到的阈值和所对应的维度。
    #最后,我们得到了最佳模型针对第四维花瓣的宽度petal width,我们就可以得到这个决策边界decision boundary。
    

    这里得出的分界是以某个参数的界限值为标准的
    得出准确率最高是0.96 但是这里我们的训练数据和测试数据并没有分开来。

    1. 交叉验证
      去一法,从训练集中拿出一个样本,并在缺少这个样本的数据上训练一个模型,然后看模型能否对这个样本正确分类:
    def learn_model(features,labels):
        best_acc = -1.0
        for fi in range(features.shape[1]):
            thresh = features[:, fi].copy()
            # thresh.sort()
            #print(thresh)
            for t in thresh:
                # print('t is',t)
                pred = (features[:, fi] > t)
                # print(pred)
                # print(pred == virginica)
                acc = (pred == labels).mean()
                # print('acc=', acc)
                if acc > best_acc:
                    best_acc = acc
                    best_fi = fi
                    best_t = t
    
        print('Best Accuracy:', best_acc)
        print('Best Feature Index', fi)
        print('Best Threshold', t)
    
        return {'accuracy':best_acc, 'Feature Index':best_fi, 'Threshold':best_t}
    
    def apply_model(features,labels,model):
        t = model['Threshold']
        pred = (features[:, fi] > t)
        acc = (pred == labels).mean()
        return  pred
    
    
    error = 0.0
    for ei in range(len(features)):
    #选择除了ei以外的所有位置:
        training = np.ones(len(features), bool)
        training[ei] = False
        testing = ~training
        model = learn_model(features[training],virginica[training])
        predictions = apply_model(features[testing],virginica[testing],model)
        error += np.sum(predictions != virginica[testing])
    print(error)
    

    相关文章

      网友评论

          本文标题:机器学习——对样本分类

          本文链接:https://www.haomeiwen.com/subject/senasftx.html