美文网首页
4.3训练数据集、测试数据集

4.3训练数据集、测试数据集

作者: 逆风的妞妞 | 来源:发表于2019-06-27 17:55 被阅读0次

    4.3训练数据集、测试数据集

    1.判断机器学习算法的性能

    image.png
    测试我们的算法
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    iris = datasets.load_iris()
    
    x = iris.data
    y = iris.target
    
    train_test_split

    将原始数据集拆分成两部分,一部分是训练数据集,一部分是测试数据集。

    # 先对原始数据进行随机化,但是因为x.y对应的关系,所以随机化处理是对应关系应该保持一致。
    # 随机化方法:可以先把X.y合成一个矩阵然后随机取出一部分数据;另一种方法是先对y进行随机化
    # 形成15个索引的随机序列
    shuffle_indexes = np.random.permutation(len(X)
    # 查看序列
    shuffle_indexes
    # 指定选取测试数据集的比例
    test_ratio = 0.2
    test_size = int(len(X) * test_ratio)
    # 获得测试数据集索引
    test_indexes = shuffle_indexes[:test_size]
    # 获得训练数据集索引
    train_indexes = shuffle_indexes[test_size:]
    
    # 获取测试和训练数据
    X_train = X[train_indexes]
    y_train = y[train_indexes]
    
    X_test = X[test_indexes]
    y_test = y[test_indexes]
    

    创建model_selection.py文件

    # 分割原始数据集为测试数据集和训练数据集
    
    import numpy as np
    
    def train_test_split(X, y, test_ratio=0.2, seed=None):
        assert X.shape[0] == y.shape[0], \
            "the size of X must be equal to the size of y"
        assert 0.0 <= test_ratio <= 1.0, \
            "test_ratio must be valid"
    
        if seed:
            np.random.seed(seed)
    
        shuffled_indexes = np.random.permutation(len(X))
    
        test_size = int(len(X * test_ratio))
        test_indexes = shuffled_indexes[:test_size]
        train_indexes = shuffled_indexes[test_size:]
    
        X_train = X[train_indexes]
        y_train = y[train_indexes]
    
        X_test = X[test_indexes]
        y_test = y[test_indexes]
    
        return X_train, y_train, X_test, y_test
    
    测试使用我们的算法
    from playML.model_selection import train_test_split
    X_train, y_train, X_test, y_test = train_test_split(X, y)
    
    from playML.kNN2 import KNNClassifier
    my_knn_clf = KNNClassifier(k=3)
    my_knn_clf.fit(X_train, y_train)
    y_predict = my_knn_clf.predict(X_test)
    # 得出预测结果
    y_predict
    # 检验预测结果和实际结果
    sum(y_predict == y_test)
    # 计算预测准确率
    sum(y_predict == y_test)/len(y_test)
    
    sklearn中的train_test_split
    from sklearn.model_selection import train_test_split
    # random_state设置随机种子
    X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
    

    相关文章

      网友评论

          本文标题:4.3训练数据集、测试数据集

          本文链接:https://www.haomeiwen.com/subject/ywxdcctx.html