美文网首页
运动状态

运动状态

作者: 闫_锋 | 来源:发表于2018-05-22 09:00 被阅读7次

    算法流程

    • 需要从特征文件和标签文件中将所有数据加载到内存中,由于存在缺失值,此步骤还需要进行简单的数据预处理。
    • 创建对应的分类器,并使用训练数据进行训练。
    • 利用测试集预测,通过使用真实值和预测值的比对,计算模型整体的准确率和召回率,来评测模型。
    process.png

    sklearn库与处理模块Imputer
    自动生成训练集和测试集train_test_split

    • K近邻分类器KNeighborsClassifier
    • 决策树分类器DesicionTreeClassifier
    • 高斯朴素贝叶斯函数GaussianNB
    import pandas as pd
    import numpy as np
    
    from sklearn.preprocessing import Imputer
    from sklearn.metrics import classification_report
    
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.naive_bayes import GaussianNB
    
    
    def load_datasets(feature_paths, label_paths):
        feature = np.ndarray(shape=(0, 41))
        label = np.ndarray(shape=(0, 1))
        for file in feature_paths:
            df = pd.read_table(file, delimiter=',', na_values='?', header=None)
            imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
            imp.fit(df)
            df = imp.transform(df)
            feature = np.concatenate((feature, df))
    
        for file in label_paths:
            df = pd.read_table(file, header=None)
            label = np.concatenate((label, df))
    
        label = np.ravel(label)
        return feature, label
    
    
    if __name__ == '__main__':
        ''' 数据路径 '''
        featurePaths = ['A/A.feature', 'B/B.feature', 'C/C.feature', 'D/D.feature', 'E/E.feature']
        labelPaths = ['A/A.label', 'B/B.label', 'C/C.label', 'D/D.label', 'E/E.label']
        ''' 读入数据  '''
        x_train, y_train = load_datasets(featurePaths[:4], labelPaths[:4])
        x_test, y_test = load_datasets(featurePaths[4:], labelPaths[4:])
        x_train, x_, y_train, y_ = train_test_split(x_train, y_train, test_size=0.0)
    
        print('Start training knn')
        knn = KNeighborsClassifier().fit(x_train, y_train)
        print('Training done')
        answer_knn = knn.predict(x_test)
        print('Prediction done')
    
        print('Start training DT')
        dt = DecisionTreeClassifier().fit(x_train, y_train)
        print('Training done')
        answer_dt = dt.predict(x_test)
        print('Prediction done')
    
        print('Start training Bayes')
        gnb = GaussianNB().fit(x_train, y_train)
        print('Training done')
        answer_gnb = gnb.predict(x_test)
        print('Prediction done')
    
        print('\n\nThe classification report for knn:')
        print(classification_report(y_test, answer_knn))
        print('\n\nThe classification report for DT:')
        print(classification_report(y_test, answer_dt))
        print('\n\nThe classification report for Bayes:')
        print(classification_report(y_test, answer_gnb))
    
    • 在所有的特征数据中,可能存在缺失值或者冗余特征。如果将这些特征不加处理地送入后续的计算,可能会导致模型准确度下降并且增大计算量。
    • 在特征选择阶段,通常需要借助辅助软件(例如Weka)将数据进行可视化并进行统计。
    • 请大家可以通过课外学习思考如何筛选冗余特征,提高模型训练效率,也可以尝试调用sklearn提供的其他分类器进行数据预测。

    相关文章

      网友评论

          本文标题:运动状态

          本文链接:https://www.haomeiwen.com/subject/lcfcjftx.html