算法流程
- 需要从特征文件和标签文件中将所有数据加载到内存中,由于存在缺失值,此步骤还需要进行简单的数据预处理。
- 创建对应的分类器,并使用训练数据进行训练。
- 利用测试集预测,通过使用真实值和预测值的比对,计算模型整体的准确率和召回率,来评测模型。
sklearn库与处理模块Imputer
自动生成训练集和测试集train_test_split
- K近邻分类器KNeighborsClassifier
- 决策树分类器DesicionTreeClassifier
- 高斯朴素贝叶斯函数GaussianNB
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
def load_datasets(feature_paths, label_paths):
feature = np.ndarray(shape=(0, 41))
label = np.ndarray(shape=(0, 1))
for file in feature_paths:
df = pd.read_table(file, delimiter=',', na_values='?', header=None)
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(df)
df = imp.transform(df)
feature = np.concatenate((feature, df))
for file in label_paths:
df = pd.read_table(file, header=None)
label = np.concatenate((label, df))
label = np.ravel(label)
return feature, label
if __name__ == '__main__':
''' 数据路径 '''
featurePaths = ['A/A.feature', 'B/B.feature', 'C/C.feature', 'D/D.feature', 'E/E.feature']
labelPaths = ['A/A.label', 'B/B.label', 'C/C.label', 'D/D.label', 'E/E.label']
''' 读入数据 '''
x_train, y_train = load_datasets(featurePaths[:4], labelPaths[:4])
x_test, y_test = load_datasets(featurePaths[4:], labelPaths[4:])
x_train, x_, y_train, y_ = train_test_split(x_train, y_train, test_size=0.0)
print('Start training knn')
knn = KNeighborsClassifier().fit(x_train, y_train)
print('Training done')
answer_knn = knn.predict(x_test)
print('Prediction done')
print('Start training DT')
dt = DecisionTreeClassifier().fit(x_train, y_train)
print('Training done')
answer_dt = dt.predict(x_test)
print('Prediction done')
print('Start training Bayes')
gnb = GaussianNB().fit(x_train, y_train)
print('Training done')
answer_gnb = gnb.predict(x_test)
print('Prediction done')
print('\n\nThe classification report for knn:')
print(classification_report(y_test, answer_knn))
print('\n\nThe classification report for DT:')
print(classification_report(y_test, answer_dt))
print('\n\nThe classification report for Bayes:')
print(classification_report(y_test, answer_gnb))
- 在所有的特征数据中,可能存在缺失值或者冗余特征。如果将这些特征不加处理地送入后续的计算,可能会导致模型准确度下降并且增大计算量。
- 在特征选择阶段,通常需要借助辅助软件(例如Weka)将数据进行可视化并进行统计。
- 请大家可以通过课外学习思考如何筛选冗余特征,提高模型训练效率,也可以尝试调用sklearn提供的其他分类器进行数据预测。
网友评论