使用sklearn来实现决策树建模

作者: 皮皮大 | 来源:发表于2020-01-15 16:48 被阅读0次

    本文中讲解的是使用sklearn实现决策树及其建模过程,包含

    • 数据的清洗和数据分离train_test_split
    • 采用不同的指标,基尼系数或者信息熵进行建模,使用的是X_trainy_train
      • 实例化
      • fit拟合
    • 预测功能:采用上面的两种实例化进行预测y_pred = clf_gini.predict(X_test)
    • 结果评估
      • 混淆矩阵
      • 准确率
      • 分类报告
    `

    封装成函数实现

    import numpy as np
    import pandas as pd
    from sklearn.metrics import confusion_matrix  # 混淆矩阵
    from sklearn.model_selection import train_test_split  # 数据分离模块
    from sklearn.tree import DecisionTreeClassifier   #  分类决策树
    from sklearn.metrics import accuracy_score  # 评价指标
    from sklearn.metrics import classification_report   # 生成分类结果报告模块
    
    # 读取数据 importing data
    def load_data():
        balance_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-'+'databases/balance-scale/balance-scale.data',sep=',',header=None)   # 导入数据集,同时设置头部 
        print("Dataset Length", len(balance_data))
        
        print(balance_data.head())
        return balance_data
    
    # 训练集和测试集的分离 splitting the dataset into train and test
    def split_dataset(balance_data):
        
        X = balance_data.values[:, 1:5]  # 提取特征数据
        y = balance_data.values[:, 0]  # 提取数据标签
        
        X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,
                                                          random_state=100)  # 进行数据分离
        
        return X, y, X_train, X_test, y_train, y_test
    
    # 使用基尼系数进行训练 training with giniIndex
    def train_using_gini(X_train, y_train):
        
        # 先建立实例,再进行fit拟合
        clf_gini = DecisionTreeClassifier(criterion="gini"   # 实例化
                                         ,random_state=100
                                         ,max_depth=3
                                         ,min_samples_leaf=5)
        clf_gini.fit(X_train, y_train)  # fit拟合
        return clf_gini
    
    # 使用信息熵进行训练 training with entropy
    def train_using_entropy(X_train, y_train):
        
        # 实例化+fit拟合
        clf_entropy = DecisionTreeClassifier(criterion="entropy"
                                         ,random_state=100
                                         ,max_depth=3
                                         ,min_samples_leaf=5)
        clf_entropy.fit(X_train, y_train)
        return clf_entropy
    
    # 预测功能 make predictions
    def prediction(X_test, clf_object):
        
        y_pred = clf_object.predict(X_test)
        print("Predicted vlaues:")
        print(y_pred)
        return y_pred
    
    # 计算准确率 calculate accuracy
    def cal_accuracy(y_test, y_pred):
        
        print("Confusion Matrix:", confusion_matrix(y_test, y_pred))
        
        print("Accuracy:", accuracy_score(y_test, y_pred)*100)
        
        print("Report:", classification_report(y_test, y_pred))
        
    def main():
        data = load_data()
        X, y, X_train, X_test, y_train, y_test = split_dataset(data)
        clf_gini = train_using_gini(X_train, y_train)
        clf_entropy = train_using_entropy(X_train, y_train)
        
        print("result using gini Index:")
        y_pred_gini = prediction(X_test, clf_gini)
        cal_accuracy(y_test, y_pred_gini)
        
        print("result using Entropy:")
        y_pred_entropy = prediction(X_test, clf_entropy)
        cal_accuracy(y_test, y_pred_entropy)
           
    if __name__ == "__main__":
        main()
    
    image
    image

    Jupyter notebook中分行实现

    数据导入

    # 加载UCI上的数据
    data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-'+'databases/balance-scale/balance-scale.data',sep=',',header=None)
    data.head()
    
    image-20200115161012134

    切分特征数据和数据标签

    X = data.values[:, 1:5]  # 切分特征数据和数据标签
    y = data.values[:, 0]
    

    数据分离

    # TTS:train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)
    

    使用信息熵和基尼系数来建立模型

    # 使用基尼系数
    clf_gini = DecisionTreeClassifier(criterion = "gini",
                                      random_state = 100,
                                      max_depth=3,
                                      min_samples_leaf=5)
    clf_gini.fit(X_train, y_train)
    
    # 使用信息熵
    clf_entropy = DecisionTreeClassifier(criterion = "entropy",
                                         random_state = 100,
                                         max_depth = 3, 
                                         min_samples_leaf = 5)
    clf_entropy.fit(X_train, y_train)
    

    数据预测和评估

    # 使用gini系数预测
    y_pred = clf_gini.predict(X_test)  # 将X_test的数据拿去进行预测,将得到的结果和y_test进行对比
    
    confusion_matrix(y_test,y_pred) # 混淆矩阵
    accuracy_score(y_test, y_pred)  # 计算准确率
    classification_report(y_test, y_pred)  # 分类信息
    
    image-20200115161915244

    相关文章

      网友评论

        本文标题:使用sklearn来实现决策树建模

        本文链接:https://www.haomeiwen.com/subject/bgwtzctx.html