美文网首页影像组学学习笔记
影像组学学习笔记(10)-T检验+lasso+随机森林

影像组学学习笔记(10)-T检验+lasso+随机森林

作者: 北欧森林 | 来源:发表于2020-11-24 04:52 被阅读0次

    本笔记来源于B站Up主: 有Li 的影像组学系列教学视频
    本节(10)主要介绍: T检验+lasso+随机森林

    李博士借用和女朋友一起吃饭这个实例来说明:爱情和机器学习一样,复杂深奥、难以揣测。

    import pandas as pd
    import numpy as np
    from sklearn.utils import shuffle
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LassoCV
    from sklearn.model_selection import train_test_split, cross_val_score,KFold,RepeatedKFold,GridSearchCV
    from scipy.stats import pearsonr, ttest_ind, levene
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import svm
    
    xlsx1_filePath = 'C:/Users/RONG/Desktop/PythonBasic/data_A.xlsx'
    xlsx2_filePath = 'C:/Users/RONG/Desktop/PythonBasic/data_B.xlsx'
    data_1 = pd.read_excel(xlsx1_filePath)
    data_2 = pd.read_excel(xlsx2_filePath)
    rows_1,__ = data_1.shape
    rows_2,__ = data_2.shape
    data_1.insert(0,'label',[0]*rows_1)
    data_2.insert(0,'label',[1]*rows_2)
    data = pd.concat([data_1,data_2])
    data = shuffle(data)
    data = data.fillna(0)
    X = data[data.columns[1:]]
    y = data['label']
    colNames = X.columns
    X = X.astype(np.float64)
    X = StandardScaler().fit_transform(X)
    X = pd.DataFrame(X)
    X.columns = colNames
    
    # t-test for feature selection
    index = []
    for colName in data.columns[1:]:
        if levene(data_1[colName],data_2[colName])[1] > 0.05:
            if ttest_ind(data_1[colName],data_2[colName])[1] < 0.05:
                index.append(colName)
        else:
            if ttest_ind(data_1[colName],data_2[colName],equal_var = False)[1] < 0.05:
                index.append(colName)
    print(len(colName))
    
    # to select the 'positive' features
    if 'label' not in index:index = ['label']+index
    data_1 = data_1[index]
    data_2 = data_2[index]
    data = pd.concat([data_1,data_2])
    data = shuffle(data)
    data.index = range(len(data))#re-label after mixure
    X = data[data.columns[1:]]
    y = data['label']
    X = X.apply(pd.to_numeric,errors = 'ignore') # transform the type of the data 
    colNames = X.columns # to read the feature's name
    X = X.fillna(0)
    X = X.astype(np.float64)
    X = StandardScaler().fit_transform(X)
    X = pd.DataFrame(X)
    X.columns = colNames
    
    # lasso for further feature selection
    alphas = np.logspace(-3,1,30)
    model_lassoCV = LassoCV(alphas = alphas, cv = 10, max_iter = 100000).fit(X,y)
    
    print(model_lassoCV.alpha_)
    coef = pd.Series(model_lassoCV.coef_,index = X.columns)
    print('Lasso picked ' + str(sum(coef !=0))+' variables and eliminated the other ' + str(sum(coef == 0))
    
    index = coef[coef != 0].index
    X = X[index]
    X.head()
    print(coef[coef !=0])
    
    # RandomFrorest
    X_train, X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3)
    model_rf = RandomForestClassifier(n_estimators = 20).fit(X_train,y_train)
    score_rf = model_rf.score(X_test,y_test)
    print(score_rf)
    

    相关文章

      网友评论

        本文标题:影像组学学习笔记(10)-T检验+lasso+随机森林

        本文链接:https://www.haomeiwen.com/subject/dpcabktx.html