美文网首页我爱编程
From Linear Regression to Logist

From Linear Regression to Logist

作者: 异想派 | 来源:发表于2016-10-23 00:19 被阅读93次

    Binary classification with logistic regression

    • 概率分布
    • response value represents a probablity, between [0,1]

    1 . 普通的线性回归假设响应变量呈正态分布,又称高斯分布或钟形曲线(bell curve)
    2 . 若响应变量不满足正态分布,而是概率事件,则假设不满足
    3 . 广义线性回归,用联连函数(link function)来描述解释变量和响应变量的关系
    4 . 普通线性回归作为广义线性回归的特例使用的是恒等联连函数(identity link function), 将解释变量通过线性组合来联连服从正态分布的响应变量
    5 . 对于逻辑回归,如果响应变量超过某个临界值,预测结果为阳性,否则为阴性
    6 . The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function。the logistic function returns a value between 0 and 1


    7 . For logistic function,t is equal to a linear combination of explanatory variables

    Spam filtering(垃圾短信过滤)

    1 . explore data and calculate some basic summary statics using pandas

    import pandas as pd
    df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
    print ('Number of spam messages:',df[df[0]=='spam'][0].count()) 
    print ('Number of ham messages',df[df[0]=='ham'][0].count())
    

    2 . create a TfidfVectorizer, then fit it with training messages, and transform both the training and test messages

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.cross_validation import train_test_split
    df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例为test集,type类型为Series
    
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train_raw)  #生成矩阵
    X_test = vectorizer.transform(X_test_raw)   #type为scipy的矩阵
    

    3 . create an instance of LogisticRegression and train the model

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.cross_validation import train_test_split
    df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例为test集,type类型为Series
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train_raw)  #生成矩阵
    X_test = vectorizer.transform(X_test_raw)   #type为scipy的矩阵
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    for i, prediction in enumerate(predictions[:5]):
        print ('Prediction:%s. Truelabel:%s. Message:%s' % (prediction,y_test.iloc[i],X_test_raw.iloc[i]))    
        #此处必须使用iloc,基于位置的索引。若用X_test_raw[i]会报错,因为拆分训练、测试集时,索引也相应变了,尤其针对数字索引
    

    Binary classification performance metrics(效果度量方法)

    预测阳性 预测阴性
    实际阳性 True Positive False Negative
    实际阴性 False Positive True Negative
    实际运行时如下,阳性在下
    0 1
    0
    1
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt
    y_test = [0,0,0,0,0,1,1,1,1,1]
    y_pred = [0,1,0,0,0,0,0,1,1,1]
    confusion_matrix = confusion_matrix(y_test, y_pred)
    print(confusion_matrix)
    plt.matshow(confusion_matrix)
    plt.title('Confusion matrix')
    plt.colorbar()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    

    Accuracy

    • Accuracy measures a fraction of the classifier's predictions that are correct
    from sklearn.metrics import accuracy_score
    y_pred=[0,1,1,0]
    y_true=[1,1,1,1]
    print 'Accuracy:',accuracy_score(y_true,y_pred)  #outcome is 0.5
    
    • evaluate the classifier's accuracy
    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.cross_validation import train_test_split, cross_val_score
    df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train_raw)
    X_test = vectorizer.transform(X_test_raw)
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    scores = cross_val_score(classifier, X_train, y_train, cv=5)
    #y_pre=classifier.predict(X_test)
    #for i,pre in enumerate(y_pre[:5]):
    #    print y_pre[i],y_test.iloc[i],X_test_raw.iloc[i]
    print 'Accuracy',np.mean(scores), scores
    #Outcome:Accuracy 0.955980861244 [ 0.94976077  0.95933014  0.96052632  0.96291866  0.94736842]
    
    • Drawback
      1 . accuracy can't distinguish between false positive errors and false negative errors
      2 . accuracy is not an informative metrics if the proportions of the class are skewed(倾斜) in the population

    Precision and recall 精确率和召回率

    • definition
    1. the precision is the fraction of positive predictions that are correct


    2. recall is the fraction of truly positive instances that the classifier recognizes(被分类器识别出来的真阳性占所有阳性的比例)


    • calculate SMS classifier's precision and recall
    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.cross_validation import train_test_split, cross_val_score
    df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train_raw)
    X_test = vectorizer.transform(X_test_raw)
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)  
    precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')  #实际运行报错,不知为啥
    print 'Precision', np.mean(precisions), precisions
    recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
    print 'Recall', np.mean(recalls), recalls
    f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
    print 'F1:', np.mean(f1s), f1s
    #Outcome:
    Precision 0.989910506899 [ 0.98591549  1.          0.98850575  0.98795181  0.98717949]
    Recall 0.685907046477 [ 0.60344828  0.69565217  0.74782609  0.71304348  0.66956522]
    F1: 0.806840977066 [ 0.84102564  0.81675393  0.8042328   0.79144385  0.78074866]
    

    1 . Precision=0.9899 means almost all of the messages that it predicted as spam were actually spam
    2 . Recall=0.686 means it incorrectly classified approximately 32 precent of the spam messages as ham

    Calculating the F1 measure

    ROC AUC

    • unlike accuracy,the ROC curve is insensitive to data sets with unbalanced class proportions
    • ROC curves plot the classi er's recall against its fall-out
    • Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives


    • AUC(area under curve)
      which represents the expected performance of the classifier
    • plot the ROC curve for SMS spam
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.cross_validation import train_test_split, cross_val_score
    from sklearn.metrics import roc_curve, auc
    df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train_raw)
    X_test = vectorizer.transform(X_test_raw)
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    predictions = classifier.predict_proba(X_test)
    false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])    #将y_test和预测值进行比较
    roc_auc = auc(false_positive_rate, recall)     #计算AUC的值
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)   #'b'表示蓝色线条
    plt.legend(loc='lower right')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.ylabel('Recall')
    plt.xlabel('Fall-out')
    plt.show()
    

    Tuning models with grid search(网格搜索调整模型)

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.grid_search import GridSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.cross_validation import train_test_split
    from sklearn.metrics import precision_score, recall_score, accuracy_score
    pipeline = Pipeline([
        ('vect', TfidfVectorizer(stop_words='english')),
        ('clf', LogisticRegression())
    ])
    parameters = {
        'vect__max_df': (0.25, 0.5, 0.75),
        'vect__stop_words': ('english', None),
        'vect__max_features': (2500, 5000, 10000, None),
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__use_idf': (True, False),
        'vect__norm': ('l1', 'l2'),
        'clf__penalty': ('l1', 'l2'),
        'clf__C': (0.01, 0.1, 1, 10),
    }
    if __name__ == "__main__":
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
        df = pd.read_csv('/Users/enniu/Desktop/sms.csv')
        X, y, = df['message'], df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        grid_search.fit(X_train, y_train)
        print 'Best score: %0.3f' % grid_search.best_score_
        print 'Best parameters set:'
        best_parameters = grid_search.best_estimator_.get_params()
        for param_name in sorted(parameters.keys()):
            print '\t%s: %r' % (param_name, best_parameters[param_name])
        predictions = grid_search.predict(X_test)
        print 'Accuracy:', accuracy_score(y_test, predictions)
        print 'Precision:', precision_score(y_test, predictions)
        print 'Recall:', recall_score(y_test, predictions)
    # The following is the output of the script:
    Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
    [Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
    [Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   23.8s
    [Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   52.3s
    [Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.6min
    [Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.5min
    [Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  3.7min
    [Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  5.1min
    [Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  6.8min
    [Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 11.2min
    [Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 12.4min finished
    Best score: 0.985
    Best parameters set:
        clf__C: 10
        clf__penalty: 'l2'
        vect__max_df: 0.25
        vect__max_features: 2500
        vect__ngram_range: (1, 2)
        vect__norm: 'l2'
        vect__stop_words: None
        vect__use_idf: True
    Accuracy: 0.98493543759
    Precision: 0.983333333333
    Recall: 0.907692307692
    

    Multi-class classification

    • One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance

    相关文章

      网友评论

        本文标题:From Linear Regression to Logist

        本文链接:https://www.haomeiwen.com/subject/wxbhuttx.html