美文网首页
scikit-learn机器学习:逻辑回归

scikit-learn机器学习:逻辑回归

作者: 简单一点点 | 来源:发表于2020-02-06 23:33 被阅读0次

    前面讨论的模型都是泛线性模型,现在看一下逻辑回归。

    使用逻辑回归进行二元分类

    普通的线性回归假设响应变量符合正态分布。在逻辑回归中,响应变量描述了结果是正向情况的概率。如果响应变量等于或者超出了一个区分阈值,则被预测为正向类,否则将被预测为负向类。响应变量使用逻辑函数建模为一个特征的线性组合函数。

    逻辑函数总是返回一个位于0~1之间的值,公式如下所示,其中e是欧拉常数,约等于2.718。

    {F \left( t \left) =\frac{{1}}{{1+\mathop{{e}}\nolimits^{{-t}}}}\right. \right. }

    对于逻辑回归,t等于解释变量的线性组合,公式如下:

    {F \left( x \left) =\frac{{1}}{{1+\mathop{{e}}\nolimits^{{- \left( \beta \mathop{{}}\nolimits_{{0}}+ \beta x \right) }}}}\right. \right. }

    垃圾邮件过滤

    下面看一个使用逻辑回归二元分类的任务:垃圾邮件过滤。数据集来在UCI机器学习仓库。地址为 http://archive.ics.uci.edu/ml/datasets/sms+spam+collection

    import pandas as pd
    df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
    print(df.head())
    
          0                                                  1
    0   ham  Go until jurong point, crazy.. Available only ...
    1   ham                      Ok lar... Joking wif u oni...
    2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
    3   ham  U dun say so early hor... U c already then say...
    4   ham  Nah I don't think he goes to usf, he lives aro...
    
    print('Number of spam messages: %s' % df[df[0]=='spam'][0].count())
    print('Number of ham messages: %s' % df[df[0]=='ham'][0].count())
    
    Number of spam messages: 747
    Number of ham messages: 4825
    

    数据集的每一行由一个二元标签(ham代表非垃圾邮件,spam代表垃圾邮件)和一个文本信息组成。其中包含4827条信息是非垃圾短信,747条信息是垃圾短信。下面我们使用scikit-learn类库的LogisticRegression类来进行一些预测。

    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.model_selection import train_test_split, cross_val_score
    
    X = df[1].values
    y = df[0].values
    # 首先将标签转换为0和1
    y = [1 if yy == 'spam' else 0 for yy in y]
    # 划分训练集和测试集
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y)
    # 转换文本
    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_train_raw)
    X_test = vectorizer.transform(X_test_raw)
    # 训练模型并预测
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    for i, prediction in enumerate(predictions[:5]):
        print('Predicted: %s, message: %s' % (prediction, X_test_raw[i]))
    
    Predicted: 0, message: R u over scratching it?
    Predicted: 0, message: Babe! How goes that day ? What are you up to ? I miss you already, my Love ... * loving kiss* ... I hope everything goes well.
    Predicted: 0, message: I'm going 2 orchard now laready me reaching soon. U reaching?
    Predicted: 0, message: ... Are you in the pub?
    Predicted: 0, message: I dont thnk its a wrong calling between us
    
    
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    

    二元分类评价指标

    对二元分类器进行评价的指标包括准确率、精准率、召回率、F1值和ROC AUC得分,这些衡量方式都是基于真阳性、真阴性、假阳性和假阴性的概念。阴性和阳性指代类,真和假表示预测和实际是否相同。

    概念解释如下:

    1. 真阳性(True Positive,TP):样本的真实类别是正例,并且模型预测的结果也是正例
    2. 真阴性(True Negative,TN):样本的真实类别是负例,并且模型将其预测成为负例
    3. 假阳性(False Positive,FP):样本的真实类别是负例,但是模型将其预测成为正例
    4. 假阴性(False Negative,FN):样本的真实类别是正例,但是模型将其预测成为负例

    混淆矩阵(confusion matrix)可以对其进行可视化,下面看一个简单的例子。

    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt
    
    y_test1 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    y_pred1 = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
    confusion_matrix = confusion_matrix(y_test1, y_pred1)
    print(confusion_matrix)
    plt.matshow(confusion_matrix)
    plt.title('Confusion matrix')
    plt.colorbar()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    
    [[4 1]
     [2 3]]
    
    output_6_1.png

    准确率

    准确率用来衡量分类器预测正确的比例。LogisticRegression.score方法使用准确率来给一个测试集的标签进行云测和打分。

    scores = cross_val_score(classifier, X_train, y_train, cv=5)
    print('Accuracies: %s' % scores)
    print('Mean accuracy: %s' % np.mean(scores))
    
    Accuracies: [0.95101553 0.95221027 0.94850299 0.96167665 0.95449102]
    Mean accuracy: 0.9535792930268496
    
    
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    

    虽然准确率衡量了分类器的整体正确性,它并不能区分假阳性和假阴性。

    精准率和召回率

    精准率代表是阳性预测结果为正确的比例,召回率代表真实阳性实例被分类器辨认出的比例。

    
    precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
    print('Mean Precision: %s' % np.mean(precisions))
    recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
    print('Mean recall: %s' % np.mean(recalls))
    
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    
    
    Mean Precision: 0.991777693186144
    Mean recall: 0.6476554536187563
    

    F1值

    F1值是精准率和召回率的调和平均值。

    f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
    print('Mean recall: %s' % np.mean(f1s))
    
    Mean recall: 0.7829760388268829
    
    
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    

    ROC AUC

    受试者操作特征(ROC)曲线,可以对一个分类器的指标进行可视化。ROC曲线描绘了分类器召回率和衰退之间的关系。AUC是ROC曲线以下部分的面积,它将ROC曲线归纳为一个用来标识分类器预计效果的值。

    from sklearn.metrics import roc_curve
    from sklearn.metrics import auc
    predictions = classifier.predict_proba(X_test)
    false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
    roc_auc = auc(false_positive_rate, recall)
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.ylabel('Recall')
    plt.xlabel('Fall-out')
    plt.show()
    
    output_14_0.png

    使用网格搜索微调模型

    在scikit-learn库中,超参数通过估计器和转换器的构造函数设置。在前面的例子中,我们没有设置LogisticRegression类的任何参数,对于所有的超参数我们都使用了默认值。

    网格搜索是一种选择能产生最优模型的超参数的常用方法。它接受一个包含所有应该被微调的超参数的可能取值集合,并评估在该集合的笛卡尔乘积的每一个元素上训练的模型的效果,也就是,网格搜索是一种穷举搜索,它在指定超参数值的每一种可能的组合上对模型进行训练和评估。

    我们可以使用scikit-learn库中的GridSearchCV类来找出较好的超参数值,GridSearchV接受一个估计器、一个参数空间和一个衡量指标。

    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import GridSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.metrics import precision_score, recall_score, accuracy_score
    
    pipeline = Pipeline([
        ('vect', TfidfVectorizer(stop_words='english')),
        ('clf', LogisticRegression())
    ])
    parameters = {
        'vect__max_df': (0.25, 0.5, 0.75),
        'vect__stop_words': ('english', None),
        'vect__max_features': (2500, 5000, 10000, None),
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__use_idf': (True, False),
        'vect__norm': ('l1', 'l2'),
        'clf__penalty': ('l1', 'l2'),
        'clf__C': (0.01, 0.1, 1, 10),
    }
    
    df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
    X = df[1].values
    y = df[0].values
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    grid_search.fit(X_train, y_train)
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    predictions = grid_search.predict(X_test)
    print('Accuarcy: ', accuracy_score(y_test, predictions))
    print('Precision: ', precision_score(y_test, predictions))
    print('Recall: ', recall_score(y_test, predictions))
    
    Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
    
    
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
    [Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.4s
    [Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   32.6s
    [Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.0min
    [Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.9min
    [Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  3.2min
    [Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  4.8min
    [Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  6.9min
    [Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  8.5min
    [Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 13.8min
    [Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 15.0min finished
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    
    
    Best score: 0.984
    Best parameters set:
        clf__C: 10
        clf__penalty: 'l2'
        vect__max_df: 0.5
        vect__max_features: 5000
        vect__ngram_range: (1, 2)
        vect__norm: 'l2'
        vect__stop_words: None
        vect__use_idf: True
    Accuarcy:  0.9856424982053122
    Precision:  0.9748427672955975
    Recall:  0.9064327485380117
    

    多类别分类

    在许多分类问题中,类别会多于两类,这时scikit-learn类库会使用一对剩余的策略来支持多类别分类。LogisticRegression类本身就支持使用一对剩余策略支持多类别分类。

    下面看一个kaggle上根据烂番茄数据库中影评的情绪短语进行分类的例子(entiment-analysis-on-movie-reviews)。每个短语被分为以下几种情绪:负向、略负向、中立、略正向、正向。

    df = pd.read_csv('./sentiment-analysis-on-movie-reviews/train.tsv', header=0, delimiter='\t')
    print(df.count())
    
    PhraseId      156060
    SentenceId    156060
    Phrase        156060
    Sentiment     156060
    dtype: int64
    
    print(df.head())
    
       PhraseId  SentenceId                                             Phrase  \
    0         1           1  A series of escapades demonstrating the adage ...   
    1         2           1  A series of escapades demonstrating the adage ...   
    2         3           1                                           A series   
    3         4           1                                                  A   
    4         5           1                                             series   
    
       Sentiment  
    0          1  
    1          2  
    2          2  
    3          2  
    4          2  
    
    print(df['Phrase'].head(10))
    
    0    A series of escapades demonstrating the adage ...
    1    A series of escapades demonstrating the adage ...
    2                                             A series
    3                                                    A
    4                                               series
    5    of escapades demonstrating the adage that what...
    6                                                   of
    7    escapades demonstrating the adage that what is...
    8                                            escapades
    9    demonstrating the adage that what is good for ...
    Name: Phrase, dtype: object
    
    print(df['Sentiment'].describe())
    
    count    156060.000000
    mean          2.063578
    std           0.893832
    min           0.000000
    25%           2.000000
    50%           2.000000
    75%           3.000000
    max           4.000000
    Name: Sentiment, dtype: float64
    
    print(df['Sentiment'].value_counts())
    
    2    79582
    3    32927
    1    27273
    4     9206
    0     7072
    Name: Sentiment, dtype: int64
    
    print(df['Sentiment'].value_counts() / df['Sentiment'].count())
    
    2    0.509945
    3    0.210989
    1    0.174760
    4    0.058990
    0    0.045316
    Name: Sentiment, dtype: float64
    

    可以看到其中接近一般都是中立,下面使用scikit-learn类库训练一个分类器。

    X, y = df['Phrase'], df['Sentiment'].as_matrix()
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
    
    pipeline = Pipeline([
        ('vect', TfidfVectorizer(stop_words='english')),
        ('clf', LogisticRegression())
    ])
    parameters = {
        'vect__max_df': (0.25, 0.5),
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__use_idf': (True, False),
        'clf__C': (0.1, 1, 10),
    }
    
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
    
    E:\python\python36\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
      """Entry point for launching an IPython kernel.
    E:\python\python36\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
      warnings.warn(CV_WARNING, FutureWarning)
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
    
    
    Fitting 3 folds for each of 24 candidates, totalling 72 fits
    
    
    [Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
    [Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  3.9min finished
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
      "this warning.", FutureWarning)
    
    
    Best score: 0.620
    Best parameters set:
        clf__C: 10
        vect__max_df: 0.25
        vect__ngram_range: (1, 2)
        vect__use_idf: False
    

    和二元分类一样,混淆矩阵对于可视化分类器的错误非常有用。精准率、召回率和F1值也可以针对每个类别进行计算,对于所有预测的准确率也会计算。

    from sklearn.metrics import classification_report, confusion_matrix
    predictions = grid_search.predict(X_test)
    print('Accuracy: %s' % accuracy_score(y_test, predictions))
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, predictions))
    print('Classification Report:')
    print(classification_report(y_test, predictions))
    
    Accuracy: 0.6364603357682942
    Confusion Matrix:
    [[ 1136  1734   597    71     1]
     [  904  6027  6070   552    21]
     [  231  3116 32634  3535   160]
     [   28   402  6732  8156  1351]
     [    7    34   549  2272  1710]]
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.49      0.32      0.39      3539
               1       0.53      0.44      0.48     13574
               2       0.70      0.82      0.76     39676
               3       0.56      0.49      0.52     16669
               4       0.53      0.37      0.44      4572
    
        accuracy                           0.64     78030
       macro avg       0.56      0.49      0.52     78030
    weighted avg       0.62      0.64      0.62     78030
    

    多标签分类和问题转换

    前面的分类中,每个实例必须分配给一个类集合的中的一个类,而多标签分类,其中每个实例可以被分配给类别集合的一个子集。比如论坛中的帖子可以分配多个标签。

    对于多标签分类,有2种解决办法:

    1. 第一种问题转换方法是一种将原多标签问题转换为一系列单标签分类问题的技巧,将训练数据中出现的每个标签集转换为单个标签。
    2. 第二种问题转换方法是对训练集中的每一个标签训练一个二元分类器。每一个分类器预测实例是否属于某个标签。
    
    

    相关文章

      网友评论

          本文标题:scikit-learn机器学习:逻辑回归

          本文链接:https://www.haomeiwen.com/subject/vvkpxhtx.html