下面看一个使用逻辑回归二元分类的任务:垃圾邮件过滤。数据集来在UCI机器学习仓库。地址为 http://archive.ics.uci.edu/ml/datasets/sms+spam+collection 。
import pandas as pd
df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
0 1
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
print('Number of spam messages: %s' % df[df[0]=='spam'][0].count())
print('Number of ham messages: %s' % df[df[0]=='ham'][0].count())
Number of spam messages: 747
Number of ham messages: 4825
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
X = df[1].values
y = df[0].values
# 首先将标签转换为0和1
y = [1 if yy == 'spam' else 0 for yy in y]
# 划分训练集和测试集
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y)
# 转换文本
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
# 训练模型并预测
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
print('Predicted: %s, message: %s' % (prediction, X_test_raw[i]))
Predicted: 0, message: R u over scratching it?
Predicted: 0, message: Babe! How goes that day ? What are you up to ? I miss you already, my Love ... * loving kiss* ... I hope everything goes well.
Predicted: 0, message: I'm going 2 orchard now laready me reaching soon. U reaching?
Predicted: 0, message: ... Are you in the pub?
Predicted: 0, message: I dont thnk its a wrong calling between us
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
对二元分类器进行评价的指标包括准确率、精准率、召回率、F1值和ROC AUC得分,这些衡量方式都是基于真阳性、真阴性、假阳性和假阴性的概念。阴性和阳性指代类,真和假表示预测和实际是否相同。
- 真阳性(True Positive,TP):样本的真实类别是正例,并且模型预测的结果也是正例
- 真阴性(True Negative,TN):样本的真实类别是负例,并且模型将其预测成为负例
- 假阳性(False Positive,FP):样本的真实类别是负例,但是模型将其预测成为正例
- 假阴性(False Negative,FN):样本的真实类别是正例,但是模型将其预测成为负例
混淆矩阵(confusion matrix)可以对其进行可视化,下面看一个简单的例子。
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test1 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred1 = [0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
confusion_matrix = confusion_matrix(y_test1, y_pred1)
plt.title('Confusion matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
[[4 1]
[2 3]]
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
Accuracies: [0.95101553 0.95221027 0.94850299 0.96167665 0.95449102]
Mean accuracy: 0.9535792930268496
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Mean Precision: %s' % np.mean(precisions))
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Mean recall: %s' % np.mean(recalls))
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
Mean Precision: 0.991777693186144
Mean recall: 0.6476554536187563
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('Mean recall: %s' % np.mean(f1s))
Mean recall: 0.7829760388268829
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
X = df[1].values
y = df[0].values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('\t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print('Accuarcy: ', accuracy_score(y_test, predictions))
print('Precision: ', precision_score(y_test, predictions))
print('Recall: ', recall_score(y_test, predictions))
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 10.4s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 32.6s
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 1.0min
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 1.9min
[Parallel(n_jobs=-1)]: Done 1242 tasks | elapsed: 3.2min
[Parallel(n_jobs=-1)]: Done 1792 tasks | elapsed: 4.8min
[Parallel(n_jobs=-1)]: Done 2442 tasks | elapsed: 6.9min
[Parallel(n_jobs=-1)]: Done 3192 tasks | elapsed: 8.5min
[Parallel(n_jobs=-1)]: Done 4042 tasks | elapsed: 13.8min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 15.0min finished
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
Best score: 0.984
Best parameters set:
clf__C: 10
clf__penalty: 'l2'
vect__max_df: 0.5
vect__max_features: 5000
vect__ngram_range: (1, 2)
vect__norm: 'l2'
vect__stop_words: None
vect__use_idf: True
Accuarcy: 0.9856424982053122
Precision: 0.9748427672955975
Recall: 0.9064327485380117
df = pd.read_csv('./sentiment-analysis-on-movie-reviews/train.tsv', header=0, delimiter='\t')
PhraseId 156060
SentenceId 156060
Phrase 156060
Sentiment 156060
dtype: int64
PhraseId SentenceId Phrase \
0 1 1 A series of escapades demonstrating the adage ...
1 2 1 A series of escapades demonstrating the adage ...
2 3 1 A series
3 4 1 A
4 5 1 series
0 1
1 2
2 2
3 2
4 2
0 A series of escapades demonstrating the adage ...
1 A series of escapades demonstrating the adage ...
2 A series
3 A
4 series
5 of escapades demonstrating the adage that what...
6 of
7 escapades demonstrating the adage that what is...
8 escapades
9 demonstrating the adage that what is good for ...
Name: Phrase, dtype: object
count 156060.000000
mean 2.063578
std 0.893832
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 4.000000
Name: Sentiment, dtype: float64
2 79582
3 32927
1 27273
4 9206
0 7072
Name: Sentiment, dtype: int64
print(df['Sentiment'].value_counts() / df['Sentiment'].count())
2 0.509945
3 0.210989
1 0.174760
4 0.058990
0 0.045316
Name: Sentiment, dtype: float64
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
parameters = {
'vect__max_df': (0.25, 0.5),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10),
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('\t%s: %r' % (param_name, best_parameters[param_name]))
E:\python\python36\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
"""Entry point for launching an IPython kernel.
E:\python\python36\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 24 candidates, totalling 72 fits
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.5min
[Parallel(n_jobs=-1)]: Done 72 out of 72 | elapsed: 3.9min finished
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
E:\python\python36\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
"this warning.", FutureWarning)
Best score: 0.620
Best parameters set:
clf__C: 10
vect__max_df: 0.25
vect__ngram_range: (1, 2)
vect__use_idf: False
from sklearn.metrics import classification_report, confusion_matrix
predictions = grid_search.predict(X_test)
print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
Accuracy: 0.6364603357682942
Confusion Matrix:
[[ 1136 1734 597 71 1]
[ 904 6027 6070 552 21]
[ 231 3116 32634 3535 160]
[ 28 402 6732 8156 1351]
[ 7 34 549 2272 1710]]
Classification Report:
precision recall f1-score support
0 0.49 0.32 0.39 3539
1 0.53 0.44 0.48 13574
2 0.70 0.82 0.76 39676
3 0.56 0.49 0.52 16669
4 0.53 0.37 0.44 4572
accuracy 0.64 78030
macro avg 0.56 0.49 0.52 78030
weighted avg 0.62 0.64 0.62 78030
- 第一种问题转换方法是一种将原多标签问题转换为一系列单标签分类问题的技巧,将训练数据中出现的每个标签集转换为单个标签。
- 第二种问题转换方法是对训练集中的每一个标签训练一个二元分类器。每一个分类器预测实例是否属于某个标签。