unit8 骚扰短信识别

作者: 巴拉巴拉_9515 | 来源:发表于2018-06-11 17:11 被阅读0次

unit8 骚扰短信识别
骚扰短信
致用unit8
奇葩的骚扰短信
咨询中遇到性来访处理例外
假期充电打卡
这个时代，你我都是囚犯。
我以为还真的只是我以为
Unit8 Time
Laravel实现大型商城高性能消息队列

《web安全之深度学习实战》第八章：骚扰短信识别提供了四种及以上文本特征提取的方法，形成文本特征字典用于进行模型训练。

1、特征提取

（1）词频表

使用CountVectorizer函数提取短信文本每个词出现的频数，形成短信和文本的词频字典。

（2）权重处理

在一个大的文本语料库中，一些词出现频率高但却缺少实际意义（例如，在英语中“A”、“A”、“IS”等）。如果我们直接将直接计数数据馈送到分类器，那么那些非常频繁的术语将遮蔽更稀有但更有趣的术语的频率。
TfidfTransformer在CountVectorizer词频统计的基础上，统计权重。

（3）加入NGram模式

在CountVectorizer函数中，增加3Gram,token_pattern='\b\w+\b',两个因素，使3个单词为一组生成的部分vocabulary如下：

{'u dun wan': 852,
'customer service representative': 323,
'service representative freephone': 742,
'representative freephone 0808': 720,
'won guaranteed ...m': 667,
'nokia 7250i win': 586,
······

（4）进程处理

使用VocabularyProcessor建立词汇表，把文本转为词ID序列。

2、短信分类

（1）贝叶斯分类

贝叶斯分类模型处理只要几行代码就可以了。

def do_nb_wordbag(x_train, x_test, y_train, y_test):
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

do_nb_wordbag(x_train, x_test, y_train, y_test)
precision recall f1-score support
0.90 0.79 0.82 2230
[[1471 453]
[ 22 284]]

模型结果显示，贝叶斯分类准确率为90%。
优缺点：速度快

（2）SVM分类

SVM支持向量机训练代码也很简单。

def do_svm_doc2vec(x_train, x_test, y_train, y_test):
    print("SVM and doc2vec")
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(metrics.accuracy_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

'precision', 'predicted', average, warn_for)
avg / total 0.74 0.86 0.80 2230
0.862780269058296

优缺点：速度相对贝叶斯分类较慢。虽然其他资料显示SVM在垃圾短信识别的效率比贝叶斯要高，但本次拟合结果并不是很理想。

（3）随机森林分类

def do_rf_word2vec(x_train, x_test, y_train, y_test):
    print("rf and word2vec")
    clf = RandomForestClassifier(n_estimators=50)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(metrics.accuracy_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

0.9789237668161435
[[1921 3]
[ 44 262]]

优点：在本次分类中随机森林运行时间短（介于贝叶斯和SVM之间），垃圾短信识别效果高，准确率达97.8%。

（4）XGBoost分类

def do_xgboost_word2vec(x_train, x_test, y_train, y_test):
    print("xgboost and word2vec")
    xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
    y_pred = xgb_model.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

precision recall f1-score support
0.96 0.95 0.95 2230
[[1923 1]
[ 101 205]]

优缺点：运行速度相对较慢（比SVM还慢），准确率96%，分类效果比较好。

（5）MLP分类

def do_dnn_wordbag(x_train, x_test, y_train, y_test):
    print("MLP and wordbag")
    global max_features
    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',alpha=1e-5,
                        hidden_layer_sizes = (5, 2),random_state = 1)
    print(clf)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))
    print(metrics.confusion_matrix(y_test, y_pred))

precision recall f1-score support
0.97 0.97 0.97 2230
[[1889 35]
[ 27 279]]

优缺点：模型计算速度快（和随机森林运算耗时差不多），准确率达97%。

（6）神经网络分类

模型运算太耗时间，每次运行算法电脑风扇高速运行，不算了。

·

3、模型比较

贝叶斯分类：速度快，准确率90%；
SVM分类：速度较慢，准确率86%；
随机森林：速度较快（介于贝叶斯和SVM之间）准确率达97.8%；
XGBoost：运行速度相对较慢（比SVM还慢），准确率96%；
MLP分类: 速度快（和随机森林运算耗时差不多），准确率达97.0%;
神经网络: 模型运算太耗时间.

因此采用随机森林/MLP对的大规模垃圾短信进行识别会比较合适。

4、小结

作者将案例相关代码发布在github平台上

网友评论

本文标题：unit8 骚扰短信识别

本文链接：https://www.haomeiwen.com/subject/cyrheftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！