美文网首页
NLP简单代码实践

NLP简单代码实践

作者: 万州客 | 来源:发表于2022-05-02 08:52 被阅读0次

真的简单

一,代码

from sklearn.feature_extraction.text import CountVectorizer
import jieba
from sklearn.datasets import load_files
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfTransformer

'''

vect = CountVectorizer(ngram_range=(2, 2))
en = ['The quick brown fox jumps over a lazy dog']
vect.fit(en)
print('单词数:{}'.format(len(vect.vocabulary_)))
print('分词:{}'.format(vect.vocabulary_))

# cn = ['那只敏捷的棕色狐狸跳过了一只懒惰的狗']
cn = jieba.cut('懒惰的狐狸不如敏捷的狐狸敏捷,敏捷的狐狸不如懒惰的狐狸懒惰')
cn = [' '.join(cn)]
vect.fit(cn)
print('单词数:{}'.format(len(vect.vocabulary_)))
print('分词:{}'.format(vect.vocabulary_))

bag_of_words = vect.transform(cn)
print('转化为词袋的特征:{}'.format(repr(bag_of_words)))
print('词袋的密度表达:{}'.format(bag_of_words.toarray()))
'''
train_set = load_files('D:/tmp/ImdbLite/train')
X_train, y_train = train_set.data, train_set.target
X_train = [doc.replace(b'<br />', b' ') for doc in X_train]

test_set = load_files('D:/tmp/ImdbLite/test')
X_test, y_test = test_set.data, test_set.target
X_test = [doc.replace(b'<br />', b' ') for doc in X_test]

vect = CountVectorizer().fit(X_train)
X_train_vect = vect.transform(X_train)

scores = cross_val_score(LinearSVC(), X_train_vect, y_train)
print('模型平均分:{:.3f}'.format(scores.mean()))

X_test_vect = vect.transform(X_test)
clf = LinearSVC().fit(X_train_vect, y_train)

tfidf = TfidfTransformer(smooth_idf=False)
tfidf.fit(X_train_vect)
X_train_tfidf = tfidf.transform(X_train_vect)
X_test_tfidf = tfidf.transform(X_test_vect)

print('未经TFIDF处理的特征:', X_train_vect[:5, :5].toarray())
print('经过TFIDF处理的特征:', X_train_tfidf[:5, :5].toarray())
print('测试集模型得分:{}'.format(clf.score(X_test_vect, y_test)))

print('训练样本特征数量:{}'.format(len(vect.get_feature_names())))
print('最后10个训练样本特征:{}'.format(vect.get_feature_names()[-10:]))
print('训练集文件数据:{}'.format(len(X_train)))
print('随机抽一个看看:', X_train[22])
print('训练集文件数据:{}'.format(len(X_test)))
print('随机抽一个看看:', X_test[22])

二,输出

C:\Users\ccc\AppData\Local\Programs\Python\Python38\python.exe D:/Code/Metis-Org/app/service/time_series_detector/algorithm/ai_test.py
模型平均分:0.810
C:\Users\ccc\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
未经TFIDF处理的特征: [[0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
经过TFIDF处理的特征: [[0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.13862307 0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]]
测试集模型得分:0.6336633663366337
训练样本特征数量:3941
最后10个训练样本特征:['young', 'your', 'yourself', 'yuppie', 'zappa', 'zero', 'zombie', 'zoom', 'zooms', 'zsigmond']
训练集文件数据:100
随机抽一个看看: b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.  As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.  Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.  As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.  Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.  On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.  Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."
训练集文件数据:202
随机抽一个看看: b"Alas, another Costner movie that was an hour too long. Credible performances, but the script had no where to go and was in no hurry to get there. First we are offered an unrelated string of events few of which further the story. Will the script center on Randall and his wife? Randall and Fischer? How about Fischer and Thomas? In the end, no real front story ever develops and the characters themselves are artificially propped up by monologues from third parties. The singer explains Randall, Randall explains Fischer, on and on. Finally, long after you don't care anymore, you will learn something about the script meetings. Three endings were no doubt proffered and no one could make a decision. The end result? All three were used, one, after another, after another. If you can hang in past the 100th yawn, you'll be able to pick them out. Despite the transparent attempt to gain points with a dedication to the Coast Guard, this one should have washed out the very first day."

Process finished with exit code 0

相关文章

  • NLP简单代码实践

    真的简单 一,代码 二,输出

  • NLP基础

    NLP基础 NLP涉及知识 NLTK库 分词 TF-IDF 手动操作安装NLTK库 代码小练 什么是NLP 词处理...

  • iOS代码瘦身实践

    iOS代码瘦身实践 iOS代码瘦身实践

  • NLP的12条前提假设

    Nlp的12条前提假设,是Nlp概念及技巧的基础,可以简单理解为Nlp世界中的世界观。Nlp不称这些前提假设为原则...

  • 敏捷AI | NLP技术在宜信业务中的实践【构建用户画像篇】

    导读:在业务中如何运用NLP技术构建客户画像。 拓展阅读:敏捷AI | NLP技术在宜信业务中的实践【智能聊天机器...

  • 自然语言处理的语义建模介绍

    摘要:本文主要是简单介绍了自然语言处理( NLP )的语义建模思想。 在本文中,我将简单介绍自然语言处理(NLP)...

  • 自然语言处理的语义建模介绍

    摘要:本文主要是简单介绍了自然语言处理( NLP )的语义建模思想。 在本文中,我将简单介绍自然语言处理(NLP)...

  • 《智慧NLP教练实践》

    这本《智慧NLP教练实践》来自于幸福家的公益助教培训课堂,是黄巧颖Esther老师的四天课程总结。我是2017年1...

  • nlp一些思考

    对于nlp心理学有几年的学习和实践,通过nlp心理学的学习,可以提升自我审视,有更为灵活的思维,让人生更为幸福。 ...

  • 自然语言处理_自学习平台

    前两周协助项目验收验收,使用了下NLP自学习平台,简单做个记录 提供了许多针对不同场景优化过的NLP模型。NLP自...

网友评论

      本文标题:NLP简单代码实践

      本文链接:https://www.haomeiwen.com/subject/pdtdyrtx.html