美文网首页
NLP简单代码实践

NLP简单代码实践

作者: 万州客 | 来源:发表于2022-05-02 08:52 被阅读0次

    真的简单

    一,代码

    from sklearn.feature_extraction.text import CountVectorizer
    import jieba
    from sklearn.datasets import load_files
    from sklearn.svm import LinearSVC
    from sklearn.model_selection import cross_val_score
    from sklearn.feature_extraction.text import TfidfTransformer
    
    '''
    
    vect = CountVectorizer(ngram_range=(2, 2))
    en = ['The quick brown fox jumps over a lazy dog']
    vect.fit(en)
    print('单词数:{}'.format(len(vect.vocabulary_)))
    print('分词:{}'.format(vect.vocabulary_))
    
    # cn = ['那只敏捷的棕色狐狸跳过了一只懒惰的狗']
    cn = jieba.cut('懒惰的狐狸不如敏捷的狐狸敏捷,敏捷的狐狸不如懒惰的狐狸懒惰')
    cn = [' '.join(cn)]
    vect.fit(cn)
    print('单词数:{}'.format(len(vect.vocabulary_)))
    print('分词:{}'.format(vect.vocabulary_))
    
    bag_of_words = vect.transform(cn)
    print('转化为词袋的特征:{}'.format(repr(bag_of_words)))
    print('词袋的密度表达:{}'.format(bag_of_words.toarray()))
    '''
    train_set = load_files('D:/tmp/ImdbLite/train')
    X_train, y_train = train_set.data, train_set.target
    X_train = [doc.replace(b'<br />', b' ') for doc in X_train]
    
    test_set = load_files('D:/tmp/ImdbLite/test')
    X_test, y_test = test_set.data, test_set.target
    X_test = [doc.replace(b'<br />', b' ') for doc in X_test]
    
    vect = CountVectorizer().fit(X_train)
    X_train_vect = vect.transform(X_train)
    
    scores = cross_val_score(LinearSVC(), X_train_vect, y_train)
    print('模型平均分:{:.3f}'.format(scores.mean()))
    
    X_test_vect = vect.transform(X_test)
    clf = LinearSVC().fit(X_train_vect, y_train)
    
    tfidf = TfidfTransformer(smooth_idf=False)
    tfidf.fit(X_train_vect)
    X_train_tfidf = tfidf.transform(X_train_vect)
    X_test_tfidf = tfidf.transform(X_test_vect)
    
    print('未经TFIDF处理的特征:', X_train_vect[:5, :5].toarray())
    print('经过TFIDF处理的特征:', X_train_tfidf[:5, :5].toarray())
    print('测试集模型得分:{}'.format(clf.score(X_test_vect, y_test)))
    
    print('训练样本特征数量:{}'.format(len(vect.get_feature_names())))
    print('最后10个训练样本特征:{}'.format(vect.get_feature_names()[-10:]))
    print('训练集文件数据:{}'.format(len(X_train)))
    print('随机抽一个看看:', X_train[22])
    print('训练集文件数据:{}'.format(len(X_test)))
    print('随机抽一个看看:', X_test[22])
    

    二,输出

    C:\Users\ccc\AppData\Local\Programs\Python\Python38\python.exe D:/Code/Metis-Org/app/service/time_series_detector/algorithm/ai_test.py
    模型平均分:0.810
    C:\Users\ccc\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
      warnings.warn(msg, category=FutureWarning)
    未经TFIDF处理的特征: [[0 0 0 0 0]
     [0 0 0 0 0]
     [0 1 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]]
    经过TFIDF处理的特征: [[0.         0.         0.         0.         0.        ]
     [0.         0.         0.         0.         0.        ]
     [0.         0.13862307 0.         0.         0.        ]
     [0.         0.         0.         0.         0.        ]
     [0.         0.         0.         0.         0.        ]]
    测试集模型得分:0.6336633663366337
    训练样本特征数量:3941
    最后10个训练样本特征:['young', 'your', 'yourself', 'yuppie', 'zappa', 'zero', 'zombie', 'zoom', 'zooms', 'zsigmond']
    训练集文件数据:100
    随机抽一个看看: b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.  As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.  Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.  As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.  Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.  On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.  Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."
    训练集文件数据:202
    随机抽一个看看: b"Alas, another Costner movie that was an hour too long. Credible performances, but the script had no where to go and was in no hurry to get there. First we are offered an unrelated string of events few of which further the story. Will the script center on Randall and his wife? Randall and Fischer? How about Fischer and Thomas? In the end, no real front story ever develops and the characters themselves are artificially propped up by monologues from third parties. The singer explains Randall, Randall explains Fischer, on and on. Finally, long after you don't care anymore, you will learn something about the script meetings. Three endings were no doubt proffered and no one could make a decision. The end result? All three were used, one, after another, after another. If you can hang in past the 100th yawn, you'll be able to pick them out. Despite the transparent attempt to gain points with a dedication to the Coast Guard, this one should have washed out the very first day."
    
    Process finished with exit code 0
    
    

    相关文章

      网友评论

          本文标题:NLP简单代码实践

          本文链接:https://www.haomeiwen.com/subject/pdtdyrtx.html