美文网首页
特征提升

特征提升

作者: lip136 | 来源:发表于2017-09-22 12:01 被阅读0次

    特征提取

    目的:将数字化的信号数据、符号化的文本转化成特征向量。

    1. 字典储存的数据,用DictVectorizer进行特征抽取与向量化。
    from sklearn.feature_extraction import DictVectorizer
    measurements = [{'city':'Dubai','temperature':33},{'city':'London','temperature':12},{'city':'San Fransisco','temperature':18}]
    vec = DictVectorizer()
    print vec.fit_transform(measurements).toarray()
    print vec.get_feature_names()
    
    1. 文本数据
      (1). 文本特征表示方法--词袋法(Bag of Words)
      将不重复的词汇集合成词表,每条训练文本都可以在词表上映射一个特征向量。
      (2). 特征数值的计算方法
      CountVectorizer
      考虑每种词汇在该条训练文本中的频率。
    #coding:utf-8
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.cross_validation import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import classification_report
    
    news = fetch_20newsgroups(subset='all')
    x_train,x_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)
    #默认不去除英文停用词
    count_vec = CountVectorizer()
    #训练特征向量
    x_count_train = count_vec.fit_transform(x_train)
    x_count_test = count_vec.transform(x_test)
    
    #对朴素贝叶斯分类器进行初始化
    mnb_count = MultinomialNB()
    #对训练样本参数学习
    mnb_count.fit(x_count_train,y_train)
    
    print '20组新闻数据使用朴素贝叶斯count',mnb_count.score(x_count_test,y_test)
    
    y_count_predict = mnb_count.predict(x_count_test)
    
    print classification_report(y_test,y_count_predict,target_names=news.target_names)
    

    TfidfVectorizer
    不仅关注本条训练数据,还关注其他数据。

    #coding:utf-8
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.cross_validation import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import classification_report
    
    news = fetch_20newsgroups(subset='all')
    x_train,x_test,y_train,y_test =train_test_split(news.data,news.target,test_size=0.25,random_state=33)
    
    tfidf_vec = TfidfVectorizer()
    x_tfidf_train = tfidf_vec.fit_transform(x_train)
    x_tfidf_test = tfidf_vec.transform(x_test)
    
    mnb_tfidf = MultinomialNB()
    mnb_tfidf.fit(x_tfidf_train,y_train)
    
    print '20组新闻数据使用朴素贝叶斯tfidf:',mnb_tfidf.score(x_tfidf_test,y_test)
    y_tfidf_predict = mnb_tfidf.predict(x_tfidf_test)
    print classification_report(y_test,y_tfidf_predict,target_names=news.target_names)
    

    在训练文本量较多的时候,利用Tfidf提升模型性能作用。

    相关文章

      网友评论

          本文标题:特征提升

          本文链接:https://www.haomeiwen.com/subject/wevxextx.html