知识点普及
- 逆文档频率(IDF): 每个词的权重,大小与该词在文档中出现频率成反比.
log(文档总数/包含该词的文档数+1)
- TF-IDF:权衡某个分词是否为关键词的指标, 该值越大是关键词的可能性越大
TF-IDF = TF * IDF
TF :词频数
- 文本向量化:假设有m篇文章d1,d2,...dn,对它们进行分词,得到n个分词w1,w2....wn,那么Fij代表第i篇文章中分词j出现的次数,这些文章可用矩阵标识
附注
zhPartent = re.compile(u'[\u4e00-\u9fa5]+') 匹配中文分词
- IDF 计算
def hanlder(x):
return (numpy.log2(len(corpos)/numpy.sum(x>0)+1))
IDF = TF.apply(hanlder)
- 交叉统计函数
pivot_table(values,index,columns,aggfunc,fill_value)
实例代码
# -*- coding: utf-8 -*-
import numpy
#创建语料库
import os;
import os.path;
import codecs;
filePaths = [];
fileContents = [];
for root, dirs, files in os.walk(
"D:\\PDM\\2.7\\SogouC.mini\\Sample"
):
for name in files:
filePath = os.path.join(root, name);
filePaths.append(filePath);
f = codecs.open(filePath, 'r', 'utf-8')
fileContent = f.read()
f.close()
fileContents.append(fileContent)
import pandas;
corpos = pandas.DataFrame({
'filePath': filePaths,
'fileContent': fileContents
});
import re
#匹配中文的分词
zhPattern = re.compile(u'[\u4e00-\u9fa5]+')
import jieba
segments = []
filePaths = []
for index, row in corpos.iterrows():
filePath = row['filePath']
fileContent = row['fileContent']
segs = jieba.cut(fileContent)
for seg in segs:
if zhPattern.search(seg):
segments.append(seg)
filePaths.append(filePath)
segmentDF = pandas.DataFrame({
'filePath':filePaths,
'segment':segments
})
#移除停用词
stopwords = pandas.read_csv(
"D:\\PDM\\2.7\\StopwordsCN.txt",
encoding='utf8',
index_col=False,
quoting=3,
sep="\t"
)
segmentDF = segmentDF[
~segmentDF.segment.isin(
stopwords.stopword
)
]
#按文章进行词频统计
segStat = segmentDF.groupby(
by=["filePath", "segment"]
)["segment"].agg({
"计数":numpy.size
}).reset_index().sort(
columns=["计数"],
ascending=False
);
#把小部分的数据删除掉
segStat = segStat[segStat.计数>1]
#进行文本向量计算
TF = segStat.pivot_table(
index='filePath',
columns='segment',
values='计数',
fill_value=0
)
TF.index
TF.columns
def hanlder(x):
return (numpy.log2(len(corpos)/(numpy.sum(x>0)+1)))
IDF = TF.apply(hanlder)
TF_IDF = pandas.DataFrame(TF*IDF)
tag1s = []
tag2s = []
tag3s = []
tag4s = []
tag5s = []
for filePath in TF_IDF.index:
tagis = TF_IDF.loc[filePath].order(
ascending=False
)[:5].index
tag1s.append(tagis[0])
tag2s.append(tagis[1])
tag3s.append(tagis[2])
tag4s.append(tagis[3])
tag5s.append(tagis[4])
tagDF = pandas.DataFrame({
'filePath':corpos.filePath,
'fileContent':corpos.fileContent,
'tag1':tag1s,
'tag2':tag2s,
'tag3':tag3s,
'tag4':tag4s,
'tag5':tag5s
})
sklearn 实现文档向量化
sklearn关于数据挖掘知识请参见:使用sklearn进行数据挖掘
本例中所用知识点
- 文档向量化:
sklearn.feature_extraction.text.CountVectorizer
- TFIDF计算:
sklearn.feature_extraction.text.TfidTransformer
实例代码
#!/usr/bin/env python
# coding=utf-8
contents = [
'我 是 中国人。',
'你 是 美国人。',
'你 叫 什么 名字?',
'她 是 谁 啊?'
]
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer()
textVector = countVectorizer.fit_transform(contents)
textVector.todense()
countVectorizer.vocabulary_
countVectorizer = CountVectorizer(min_df = 0,
token_pattern= r"\b\w+\b")
textVector = countVectorizer.fit_transform(contents)
textVector.todense()
countVectorizer.vocabulary_
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfid = transformer.fit_transform(textVector)
import pandas as pd
TFIDFDataFrame = pd.DataFrame(tfid.toarray())
TFIDFDataFrame.columns = countVectorizer.get_feature_names()
import numpy as np
TFIDFSorted = np.argsort(tfid.toarray(),axis=1)[:,-2:]
TFIDFDataFrame.columns[TFIDFSorted].values
# print (TFIDFDataFrame)
网友评论