使用数据250w篇游记数据
分词后的内容大约 20g
使用word2vec训练,维度200 窗口15
词频大于等于5 的词 有5330282个
import gensim
import os
import re
import sys
import multiprocessing
from tokenize import tokenize
from time import time
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for root, dirs, files in os.walk(self.dirname):
for filename in files:
file_path = root + '/' + filename
for line in open(file_path):
try:
sline = line.strip()
if sline == "":
continue
###rline = cleanhtml(sline)
tokenized_line = ''.join(sline)
###print(tokenized_line)
word_line = [word for word in tokenized_line.split( )]
yield word_line
except Exception:
print("catch exception")
yield ""
if __name__ == '__main__':
begin = time()
sentences = MySentences("traindata")
model = gensim.models.Word2Vec(sentences,size=200,window=15,min_count=10, workers=multiprocessing.cpu_count())
model.save("model/word2vec_gensim")
model.wv.save_word2vec_format("model/word2vec_org",
"model/vocabulary",
binary=False)
end = time()
print ("Total procesing time: %d seconds" % (end - begin))
输出了前500号未过滤词的词云
其实词频最高的是逗号‘,’,高达6亿次
但是‘的’由于体积的优势看起来更大
词云1号
当然上面那个没有什么实际的意义,
就是看一下 ,这些头部数据基本上没有太大的意义,先进行一次过滤。
过滤掉单个字符之后,还剩下5318898个
虽然没少多少,但是头部看着好多了
开始出现一些有意义的词
比如 日本 台湾 北京 酒店 住宿
500个词放在这个框框里面有点挤了
从这个框框里 大概可以了解一些人们的旅行信息了
比如 排行第一的词是‘我们’
说明人们大多数时间还是同伴旅行的
跟旅行中相关的最靠前的词是‘酒店’
说明怎么住是一个人们最关心的问题
其次是‘司机’ 说明坐车是最常见的交通方式
最早出现的一个旅行目的地名词竟然是‘日本’
词云2号
不说那么多了 继续消减
过滤掉标点符号
过滤掉纯数字
过滤掉停用词
过滤掉长度超过15的词
剩余词量 5050467个
词云3号
可以看出 仍然存在很多没有意义的词
比如离开 网上 门口东西 这样的词
还有差不多 算是 确定 这个样子的词
考虑使用词性来过滤一下
词性标注结果
标注完词性,一共39种词性,上图按文件大小排序
下图是每个词性的词数
词性数目统计
经过我的人肉观测
e 感叹词,Dg重复性副词,z,w,r,j,l,t,a,b,c,u,ad,Ag,d,f,an,Tg,Ng,vd,Vg,Bg,p,q,k无意义,o,y语气助词统统去掉。
人肉删掉一些词性之后,还剩下的词汇量:
27351268个
观察了一下 长尾词还是很多
多半是只有一篇内容的,或者分词错误造成的一些奇葩
或者真的很少出现,这些内容实际上是没有什么贡献度的
那么我把词频低于50的都砍掉了
还剩下539231个词
词云4号
头部的变化不是很大了
看一下尾部数据吧
词云5号
可以看出 还有一些无意义的东西没用去掉
比如‘& quot;’
还需要进一步的清洗
去掉了全部特殊字符开头的内容
去掉了数字英文开头的内容
感觉这样可能会有一些损失
不过应该还可以接受
对于我们一个汉语母语的平台来说
词云6号
抽了中间的500个词
虽然还有一些似乎分词不太正常
但是这没有什么影响了
还剩487560词
差不多可以开始下一步了
对词表做几个对比测试
1.全量词 487560个
2.去掉地域和POI词汇 443982个
下面三个在500w词的基础上
3.抽取出地域词汇(国家,省,市,县) 12894个
4.抽取出POI词汇(POI词汇指的一些具体地点)88723个
5.抽取出地域和POI词汇 上面两个的并集
下面这三种主要是希望得到一个地域和poi之间的聚类,在这里先不做进一步的说明了。
我们先暂且使用全量词来进行训练。
设定1000个分类,最大迭代词数100次,kmeans的代码如下:
import gensim
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from time import time
def load_model():
model = gensim.models.Word2Vec.load('../word2vec/model/word2vec_gensim')
return model
def load_filterword():
fd = open("voc.txt","r")
filterword=[]
for line in fd.readlines():
line=line.strip()
filterword.append(line)
return filterword
if __name__=="__main__":
start = time()
model = load_model()
filterword = load_filterword()
print("load filterword")
wordvector = []
filterkey={}
for word in filterword:
wordvector.append(model[word])
filterkey[word]=model[word]
clf = KMeans(n_clusters=1000,max_iter=100,n_jobs=10)
s = clf.fit_predict(wordvector)
joblib.dump(clf,"kmeans_model.pkl")
这个到底选取多少个类别,需要经过若干次的测试和迭代,还需要有运营的干预和筛选,具体要怎么确定类别数我就不说了,先往后讲。
在聚类方法运算完之后,我们要算出每个分类中的词到类中心的距离,代码如下
import gensim,numpy
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from time import time
def calEuclideanDistance(vec1,vec2):
dist = numpy.sqrt(numpy.sum(numpy.square(vec1 - vec2)))
return dist
if __name__=="__main__":
model = gensim.models.Word2Vec.load('/mfw_data/algo/hexialong/code/word2vec/model/word2vec_gensim')
start = time()
clf = joblib.load("kmeans_model/kmeans_model.pkl")
print("load model")
labels = clf.labels_
print(labels)
centers = clf.cluster_centers_
fp = open("voc.txt","r")
keys=[]
for line in fp.readlines():
line=line.strip()
keys.append(line)
labellist = labels.tolist()
num=0
dists={}
labelss={}
print(len(labellist))
print(len(centers))
for key in keys:
vector= model[key]
center=centers[labellist[num]]
v1=numpy.array(vector)
v2=numpy.array(center)
dist=calEuclideanDistance(v1,v2)
dists[key]=dist
labelss[key]=labellist[num]
num=num+1
fp = open("1000c_result",'w')
for key in keys:
fp.write(key+"\t"+str(labelss[key])+"\t"+str(dists[key])+"\n")
fp.close()
end = time()
print("use time")
print(end-start)
然后我们获取了一个结果集,我随便截取一段
躲避 823 35.07538150098018
摇晃 140 36.60522929361628
崖壁 376 23.22836102338548
无力 805 30.657890351709458
龙头路 884 37.903673395687484
猜测 466 27.275837201838378
长椅 676 33.87635681753719
小妞 997 17.82934232233481
小街 315 20.34906805205499
细细品味 974 33.128149439823794
主楼 154 31.142244843582723
定义 612 37.68801087539801
乌龙寨 62 30.640911350466737
最先 1003 31.885286255996636
来来回回 1144 36.25398384183349
缩影 1027 31.153551423048373
山景 137 25.425809239874074
大阪城 975 27.961891172956033
母女 689 21.706070706906047
叫声 275 33.4757087831385
走动 108 37.69422496550644
热点 169 40.37795870338444
真人 232 30.07144736130747
宰客 672 38.89092352995517
燃烧 419 27.574033962685565
开胃 1121 32.80393048822961
闪耀 494 25.089004413299243
云集 175 37.36147179956325
白河 366 43.03368242374299
摸摸 361 36.31363986732377
免费参观 9 31.104615607714344
地盘 239 33.744949186776694
中华民族 906 32.31989998815137
吧台 676 32.87061575864565
规范 895 37.69395812138509
高于 609 37.97722569832258
奇峰 107 29.24820251169711
麦田 238 32.273416467434785
剧情 232 33.91195672348161
马迭尔 3 32.157627064591544
世俗 521 29.173518246131916
威武 483 28.9615555312789
事宜 685 39.77431007044596
车行 471 27.79744221726671
接下来,我们对聚类的结果进行标准化,注意,接下来这部分使用java编写。因为我的主程序是用java的,所以后面的步骤就都是java的了
//topic数据结构类
public class Topic {
// private String name;
private String id;
private double weight;
private String keyword;
public Topic(String id, double weight, String keyword) {
// this.name = "Default";
this.id = id;
this.weight = weight;
this.keyword = keyword;
}
public Topic(String id, double weight) {
// this.name = "Default";
this.id = id;
this.weight = weight;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public double getWeight() {
return weight;
}
public void setWeight(double weight) {
this.weight = weight;
}
public String getKeyword() {
return keyword;
}
public void setKeyword(String keyword) {
this.keyword = keyword;
}
@Override
public String toString() {
return String.format("%s:%.4f:%s", this.id, this.weight,this.keyword);
}
}
//加载文件的方法
public static void loadLocalTopic() {
try {
FileReader fr = new FileReader(path);
BufferedReader br = new BufferedReader(fr);
String line = null;
while((line=br.readLine())!=null) {
String[] linearr = line.split(",");
if(linearr.length != 3)
continue;
System.out.println(linearr[0]+"\t"+linearr[1]+"\t"+linearr[2]);
Topic topic = new Topic(linearr[1],Double.valueOf(linearr[2]),linearr[0]);
if(topicmodel.containsKey(topic.getId())) {
Set<Topic> topicSet = topicmodel.get(topic.getId());
topicSet.add(topic);
topicmodel.put(topic.getId(), topicSet);
}else {
Set<Topic> topicSet = new HashSet<>();
topicSet.add(topic);
topicmodel.put(topic.getId(), topicSet);
}
}
br.close();
fr.close();
}catch(Exception e) {
e.printStackTrace();
}
}
package algorithm.topic;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map.Entry;
import java.util.Set;
//归一化值
public class normalizeTopicModel {
public void normal() throws IOException {
TopicManager.loadLocalTopic();
FileWriter fw = new FileWriter("Documents/1000c_result",true);
for(Entry<String, Set<Topic>> entry : TopicManager.topicmodel.entrySet()) {
String topicid = entry.getKey();
Set<Topic> topicSet = entry.getValue();
List<Topic> topicList = new ArrayList<Topic>();
topicList.addAll(topicSet);
for(int i = 0; i < topicList.size(); i++) {
for(int j = i + 1; j < topicList.size(); j++) {
if(compare(topicList.get(i),topicList.get(j))) {
Topic temTopic = topicList.get(i);
topicList.set(i, topicList.get(j));
topicList.set(j, temTopic);
}
}
}
double max = topicList.get(0).getWeight();
if(max == 0)
{
max = 1;
}
else
max = 1 / max;
for(int i = 0; i < topicList.size(); i++) {
Topic temTopic = topicList.get(i);
if(temTopic.getWeight() == 0)
{
temTopic.setWeight(1);
}
else {
double score = 1 / temTopic.getWeight() / max;
if(score > 1)
score = 1;
temTopic.setWeight(score);
}
topicList.set(i, temTopic);
fw.write(topicList.get(i).getKeyword()+","+topicList.get(i).getId()+","+String.valueOf(topicList.get(i).getWeight())+"\n");
System.out.println(topicList.get(i).getWeight());
}
}
fw.flush();
fw.close();
}
public boolean compare(Topic t1, Topic t2){
if(t1.getWeight() >= t2.getWeight())
return true;
else return false;
}
public static void main(String[] args) throws IOException {
normalizeTopicModel norTopic = new normalizeTopicModel();
norTopic.normal();
}
}
下面我们来看一下效果,随便找了一个效果还可以的类别,比如这个叫做99的类别。中心点是“锐化”。很明显这是一个跟摄影有关的主题,而且还挺内聚的呢。
锐化,99,1.0
暗角,99,0.994166923709541
焦外,99,0.9563906404235352
曝光度,99,0.887396937885799
曝光值,99,0.8817494960102079
偏色,99,0.8670448689624797
滤光镜,99,0.8521734004249849
色阶,99,0.8413752299262939
负片,99,0.8379294006879198
布光,99,0.7861324411282584
高感光度,99,0.7767953538285236
侧逆光,99,0.7464948101323751
眩光,99,0.7444798661565253
偏蓝,99,0.7430276393958444
曝光量,99,0.7345760940444289
锐度,99,0.7272990277275961
移轴,99,0.7253110404519592
反光板,99,0.7215269135780907
暗部,99,0.7208636735199092
调亮,99,0.7207182417276562
校正,99,0.7160718069956977
调光,99,0.7117975800706153
宽容度,99,0.6911525585962236
光效,99,0.6845066039370883
长曝,99,0.650904183083746
畸变,99,0.6466897137567535
光圈值,99,0.6445456196337443
光量,99,0.6242530652823783
侧光,99,0.6229659367270823
偏光镜,99,0.6127629588942928
小光圈,99,0.606456397878928
镀膜,99,0.6022481835019297
人眼,99,0.6014777976918868
顶光,99,0.5908897588986189
弱光,99,0.5705607302698816
高光,99,0.5700382030056783
取景框,99,0.5685954417962578
影调,99,0.5595033799120289
遮光罩,99,0.5563819297539387
仰角,99,0.5467289109387645
镜像,99,0.5388347533266572
边框,99,0.5355165571315302
接片,99,0.5303093604095706
纵深感,99,0.5288313086927584
感光,99,0.52876529723293
焦环,99,0.52747907056666
后期制作,99,0.5184563208017946
微单m,99,0.5128535233095775
过曝,99,0.5127220944867199
白平衡,99,0.5124379791332985
偏暗,99,0.5110589723626944
色差,99,0.496526819365721
变镜,99,0.487022341500059
光比,99,0.48184538386941067
失真,99,0.47753989213368614
成像,99,0.4769627810129018
合成,99,0.47279976084701547
偏振镜,99,0.4548917023656295
裁剪,99,0.45235518514460493
调出,99,0.4508906010219789
剪裁,99,0.44886692408911094
数值,99,0.4472394610199887
补光,99,0.4461187353140359
修片,99,0.43721520053374113
虚化,99,0.4350999809744712
局部,99,0.432974107376291
饱和,99,0.4322429502305436
画幅,99,0.42857023181327153
光学,99,0.4282047677593209
图像,99,0.42467803726264874
景深,99,0.4149333946375465
色温,99,0.4145943587200621
自然光,99,0.4131001646275333
连拍,99,0.4121591777530912
聚焦,99,0.40753583873743443
摄者,99,0.4065014835857994
取景器,99,0.4021978227058686
静态,99,0.3937075050076308
图层,99,0.38335470245546843
分辨率,99,0.3816336025033504
慢门,99,0.3769640046705349
亮度,99,0.37665720653846213
美化,99,0.3694056356038968
大光,99,0.3659599588323689
调成,99,0.3658377188608247
清晰度,99,0.3609106023844095
放大,99,0.352992668827245
透视,99,0.35031360806748657
像素,99,0.3426351978294951
噪点,99,0.3418777083498393
原片,99,0.33455915182538637
参照物,99,0.3309971861029381
焦点,99,0.3268045925664173
饱和度,99,0.3245124673223553
水印,99,0.3235201831201075
反光,99,0.3157924523916027
光板,99,0.3130629201445409
修图,99,0.31278956502774113
底片,99,0.3073610697786377
比度,99,0.30231977280457845
视力,99,0.300582304912615
摄体,99,0.2785162779295052
大家aa平摊,99,0.27656660628966984
滤镜,99,0.27502994565521094
调色,99,0.273747862653945
接下来就是要进入运营干预环节了。
比如这里面出现的大家aa平摊这种奇怪的词干掉,比如有很多垃圾词会聚合成一个垃圾类,把这种垃圾类去掉,blabla。
好了,这个先说到这里~~~
网友评论