美文网首页py爬虫大数据产品经理文章收集
大V的微博特征提取(简单的爬虫加数据分析)

大V的微博特征提取(简单的爬虫加数据分析)

作者: TheMarcMa | 来源:发表于2015-09-09 00:51 被阅读2051次

    文章的思路来源是在学习《集体智慧编程》中关于寻找独立特征一章,想到把不同新闻来源换成不同微博大V的内容,很好奇会得到什么结果?

    1.内容获取

    1.1 模拟登录微博

    用各大V的原创微博内容代替新闻来源。这边在wap版微博进行抓取,相对weibo.com来说weibo.cn版本更加简单,同时登录相对没那么复杂。首先模拟登录,代码(Python2.7)如下:

    weiboUrl = 'http://weibo.cn/pub/'
    loginUrl  = bs(requests.get(weiboUrl).content).find("div",{"class":"ut"}).find("a")['href']
    origInfo  = bs(requests.get(loginUrl).content)loginInfo = origInfo.find("form")['action']
    loginpostUrl = 'http://login.weibo.cn/login/'+loginInfo
    headers = { 'Host': 'login.weibo.cn',            
                'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; BOIE9;ZHCN)',            
                'Referer' : 'http://login.weibo.cn/login/?ns=1&revalid=2&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%D0%C2%C0%CB%CE%A2%B2%A9&vt=',          
               }
    postData = { 'mobile': '我的微博账号',                                                                  
                 origInfo.find("form").find("input",{"type":"password"})['name']: '我的微博密码',           
                 'remember':'on',      
                 'backURL':origInfo.find("form").find("input",{"name":"backURL"})['value'],               
                 'backTitle': origInfo.find("form").find("input",{"name":"backTitle"})['value'],          
                 'tryCount': origInfo.find("form").find("input",{"name":"tryCount"})['value'],           
                 'vk': origInfo.find("form").find("input",{"name":"vk"})['value'],                        
                 'submit': origInfo.find("form").find("input",{"name":"submit"})['value'],              }``` 
    

    s = requests.Session()
    req = s.post(loginpostUrl, data=postData, headers=headers)```
    那么模拟登录就已经完成,接下来就可以利用s.get(url,cookies=req.cookies)的形式去爬各种url了。
    下面,利用一个元组和一个字典确定要抓取的大V的微博ID:

    name_tuple = (u'谷大',u'magasa',u'猴姆',u'中国国家天文',u'耳帝',u'freshboy',u'松鼠可学会',u'人人影视',u'博物杂志',u'noisey音乐', u'网易公开课',u'LifeTime',u'DK在北京',u'电影扒客',u'lonelyplanet')
    name_dic = {u'谷大':u'ichthy',u'magasa':u'magasafilm',u'猴姆':u'houson100037',u'中国国家天文':u'64807699',u'耳帝':u'eargod',u'freshboy':u'freshboy',u'松鼠可学会':u'songshuhui',u'人人影视':u'yyets',u'博物杂志':u'bowu',u'noisey音乐':u'noisey', u'网易公开课':u'163open',u'LifeTime':u'usinvester',u'DK在北京':u'dkinbj',u'电影扒客':u'2315579285', u'lonelyplanet':u'lonelyplanet',}```
    元组的目的是为了固定大V的顺序,因为如果直接遍历字典的keys,顺序会是随机的,给我们后续确定特征来源带来很大麻烦。
    
    接下来开始遍历这些微博:
    

    def get_weibos():
    s = requests.Session()
    req = s.post(loginpostUrl, data=postData, headers=headers)
    all_words={}
    store_weibos = {}
    for person in name_tuple:
    person_id = name_dic[person]
    store_weibos.setdefault(person,{})
    for index in range(1,7): #每个人抓7页的微博
    index_added = str(index)
    person_url = 'http://weibo.cn/'+person_id+'?filter=1&page='+index_added+'&vt=4' #url加入filter=1参数只查看原创微博
    req2 = s.get(person_url,cookies=req.cookies)
    soup1 = bs(req2.text)
    for op in soup1.find_all('div'):
    if u'class' and u'id' in op.attrs.keys() and u'c' in op.attrs[u'class']: #通过观察网页中div标签中class类为'c'的是微博内容
    op_weibo = op.span.text ```

    1.2 中文分词

    通过上文得到的op_weibo就是每条的微博内容了,但是与英文不同的是,不能直接通过空格进行分词,这里用结巴分词(https://github.com/fxsjy/jieba)
    接上节代码

                             op_weibo = op.span.text
                             for word in jieba.cut(op_weibo,cut_all=False):  
                                if len(word)>1 and word not in clean_words:    
                             #利用jieba.cut得到分词结果集,筛选去掉长度很短的符号或词,同时可以设立clean_words进行过滤
                                   #store_weibos为一个字典,每个人下又为一个字典,纪录他的微博中出现的单词及次数                  
                                   store_weibos[person].setdefault(word,0)                        
                                   store_weibos[person][word]+=1    
               for word_1 in store_weibos[person].keys():  
                   all_words.setdefault(word_1,0)   
                   all_words[word_1]+=1
               print 'get %s already' %person
        #allwords 是筛选出在超过3个人中都出现的词以及在少于90%的人中出现的词 
        allwords = [w for w,n in all_words.items() if n>3 and n<len(name_dic.keys())*0.9]
        #l1是每个人创建跟allwords一样长的词表,对应这些词在该人下出现的次数,即为[person-words]矩阵
        l1 = [[(word in store_weibos[person] and store_weibos[person][word] or 0) for word in allwords] for person in name_tuple]
        return all_words, store_weibos,allwords,l1```
    
    ##2.特征提取-矩阵分解
    下面代码主要来自《集体智慧编程》(http://www.amazon.cn/集体智慧编程-西格兰/dp/B001NPDVP2)
    

    def difcost(a,b): #构造代价函数,用于矩阵特征分解
    dif=0
    for i in range(shape(a)[0]):
    for j in range(shape(a)[1]):
    dif += pow(a[i,j]-b[i,j],2)
    return dif

    分解矩阵,将[个人-单词]矩阵分解为[个人-特征]*[特征-单词]矩阵

    def factorize(v, pc=10, iter=50):
    ic = shape(v)[0] #icfc
    fc = shape(v)[1]
    w = matrix([[random.random() for j in range(pc)] for i in range(ic)]) #ic
    pc weight matrix
    h = matrix([[random.random() for j in range(fc)] for i in range(pc)]) #pc*fc feature matrix

    find v = w*h Matrix Factorization

    for i in range(iter):        
      wh = w*h        
      cost = difcost(wh,v)        
      #every 10 times print the cost        
      if i%10 == 0: print cost        
      if cost == 0: break        
      hn = (transpose(w)*v)          
      hd = (transpose(w)*w*h)
      h = matrix(array(h)*array(hn)/array(hd))       
      wn = (v*transpose(h)) 
      wd = (w*h*transpose(h))
      w = matrix(array(w)*array(wn)/array(wd))    
    

    return w,h

    #按特征展示
    def showfeatures(w,h,titles,wordvec,out = 'features.txt'):
    outfile = file(out,'w')
    pc,wc = shape(h) # h is feature matrix
    toppatterns=[[] for x in range(len(titles))]
    patternnames= []

    pc is the number of features

    for i in range(pc):
    slist=[] # wc is the number of words
    for j in range(wc):
    slist.append((h[i,j],wordvec[j]))
    slist.sort()
    slist.reverse() #sorted by weight-h[i,j] from big to little, the get the correlated word
    n = [s[1] for s in slist[0:6]]
    outfile.write(str(n)+'\n')
    patternnames.append(n) #w[j,i] refer to article-feature
    flist = []
    for j in range(len(titles)):
    flist.append((w[j,i],titles[j]))
    toppatterns[j].append((w[j,i],i,titles[j]))

    flist.sort()        
    flist.reverse()        
    for f in flist[0:3]:            
      outfile.write(str(f)+'\n')        
    outfile.write('\n')    
    

    outfile.close()
    return toppatterns,patternnames

    按文章展示

    def showarticles(titles, toppatterns, patternnames, out='articles.txt'):
    outfile = file(out,'w')

    for j in range(len(titles)):
    outfile.write(titles[j].encode('utf8')+'\n')
    # sort w:article-feature desc
    toppatterns[j].sort()
    toppatterns[j].reverse()
    #top3 w[article,feature]
    for i in range(3):
    a = u''.encode('utf8')
    for word in patternnames[toppatterns[j][i][1]]:
    a=a+' '+word.encode('utf8')
    outfile.write(str(toppatterns[j][i][0])+' '+ a +'\n')
    # w[article,feature]+feature ,respectively outfile.write('\n')outfile.close()```

    3.结果

    import weibo_feature   (模块名)
    a,b,c,d = weibo_feature.getweibos()```
    其中d就是我们需要的人对词的列表,转换为矩阵m_d后,利用factorize函数分解为权重矩阵weights和特征矩阵feat。c是全部人的词中参与计数的完备词库。利用这些数据就可以进行展示了。
    

    m_d = matrix(d)
    weights,feat = weibo_feature. factorize(m_d)
    topp,pn = weibo_feature.showfeatures(weights,feat,name_tuple,c)
    weibo_feature.showarticle(name_tuple,topp,pn)

    通过得到的文件可以看出:showfeatures展示每个特征跟哪些人相关,更直观的是showarticle,可以看到每个人跟哪些特征最相关

    
    
    ![1.jpg](http:https://img.haomeiwen.com/i743445/979ccc28ae037792.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
    
    ![2.jpg](https://img.haomeiwen.com/i743445/a7ee693a4ed2c2ab.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
    如上两个图分别是词组特征与人的权重,以及人与词组特征的权重的结果。
    **可以看出从特征看对应哪些大V是一目了然的,同时每个大V最相关的前三个特征也一目了然,并且仅从结果和平时经验判断这个特征分解和系数是比较合理的。**
    *存在的问题:最好能增加词性的判断,排除掉一些连词。*
    *抓取的数据较少,没有放到数据库中,只是一个demo。*

    相关文章

      网友评论

      本文标题:大V的微博特征提取(简单的爬虫加数据分析)

      本文链接:https://www.haomeiwen.com/subject/zwqrcttx.html