美文网首页
课时20-附-一点资讯-采集媒体号最新文章情况

课时20-附-一点资讯-采集媒体号最新文章情况

作者: 田边女斯基 | 来源:发表于2016-10-09 14:14 被阅读0次

    成果

    采集地址:
    http://www.yidianzixun.com/home?page=channel&id=m115702

    def morelinks()

    写入媒体号的id(存放在list[]),产生对应的内容列表地址 url_list[]传入result(content),进行采集
    注意事项:
    1.似乎只能采集“最新”的100条
    2.似乎没必要用header

    result(content)

    采集内容,根据媒体号生成内容
    注意事项:
    1.无法使用BeautifulSoup,暂时使用正则
    2.有些参数不一定都存在,需要做好判断
    3.似乎没必要用header
    4.分开存储,如何命名,目前他的内容有频道名称,如果没有其他参数在内容中,可能需要继续根据ID做文章,而且ID需要和频道名字做好对应

    代码

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    import requests #有s
    import re
    import time
    time1 = time.strftime("%H:%M:%S").replace(':','-')
    path ='./'
    def morelinks():#目前只能采集前100条
        headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, sdch',
            'Accept-Language': 'zh-CN,zh;q=0.8',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'Cookie': 'JSESSIONID=f27f890fbaef32cf20c08ff15d664ebd368d772d21cc0323f12ba1d3b9df031e; ' \
                      'BUILD_VERSION=1475117891169; captcha=s%3A2532aa295d28e9dcc859d5da9b7fd568.FhTz501pKnp4i5QOlhAUnG6tGdkmTcIoZhDnFyV1m3Q; ' \
                      'Hm_lvt_15fafbae2b9b11d280c79eff3b840e45=1475798099; Hm_lpvt_15fafbae2b9b11d280c79eff3b840e45=1475798372; ' \
                      'CNZZDATA1255169715=222495866-1475797495-%7C1475797495; ' \
                      'cn_9a154edda337ag57c050_dplus=%7B%22distinct_id%22%3A%20%221579c6be96b1e3-0f1807e544289-4045022a-1fa400' \
                      '-1579c6be96c318%22%2C%22%24_sessionid%22%3A%200%2C%22%24_sessionTime%22%3A%201475799327%2C%22%24dp%22%3A%200%2C%22%24_' \
                      'sessionPVTime%22%3A%201475799327%2C%22%E6%9D%A5%E6%BA%90%E6%B8%A0%E9%81%93%22%3A%20%22%22%2C%22initial_view_' \
                      'time%22%3A%20%221475797495%22%2C%22initial_referrer%22%3A%20%22%24direct%22%2C%22initial_referrer_domain%22%3A%20%22%24direct%22%2C%22%24' \
                      'recent_outside_referrer%22%3A%20%22%24direct%22%7D',
            'Host': 'www.yidianzixun.com',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
        }#没什么用
        list = ['m118533', 'm115702',]#媒体号的ID
        url_list = ['http://www.yidianzixun.com/api/q/?path=channel|news-list-for-channel&channel_id={}' \
                    '&fields=docid&fields=category&fields=date&fields=image&fields=image_urls' \
                    '&fields=like&fields=source&fields=title&fields=url' \
                    '&fields=comment_count&fields=summary&fields=up&cstart=00&cend=100&version=999999&infinite=true'.format(
            i) for i in list]
        for url in url_list:#产生对应的内容地址
            web_date = requests.get(url, headers=headers)
            content = web_date.text.split('}')
            id = re.match(r'^.+&channel_id=(.*?)&.+$',url)
            print(id.group(1))
            result(content)
    
    def result(content):#数据直接采集
        from_name = re.match(r'^.+"landing_title":"(.*?)-.+$', str(content[2]))
        print(from_name.group(1))
        path_final = path + from_name.group(1) + time1 + '.txt'
        print(path_final)
        with open(path_final,'a+') as text:
         for i in range(1,len(content)-3):
                none ='0'
                detail_list=[]
                title=re.match(r'^.+"title":"(.*?)".+$',str(content[i]))if re.match(r'^.+"title":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(title.group(1))
                date=re.match(r'^.+"date":"(.*?)".+$',str(content[i]))if re.match(r'^.+"date":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(date.group(1))
                summary=re.match(r'^.+"summary":"(.*?)".+$',str(content[i]))if re.match(r'^.+"summary":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(summary.group(1))
                url=re.match(r'^.+"url":"(.*?)".+$',str(content[i]))if re.match(r'^.+"url":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(url.group(1).replace('\\',''))
                category=re.match(r'^.+"category":"(.*?)".+$',str(content[i]))if re.match(r'^.+"category":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(category.group(1))
                comment_count=re.match(r'^.+"comment_count":(.*?),".+$',str(content[i]))if re.match(r'^.+"comment_count":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(comment_count.group(1))
                up=re.match(r'^.+"up":(.*?),".+$',str(content[i]))if re.match(r'^.+"up":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(up.group(1))
                like = re.match(r'^.+"like":(.*?),".+$',str(content[i]))if re.match(r'^.+"like":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
                detail_list.append(like.group(1))
                print(str(i)+'\n' +'title: '+title.group(1)+'\n'+'date: '+date.group(1)+'\n'+
                  'summary: '+summary.group(1)+'\n'+'url: '+url.group(1).replace('/','')+'\n'+
                  'category: '+category.group(1)+'\n'+ 'comment_count: '+comment_count.group(1)+'\n'+
                  'up: '+up.group(1)+'\n'+'like: '+like.group(1)+'\n')
                text.write((str(detail_list))+'\n')
    
    #web_date = requests.get(test_url, headers=headers)
    #content = web_date.text.split('}')
    #result(content)
    morelinks()
    

    总结

    1.似乎只能采集100条,意义不大了- -
    2.注意下list[]的写法

    url_list = ['hehe{}'.format(
            i) for i in list]
    

    3.re.match(),要最短,不贪婪
    虽然参数都是在一句话中的,当因为不是所有参数都出现,顺序会变,为了检查方便只需写最简单的,

    up=re.match(r'^.+"up":(.*?),".+$',str(content[i]))
    

    4.同上,采集的时候还需要做好内容判断,注意这里的ifelse写法,以及无结果的时候,配上什么去填:

    none ='0'
    summary=re.match(r'^.+"summary":"(.*?)".+$',str(content[i]))if re.match(r'^.+"summary":(.*?),".+$',str(content[i]))\
                else re.match(r'^(.*)$',none)
    

    5.以后优先考虑移动端

    相关文章

      网友评论

          本文标题:课时20-附-一点资讯-采集媒体号最新文章情况

          本文链接:https://www.haomeiwen.com/subject/upzyyttx.html