美文网首页
Python 爬取简书个人文章目录、查看数及链接地址

Python 爬取简书个人文章目录、查看数及链接地址

作者: 喵呜e喵星人 | 来源:发表于2018-01-29 22:14 被阅读0次

    Python 爬取简书个人文章目录、查看数及链接地址

    1. 通过360极速浏览器的审查元素,选“Network”->“XHR”选项,滚动页面,找出连接地址构成的规律。https://www.jianshu.com/u/55b597320c4e?order_by=shared_at&page=2

    如下图:

    2. 根据文件数和每页显示的数量,构建链接地址。

    urls =[ 'https://www.jianshu.com/u/55b597320c4e?order_by=shared_at&page={}'.format(str(i)) for i in range(1,13)]

    3. 使用LXML库,查找需要的标题,查看量,超链地址。

    代码如下:

    # -*- coding: utf-8 -*-

    import  requests,time

    from lxml import  etree

    import pymongo

    from multiprocessing import Pool  #多线程库

    client = pymongo.MongoClient('localhost',27017)

    mydb = client['mydb']

    jianshu_user_dy = mydb['jianshu_user_dy']

    headers = {

    'X-Requested-With': 'XMLHttpRequest',

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',

    'Referer': 'https://www.jianshu.com/u/9104ebf5e177'

    }

    def get_infos(url):

    try:

    html = requests.get(url,headers =headers)

    selector = etree.HTML(html.text)

    try:

    links = selector.xpath('//*[@id="list-container"]/ul/li')

    for link in links:

    title = link.xpath('div/a/text()')[0]

    view = link.xpath('div/div/a[1]/text()')[-1].strip()

    title_url ='https://www.jianshu.com'+ link.xpath('div/a/@href')[0]

    print(title,view)

    infos = {

    'title':title,

    'url':title_url,

    'view':view

    }

    jianshu_user_dy.insert_one(infos)

    except:

    print("抓取不到内容咯???????????????")

    except requests.ConnectionError:

    print("网页出错啦!***************")

    urls =[ 'https://www.jianshu.com/u/55b597320c4e?order_by=shared_at&page={}'.format(str(i)) for i  in range(1,13)]

    if __name__ == '__main__':

    start = time.time()

    pool = Pool(processes=4)

    pool.map(get_infos,urls)

    print("合计用时:{}".format(str(time.time()-start)))

    相关文章

      网友评论

          本文标题:Python 爬取简书个人文章目录、查看数及链接地址

          本文链接:https://www.haomeiwen.com/subject/sigfzxtx.html