美文网首页
简书文章url爬取

简书文章url爬取

作者: ZemelZhu | 来源:发表于2018-09-01 14:43 被阅读0次
    爬去简书文章url,
    • 由于简书有两种加载页面方式
    1. 页面下拉加载更多,附带的参数为上面的文章id与page值


      下拉加载附带的参数.PNG
    2. 3页之后只有点击加载更多,加载方式不同

    所以最好用模拟浏览器来爬取,首先下载PhantomJS,然后制定编译器

    driver = webdriver.PhantomJS(
        executable_path=r'G:\MySorf\pythonTool\
        phantomjs-2.1.1-windows\phantomjs-2.1.1-windows'
                        r'\bin\phantomjs.exe')
    

    先模拟下拉

    <footer class="container">是简书页面的页脚

    driver.find_element_by_css_selector("footer.container")\
            .send_keys(Keys.DOWN)
    

    找到页脚,然后下拉

    模拟点击加载更多

    阅读更多的获取.PNG
            ac = driver.find_element_by_css_selector("a.load-more")
            ActionChains(driver).move_to_element(ac).click(ac).perform()
    

    最后用xpath解析出url

    ac = driver.find_element_by_css_selector("a.title")
    list = driver.find_elements_by_xpath('//a[@class="title"]')
    

    完整代码

    # -*- coding:utf-8 -*-
    
    # IPython2 测试代码
    
    # 导入 webdriver
    from selenium import webdriver
    import time
    # 要想调用键盘按键操作需要引入keys包
    from selenium.webdriver.common.keys import Keys
    # 如果没有在环境变量指定PhantomJS位置
    
    # 导入 ActionChains 类
    from selenium.webdriver import ActionChains
    
    driver = webdriver.PhantomJS(
        executable_path=r'G:\MySorf\pythonTool\
        phantomjs-2.1.1-windows\phantomjs-2.1.1-windows'
                        r'\bin\phantomjs.exe')
    # driver = webdriver.PhantomJS()
    # get方法会一直等到页面被完全加载,然后才会继续程序,通常测试会在这里选择
    driver.get("https://www.jianshu.com/")
    js = "var q=document.documentElement.scrollTop=100000"
    # 模拟js事件
    driver.execute_script(js)
    
    # 页面拉取到最底
    for i in range(1, 20):
        driver.find_element_by_css_selector("footer.container") \
            .send_keys(Keys.DOWN)
        time.sleep(1)
    
    # 鼠标移动到 ac 位置
    try:
        # 模拟点击加载更多
        for i in range(1, 9):
            # a.load-more为阅读更多的css class
            ac = driver.find_element_by_css_selector("a.load-more")
            ActionChains(driver).move_to_element(ac).click(ac).perform()
            # 休眠
            time.sleep(4 + i)
    except:
        print "exception"
    
    ac = driver.find_element_by_css_selector("a.title")
    list = driver.find_elements_by_xpath('//a[@class="title"]')
    print list.__len__()
    for link in list:
        with open("articleUrl.txt", "a") as f:
            f.write(link.get_attribute('href') + "\n")
        print (link.get_attribute('href'))
    
        # print driver.title
        # driver.save_screenshot("jianshu.png")
    
    

    articleUrl.txt中

    https://www.jianshu.com/p/11046c89367d
    https://www.jianshu.com/p/94ba3a429f53
    https://www.jianshu.com/p/e19b62bbdf39
    https://www.jianshu.com/p/98770ea700f5
    https://www.jianshu.com/p/b4d1dd505ed8
    https://www.jianshu.com/p/881f512160c7
    https://www.jianshu.com/p/22c5b6081eac
    https://www.jianshu.com/p/f31e39d3ce41
    https://www.jianshu.com/p/e57940123cc4
    https://www.jianshu.com/p/3ea8262b0927
    

    解析文章可以参考文章

    相关文章

      网友评论

          本文标题:简书文章url爬取

          本文链接:https://www.haomeiwen.com/subject/txfuwftx.html