简书文章url爬取

作者: ZemelZhu | 来源:发表于2018-09-01 14:43 被阅读0次

简书文章url爬取
爬取简书文章
第六章 spider批量爬取伯乐在线所有文章
Python爬虫实战入门六：提高爬虫效率—并发爬取智联招聘
简单的爬虫架构
爬虫
总结：爬取简书文章
使用Scrapy框架爬取简书首页文章（Selenium）
PySpider API介绍
百度指数爬取工具

爬去简书文章url，

由于简书有两种加载页面方式

页面下拉加载更多，附带的参数为上面的文章id与page值

下拉加载附带的参数.PNG
3页之后只有点击加载更多，加载方式不同

所以最好用模拟浏览器来爬取，首先下载PhantomJS，然后制定编译器

driver = webdriver.PhantomJS(
    executable_path=r'G:\MySorf\pythonTool\
    phantomjs-2.1.1-windows\phantomjs-2.1.1-windows'
                    r'\bin\phantomjs.exe')

先模拟下拉

<footer class="container">是简书页面的页脚

driver.find_element_by_css_selector("footer.container")\
        .send_keys(Keys.DOWN)

找到页脚，然后下拉

模拟点击加载更多

阅读更多的获取.PNG

        ac = driver.find_element_by_css_selector("a.load-more")
        ActionChains(driver).move_to_element(ac).click(ac).perform()

最后用xpath解析出url

ac = driver.find_element_by_css_selector("a.title")
list = driver.find_elements_by_xpath('//a[@class="title"]')

完整代码

# -*- coding:utf-8 -*-

# IPython2 测试代码

# 导入 webdriver
from selenium import webdriver
import time
# 要想调用键盘按键操作需要引入keys包
from selenium.webdriver.common.keys import Keys
# 如果没有在环境变量指定PhantomJS位置

# 导入 ActionChains 类
from selenium.webdriver import ActionChains

driver = webdriver.PhantomJS(
    executable_path=r'G:\MySorf\pythonTool\
    phantomjs-2.1.1-windows\phantomjs-2.1.1-windows'
                    r'\bin\phantomjs.exe')
# driver = webdriver.PhantomJS()
# get方法会一直等到页面被完全加载，然后才会继续程序，通常测试会在这里选择
driver.get("https://www.jianshu.com/")
js = "var q=document.documentElement.scrollTop=100000"
# 模拟js事件
driver.execute_script(js)

# 页面拉取到最底
for i in range(1, 20):
    driver.find_element_by_css_selector("footer.container") \
        .send_keys(Keys.DOWN)
    time.sleep(1)

# 鼠标移动到 ac 位置
try:
    # 模拟点击加载更多
    for i in range(1, 9):
        # a.load-more为阅读更多的css class
        ac = driver.find_element_by_css_selector("a.load-more")
        ActionChains(driver).move_to_element(ac).click(ac).perform()
        # 休眠
        time.sleep(4 + i)
except:
    print "exception"

ac = driver.find_element_by_css_selector("a.title")
list = driver.find_elements_by_xpath('//a[@class="title"]')
print list.__len__()
for link in list:
    with open("articleUrl.txt", "a") as f:
        f.write(link.get_attribute('href') + "\n")
    print (link.get_attribute('href'))

    # print driver.title
    # driver.save_screenshot("jianshu.png")

articleUrl.txt中

https://www.jianshu.com/p/11046c89367d
https://www.jianshu.com/p/94ba3a429f53
https://www.jianshu.com/p/e19b62bbdf39
https://www.jianshu.com/p/98770ea700f5
https://www.jianshu.com/p/b4d1dd505ed8
https://www.jianshu.com/p/881f512160c7
https://www.jianshu.com/p/22c5b6081eac
https://www.jianshu.com/p/f31e39d3ce41
https://www.jianshu.com/p/e57940123cc4
https://www.jianshu.com/p/3ea8262b0927

解析文章可以参考文章