爬虫相关：关于使用selenium处理ajax翻页前后取出的数据

作者: BlueCat2016 | 来源:发表于2017-01-18 21:50 被阅读177次

爬虫相关：关于使用selenium处理ajax翻页前后取出的数据
自动化测试+爬虫+数据可视化（2）爬虫部分
实战2：使用selenium爬取淘宝数据，保存在mongodb
Ajax Axios
【Python爬虫】分析网页真实请求
使用Selenium模拟浏览器行为
2018-12-28
django+angular分页功能+搜索功能
Selenium库，Python精品教程！
python爬虫-selenium 的基本使用和常用API

有一类页面，它的翻页是用ajax处理的，也就是翻页前后url没有发生变化，如果要爬这类页面，无法用传统的、获得url规则的方法，可以考虑使用selenium+phantomjs模拟鼠标点击翻页按钮。

但是模拟鼠标点击翻页之后，取出来的数据有可能没有发生变化（也就是说“点击”之后，第二次取出的数据和第一次取出的数据，比如标题，是一样的）。原因有可能是模拟点击鼠标的动作发生之后，程序还没有等到真正翻页完成就去取了第二次数据，导致取的还是第一页的数据。对于这种情况，解决办法就是在模拟鼠标点击的动作发生之后，休眠几秒钟再取数据，会比较保险。

以下是简单示例：

from selenium import webdriver
from lxml import etree
import time

current_url = "http://photo.nocutnews.co.kr/news/issue/list"
driver = webdriver.PhantomJS(executable_path="phantomjs")
driver.get(current_url)

# title = driver.find_element_by_xpath("//div[@class='photoinfo']/a/strong").text
ps = driver.page_source
html = etree.HTML(ps.encode("utf-8"))
title = html.xpath("//div[@class='photoinfo']/a/strong/text()")
print title
driver.execute_script("__doPostBack('ctl00$ctl00$cphBody$cphBody$pcPager$ctl03','')")
time.sleep(5)

# title = driver.find_element_by_xpath("//div[@class='photoinfo']/a/strong").text
ps1 = driver.page_source
html1 = etree.HTML(ps1.encode("utf-8"))
title1 = html1.xpath("//div[@class='photoinfo']/a/strong/text()")
print title1
driver.quit()

网友评论

本文标题：爬虫相关：关于使用selenium处理ajax翻页前后取出的数据

本文链接：https://www.haomeiwen.com/subject/zopbbttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫相关：关于使用selenium处理ajax翻页前后取出的数据

相关文章

爬虫相关：关于使用selenium处理ajax翻页前后取出的数据

自动化测试+爬虫+数据可视化（2）爬虫部分

实战2：使用selenium爬取淘宝数据，保存在mongodb

Ajax Axios

【Python爬虫】分析网页真实请求

使用Selenium模拟浏览器行为

2018-12-28

django+angular分页功能+搜索功能

Selenium库，Python精品教程！

python爬虫-selenium 的基本使用和常用API

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Scrapy

scrayp爬虫

python爬虫日记本

我爱编程