python爬虫--day04

作者: 陈small末 | 来源:发表于2019-01-08 08:52 被阅读0次

python爬虫--day04
3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例
Python网络爬虫（一）- 入门基础
Python网络爬虫（四）- XPath
Python网络爬虫（三）- 爬虫进阶
Python网络爬虫（六）- Scrapy框架
Python网络爬虫（五）- Requests和Beautifu

selenium&phantomJS&headless

浏览器驱动下载

IE11的Webdriver下载：
    http://dl.pconline.com.cn/download/771640-1.html
    链接：https://pan.baidu.com/s/13TTyXGNaG5cpSNdl1k9ksQ 密码：2n9n

Chrome65.0.3325.146的webdriver驱动下载：
    多版本：http://chromedriver.storage.googleapis.com/index.html
    或 http://npm.taobao.org/mirrors/chromedriver/2.43/

Firefox58的webdriver驱动下载
    链接：https://pan.baidu.com/s/1RATs8y-9Vige0IxcKdn83w 密码：l41g

selenium使用

get(url)：打开URL

def openURL():
    driver = webdriver.Chrome()
    driver.get("http://www.baidu.com")
    print(driver.page_source)

clear() ：清除数据 Clears the text if it’s a text entry element.

page_source：获取HTML源码

close()：关闭

quit()：全部关闭

click()：点击，Clicks the element.

execute_script(script, *args)：执行脚本

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

# 下拉滚动条，使浏览器加载出动态加载的内容
while True:
    # 可能像这样要拉很多次，中间要适当的延时。
    # 如果说说内容都很长，就增大下拉的长度。
    for i in range(10):
        driver.execute_script("window.scrollBy(0,1000)")
        time.sleep(3)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    break

查找元素

find_element(by='id', value=None)

find_element_by_class_name(name)

Finds element within this element’s children by class name.

find_element_by_css_selector(css_selector)

Finds element within this element’s children by CSS selector.

find_element_by_id(id_)

Finds element within this element’s children by ID.

find_element_by_link_text(link_text)

Finds element within this element’s children by visible link text.

find_element_by_name(name)

Finds element within this element’s children by name.

find_element_by_tag_name(name)

Finds element within this element’s children by tag name.

find_element_by_xpath(xpath)

Finds element by xpath.

myelement.find_element_by_xpath(".//a")

However, this will select the first link on the page.

myelement.find_element_by_xpath("//a")

find_elements(by='id', value=None)

‘Private’ method used by the find_elements_by_* methods.

find_elements_by_class_name(name)

Finds a list of elements within this element’s children by class name.

find_elements_by_css_selector(css_selector)

Finds a list of elements within this element’s children by CSS selector.

find_elements_by_id(id_)

Finds a list of elements within this element’s children by ID. Will return a list of webelements if found, or an empty list if not.

find_elements_by_link_text(link_text)

Finds a list of elements within this element’s children by visible link text.

find_elements_by_name(name)

Finds a list of elements within this element’s children by name.

find_elements_by_tag_name(name)

Finds a list of elements within this element’s children by tag name.

find_elements_by_xpath(xpath)

Finds elements within the element by xpath.

get_attribute(name)

Gets the given attribute or property of the element.

示例:

# Check if the "active" CSS class is applied to an element.
is_active = "active" in target_element.get_attribute("class")

save_screenshot(filename)

Saves a screenshot of the current element to a PNG image file. Returns

send_keys(*value)

Simulates typing into the element.

form_textfield = driver.find_element_by_name('username')
form_textfield.send_keys("admin")

search.send_keys("海贼王", Keys.ARROW_DOWN) # 回车

This can also be used to set file inputs.

file_input = driver.find_element_by_name('profilePic')
file_input.send_keys("path/to/profilepic.gif")

示例：selenium登录知乎

import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.zhihu.com/')

# 点击登录按钮
driver.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[2]/div[2]/span').click()
time.sleep(2)

# 输入用户名
username = driver.find_element_by_name("username")
username.send_keys('18588403840')
time.sleep(2)

# 输入密码
password = driver.find_element_by_name("password")
password.send_keys('Changeme_123')
time.sleep(8)

# 登录
driver.find_element_by_xpath('//*[@id="root"]/div/main/div/div/div/div[2]/div[1]/form/button').click()

# 登录后获取登录后的信息
driver.get('https://www.zhihu.com/people/zuo-zai-fen-tou-diao-xi-gui-82/activities')
print(driver.page_source)

# 可以登录后的获取cookie
# print(driver.get_cookies())


# 新版知乎设置了反爬了， 如果上面的方式无法登录：可以使用第三方登录
# 进入登陆页面
driver.find_element_by_xpath(".//*[@class='SignContainer-switch']/span").click()

# 点击社交网络账号登陆
driver.find_element_by_xpath(".//*[@class='Login-socialLogin']/button").click()
# 点击QQ登陆
driver.find_element_by_xpath(".//*[@class='Login-socialButtonGroup']/button[3]").click()

time.sleep(15)  # 时间不够的自己加
driver.refresh()  # 15秒后要刷新

# 登录后
# 获取cookie
print(driver.get_cookies())

selenium设置代理

from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()

# 设置代理
# 一定要注意，=两边不能有空格，不能是这样--proxy-server = http://202.20.16.82:10152
chromeOptions.add_argument("--proxy-server=http://10.3.132.6:808")
browser = webdriver.Chrome(chrome_options=chromeOptions)

# 查看本机ip，查看代理是否起作用
browser.get("https://blog.csdn.net/zwq912318834/article/details/78626739")
print(browser.page_source)

# 退出，清除浏览器缓存
# browser.quit()

练习：selenium登录QQ空间

提示：
    login = driver.find_element_by_id('login_frame')
    # iframe需要转换
    driver.switch_to_frame(login)

PhantomJS 无界面浏览器

已停止研发

headless

Headless Chrome是Chrome 浏览器的无界面形态，可以在不打开浏览器的前提下，使用所有 Chrome 支持的特性运行程序。相比于现代浏览器，Headless Chrome 更加方便测试web应用，获得网站的截图，做爬虫抓取信息等，也更加贴近浏览器环境。

Headless Chrome基于PhantomJS（QtWebKit内核）由谷歌Chrome团队开发。团队表示将专注研发这个项目

确保你的 chrome 浏览器版本是 60+.

配置

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")  # 使用headless 无界面形态
chrome_options.add_argument('--disable-gpu')  # 禁用gpu

driver = webdriver.Chrome(chrome_options=chrome_options)

XPath

XPath即为XML路径语言，它是一种用来确定XML（标准通用标记语言的子集）文档中某部分位置的语言。XPath基于XML的树状结构，有不同类型的节点，包括元素节点，属性节点和文本节点，提供在数据结构树中找寻节点的能力。

什么是 XPath?

XPath 使用路径表达式在 XML 文档中进行导航
XPath 包含一个标准函数库
XPath 是 XSLT 中的主要元素
XPath 是一个 W3C 标准

使用xpath

pip install lxml

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n16" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import lxml
from lxml import etree</pre>

XPath Helper插件

chrome插件网：http://www.cnplugins.com/

添加插件

Ctrl + Shift + X打开或关闭插件

XPath 术语

节点（Node）

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档（根）节点。XML 文档是被作为节点树来对待的。树的根被称为文档节点或者根节点。

请看下面这个 XML 文档：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="xml" contenteditable="true" cid="n25" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore></pre>

基本值（或称原子值，Atomic value）基本值是无父或无子的节点。

项目（Item）

项目是基本值或者节点。

节点关系

父（Parent）

每个元素以及属性都有一个父。 #####子（Children）元素节点可有零个、一个或多个子。

同胞（Sibling）

拥有相同的父的节点 #####先辈（Ancestor）某节点的父、父的父，等等。

后代（Descendant）

某个节点的子，子的子，等等。

XPath 语法

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="xml" contenteditable="true" cid="n38" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><?xml version="1.0" encoding="UTF-8"?>

<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>

<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore></pre>

选取节点

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。下面列出了最有用的路径表达式：

表达式	描述
/	获取子节点，默认选取根节点。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
..	选取当前节点的父节点。
@	选取属性。

在下面的表格中，我们已列出了一些路径表达式以及表达式的结果：

路径表达式	结果
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
/bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
/bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。

谓语（Predicates）

谓语用来查找某个特定的节点或者包含某个指定的值的节点。

谓语被嵌在方括号中。

在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()<3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang='eng']	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
/bookstore/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。

选取若干路径

通过在路径表达式中使用"|"运算符，您可以选取若干个路径。

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
//book/title	//book/price	选取 book 元素的所有 title 和 price 元素。
//title	//price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title	//price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n158" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">htmlFile = '''
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
'''

html = lxml.etree.parse("filename.html") # 读取文件
html = lxml.etree.HTML(htmltext) # 直接加载

print(html.xpath("//li/@class")) # 取出li的所有节点class名称
print(html.xpath("//li/@text")) # 为空，如果包含这个属性，
print(html.xpath("//li/a")) # li下面5个节点，每个节点对应一个元素
print(html.xpath("//li/a/@href")) # 取出li的所有节点 a内部href名称
print(html.xpath("//li/a/@href="link3.html"")) # 判断是有一个节点==link3.html
print(html.xpath("//li//span")) # 取出li下面所有的span
print(html.xpath("//li//span/@class")) # 取出li下面所有的span内部的calss
print(html.xpath("//li/a//@class")) # 取出li的所有节点内部节点a包含的class
print(html.xpath("//li")) # 取出所有节点
print(html.xpath("//li[1]")) # 取出第一个
print(html.xpath("//li[last()]")) # 取出最后一个
print(html.xpath("//li[last()-1]")) # 取出倒数第2个
print(html.xpath("//li[last()-1]/a/@href")) # 取出倒数第2个的a下面的href
print(html.xpath("//[@text="3"]")) # 选着text=3的元素
print(html.xpath("//[@text="3"]/@class")) # 选着text=3的元素
print(html.xpath("//*[@class="nimei"]")) # 选着text=3的元素
print(html.xpath("//li/a/text()")) # 取出<>
print(html.xpath("//li[3]/a/span/text()")) # 取出内部<>数据</pre>

示例1：抓取前程无忧招聘网岗位数量

示例2：抓取51job（前程无忧）全国岗位 https://jobs.51job.com/

练习：抓取上海市高级人民法院网 http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search.jsp

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" contenteditable="true" cid="n162" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
url = "http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search_content.jsp"

header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
page = requests.get(url=url, headers=header)
print(page.content.decode())
</pre>

练习：爬取链家 https://gz.lianjia.com/ershoufang/

常用UA池

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm" lang="python" contenteditable="true" cid="n165" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 1em 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> ua_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Opera/8.0 (Windows NT 5.1; U; en)",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
]</pre>

python爬虫--day04
selenium&phantomJS&headless 浏览器驱动下载 selenium使用 get(url)：打...
3分钟带你了解世界第一语言Python 入门上手也这么简单！
一、Python入门 1. Python爬虫入门一之综述 Python爬虫入门二之爬虫基础了解 Python爬虫入...
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（七）- 深度爬虫CrawlSpider
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（二）- urllib爬虫案例
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（一）- 入门基础
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（四）- XPath
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（三）- 爬虫进阶
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（六）- Scrapy框架
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（五）- Requests和Beautifu
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...

python爬虫--day04

selenium&phantomJS&headless

浏览器驱动下载

selenium使用

get(url)：打开URL

clear() ： 清除数据 Clears the text if it’s a text entry element.

page_source：获取HTML源码

close()：关闭

quit()：全部关闭

click()：点击，Clicks the element.

execute_script(script, *args)： 执行脚本

查找元素

find_element(by='id', value=None)

find_element_by_class_name(name)

find_element_by_css_selector(css_selector)

find_element_by_id(id_)

find_element_by_link_text(link_text)

find_element_by_name(name)

find_element_by_tag_name(name)

find_element_by_xpath(xpath)

find_elements(by='id', value=None)

find_elements_by_class_name(name)

find_elements_by_css_selector(css_selector)

find_elements_by_id(id_)

find_elements_by_link_text(link_text)

find_elements_by_name(name)

find_elements_by_tag_name(name)

find_elements_by_xpath(xpath)

get_attribute(name)

save_screenshot(filename)

send_keys(*value)

示例：selenium登录知乎

selenium设置代理

练习：selenium登录QQ空间

PhantomJS 无界面浏览器

headless

配置

XPath

什么是 XPath?

使用xpath

pip install lxml

XPath Helper插件

XPath 术语

节点（Node）

基本值（或称原子值，Atomic value） 基本值是无父或无子的节点。

项目（Item）

节点关系

父（Parent）

同胞（Sibling）

后代（Descendant）

XPath 语法

选取节点

谓语（Predicates）

选取未知节点

选取若干路径

示例1：抓取前程无忧招聘网岗位数量

示例2：抓取51job（前程无忧）全国岗位 https://jobs.51job.com/

练习：抓取上海市高级人民法院网 http://www.hshfy.sh.cn/shfy/gweb2017/ktgg_search.jsp

练习： 爬取链家 https://gz.lianjia.com/ershoufang/

常用UA池

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

clear() ：清除数据 Clears the text if it’s a text entry element.

execute_script(script, *args)：执行脚本

基本值（或称原子值，Atomic value）基本值是无父或无子的节点。

练习：爬取链家 https://gz.lianjia.com/ershoufang/