【学习】爬虫学习

作者: X_Ran_0a11 | 来源:发表于2019-06-27 13:28 被阅读0次

Python爬虫学习（十六）初窥Scrapy
资料
爬虫入门
Python爬虫学习之小结（一）
python爬虫学习-day7-实战
Python 基础爬虫目录
python爬虫学习-day5-selenium
python爬虫学习-day6-ip池
python爬虫学习-day3-BeautifulSoup
python爬虫学习-day4-使用lxml+xpath提取内容

https://zhuanlan.zhihu.com/p/379836932

image.png

1、获取数据

urllib2：python自带标准库
requests：需安装，更友好
selenium：相较于requests模拟http协议来获取数据，selenium是通过调用模拟器来获取数据，速度会更慢。

1.1 requests常用模块

https://docs.python-requests.org/zh_CN/latest/user/quickstart.html
https://blog.csdn.net/qq_41556318/article/details/86527763

request.get
get和post都是获取数据的方式，只不过采用的http不同协议方式。可以统一采用get方式获取数据。需要设置的header等信息通过字典设置好后发送，也可以不设置则自动传空值或默认值。

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)

headers参数可以不设置采用默认值。
timeout来限制传送时长

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

request.text
https://blog.csdn.net/qq_38900441/article/details/79946377
request.content是保留的字节，.text是自动编译后的字符串，但由于自动编译方式不对，所以遇到汉字之类的情况，需要转换编码方式才能正确显示。

import requests  
from bs4 import BeautifulSoup  
response = requests.get('https://www.baidu.com')  
response.encoding = 'utf-8'  
re_text = response.text  
print (re_text)

json转换
内置的json解码器来处理json数据

>>> import requests
 
>>> r = requests.get('https://api.github.com/events')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

headers
显示传送的header数据

>>> r.headers
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}

status_code
https://www.stubbornhuang.com/555/
正常状态码为200

>>> r = requests.get('http://httpbin.org/get')
>>> r.status_code
200

session
如果是账号登陆基本都需要用到保持，因为登陆后才会自动跳转

session = requests.session()
response = session.post(url1,data={"userName":name,"pwd":password})

image.png

1.2 selenium模块
https://python-selenium-zh.readthedocs.io/zh_CN/latest/

设定浏览器&登陆网址

from selenium import webdriver
from selenium.webdriver.common.keys import Keys  ##Keys提供键盘各类输入
driver = webdriver.Chrome()
driver.get(url = url)

输入信息

name = driver.find_element(by='name',value="userName")
pwd = driver.find_element(by='name',value="pwd")

name.send_keys('yvegmn')
pwd.send_keys('yvegmnaa')

点击
理论上需要设定一个elements，再进行操作，点击一般可以用回车代替（Keys.Enter），也可以用click函数

driver.find_element_by_class_name('layui-btn.layui-btn-fluid').click()   #
find_elements_by_class_name这种写法的话，元素名称不能有空格，把空格换成.就可以了。。但是有时候click会失效，用enter会比较方便