爬虫基础知识

作者: Donald_32e5 | 来源:发表于2019-04-28 16:48 被阅读0次

【Python】Python3网络爬虫实战-15、爬虫基础：HT
Python 爬虫实战(一) - 简介
6张脑图系统讲透python爬虫和数据分析、数据挖掘
Scrapy
Java爬虫高级教程
R爬虫实战—抓取PubMed文章的基本信息
Python67-爬虫
python爬虫入门
Python爬虫学习－爬取大规模数据(10w级）
Semantic-UI框架定制前端界面

一、定义

网络爬虫就是模拟浏览器发送网络请求，接受请求响应，一种按照一定的规则，自动的抓取互联网信息的程序。
原则上，只要是浏览器（客户端）能做的事情，爬虫都能做

二、流程

三、requests的基本使用

1、requests是Python的一个网络类库

2、requests的作用就是发送网络请求，返回响应数据


# 目标url
url = 'https://www.baidu.com' 

# 向目标url发送get请求
response = requests.get(url)

# 打印响应内容
print(response.text)

2、response的常用属性

reponse.text 响应体str类型
reponse.content 响应体bytes类型
reponse.status_code响应状态码
reponse.requests.headers响应对应的请求头
responde.headers 响应头
reponse.requests._cookies 响应对应的cookies
requests.cookies 响应cookies（经过了set-cookies动作）

3、reponse.text 和reponse.content的区别

reponse.text
- 类型：str
- 解码类型：requests模块自动根据HTTP头部对响应的编码做出有根据的推测，推测文本编码
- 如和修改编码方式： reposne.encoding='gbk'
reponse.content
- 类型：bytes
- 解码类型：没有指定
- 如何修改编码方式： reponse.content.decode('uutf-8')
获取网页源码的通用方式：
1、reponse.content.decode()
2、reponse.content.dexode('GBK')
3、reposne.text
以上三种方式，能够100%的解决所有的网页解码问题

4、发送带参数的请求，有两种方式

# 方式一：利用params参数发送带参数的请求
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# 这是目标url
# url = 'https://www.baidu.com/s?wd=python' 

# 最后有没有问号结果都一样
url = 'https://www.baidu.com/s?' 

# 请求参数是一个字典 即wd=python
kw = {'wd': 'python'} 

# 带上请求参数发起请求，获取响应
response = requests.get(url, headers=headers, params=kw) 

# 当有多个请求参数时，requests接收的params参数为多个键值对的字典，比如 '?wd=python&a=c'-->{'wd': 'python', 'a': 'c'}

print(response.content)

# 方式二：直接发送带参数的url的请求
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

url = 'https://www.baidu.com/s?wd=python'

# kw = {'wd': 'python'}

# url中包含了请求参数，所以此时无需params
response = requests.get(url, headers=headers)

5、requests发送POST请求

用法格式，

# data为字典格式
response = requests.post("http://www.baidu.com/",  data = data, headers=headers)

6、使用代理

用法格式：

# proxies是字典格式
proxies = { 
      "http": "http://127.0.0.1:12333", 
      "https": "https://127.0.1:12333", 
      }
requests.get("http://www.baidu.com",  proxies = proxies)