爬虫初步
01 安装requests 模块
C:\Users\wu-chao>pip install requests
02 安装 lxmL 模块
C:\Users\wu-chao>pip install lxml
模块安装完成后会在 python目录\Lib\site-packages 下 创建对应的库(文件夹):
lxml
lxml-4.1.1.dist-info
requests
requests-2.18.4.dist-info
03 新建python 文件,如:pachong.py ,在文件头部导入 requests、lxml 模块
import requests
from lxml import etree
04 声明地址,请求头字典(键值对)
addr="https://www.douban.com"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"}
05 发起请求,定义变量接受响应,如:resp
resp=requests.get(addr,headers=header);
06 将响应内容转换为htmlElement 对象
htmlElemet = etree.HTML(resp.content);
07 提取内容,通过选择器逻辑提取,得到一个列表
selectStr= "//ul[@class='time-list']/li/a/img/@src";
list = htmlElemet.xpath(selectStr);
08 循环列表,打印出内容
for itern in list:
print itern
完整代码 pachong.py
import requests
from lxml import etree
addr="https://www.douban.com"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"}
resp=requests.get(addr,headers=header);
htmlElemet = etree.HTML(resp.content);
selectStr= "//ul[@class='time-list']/li/a/img/@src";
list = htmlElemet.xpath(selectStr);
for itern in list:
print itern
网友评论