爬虫3

作者: 冬gua | 来源:发表于2018-03-21 21:48 被阅读0次

想要玩爬虫！正则表达式是你的必修课程！这篇足以你玩转爬虫了！
想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！
想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！
python-爬虫基础（慕课网）
6张脑图系统讲透python爬虫和数据分析、数据挖掘
Python爬虫入门（urllib+Beautifulsoup）
Python3爬虫工具-MiniSpider
深度爬虫
爬虫——Web Scraper
pip install pyspider

利用xpath 获取所需要的东西

W3School官方文档：http://www.w3school.com.cn/xpath/index.asp

XPath使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。

举例说明

import requests

from lxml import etree

import uuid

url_base='http://tieba.baidu.com/'

url1='%sf'%(url_base)

kw = input('输入贴吧：')

begin_page = int(input('起始页：'))

end_page = int(input('结束页：'))

for page in range(begin_page,end_page+1):

params = {

'kw':kw,

'pn':(page-1)*50

}

response=requests.get(url=url1,params=params)

content1=response.content

# with open('./tieba.html', 'wb') as file:

# file.write(content)

'''数据处理'''

content1 = content1.decode('utf-8')

html1 = etree.HTML(content1)

href_list = html1.xpath(

'(//div[@class="threadlist_title pull_left j_th_tit "]/a|//div[@class="col2_right j_threadlist_li_right "]/a)/@href')

for href in href_list:

url2 = '%s%s' % (url_base, href)

print(url2)

response2 = requests.get(url=url2)

content2 = response2.content

html2 = etree.HTML(content2)

src_list = html2.xpath('//div/img[@class="BDE_Image"]/@src')

for src in src_list:

file_name = str(uuid.uuid1()) + src[src.rfind('.'):]

response3 = requests.get(url=src)

content3 = response3.content

with open('./images/%s' % file_name, 'wb') as file:

file.write(content3)

网友评论

本文标题：爬虫3

本文链接：https://www.haomeiwen.com/subject/chinqftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫3

相关文章

想要玩爬虫！正则表达式是你的必修课程！这篇足以你玩转爬虫了！

想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！

想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！

python-爬虫基础（慕课网）

6张脑图系统讲透python爬虫和数据分析、数据挖掘

Python爬虫入门（urllib+Beautifulsoup）

Python3爬虫工具-MiniSpider

深度爬虫

爬虫——Web Scraper

pip install pyspider

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读