爬虫程序的一般步骤和伪代码实现

作者: 八神苍月 | 来源:发表于2017-06-05 01:18 被阅读98次

爬虫程序的一般步骤和伪代码实现
python爬虫总结
伪代码-Pseudocode
[算法详解][归并排序]Merge sort
算法导论第2.1章 - 算法基础 (伪代码和循环不变式)
爬虫｜用Python百度图片并实现自动下载，分分种几千张图片
爬虫｜用Python百度喜欢的图片并实现自动下载，分分种几千张图
图片素材不够？Python爬虫来帮你啊，海量图片一键入手
写代码没有头绪时怎么办
[算法详解][快速排序]Quick Sort

以爬取图片某网站图片为例，为代码如下所示：

start

set baseurl

request and parse

get_page_num

for 1:page_num #所有页遍历

request and parse

get_each_page_link_url_list

get_each_page_link_url_num

get_each_page_title_list makedir(title)

for 1:link_url_num #每页上面的所有图册遍历

request and parse

get_image_num image_num

for 1:image_num #每个图册上的所有图片遍历

get_image_show_page_url

request and parse

get_image_real_href

get_image_real_filename

save(image,filename)

end

close

核心代码基本就是这些，其实半个小时就可以搞完，关键是对于网页分析要头脑清醒，提

前画好流程图，一般就3层for循环，偶尔复杂的网页会出现4层的循环结构。

加粗的变量表示我们要从网页中提取出来的内容。

核心代码：

baseurl='http://www.66bb.org/ArtDD/'

这是第一页，第二页和第三页怎么找，共有多少页怎么找

html=requests.get(baseurl,headers=headers)

html.encoding='gb2312'

soup=beautifulsoup(html.text,'lxml')

total_pages=get_total_pages(soup)

for page in range(1,total_pages):

html=requests.get(baseurl,headers=headers)

html.encoding='gb2312'

soup=beautifulsoup(html.text,'lxml')

#找到第一页上全部的地址和图册名称

all_url=soup.find(class_="fzltp").findAll('li')

#测试一下href 和 title 的获取方式是否正确

for num in range(1,len(all_url)):

#对每一页里面提取链接网页和标题，标题作为文件夹的名字

href=all_url[num].find('a')['href']

dirname=all_url[num].find('img')['alt']

baseurl_1='http://www.66bb.org'+href

html_1=requests.get(baseurl_1,headers=headers)

html_1.encoding='gb2312'

soup_1=beautifulsoup(html_1.text,'lxml')

total_pages_1=get_total_pages_1(soup_1)

#可以直接查找有多少张照片，这样就能算出来每张照片的网址

total_pages_1=soup_1.find('div',class_='tpm01').find

('font',color='blue').get_text()

total_pages_1=num(total_pages_1[2:-2])

for page_1 in range(1,total_pages_1):

#再获取下一层的网址

#baseurl_2='http://www.66xx.org/ArtDD/1894/'+str(page_1) +'.html'

baseurl_2=baseurl_1+str(page_1) +'.html'

html_2=requests.get(baseurl_2,headers=headers)

html_2.encoding='gb2312'

soup_2=beautifulsoup(html_2.text,'lxml')

#提取主图的地址和名字

image_href=soup_2.find(div,class_='imgbox').find('img')['src']

image_name=soup_2.find(div,class_='imgbox').find('img')['alt']

saveimage(image_href,image_name)

网友评论

本文标题：爬虫程序的一般步骤和伪代码实现

本文链接：https://www.haomeiwen.com/subject/npobfxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫程序的一般步骤和伪代码实现

相关文章