爬虫的整体结构:
1.由parser(解析器)2.down_loader(下载器)
3.url_manager(url管理器)4.outputer(写入器)
5.spider_main(‘引擎’)
爬虫的运行流程:
流程1.root_url是根网址,你要爬取的url 最初的一个。
(你要爬百度百科python页面,则root_url = "http://baike.baidu.com/item/python")
2.1根据提供的root_url,利用url_manager中的add_new_url方法添加url到new_urls中去
2.1.1 url_manager
url_managerurl_manager中有两个set,一个是new一个是old,我们利用new来不断爬;相当于一个url大集合,从中不断的取出然后爬然后继续在新的页面爬取出url都放在这里面;
2.2 通过一个循环while我们判断是否终止的条件是 new_urls 中是否还有url
2.2.1从url_manager 中get_new_url(见代码)
2.2.2利用获取的url来下载页面downloader
downloader用的是urllib(python3X没有urllib2)直接返回下载的页面;
2.2.3 用下载的内容和url,利用parser来parse出新的url和我们需要的数据
{比如当前的url,解析出的‘内容’,‘标题’}
from bs4 import BeautifulSoup
import re
import urllib
parser2.2.4新的url加入到new_urls那个set中去;新的数据加入到datas[]
里面的每一个元素是一个dict{ }
outputer爬完之后在crawl这个方法中最后output
这里
#from baidubaike_spider import outputer, url_manager, downloader,parser
class SpiderMain(object):
def __init__(self):
self.urls = url_manager.UrlManager()
self.downloader = downloader.DownLoader()
self.parser = parser.Parser()
self.outputer = outputer.OutPuter()
def craw(self, root_url):
self.urls.add_new_url(root_url)
count = 1
while self.urls.has_new_url():#if we had the new url we keep spiding the info
#new_url = self.urls.get_new_url()
#print(new_url)
try:#some url will be changed or fialed to spided we throw the exception
print('kkk')
new_url = self.urls.get_new_url()#get the new info
print('hello')
print('craw %d:%s'%(count,new_url))#count the url we get,and get the number
downloaded_content = self.downloader.download(new_url)#using the downloader for the new content
print('downed')
new_urls,new_data = self.parser.parse(new_url,downloaded_content)
print('parsed')
#using the parser to get the newest url and parse the info to get the data
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
count += 1
if count == 10:
break
except:
print('craw fail')
print('hello')
self.outputer.output_info()
print('hhh')
if __name__ == "__main__":
root_url = "http://baike.baidu.com/item/python"
obj_spider = SpiderMain()
obj_spider.craw(root_url)
网友评论