对于这种简单的网站而言,要是想追求爬取数据的速度,就不得不说Scrapy.它是一个基于Twisted,纯Python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,是非常之方便的
1.Scrapy架构
Scrapy 使用 Twisted这个异步网络库来处理网络通讯,架构清晰,并且包含了各种中间件接口,可以灵活的完成各种需求。整体架构如下图所示:
scrapy架构
2.操作流程
<ul><li>创建一个Scrapy项目</li><li>定义提取的Item</li><li>编写爬取网站的spider并提取Item</li><li>编写 Item Pipeline 来存储提取到的 Item(即数据)</li>具体教程可参考Scrapy入门教程,就不在这赘述了</ul>
3.具体实现
网站页面现有如下需求,要爬取ZZ91网站的数据如图所示(标红部分),观察url,不能发现其url规律,即最后的数字是表示page的页码,假设前几部均已实现,现编写一个spider,实现的时候先分析页面标签,再使用beautifulSoup、Xpath、selector或正则表达式等均可,进行页面解析.
页面规律
编写spider核心代码如下:
<pre><code>class DemoSpider(scrapy.spiders.Spider):
name = "fdp1"
start_urls = [
"http://jiage.zz91.com/s/e5ba9fe794b5e793b6-1/1.html",
]
base = "http://jiage.zz91.com/s/e5ba9fe794b5e793b6-1/"
for i in xrange(2,4):
url = base + str(i) + '.html'
start_urls.append(url)
def parse(self, response):
#filename = response.url.split("/")[-2]
response.body
soup1 = BeautifulSoup(response.body ,"lxml")
divs = soup1.findAll('div',{'class':'l-main'})
length = 0
for div in divs:
items1 = div.findAll('div',{'class':'m-item'})
items2 = div.findAll('div',{'class':'m-item_2'})
length = len(items1)
for i in xrange(length):
item = Zz91Item()
link = items1[i].find('a')['href']
print link
item['fdp1_url'] = link
item['fdp1_title'] = items1[i].find('a').get_text()
response2 = urllib.urlopen(link)
soup2 = BeautifulSoup(response2, "lxml")
fdp1_content = soup2.find('div',{'class':'p_content p_contentA'})
fdp1_content =fdp1_content.get_text()
fdp1_content = ' '.join(fdp1_content.split())
#print fdp1_content
item['fdp1_content'] =fdp1_content
contact_link = items2[i].find('a')['href']
#print contact_link
item['fdp1_contact_url'] = contact_link
item['fdp1_company'] = items2[i].find('a').get_text()
</code></pre>
<p>我是用mysql数据库来存储数据的,这里考虑,数据的重复性,需要在spider类里再编写一个check函数,来检查数据是否重复爬取.代码如下:</p>
<pre><code>def check(self, article_url):
self.database = Database()
self.database.connect('crawl_data')
sql = "SELECT * FROM zz91_feidianping_1 where fdp1_url=%s order by fdp1_url"
str_article_url = article_url.encode('utf-8')
data = (str_article_url,)
try:
search_result = self.database.query(sql, data)
if search_result == ():
self.database.close()
return True
except Exception, e:
print e
traceback.print_exc()
self.database.close()
return False
</code>
</pre>
<p>在parse函数后面添加下面两行代码即可:</p>
<pre>
<code> if self.check(item['fdp1_url']):
yield item
</code>
</pre>
<p>获取数据如图所示:
Fiddler抓包分析一 Fiddler抓包分析二 不难发现公司详细联系方式都在这个json文件里
http://apptest.zz91.com/detail/?id=%s&appsystem=%s&company_id=%s&datatype=%s&usertoken=%s ,但是这个url里的id是
http://apptest.zz91.com/offerlist/?clientid=867450021846562&company_id=%s&appsystem=%s&keywords=%s&page=%s&orderflag=&datatype=%s&usertoken=%s 里的pbt_id,所以先获取上述的json文件,才能获取detail_jison文件
具体实现如下:
<pre><code>
#coding:utf-8
import requests
import urllib
appsystem = 'XXXXXXXXXXXXXXXX'
company_id ='XXXXXXXXXXXXXXXX'
datatype = 'XXXXXXXXXXXX'
usertoken = 'XXXXXXXXXXXXXX'
keystr = "XXXXXXXXXXXXXXXX"
keywords =urllib.parse.quote(keystr)
count = 1
for i in range(1,100):
url = 'http://apptest.zz91.com/offerlist/?clientid=867450021846562' \
'&company_id=%s&appsystem=%s&keywords=%s' \
'&page=%s&orderflag=&datatype=%s&usertoken=%s'\
% (company_id,appsystem,keywords,str(i),datatype,usertoken)
# 根据抓包信息 构造表单
headers = {
'Host': 'apptest.zz91.com',
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.4.4; 2014112 Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36',
'Accept': '*/*',
'Accept-Encoding': 'gzip',
'Connection': 'Keep-Alive',
'Charset': 'UTF-8',
'Cookie':'sessionid=XXXXXXXXXXXXXXXXX',
}
r = requests.session()
response = r.post(url=url,headers=headers)
rjson = response.json()
productList= rjson['productList']
# print(productList)
for product in productList:
print('第%s条信息'% (str(count)))
count += 1
com_name = product['com_name']#公司名称
pdt_price = product['pdt_price']#价格
pdt_id= product['pdt_id']#发布消息id
isshowcontact = product['isshowcontact']
isbuycontact = product['isbuycontact']
com_id = product['com_id']#公司id
pdt_time_en = product['pdt_time_en']#发布日期
pdt_name = product['pdt_name']#发布标题
com_province = product['com_province']#所在地
pdt_detail = product['pdt_detail']#发布详情,其实也比较粗略
ldbtel = product['ldbtel']
pdt_kind = product['pdt_kind']
kindtxt = pdt_kind['kindtxt']#供应或求购
kindclass = pdt_kind['kindclass']#buy
com_subname = product['com_subname']
pdt_images = product['pdt_images']
phone_rate = product['phone_rate']
phone_level = product['phone_level']
vippaibian = product['vippaibian']
pdt_name1 = product['pdt_name1']
wordsrandom = product['wordsrandom']
detail_url = 'http://apptest.zz91.com/detail/?' \
'id=%s&appsystem=%s&company_id=%s' \
'&datatype=%s&usertoken=%s'\
% (pdt_id,appsystem,company_id,datatype,usertoken)
detail_pdt = r.post(url=detail_url, headers=headers)
detail_json = detail_pdt.json()
detailList = detail_json['list']
address = detailList['address']
compname = detailList['compname']
business = detailList['business']
contact = detailList['contact']
details = detailList['details']
email = detailList['email']
mobile = detailList['mobile']
mobile1 = detailList['mobile1']
expire_time = detailList['expire_time']#发布有效期
title = detailList['title']
price = detailList['price']
price_unit = detailList['price_unit']
quantity = detailList['quantity']
quantity_unit = detailList['quantity_unit']
print(title)
print(details)
print(address)
print(compname)
print(contact)
print(mobile)
print('价格:'+str(price)+str(price_unit))
print('数量:'+str(quantity)+str(quantity_unit))
print('发布日期:'+pdt_time_en) #发布日期
print('有效期:'+expire_time)#发布有效期
</code></pre>获取结果如下图所示:
result
</p>
4.总结
至此,功能均已实现,想说一下,Fiddler抓包手机客户端数据,还是很方便的,具体配置教程如链接所示.接下来,目标就是实现识别图片中的手机号码
可恶/(ㄒoㄒ)/~~
网友评论