第一次写Python爬虫

作者: 魏兆华 | 来源:发表于2018-07-27 21:35 被阅读138次

python爬虫——scrapy框架总结
iOS程序员如何使用python写网路爬虫（一点更新）
分布式爬虫| 你必须得懂的那些Redis基础
ubuntu 16.04 安装 Scrapy
7个Python爬虫实战项目教程
Python网络爬虫一
用Python写爬虫
3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python爬虫有什么用？容易学吗？
解决《用Python写网络爬虫》中示例网站访问不了的问题

今天遇朋友求助，要从某商业网站上搜集供应商信息。因为此事甚急，遂放下手头事情，着手写python3爬虫。话说这还是我第一次正式写python3程序，也是第一次写爬虫。还好，比预想的顺利。

以下是爬虫主要代码，其中略有处理及删节：

#!/usr/bin/env python3

import requests
import json
from lxml import etree

# url已打马赛克
url_main = 'https://xxxx.com/data/ajax/get_offer_list.json?beginpage='
url_params = '&keywords=%XX%XX%XX%XX%XX%XX%XX&sortType=&descendOrder=....'

page_count = 50  #源数据有分页，页码越界会仍旧返回最后一页。干脆直接设定总页数
for pi in range(page_count):
  url_req = url_main + str(pi) + url_params
  resp = requests.get(url_req).content.decode('utf-8')
  idx_from = resp.find('(') + 1
  json_obj = resp[idx_from:-1]  #取得返回的JSON对象

  hjson = json.loads(json_obj)
  count = hjson['data']['offerCount']  #每页里的总条数

  for i in range(count):
    idx = pi*20+i+1  #每条数据设一个序列号
    cname = hjson['data']['content']['offerResult'][i]['attr']['company']['name']  #使用XPATH定位到供应商名称
    eurl = hjson['data']['content']['offerResult'][i]['eurl']   #找到供应商URL链接
    if eurl.startswith('//'):
      eurl = "https:" + eurl  #部分链接需要加上协议头

    res_detail = requests.get(eurl)  #获取供应商详情页面内容
    res_detail = res_detail.content.decode('GBK')  #页面用的GBK编码，需要解码
    
    selector = etree.HTML(res_detail)
    names = selector.xpath('//a[@class="membername"]')  #找到联系人名称
    name = ''
    if len(names) > 0:
      name = names[0].text
    else:
      names = selector.xpath('//span[@class="disc"]/a')  #联系人有时候会存在另外一处
      if len(names) > 0 and not names[0].text.isspace():
        name = names[0].text
      else:
        names = selector.xpath('//div[@class="contactSeller"]/a')  #联系人还有可能放在这里
        if len(names) > 0 and not names[0].text.isspace():
          name = names[0].text
    cells = selector.xpath('//dl[@class="m-mobilephone"]/@data-no')  #找到联系电话
    cell = ''
    if len(cells) > 0:
      cell = cells[0]

    print(idx, cname, name, cell)  #找到了，输出相关信息

为了方便使用，爬虫程序的输出结果存在test.txt里。检查文档发现，供应商数据填的不完全规范，比如有的就没有填联系电话，或者把联系电话制作进了图片里。为了简化，凡是不符合规范的一律忽略。

恰好此前还有一份朋友给的Excel通讯录，后来稍经处理转成了CSV，正好跟这次的爬虫结果一并整合，直接生成VCF文件，这样用起来也方便。于是乎说干就干：

#!/usr/bin/env python3

//联系人字典
mans = {}

#这个是爬虫结果文件
txtFile = open('test.txt', 'r')
for line in txtFile:
  segs = line.split()
  if len(segs) < 4:  #格式不规范的一律忽略
    continue
  mobile = int(segs[3])  #电话号码作为字典索引，防止重复
  if mobile not in mans:
    mans[mobile] = segs[1] + ' ' + segs[2]
txtFile.close()

#这个是已有的CSV通讯录
csvFile = open('test.csv', 'r')
for line in csvFile:
  segs = line.split(',')
  if (segs[2].endswith('\n')):  #从Windown平台生成的，去掉末尾换行符
    segs[2] = segs[2][:-1]
  if not segs[2].isdigit():  #不规范的忽略
    continue
  mobile = int(segs[2])
  if mobile not in mans:
    mans[mobile] = segs[1] + ' ' + segs[0]
csvFile.close()

#准备写入的VCF文件
vcfFile = open('final.vcf', 'w')
#逐行生成VCard联系人信息
for mobile, addr_name in mans.items():
  idx = addr_name.find(' ')
  addr = addr_name[:idx]
  name = addr_name[idx+1:]
  vcfFile.write('BEGIN:VCARD\n')
  vcfFile.write('VERSION:3.0\n')
  vcfFile.write('TEL;type=CELL;type=VOICE;type=pref:' + str(mobile) + '\n')
  vcfFile.write('N:' + name + '\n')
  vcfFile.write('ORG:' + addr + '\n')
  vcfFile.write('END:VCARD\n')
vcfFile.close()

朋友测试使用后，十分高兴，以前至少一个礼拜的工作，让我半天搞定了。

你的赞赏，我的动力！