美文网首页Python玩耍PythonPythoner集中营
BeautifulSoup和json库在爬虫项目中的应用

BeautifulSoup和json库在爬虫项目中的应用

作者: 科斯莫耗子 | 来源:发表于2016-08-22 16:02 被阅读2339次

重构人人贷爬虫的过程中,主要要爬取的数据是以json数据的格式呈现的,要提取的html内容如下:

<script id="credit-info-data" type="text/x-json">
{
    "data": {
        "creditInfo": {
            "account": "INVALID", 
            "album": "INVALID", 
            "borrowStudy": "VALID", 
            "car": "INVALID", 
            "child": "INVALID", 
            "credit": "FAILED", 
            "creditInfoId": 499250, 
            "detailInformation": "VALID", 
            "fieldAudit": "INVALID", 
            "graduation": "PENDING", 
            "house": "INVALID", 
            "identification": "VALID", 
            "identificationScanning": "VALID", 
            "incomeDuty": "PENDING", 
            "kaixin": "INVALID", 
            "lastUpdateTime": "Aug 1, 2014 12:00:00 AM", 
            "marriage": "VALID", 
            "mobile": "VALID", 
            "mobileAuth": "INVALID", 
            "mobileReceipt": "INVALID", 
            "other": "INVALID", 
            "renren": "INVALID", 
            "residence": "VALID", 
            "titles": "INVALID", 
            "user": 503971, 
            "version": 24, 
            "video": "PENDING", 
            "work": "OVERDUE"
        }, 
        "creditPassedTime": {
            "creditPassedTimeId": 499214, 
            "detailInfomation": "Nov 19, 2013 10:57:21 PM", 
            "identification": "Nov 19, 2013 3:14:27 PM", 
            "identificationScanning": "Nov 21, 2013 11:36:55 AM", 
            "lastUpdateTime": "Aug 1, 2014 12:00:00 AM", 
            "marriage": "Nov 21, 2013 11:37:32 AM", 
            "mobile": "Nov 19, 2013 3:10:53 PM", 
            "residence": "Nov 21, 2013 11:37:44 AM", 
            "user": 503971, 
            "work": "Nov 21, 2013 11:37:25 AM"
        }, 
        "loan": {
            "address": "\u5c71\u4e1c", 
            "allProtected": false, 
            "allowAccess": true, 
            "amount": 30000.0, 
            "amountPerShare": 50.0, 
            "avatar": "", 
            "borrowType": "\u8d2d\u8f66\u501f\u6b3e", 
            "borrowerId": 503971, 
            "borrowerLevel": "HR", 
            "currentIsRepaid": false, 
            "description": "\u672c\u4eba\u662f\u9ad8\u4e2d\u6559\u5e08\uff0c\u5de5\u8d44\u7a33\u5b9a\uff0c\u73b0\u5728\u4e70\u8f66\u5411\u5927\u5bb6\u501f\u6b3e\uff0c\u6bcf\u6708\u53d1\u5de5\u8d44\u6309\u65f6\u5f52\u8fd8\u3002", 
            "displayLoanType": "XYRZ", 
            "finishedRatio": 0.0, 
            "forbidComment": false, 
            "interest": 22.0, 
            "interestPerShare": 0.0, 
            "jobType": "\u5de5\u85aa\u9636\u5c42", 
            "leftMonths": 24, 
            "loanId": 123456, 
            "loanType": "DEBX", 
            "monthlyMinInterest": "[{\"month\":\"3\",\"minInterest\":\"10\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"6\",\"minInterest\":\"11\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"9\",\"minInterest\":\"12\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"12\",\"minInterest\":\"12\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"15\",\"minInterest\":\"13\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"18\",\"minInterest\":\"13\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"24\",\"minInterest\":\"13\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0},{\"month\":\"36\",\"minInterest\":\"13\",\"maxInterest\":\"24\",\"mgmtFee\":\"0.3\",\"tradeFee\":\"0\",\"guaranteeFee\":\"5\",\"inRepayPenalFee\":\"1\",\"divideFee\":0.0}]", 
            "months": 24, 
            "nickName": "sdcsqk", 
            "oldLoan": false, 
            "openTime": "Nov 19, 2013 9:11:48 PM", 
            "overDued": false, 
            "picture": "", 
            "principal": 0.0, 
            "productId": 7, 
            "productName": "HR", 
            "repaidByGuarantor": false, 
            "repayType": "MONTH", 
            "startTime": "Dec 19, 2013 9:11:48 PM", 
            "status": "FAILED", 
            "surplusAmount": 30000.0, 
            "title": "\u9ad8\u4e2d\u6559\u5e08\uff0c\u5de5\u4f5c\u7a33\u5b9a\u6309\u65f6\u5f52\u8fd8!", 
            "utmSource": "from-website", 
            "verifyState": "CANCEL"
        }
    }, 
    "status": 0
}</script>

在之前的版本中,应用了re进行简单粗暴的正则匹配,效率较低,因此在重构过程中,将使用BS4对这个标签进行提取,之后应用json库将string转为dict,便于后面的调用和输出。

下面简单介绍一下应用到的方法:

# ! /usr/bin/env python 
# -*- coding:utf-8 -*-

__author__ = 'Gao Yuhao'

# 统一 Python 2 和 3 的语法
try:
    input = raw_input
except:
    pass

import requests
from bs4 import BeartifulSoup
import json

# 确定测试爬虫页面
page_index = input('Pls input the page_index you want to try:')
surl = 'http://www.we.com/lend/detailPage.action?loanId=' + page_index

# 使用requests获取网页
req = requests.get(url = surl)
html = req.text.encode('utf-8')

# 使用BS提取内容
soup = BeautifulSoup(html)
res = soup('script',id = 'credit-info-data')[0].text

# 使用json将其转换为dict
res_json = json.loads(res)
print json.dumps(res_json, indent = 4)

相关文章

网友评论

  • 芦苇和远方:最近一直在研究beautifulsoup,终于研究明白了,你还可以说的更详细一点!!嘿嘿……
  • 使我不得开心颜:爬过花瓣网,就是这种情况
  • Garfield_Liang:最近我也在爬虫淘宝的时候,也遇到类似的问题,但我是通过pandas里面的read_json直接转为dataframe。感觉如果是做数据分析的话,这样会更加方便
  • 掂吾掂:现在python的最主要分为3.0+和2.7...对于新手来说都不知道是基于哪个版本的...所以希望作者以后能注意这一点...这个看上去应该是2.7版本的python吧?
    科斯莫耗子:@掂吾掂 谢谢你的建议,因为我这边有些项目的库在3的环境支持不好,所以生产环境依然是2.7,不过正如你所说,由2向3 的过渡是趋势,所以我也在逐步把一些项目向3兼容,这个项目重构的一个目的也就是兼容两个版本的python,稍后更新一下文中的代码
    掂吾掂:@科斯莫耗子 不过我建议作者以后的项目还是做phtyon3的版本,毕竟都得跟着新版本走,要不然这个语言的开发者都只用旧版本来开发程序的话,那就真的太悲哀了...我一开始学python的时候也是犹豫从python 2开始还是从python 3开始,后来还是果然从python 3开始学习,本来是做iOS的,结果现在爬数据,selenium自动化测试都会了,都是一步一步学上来的,而且我感觉现在很多第三方开源的架框都以python 3为标准了...所以希望作者还是放弃python 2,要不断引导新人去学习python 3...
    科斯莫耗子:@掂吾掂 好的,这篇文章只是简单介绍了这两个包,在实际项目中,我一般都会争取做到兼容2和3

本文标题:BeautifulSoup和json库在爬虫项目中的应用

本文链接:https://www.haomeiwen.com/subject/gqcgsttx.html