15-通用爬虫模块-数据提取

作者: 努力爬行中的蜗牛 | 来源:发表于2019-03-12 15:44 被阅读0次

15-通用爬虫模块-数据提取
python爬虫---基本的模块，你一定要懂！！
python-爬虫
14-通用爬虫模块-数据获取
爬虫
网络爬虫（四）
爬虫原理与数据抓取之一: 通用爬虫和聚焦爬虫
（了解）通用爬虫和聚焦爬虫--爬虫基础教程（python）（二）
（二）爬虫框架(1)——scrapy简介
2019-07-10 近期想解决的问题

数据提取

简单的来说，数据提取就是从响应中获取我们想要的数据的过程。

数据分类

非结构化的数据：html类
处理方法：正则表达式，xpath

结构化数据：json，xml等
处理方法：转化为python数据类型

数据提取-JSON

由于把json数据转化为python内建数据类型很简单，所以爬虫中，如果我们能够找到返回json数据的URL，就会尽量使用这种URL

JSON（JavaScript Object Notation）是一种轻量级的数据交换格式，它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。适用于进行数据交互的场景，比如网站前台和后台之间的数据交互。

json与python对象转化.png

json使用注意点

json中的字符串都是双引号引起来的
- 如果不是双引号
- eval：能实现简单的字符串和python类型的转化
- replace：把单引号替换为双引号

import json
from parse_url import parse_url
# from pprint import pprint

url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0"
html_str = parse_url(url)

# json.loads把json字符串转化为python类型
ret1 = json.loads(html_str)
# pprint(ret1)
# print(type(ret1))

# json.dumps能够把ptyhon类型转化为json字符串
with open("douban.json", "w", encoding="utf-8") as f:
    f.write(json.dumps(ret1, ensure_ascii=False, indent=4))

# with open("douban.json", "r", encoding="utf-8") as f:
#     ret2 = f.read()
#     ret3 = json.loads(ret2)
#     print(ret3)

# 使用json.load提取类文件对象中的数据
# with open("douban.json", "r", encoding="utf-8") as f:
#     ret4 = json.load(f)
#     print(ret4)
#     print(type(ret4))


# json.dump能够把python类型放入类文件独享中
with open("douban1.json", "w", encoding="utf-8") as f:
    json.dump(ret1, f, ensure_ascii=False, indent=2)