Python 网络爬虫学习笔记.CH3 采集数据

作者: 硼酸滴耳液 | 来源:发表于2017-09-24 15:46 被阅读0次

Python网络数据采集之图像识别与文字处理
Python 网络爬虫学习笔记.CH3 采集数据
Python爬虫学习笔记——1.环境搭建
Python 爬虫基础｜Python网络数据采集笔记
2018最佳人工智能数据采集(爬虫)工具书下载
数据埋点方案简述
Python网络数据采集
《Python网络数据采集》 ([美] 米切尔) 中文pdf版
2019年Python爬虫学习必看
爬虫入门

之所以叫网络爬虫（Web crawler），是因为它们可以沿着网络爬行。本质就是一种递归方式。为了找到 URL 链接，爬虫必须首先获取网页内容，检查这个页面的内容，再寻找另一个 URL，然后获取 URL 对应的网页内容，不断循环这一过程。

提取页面链接：

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a"):

if 'href' in link.attrs:

print(link.attrs['href'])

过滤多余的连接：

以仅提取“词条链接”为例，相比于“其他链接”，“词条链接”：

• 都在 id 是 bodyContent 的 div 标签里

• URL 链接不包含分号

• URL 链接都以 /wiki/ 开头

——利用find()方法和正则表达式过滤“其他链接”：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import datetime

import random

import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):

html = urlopen("http://en.wikipedia.org"+articleUrl)

bsObj = BeautifulSoup(html, "html.parser")

return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")

while len(links) > 0:

newArticle = links[random.randint(0, len(links)-1)].attrs["href"]

print(newArticle)

links = getLinks(newArticle)

链接去重：

为了避免一个页面被采集两次，链接去重是非常重要的。在代码运行时，把已发现的所有链接都放到一起，并保存在方便查询的列表里（下文示例指 Python 的集合 set 类型）。只有“新”链接才会被采集，之后再从页面中搜索其他链接：

遍历首页上每个链接，并检查是否已经在全局变量集合 pages 里面了（已经采集的页面集合）。如果不在，就打印到屏幕上，并把链接加入pages 集合，再用 getLinks 递归地处理这个链接。

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

global pages

html = urlopen("http://en.wikipedia.org"+pageUrl)

bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

if 'href' in link.attrs:

if link.attrs['href'] not in pages:

# we meet the new page

newPage = link.attrs['href']

print(newPage)

pages.add(newPage)

getLinks(newPage)

getLinks("")

收集整个网站数据的组合程序：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

global pages

html = urlopen("http://en.wikipedia.org"+pageUrl)

bsObj = BeautifulSoup(html, "html.parser")

try:

print(bsObj.h1.get_text())

print(bsObj.find(id ="mw-content-text").findAll("p")[0])

print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])

except AttributeError:

print("This page is missing something! No worries though!")

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

if 'href' in link.attrs:

if link.attrs['href'] not in pages:

#We have encountered a new page

newPage = link.attrs['href']

print("----------------\n"+newPage)

pages.add(newPage)

getLinks(newPage)

getLinks("")