第一章初见网络爬虫

作者: VB过得VB | 来源:发表于2017-02-04 22:12 被阅读8次

第一章：网络爬虫简介
第一章初见网络爬虫
30个小时搞定Python网络爬虫
第1章网络爬虫简介
1-基本概念
2018-01-11 Python网络爬虫与信息提取网络爬虫
[Python网络爬虫]第1章网络爬虫入门
Python网络爬虫实战之十四：Scrapy结合scrapy-s
Python网络爬虫实战之七：动态网页爬取案例实战 Seleni
Python网络爬虫实战之八：动态网页爬取案例实战 Seleni

1.1、网络连接

# scrapetest.py
from urllib.request import urlopen # 查找Python的request模块（在urllib库里面），指导人一个urlopen函数
html = urlopen("http://pythonscraping.com/pages/page1.html").read() # urlopen用来打开并读取一个从网络获取的远程对象
print(html)
-------------------------------------------------------------------------
# 打印输出
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

1.2、运行BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html").read()
bs0bj = BeautifulSoup(html, 'lxml')
print(bs0bj)
print(bs0bj.h1)
-------------------------------------------------------------------------
# 打印输出
<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

<h1>An Interesting Title</h1>

1.3、异常处理

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url).read()
    except HTTPError as e:
        return None
    try:
        bsobj = BeautifulSoup(html, 'lxml')
        title = bsobj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not found")
else:
    print(title)
---------------------------------------------------------------------------
# 打印输出
<h1>An Interesting Title</h1>

网友评论

本文标题：第一章初见网络爬虫

本文链接：https://www.haomeiwen.com/subject/gwlsittx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

第一章初见网络爬虫

1.1、网络连接

1.2、运行BeautifulSoup

1.3、异常处理

相关文章

第一章：网络爬虫简介

第一章初见网络爬虫

30个小时搞定Python网络爬虫

第1章网络爬虫简介

1-基本概念

2018-01-11 Python网络爬虫与信息提取网络爬虫

[Python网络爬虫]第1章网络爬虫入门

Python网络爬虫实战之十四：Scrapy结合scrapy-s

Python网络爬虫实战之七：动态网页爬取案例实战 Seleni

Python网络爬虫实战之八：动态网页爬取案例实战 Seleni

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第一章 初见网络爬虫

1.1、网络连接

1.2、运行BeautifulSoup

1.3、异常处理

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第一章初见网络爬虫