美文网首页
第一章 初见网络爬虫

第一章 初见网络爬虫

作者: VB过得VB | 来源:发表于2017-02-04 22:12 被阅读8次

    1.1、网络连接

    # scrapetest.py
    from urllib.request import urlopen # 查找Python的request模块(在urllib库里面),指导人一个urlopen函数
    html = urlopen("http://pythonscraping.com/pages/page1.html").read() # urlopen用来打开并读取一个从网络获取的远程对象
    print(html)
    -------------------------------------------------------------------------
    # 打印输出
    b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
    

    1.2、运行BeautifulSoup

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    html = urlopen("http://pythonscraping.com/pages/page1.html").read()
    bs0bj = BeautifulSoup(html, 'lxml')
    print(bs0bj)
    print(bs0bj.h1)
    -------------------------------------------------------------------------
    # 打印输出
    <html>
    <head>
    <title>A Useful Page</title>
    </head>
    <body>
    <h1>An Interesting Title</h1>
    <div>
    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    </div>
    </body>
    </html>
    
    <h1>An Interesting Title</h1>
    

    1.3、异常处理

    from urllib.request import urlopen
    from urllib.error import HTTPError
    from bs4 import BeautifulSoup
    def getTitle(url):
        try:
            html = urlopen(url).read()
        except HTTPError as e:
            return None
        try:
            bsobj = BeautifulSoup(html, 'lxml')
            title = bsobj.body.h1
        except AttributeError as e:
            return None
        return title
    title = getTitle("http://www.pythonscraping.com/pages/page1.html")
    if title == None:
        print("Title could not found")
    else:
        print(title)
    ---------------------------------------------------------------------------
    # 打印输出
    <h1>An Interesting Title</h1>
    

    相关文章

      网友评论

          本文标题:第一章 初见网络爬虫

          本文链接:https://www.haomeiwen.com/subject/gwlsittx.html