美文网首页python爬虫
python爬虫(1)_获取网页

python爬虫(1)_获取网页

作者: 行走的老者 | 来源:发表于2017-06-21 23:50 被阅读89次

    分析网站

    1. 识别对方使用技术-builtwith模块
    pip install builtwith
    使用:
    >>> import builtwith 
    >>> builtwith.parse("http://127.0.0.1:8000/examples/default/index")
    {u'javascript-frameworks': [u'jQuery'], u'font-scripts': [u'Font Awesome'], u'web-frameworks': [u'Web2py'], u'programming-languages': [u'Python']}
    
    1. 寻找网站所有者
    安装:pip install python-whois
    
    使用:
    

    import whois
    print whois.whois("appspot.com")
    {
    "updated_date": [
    "2017-02-06 00:00:00",
    "2017-02-06 02:26:49"
    ],
    "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
    "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
    "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
    "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
    "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
    "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
    "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
    "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
    ],
    "name": "DNS Admin",
    "dnssec": "unsigned",
    "city": "Mountain View",
    "expiration_date": [
    "2018-03-10 00:00:00",
    "2018-03-09 00:00:00"
    ],
    "zipcode": "94043",
    "domain_name": [
    "APPSPOT.COM",
    "appspot.com"
    ],
    "country": "US",
    "whois_server": "whois.markmonitor.com",
    "state": "CA",
    "registrar": "MarkMonitor, Inc.",
    "referral_url": "http://www.markmonitor.com",
    "address": "2400 E. Bayshore Pkwy",
    "name_servers": [
    "NS1.GOOGLE.COM",

    "NS2.GOOGLE.COM", 
    "NS3.GOOGLE.COM", 
    "NS4.GOOGLE.COM", 
    "ns1.google.com", 
    "ns4.google.com", 
    "ns2.google.com", 
    "ns3.google.com"
    

    ],
    "org": "Google Inc.",
    "creation_date": [
    "2005-03-10 00:00:00",
    "2005-03-09 18:27:55"
    ],
    "emails": [
    "abusecomplaints@markmonitor.com",
    "dns-admin@google.com"
    ]
    }
    可以看到改域名归属google。

    ### 编写第一个爬虫
    #### 下载网页
    要想爬取网页,我们首先将其下载下来,下面示例使用Python的urllib2模块下载url。
    1. 基本写法
    ```python
    import urlib2
    def download(url):
        print 'Downloading:', url
        try:
            html = urllib2.urlopen(url).read()
        except urllib2.URLError as e:
            print 'Downloading error:', e.reason
            html = None
        return html
    
    1. 重试下载
      当爬取是对方服务器可能会返回500等服务端错误,当出现服务端错误时,我们可以试着重试下载,因为目标服务器是没有问题的,我们可以试着重试下载。
      示例:
    import urlib2
    def download(url, num_retries=3):
        print 'Downloading:', url
        try:
            html = urllib2.urlopen(url).read()
        except urllib2.URLError as e:
            print 'Downloading error:', e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return download(url, num_retries - 1)
        return html
    

    我们试着访问http://httpstat.us/500,该网站会返回500错误,代码如下:

    if __name__ == '__main__':
        download("http://httpstat.us/500")
        pass
    

    执行结果:

    Downloading: http://httpstat.us/500
    Downloading error: Internal Server Error
    Downloading: http://httpstat.us/500
    Downloading error: Internal Server Error
    Downloading: http://httpstat.us/500
    Downloading error: Internal Server Error
    Downloading: http://httpstat.us/500
    Downloading error: Internal Server Error
    

    可以看到重试了3次才会放弃,则重试下载成功。

    1. 设置用户代理
      python 访问网站时,默认使用Python-urllib/2.7作为默认用户代理,其中2.7为python的版本号,对于一些网站会拒绝这样的代理下载,所以为了正常的访问,我们需要重新设置代理。
    import urllib2
    def download(url, num_retries=3, user_agent="wswp"):
        print 'Downloading:', url
        headers = {'User-agent': user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print 'Downloading error:', e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return download(url, num_retries - 1, user_agent)
        return html
    

    使用代理:

    if __name__ == '__main__':
        user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        html = download("http://www.meetup.com",user_agent)
        print html
        pass
    

    链接爬虫

    使用链接爬虫可以爬下整个网站的链接,但是通常我们只需要爬下我们感兴趣的链接,所以我们,我们可以使用正则表达式来匹配,代码如下:

    import urllib2
    import re
    import urlparse
    def download(url, num_retries=3, user_agent="wswp"):
        print 'Downloading:', url
        headers = {'User-agent': user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print 'Downloading error:', e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return download(url, num_retries - 1, user_agent)
        return html
    
    def link_crawler(sell_url, link_regex):
        crawl_queue = [sell_url]
        seen = set(crawl_queue)
        while crawl_queue:
            url = crawl_queue.pop()
            html = download(url)
            for link in get_links(html):
                if re.match(link_regex, link):
                    print link
                    # check if have already seen this link
                    link = urlparse.urljoin(sell_url, link)
                    if link not in seen:
                        seen.add(link)
                        crawl_queue.append(link)
    
    def get_links(html):
        webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
        return webpage_regex.findall(html)
    
    if __name__ == '__main__':
        user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        
        link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)')
        pass
    

    高级功能

    1. 支持代理

    有时候一些网站屏蔽了很多国家,所以我们需要使用代理来访问这些网站,下面我们使用urllib2示例支持代码:

    proxy = ...
    opener = urllib2.build_opener()
    proxy_params = {urlparse.urlparse(url).scheme:proxy}
    opener.add_handle(urllib2.ProxyHandler(proxy_param))
    response = opener.open(request)
    

    将上面的代码集成到下载示例中,如下:

    def download(url, num_retries=3, user_agent="wswp", proxy=None):
        print 'Downloading:', url
        headers = {'User-agent': user_agent}
        request = urllib2.Request(url, headers=headers)
    
        # add proxy
        opener = urllib2.build_opener()
        if proxy:
            proxy_param = {urlparse.urlparse(url).scheme: proxy}
            opener.add_handler(urllib2.ProxyHandler(proxy_param))
    
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print 'Downloading error:', e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return download(url, num_retries - 1, user_agent, proxy)
        return html
    
    1. 下载限速

    很多时候,我们处理爬虫时,经常会遇到由于访问速度过快,会面临被封禁或造成对面服务器过载的风险,为了能正常模拟用户的访问,避免这些风险,我们需要在两次下载之间添加延迟,对爬虫进行限速,实现示例如下:

    
    import urlparse
    
    
    class Throttle:
        """Add a delay between downloads to the same domain """
        def __init__(self, delay):
            # amount of delay between downloads for each domain
            self.delay = delay
            # 存储访问一个网站的最后时间点
            self.domains = {}
    
        def wait(self, url):
            domain = urlparse.urlparse(url).netloc
            last_accessed = self.domains.get(domain)
    
            if self.delay > 0 and last_accessed is not None:
                sleep_secs = self.delay - (datetime.datetime.new() - last_accessed).seconds
                if sleep_secs > 0:
                    # 在访问网站之前延迟sleep_secs之后进行下次访问
                    time.sleep(sleep_secs)
            # 更新最后一次访问同一网站的时间
            self.domains[domain] = datetime.datetime.new()
    

    Throttle类记录了每个域名上次访问的时间,如果当前时间距离上次访问时间小于制定延迟,则执行睡眠操作,这样我们在每次下载之前调用Throttle对象对爬虫进行限速,集成之前的下载代码如下:

    # 在下载之前添加
    throttle = Throttle(delay)
    ...
    throttle.wait(url)
    result = download(url, num_retries=num_retries, user_agent=user_agent, proxy=proxy)
    
    
    1. 避免爬虫陷阱

    所谓的爬虫陷阱是指:之前我们使用追踪链接或爬取该网站的所有链接,但是有一种情况,就是在当前页面包含下个页面的链接,下个页面包含下下个页面的链接,也就是可以无休止的链接下去,这种情况我们称作爬虫链接

    想要避免这种情况,一个简单的方法就是我们记录到达当前页面经过了多少链接,也就是我们说的深度,当达到最大深度时,爬虫不再向队列中添加该网页的链接,我们在之前追踪链接的代码上添加这样的功能,代码如下:

    import urllib2
    import re
    import urlparse
    
    # 新增限制访问页面深度的功能
    def download(url, num_retries=3, user_agent="wswp"):
        print 'Downloading:', url
        headers = {'User-agent': user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print 'Downloading error:', e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return download(url, num_retries - 1, user_agent)
        return html
    
    def link_crawler(sell_url, link_regex, max_depth=2):
        max_depth = 2
        crawl_queue = [sell_url]
        # 将seen修改为一个字典,增加页面访问深度的记录
        seen = {}
        seen[sell_url] = 0
        
        while crawl_queue:
            url = crawl_queue.pop()
            html = download(url)
            # 获取长度,判断是否到达了最大深度
            depth = seen[url]
            if depth != max_depth:
                for link in get_links(html):
                    if re.match(link_regex, link):
                        print link
                        # check if have already seen this link
                        link = urlparse.urljoin(sell_url, link)
                        if link not in seen:
                            seen[link] = depth + 1
                            crawl_queue.append(link)
    
    def get_links(html):
        webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
        return webpage_regex.findall(html)
    
    if __name__ == '__main__':
        user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        
        link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)')
        pass
    

    当然了,如果你想禁用这个功能,只需要将max_depth设为负数即可。

    1. 最终版本代码
    import re
    import urlparse
    import urllib2
    import time
    from datetime import datetime
    import robotparser
    import Queue
    
    
    def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1):
        """Crawl from the given seed URL following links matched by link_regex
        """
        # the queue of URL's that still need to be crawled
        crawl_queue = Queue.deque([seed_url])
        # the URL's that have been seen and at what depth
        seen = {seed_url: 0}
        # track how many URL's have been downloaded
        num_urls = 0
        rp = get_robots(seed_url)
        throttle = Throttle(delay)
        headers = headers or {}
        if user_agent:
            headers['User-agent'] = user_agent
    
        while crawl_queue:
            url = crawl_queue.pop()
            # check url passes robots.txt restrictions
            if rp.can_fetch(user_agent, url):
                throttle.wait(url)
                html = download(url, headers, proxy=proxy, num_retries=num_retries)
                links = []
    
                depth = seen[url]
                if depth != max_depth:
                    # can still crawl further
                    if link_regex:
                        # filter for links matching our regular expression
                        links.extend(link for link in get_links(html) if re.match(link_regex, link))
    
                    for link in links:
                        link = normalize(seed_url, link)
                        # check whether already crawled this link
                        if link not in seen:
                            seen[link] = depth + 1
                            # check link is within same domain
                            if same_domain(seed_url, link):
                                # success! add this new link to queue
                                crawl_queue.append(link)
    
                # check whether have reached downloaded maximum
                num_urls += 1
                if num_urls == max_urls:
                    break
            else:
                print 'Blocked by robots.txt:', url
    
    
    class Throttle:
        """Throttle downloading by sleeping between requests to same domain
        """
        def __init__(self, delay):
            # amount of delay between downloads for each domain
            self.delay = delay
            # timestamp of when a domain was last accessed
            self.domains = {}
            
        def wait(self, url):
            domain = urlparse.urlparse(url).netloc
            last_accessed = self.domains.get(domain)
    
            if self.delay > 0 and last_accessed is not None:
                sleep_secs = self.delay - (datetime.now() - last_accessed).seconds
                if sleep_secs > 0:
                    time.sleep(sleep_secs)
            self.domains[domain] = datetime.now()
    
    
    def download(url, headers, proxy, num_retries, data=None):
        print 'Downloading:', url
        request = urllib2.Request(url, data, headers)
        opener = urllib2.build_opener()
        if proxy:
            proxy_params = {urlparse.urlparse(url).scheme: proxy}
            opener.add_handler(urllib2.ProxyHandler(proxy_params))
        try:
            response = opener.open(request)
            html = response.read()
            code = response.code
        except urllib2.URLError as e:
            print 'Download error:', e.reason
            html = ''
            if hasattr(e, 'code'):
                code = e.code
                if num_retries > 0 and 500 <= code < 600:
                    # retry 5XX HTTP errors
                    return download(url, headers, proxy, num_retries-1, data)
            else:
                code = None
        return html
    
    
    def normalize(seed_url, link):
        """Normalize this URL by removing hash and adding domain
        """
        link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates
        return urlparse.urljoin(seed_url, link)
    
    
    def same_domain(url1, url2):
        """Return True if both URL's belong to same domain
        """
        return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc
    
    
    def get_robots(url):
        """Initialize robots parser for this domain
        """
        rp = robotparser.RobotFileParser()
        rp.set_url(urlparse.urljoin(url, '/robots.txt'))
        rp.read()
        return rp
            
    
    def get_links(html):
        """Return a list of links from html 
        """
        # a regular expression to extract all links from the webpage
        webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
        # list of all links from the webpage
        return webpage_regex.findall(html)
    
    
    if __name__ == '__main__':
        user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        
        link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)', delay=0, num_retries=1, user_agent=user_agent)
    

    上面就是集成以上所有功能的版本,现在我们可以使用这个爬虫执行看看效果了,在终端输入:python xxx.py(xxx.py是你在上面保存代码的文件名)

    相关文章

      网友评论

        本文标题:python爬虫(1)_获取网页

        本文链接:https://www.haomeiwen.com/subject/eerxcxtx.html