约束

Python、 Java、 PHP、 C#、 Go 等语言都可以实现爬虫，但是在爬取网站信息时也需要注意一些约束规范。国内外关于网络数据采集相关的法律法规在不断完善中，提倡严格控制网络数据采集的速度，降低被采集网站服务器的负担。

爬取一个网站有三种常用的方法，下面分别举例介绍，所用的是Python2.7,以后更新文章的时候两种版本都可能出现，学习还是需要有所输出的，好记性不如烂笔头，把这些零散的笔记展现出来也算是一个总结和实践了。

注：这里是早期看爬虫书籍时候学习笔记，用的是pyhthon 2.7，升级到3.5版本以后2.X中的urllib2库发生了一些变化，变成了urllib库并被划分为一些子库。

1. 爬取网站地图

def crawl_sitemap(url):
    # 1. 网站地图爬虫
    # 使用示例网站robots.txt文件中发现的网站地图来下载所有网页。为解析网站地图，会使用一个简单的正则表达式，
    # 从<loc>标签中提取出URL（更加robust的方法是CSS selector）
    # download the sitemap file

    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here
        # ...
crawl_sitemap(url_sitemap)

2. 遍历每个网页的数据库ID

设置用户代理：

# 利用网站结构的弱点，更加轻松访问所有内容。
# 下面是一些示例国家的URL，可以看出这些URL只是在结尾处有区别，包括国家名和ID
# 一般情况下web服务器会忽略这个字符串，只使用ID来匹配数据库中的相关记录，网页依然可以加载成功。
# http://example.webscraping.com/view/Afghanistan-1
# http://example.webscraping.com/view/Australia-2
# http://example.webscraping.com/view/Brazil-3

# 下面是使用了该技巧的代码
# itertools.count(start, step)
# 起始参数(start)默认值为0
# 步长(step)默认值为1
# 作用: 返回以start开头的均匀间隔step步长的值
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        break
    else:
        # success -can scrap the result
        pass

# 这段代码对ID进行遍历直到下载出错停止，假设此时已经到达最后一个国家页面。
# 这种实现方式存在一个缺陷，那就是某些记录可能已被删除，数据库ID之间并不是连续的。
# 此时只要访问某个间隔点爬虫就会立即退出。下面改进代码，连续发生多次下载错误后才退出程序
# 但这种爬虫方式不是高效的做法

# maximum number of consecutive download errors allowed
max_errors = 5
# current number of consecutive download errors
num_errors = 0
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        # recieved an error trying to download this webpage
        num_errors += 1
        if num_errors == max_errors:
            # reached maximum number of
            # consecutive errors so exit
            break
        else:
            # success -can scrape the result
            # ..
            num_errors = 0

3. 跟踪网页链接

链接爬虫

# 以上两种技术只要可用就应当使其进行爬取，因为这两种方法最小化了需要下载的网页数量。
# 对于另一些网站，需要让爬虫模拟用户行为，跟踪链接，访问感兴趣的内容


def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            # check if link matches expected regex
            if re.match(link_regex, link):
                # from absolute link
                link = urlparse.urljoin(seed_url, link)
                # check if have already seen this link
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)


def get_links(html):
    # Return a list of links from html
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^]>+href=["\'](.*?)]["\']]', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)