页面分析

Bing的首页分析还算简单的，直接利用Chrome定位元素，然后再查找background图片链接。当然从分析中也能看出其实它也是动态修改的，实际上由JavaScript修改样式而来。

Bing首页
看到了里面的DOM节点名称为bgDiv，用这个名称去Source的标签页中搜索，得到如下结果

HTML源代码中bgDiv

既然如此，我们就采用正则表达式的形式进行图片链接的提取。

re.compile(r'g_img={(.?),.?};')，非贪婪模式匹配到第一个逗号，其实后面的就可以删除了，但是以防万一还是又匹配了一次。
正则表达式

代码书写

为了熟悉一下aiohttp，所以本次就尝试使用aiohttp来书写代码，比较简单。

pip install aiohttp，先安装aiohttp库
pip install apscheduler，安装apscheduler，这个库用来定时触发任务，可以查看链接的文档，官方有例子比较不错。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
 @File       : bing_crawler.py
 @Time       : 2017/9/28 0028 20:52
 @Author     : Empty Chan
 @Contact    : chen19941018@gmail.com
 @Description:
"""
import re
import asyncio
import aiohttp
import click
import os
import time
from apscheduler.schedulers.asyncio import AsyncIOScheduler

BASE_URL = 'http://cn.bing.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
PAT = re.compile(r'g_img={(.*?),.*?};')
PATH = os.path.abspath('.')
TODAY = time.strftime('%Y-%m-%d', time.localtime(time.time()))  # 转换得到当前日期


@asyncio.coroutine
def run():
    click.echo('crawl %s bing picture....' % TODAY)
    yield from crawl()  # 传统3.x的协程语法糖，遇到yield from，则函数记住此位置，然后立即返回
    click.echo('crawl finished')


async def crawl():  # 3.5引入的协程语法糖， async等同于@asyncio.coroutine， await 等同于yield from
    async with aiohttp.ClientSession(headers=headers) as session:  # 声明session
        async with session.get(BASE_URL) as resp:
            text = await resp.text()  # 立即返回，等待下次loop来获取，如果await后等待到返回值，则继续往下运行，否则返回继续等待下次loop
            click.echo(resp.status)
            if resp.status == 200:
                pat = PAT.findall(text)
                if len(pat) > 0:
                    img = pat[0].replace('"', '').replace('url:', '').strip()
                    url = BASE_URL + img
                    click.echo(url)
                    click.echo(PATH)
        if not os.path.exists('.\\Bing'):
            os.mkdir('.\\Bing')
        async with session.get(url) as res:
            with open('.\\Bing\\%s.jpg' % TODAY, 'wb') as f:
                while True:
                    chunk = await res.content.read(512)
                    if not chunk:
                        break
                    f.write(chunk)
                click.echo('save picture ok!')


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    scheduler = AsyncIOScheduler({'event_loop': loop})  # 声明AsyncIOScheduler异步定时任务，传入event_loop
    job = scheduler.add_job(run, 'cron', second='*/5')  # 通过add_job将run方法添加到定时任务中,5sec跑一次
    try:
        scheduler.start()
    except (KeyboardInterrupt, SystemExit):
        scheduler.shutdown()
    print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C'))

    # Execution will block here until Ctrl+C (Ctrl+Break on Windows) is pressed.
    try:
        loop.run_forever()  # 保证事件loop运行
    except (KeyboardInterrupt, SystemExit):
        pass

这周的文字不多，也不放什么GitHub了，大家直接运行就行，哈哈。国庆快乐！！！！欢迎喜欢我的文章的多多支持，小生希望能够写出更多好玩的东西！！
为伟大的祖国打call！！