思路

本来想着爬一下微信文章，后来发现搜狗的微信搜索是静态页面，顿时索然无味。

不过只要想做，就总能找到没学过的东西。比如说，将文章内容用Markdown保存下来。另外，我发现现在搜狗搜索里公众号文章只能显示前10篇了，似乎用搜狗爬取微信文章的路被断了。我粗略看了一下，用微信客户端还是可以访问的，但网页版不行，PC端的可以，手机端应该也可以。登陆其实还好解决，其余只能看抓包了，之后有机会做一下。

听说微信和知乎都在封IP，所以下一步我准备先把代理池做起来。

言归正传，怎么爬取文章的思路我就不说了，这里主要说下如何把文章内容转化为markdown格式并保存到本地。

HTML转Markdown

我在网上找到了两个html转md库：tomb和html2text。两个我都试了，后来选了前面那个，是位成都朋友写的。不过不管是哪个，转换之后都有一些不尽如意，比如说：图片视频没处理；有残留的 、标签等等。

我这里是在使用tomb库之前先用BeautifulSoup将内容做了预处理，具体的处理如下：

提取文章标题，规整作者，公众号名字和文章发布时间字段。
关于<img>标签，将图片的名字（如果有的话，没有自己取一个）和地址提取出来，替换成类似于！[图片名字](图片地址)的结构，同时把图片保存到当前位置的_images目录下。
关于iframe，多用于视频的标签，替换成类似于[视频名](视频地址)的结构，后续会被转为超链接。
去除所有的 ,  标签。
去除文章末尾赞赏部分的内容。

这里为什么图片和视频要被替换为span标签，是因为BeautifulSoup库不能直接添加字符串（会被转为&alt, &amp之类的符号），好在bs也提供了一些修改HTML结构的方法，比如说insert_after()和decompose()，具体的可以看官方文档。

后续的话，我打算写一个HTML转MD的库，名字就叫ohhtml2md。

代码

效果

下载之后如图：

wechat-download.jpg

如果用类似Typora可以直接解析md的编辑器，可以直接进行阅读：

wechat-article.gif

但是有些内容的换行其实是有问题的。

wechat.py

import requests
import time
from bs4 import BeautifulSoup
from random import choice
import tomd
import os
import Configure

header = {}
header['user-agent'] =  choice(Configure.FakeUserAgents)

def getOnePageURLs():
    payload = {
        'type':2,
        's_from':'input',
        'query':'garmin fenix 5',
        'ie':'utf8'
    }

    url = "http://weixin.sogou.com/weixin"
    try:
        response = requests.get(url, headers=header, params=payload)
        content = None

        if response.status_code == requests.codes.ok:
            content = response.text
            
    except Exception as e:
        print (e)

    soup = BeautifulSoup(content, 'lxml')

    news_list = soup.find(class_='news-list').find_all('li')

    for new in news_list:
        getContent(new.a.get('href').strip())
        #break

def getContent(url):
    try:
        response = requests.get(url, headers=header)
        content = None

        if response.status_code == requests.codes.ok:
            content = response.text
            
    except Exception as e:
        print (e)
    
    soup = BeautifulSoup(content, 'lxml')   

    if not os.path.isdir("Download"):
        os.mkdir("Download")

    html_content_parser(soup)
    
def html_content_parser(soup):
    html_content = soup.find(id='img-content')

    # 获取标题，构建文件名
    title = html_content.find(class_='rich_media_title')
    title.string = title.text.strip()

    filename = title.text.strip().replace('|','-') + '.md'
    
    # 筛选作者信息
    profile_name = html_content.find(class_='rich_media_meta').a.text.strip()
    publish_time = html_content.find(id='meta_content').em.text.strip()

    author = html_content.find(id='meta_content').find('p',class_='rich_media_meta_primary')
    author = author.text.strip()[3:] if author else None

    new_tag = soup.new_tag('p')
    new_tag.string = "公众号: " + profile_name
    title.insert_after(new_tag)

    new_tag2 = soup.new_tag('p')
    new_tag2.string = "发布时间: " + publish_time
    new_tag.insert_after(new_tag2)

    if author:
        new_tag3 = soup.new_tag('p')
        new_tag3.string = "作者: " + author
        new_tag2.insert_after(new_tag3)

    html_content.find(id='meta_content').decompose()
    
    # 去除赞赏
    html_content.find(class_='reward_area').decompose()
    html_content.find(class_='reward_qrcode_area').decompose()

    # 处理图片
    ## 文件夹
    dir_name = "Download/_images"
    if not os.path.isdir(dir_name):
        os.mkdir(dir_name)
    
    ## 下载并替换为本地图片
    cnt = 1
    
    images = html_content.find_all('img')

    if images:
        for image in images:
            img_src = image.get('data-src')
            img_type = image.get('data-type')
            img_name = "{0:s}_{1:d}.{2:s}".format(filename[:-3], cnt, img_type if img_type else 'png')
            cnt += 1

            file_path = "Download/_images/{0:s}".format(img_name)
            with open (file_path, 'wb') as file:
                response = requests.get(url = img_src)
                for block in response.iter_content(1024):
                    if block:
                        file.write(block)
                    else:
                        break
            
            tag = soup.new_tag('span')
            tag.string = "![](_images/{0:s})".format(img_name)
            image.replace_with(tag)

    # 处理视频
    videos = html_content.find_all('iframe')
    if videos:
        for video in videos:
            video_src = video.get('data-src')

            tag = soup.new_tag('span')
            tag.string = "[此处是视频]({0:s})".format(video_src)
            video.replace_with(tag)
    

    # 去除<br/> <br></br>
    brs = html_content.find_all('br')
    if brs:
        for br in brs:
            br.decompose()

    mdText = str(html_content).replace('<br/>','')

    mdText = tomd.Tomd(mdText).markdown

    with open("Download/"+filename, 'w', encoding='utf8') as file:
        file.write(mdText)


if __name__ == '__main__':
    getOnePageURLs()