[造轮子]爬取搜狗微信公众号文章

作者: Charles__Jiang | 来源:发表于2016-11-21 13:16 被阅读3360次

[造轮子]爬取搜狗微信公众号文章
2019-02-21
Python3网络爬虫开发实战之使用代理爬取微信公众号文章
50行Python代码爬取微信公众号所有文章
Python 爬虫进公司必会项目
微信公众号批量爬取Java版
微信公众号及服务号文章爬取
爬虫练手：使用IP代理池，爬取微信文章信息
数据采集-微信公众号文章的完整爬取过程笔记
Python 简单关键字爬取公众号文章

博客链接：http://www.charlesjiang.com/archives/9.html

背景：想做一个公众号文章资源APP，发现搜狗有搜索公众号文章功能，果断开撸。http://wxiread.com (用CMS搭了个简易的站)。

Step1.分析栏目及接口

搜狗微信地址：http://weixin.sogou.com/ , 如下图：

搜狗分了20个栏目，分别是热门,推荐,段子手,养生堂,私房话… 对应地址从 /pcindex/pc/pc_0 到 /pcindex/pc/pc_19 , 如：http://weixin.sogou.com/pcindex/pc/pc_0/1.html [1.html为分页号]。整理栏目对应关系见表如下：

Step2.分析列表结构

文章列表页由 li 节点构成，li 的ID可看做文章ID，li子节点包括文章标题，描述信息，作者，作者头像等。

Step3.使用QueryList采集文章基本信息

QueryList 是一个基于PHP的DOM解析工具，功能强大，语法类似于JQuery；详细使用可查看官方文档

代码如下：

protected function get_article_list($url)
{
     //获取文章LI ID规则
    $rules = array(
        'article_id' => array('li', 'id'),//文章ID
        'inner_html' => array('li', 'html')
    );

    //递归获取LI节点内容
    $data = QueryList::Query($url, $rules)->getData(function($li) {
        $id   = $li['article_id'];
        $info = QueryList::Query($li['inner_html'], array(
            'article_url'   => array(".wx-img-box > a", "href"), //文章地址
            'author_url'    => array(".pos-wxrw > a", "href"), //作者地址
            'author_avatar' => array(".pos-wxrw > a > p > img", "src"), //作者头像
            'article_thumb' => array(".wx-img-box > a > img", "src"), //文章缩略图
            'author_name'   => array(".pos-wxrw > a > p:eq(1)", "text"), //作者名称
            'article_title' => array(".wx-news-info2 > h4", "text"), //文章标题
            'article_des'   => array(".wx-news-info", "text"), //文章简介
            'article_create_at' => array(".wx-news-info2 [v]", "v"), //文章标题
            'article_hits' => array(".wx-news-info2 > .s-p", "text", "", function($i){ preg_match('/\d+/', $i, $ms); return (int)$ms[0];}), //文章标题
        ))->data;
        unset($info['inner_html']);

        $info[0]['article_id']   = $id;
        $info[0]['article_hits'] = intval($info[0]['article_hits']);

        return $info[0];
    });

    return $data;
}

Step4.获取文章详情

第三步仅采集了文章基本信息（标题，作者，简介等），要获取详情信息需要打开原文地址爬取内容。代码如下：

/**
* 获取文章详情(并返回文章基本信息用于更新)
* @param $url
* @return array
*/
protected function get_content($url)
{
    $option = array(
        'http' => array(
            'header' => "Referer:" . self::SET_REFER),
    );
    $url = file_get_contents($url, FALSE, stream_context_create($option));

    //去除微信干扰元素!!!否则乱码
    $url = str_replace("<!--headTrap<body></body><head></head><html></html>-->", "", $url);
    $rules = array(
        'content' => array('#js_content', 'html'),//文章内容
    );
    $content = QueryList::Query($url, $rules)->getData();
    //原文链接
    preg_match("/var msg_link = \".*\"/", $url, $matches);
    $orUrl = html_entity_decode(urldecode($matches[0]));
    $orUrl = substr(explode('var msg_link = "', $orUrl)[1], 0, -4);

    //原文标题 !避免出现标题被截取
    preg_match("/var msg_title = \".*\"/", $url, $matches);
    $orTitle = $matches[0];
    $orTitle = substr(explode('var msg_title = "', $orTitle)[1], 0, -1);

    //原文作者头像
    preg_match("/var round_head_img = \".*\"/", $url, $matches);
    $orAuthAvatar = $matches[0];
    $orAuthAvatar = substr(explode('var round_head_img = "', $orAuthAvatar)[1], 0, -1);

    //原文缩略图
    preg_match("/var msg_cdn_url = \".*\"/", $url, $matches);
    $orImgUrl = $matches[0];
    $orImgUrl = substr(explode('var msg_cdn_url = "', $orImgUrl)[1], 0, -1);

    return array(
        'content'        => $content[0]['content'],
        'article_url'    => urldecode($orUrl),
        'article_title'  => html_entity_decode($orTitle),
        'author_avatar'  => $orAuthAvatar,
        'article_thumb'  => $orImgUrl
    );
}

Step5.数据入库

数据库大致设计如下：

wechat_article: 保存文章基本信息
wechat_article_content: 文章详情信息
wechat_category: 栏目信息
wechat_article_ids: 已被导入的文章，避免重复导入（可以选用Redis等）

Step6.将文章同步到CMS

方便操作，我选用的是PHPCMS, 在后台建好栏目，写一个导入脚本，用定时任务执行，现情况如下：

其他

1.微信图片防盗链：

微信原文图片做了防盗链，在同步到CMS时，我将所有图片链接替换为中转地址如：
http://www.wxiread.com/api.php?op=ref_control&url=http://mmbiz.qpic.cn/mmbiz_gif/jxateR9eXe1x9yPwA89Rm8mtjZYCgMuiauGKMMOtsEVAyCrsicJVhNv5ON4QOfLJHXRdsUyj8kklDwzicIrNSRyNw/0?wx_fmt=gif

api.php 代码如下：

$sogouPre = "http://img02.store.sogou.com/net/a/05/link?appid=100520091&url=";
/**
* 防盗链处理
*/
$url  = @trim($_REQUEST['url']);
if (empty($url) || !isUrl($url)) {
    return;
}

$imgType = getImgType($url);
$opts = array(
    'http'=>array(
        'method'=>"GET",
        'header'=>"Referer:http://weixin.sogou.com/ \n" .
            "User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 \n".
            "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
    )
);
$context = stream_context_create($opts);
$file    = file_get_contents($sogouPre . $url, FALSE, $context);
header("Content-type:image/{$imgType}");
echo $file;