美文网首页
微信爬虫实战

微信爬虫实战

作者: misspass | 来源:发表于2020-08-16 13:01 被阅读0次

    https://nuozhilin.site/2020/02/20/2020-02-20-weixin-crawler-practice/
    https://www.lizenghai.com/archives/31687.html

    代理

    macOS

    • 安装mitmproxy

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    3
    4
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">brew install mitmproxy

    mitmdump

    Proxy server listening at http://*:8080

    </pre>

    |

    mitmproxy是一款开源免费且可编程的HTTPS代理 可选方案还有AnyProxy

    • 安装证书

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    3
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">双击 ~/.mitmproxy/mitmproxy-ca-cert.pem

    配置 mitmproxy证书为 始终信任
    </pre>

    |

    • 设置代理

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    3
    4
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">系统配置 => 网络 => 高级 => 代理

    Web Proxy (HTTP) => 127.0.0.1:8080
    Secure Web Proxy (HTTPS) => 127.0.0.1:8080
    </pre>

    |

    Android

    • 拷贝证书

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">拷贝 macOS证书~/.mitmproxy/mitmproxy-ca-cert.pem至手机
    </pre>

    |

    • 安装证书

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">MIUI11 => 设置 => 加密与凭据 => 从SD卡安装
    </pre>

    |

    • 设置代理

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">网络 => 代理 => macos_ip:8080
    </pre>

    |

    macOS和Android必须处于同一网络

    爬虫

    数据库

    • MySQL

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    3
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">docker run --name mysql-weixin -p 3306:3306 -e MYSQL_ROOT_PASSWORD=123456 -d mysql:5.7.17

    docker exec -i mysql-weixin mysql -uroot -p123456 <<< "CREATE DATABASE IF NOT EXISTS wechat DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_general_ci;"
    </pre>

    |

    • Redis

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">docker run --name redis-weixin -p 6379:6379 -d redis
    </pre>

    |

    爬虫

    • 运行爬虫

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">wget https://zbkj-service.oss-cn-beijing.aliyuncs.com/wechat/wechat_spider.zip

    unzip wechat_spider.zip && rm -rf __MACOSX

    cd wechat_spider

    chmod +x wechat-spider-mac

    退出之前的mitmdump

    ./wechat-spider-mac
    </pre>

    |

    爬虫服务的数据库配置文件在config.yaml

    • 添加公众号 - 抓取任务

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">docker exec -i mysql-weixin mysql -uroot -p123456 <<< "USE wechat; INSERT INTO wechat_account_task (__biz) VALUES('MzIyNzk1MTU2OQ==');"
    </pre>

    |

    上述biz是指公众号”机械指挥官”在微信平台中的唯一编号 同时手机微信需要关注此公众号 否则数据不能抓取全

    • 访问公众号 - 历史消息

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    3
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">MIUI11 => 微信 => 通讯录 => 公众号 =>

    "机械指挥官" => 新闻资讯 => "机械指挥官" (历史消息)
    </pre>

    |

    先添加要抓取的公众号 然后再访问任意公众号的”历史消息” 才能触发

    此时爬虫开始抓取 ./logs/wechat_spider.log日志如下

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
    2
    </pre>

    |

    <pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">MainThread|2020-02-20 14:48:17,877|deal_data.py|deal_article_list|line:290|INFO| 抓取到列表底部 无更多文章,公众号 MzIyNzk1MTU2OQ== 抓取完毕
    MainThread|2020-02-20 15:00:40,828|deal_data.py|__parse_article_list|line:153|INFO| 采集到上次发布时间 公众号 MzIyNzk1MTU2OQ== 采集完成
    </pre>

    |

    参考

    相关文章

      网友评论

          本文标题:微信爬虫实战

          本文链接:https://www.haomeiwen.com/subject/gsckghtx.html