一、分析页面

首先，用命令行工具打开mitmweb，如图：

mitmweb

image

然后，将手动配置手机的代理服务器为PC机。

如何配置可以参考之前的文章https://www.cnblogs.com/xyztank/articles/12362470.html

接着，在手机打开想要爬取的软件“得到App”。

image

根据代理截获的数据，然后进行分析，最终定位文章地址列表：

https://entree.igetget.com/bauhinia/v1/class/purchase/article_list

然后，分析服务器返回的json数据，可以看到文章的标题及地址。

image

二、代码实现

from mitmproxy import ctx
import json
from lxml import etree
from selenium import webdriver

def response(flow):
    """
    利用mitmdump ui分析出页面url
    """
    start_url = "https://entree.igetget.com/bauhinia/v1/class/purchase/article_list"
    if flow.request.url.startswith(start_url):
        text = flow.response.text
        data = json.loads(text)
        talks = data.get('c').get('article_list')
        for talk in talks:
            title = talk.get('share_title')
            url = talk.get('share_url')
            ctx.log.info(str(title))
            parse_page(url)


def parse_page(url):
    """
    发现获得的url页面无法直接解析，
    进一步分析得出，页面信息是通过js渲染，
    但是浏览器又能正常显示页面，于是采用selenium方式爬取信息
    """
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--headless")
    #通过headless设置，让浏览器不再显示
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get(url)
    resouce = driver.page_source
    html = etree.HTML(resouce)
    title = html.xpath('//h1[@class="title"]/text()')[0]
    time = html.xpath('//span[@class="time"]/text()')[0]
    content = html.xpath('//div[@class="text"]//p//text()')
    content = "".join(content)
    print(title, time)
    save(title, time, content)


def save(title,time,content):
    """
    保存至文本文件中
    """
    with open('dedao.txt','a',encoding='utf-8') as fp:
        fp.write('\n'.join([title,content,time]))
        fp.write('\n' + '='*50 + '\n')

windows平台命令行运行脚本：