美文网首页
Fiddler+Python+UiPath抓包

Fiddler+Python+UiPath抓包

作者: 旷kevin | 来源:发表于2020-05-08 14:14 被阅读0次

场景:截获某应用的数据包

涉及两种请求的返回截取:

1.https://xxxx.xxxx.com/static/pc/api/v1/documents/1594/group-correlations?type=wrong&page={具体哪一页}&size=20

     这个请求返回列表模式下所有钩稽关系对,包含了四元组信息和索引信息

2.https://xxxx.xxxx.com/static/pc/api/v1/document/1594/html_segment?entity_type={识别出的四元组的原始来源的类型}&entity_index={具体的四元组的id}

     这个请求返回某个四元组在文档中的原始来源,本次抓取只取type=paragraph的来源

纯页面抓取的困难:

    1.页面上的信息不能完全展示四元组

    2.关联条目没有固定格式

python抓取的困难:

    1.由于网络限制request一直调不通

    2.mitmproxy代理设置后各种异常阻断

最终方案:

    1.使用fiddler工具作为客户端和服务器的中间代理,通过JScript.NET在请求返回时将数据包写到本地

    2.python读取存到本地的数据包(json字符串),进行解析(也可在JScript.Net脚本中写入,但由于本人不熟悉还是选择用python处理)

    3.设计UiPath流程不断触发页面发送请求,直到遍历完所有钩稽关系的请求,每一次翻页时执行一次python脚本

备注:

    1.fiddler的代理设置灵活,即开即代理,存写数据包的代码写在Rules>Customize Rules点击后的CustomRules.js脚本里OnBeforeResponse方法中,代码如下:

CustomRules.js里

static function OnBeforeResponse(oSession: Session) {

        if (m_Hide304s && oSession.responseCode == 304) {

            oSession["ui-hide"] = "true";

        }

        if(oSession.HostnameIs("autodoc_mvip.paodingai.com")&&oSession.url.Contains("group-correlations")){

            var jsonString = oSession.GetResponseBodyAsString();

            var responseJSON = Fiddler.WebFormats.JSON.JsonDecode(jsonString);

            //FiddlerApplication.Log.LogString(" message from OnBeforeResponse: "+responseJSON.JSONObject["data"].JSONObject["data"].length);

            //for(var i=0;i<responseJSON.JSONObject["data"].JSONObject[""])

            // 保存文件到本地

            var fso;

            var file;

            fso = new ActiveXObject("Scripting.FileSystemObject");

            file = fso.OpenTextFile("D:\\Users\\{我的名字}\\data\\json.txt",2 ,true, -2);

            file.writeLine(jsonString);

            file.writeLine("\n");

            file.close();

        }

        if(oSession.HostnameIs("autodoc_mvip.paodingai.com")&&oSession.url.Contains("html_segment")&&oSession.url.Contains("PARAGRAPH")){

            var requestString = oSession.url;

            FiddlerApplication.Log.LogString("截获请求:"+requestString);

            var vars = requestString.split("&");

            var entity_id = vars[1].split("=")[1];

            FiddlerApplication.Log.LogString("截获entity_id="+entity_id+"的数据包");

            var jsonString = oSession.GetResponseBodyAsString();

            // 保存文件到本地

            var fso;

            var file;

            fso = new ActiveXObject("Scripting.FileSystemObject");

            file = fso.OpenTextFile("D:\\Users\\{{我的名字}}\\data\\"+entity_id+".txt",2 ,true, -2);

            file.writeLine(jsonString);

            file.writeLine("\n");

            file.close();

        }

    }

  2. 超级简单python解析脚本,这里的逻辑完全基于我对返回数据包的格式理解:

extract.js

import json

import csv

def formateQuadruple(q, t):

    if t=='公式不予处理':

        return q

    output = ""

    #attributes

    attributes = "attributes:"

    isFirst = True

    try:

        for a in q['attributes']:

            if isFirst:

                attributes += a['text']

                isFirst = False

            else:

                attributes += ','+a['text']

    except:

        print('no attributes detected')

    output += attributes

    #preattributes / head_attributes

    if t=='paragraph':

        preattributes = "preattributes:"

        isFirst = True

        try:

            for a in q['preattributes']:

                if isFirst:

                    preattributes += a['text']

                    isFirst = False

                else:

                    preattributes += ','+a['text']

        except:

            print('no preattributes detected')

        output += " "+preattributes

    if t=='table':

        head_attributes = "head_attributes:"

        isFirst = True

        try:

            for a in q['head_attributes']:

                if isFirst:

                    head_attributes += a['text']

                    isFirst = False

                else:

                    head_attributes += ','+a['text']

        except:

            print('no head_attributes detected')

        output += " "+head_attributes

    #value

    value = "value:"

    isFirst = True

    try:

        for a in q['value']:

            if isFirst:

                value += a['text']

                isFirst = False

            else:

                value += ','+a['text']

    except:

        print('no value detected')

    output += " "+value

    #time

    time = "time:"

    isFirst = True

    try:

        for a in q['time']:

            if isFirst:

                time += a['text']

                isFirst = False

            else:

                time += ','+a['text']

    except:

        print('no time detected')

    output += " "+time

    return output

def findRawContent(entity_id):

    content="not captured"

    try:

        data = open("./data/"+str(entity_id)+".txt","r").read()

        obj = json.loads(data)

        content=obj['data']['entity']

    except:

        print('there has no txt file captured for entity='+str(entity_id))

    return content

if __name__ == '__main__':

    print('begin to extract data packages from current page...')

    data = open("./data/json.txt","r").read()

    f = open('q.csv','a+',encoding='gbk',newline='')

    csv_writer = csv.writer(f)

    obj = json.loads(data)

    #当前页面所有钩稽关系

    keys = list(obj['data']['data'].keys())

    for key in keys:

        #print(obj['data']['data'][key])

        #遍历一个钩稽关系内的多条关联:每一个关联里的main_correlation_item都是一样的

        for relation in obj['data']['data'][key]:

            main_entity = relation['main_correlation_item']['data']['entity']

            main_page = relation['main_correlation_item']['page']

            main_q = relation['main_correlation_item']['data']['entity']['quadruple']

            main_origin_content=""

            if main_entity['type']=='paragraph':

                main_origin_content=findRawContent(main_entity['id'])

            else:

                main_origin_content="表格和公式不予捕捉原始来源"

            matching_degree = relation['matching_degree']

            #如果是多个数组就是 公式,暂不予处理

            if len(relation['correlation_items'])>1:

                print('公式不予处理')

                correlate_entity={}

                correlate_entity['id']='公式不予处理'

                correlate_entity['type']='公式不予处理'

                correlate_page='公式不予处理'

                correlate_q='公式不予处理'

                correlate_origin_content='公式不予处理'

            else:

                if len(relation['correlation_items'])==1:

                    correlate_entity = relation['correlation_items'][0]['data']['entity']

                    correlate_page = relation['correlation_items'][0]['page']

                    correlate_q = relation['correlation_items'][0]['data']['entity']['quadruple']

                    correlate_origin_content=""

                    if correlate_entity['type']=='paragraph':

                        correlate_origin_content=findRawContent(correlate_entity['id'])

                    else:

                        correlate_origin_content="表格和公式不予捕捉原始来源"

                else:

                    correlate_entity={}

                    correlate_entity['id']='无关联'

                    correlate_entity['type']='无关联'

                    correlate_page='无关联'

                    correlate_q='无关联'

                    correlate_origin_content='无关联'

            csv_writer.writerow([main_entity['id'], main_entity['type'],

                                 formateQuadruple(main_q, main_entity['type']),

                                 main_page,

                                 main_origin_content,

                                 matching_degree,

                                 correlate_entity['id'], correlate_entity['type'],

                                 formateQuadruple(correlate_q, correlate_entity['type']),

                                 correlate_page,

                                correlate_origin_content])

    f.close()

    print('done.')

3. uiPath流程文件:(执行前打开fiddler代理,执行中逐一点击钩稽关系对,并在每次翻页前调用一次上述python脚本)

相关文章

网友评论

      本文标题:Fiddler+Python+UiPath抓包

      本文链接:https://www.haomeiwen.com/subject/kgdfnhtx.html