美文网首页
Fiddler+Python+UiPath抓包

Fiddler+Python+UiPath抓包

作者: 旷kevin | 来源:发表于2020-05-08 14:14 被阅读0次

    场景:截获某应用的数据包

    涉及两种请求的返回截取:

    1.https://xxxx.xxxx.com/static/pc/api/v1/documents/1594/group-correlations?type=wrong&page={具体哪一页}&size=20

         这个请求返回列表模式下所有钩稽关系对,包含了四元组信息和索引信息

    2.https://xxxx.xxxx.com/static/pc/api/v1/document/1594/html_segment?entity_type={识别出的四元组的原始来源的类型}&entity_index={具体的四元组的id}

         这个请求返回某个四元组在文档中的原始来源,本次抓取只取type=paragraph的来源

    纯页面抓取的困难:

        1.页面上的信息不能完全展示四元组

        2.关联条目没有固定格式

    python抓取的困难:

        1.由于网络限制request一直调不通

        2.mitmproxy代理设置后各种异常阻断

    最终方案:

        1.使用fiddler工具作为客户端和服务器的中间代理,通过JScript.NET在请求返回时将数据包写到本地

        2.python读取存到本地的数据包(json字符串),进行解析(也可在JScript.Net脚本中写入,但由于本人不熟悉还是选择用python处理)

        3.设计UiPath流程不断触发页面发送请求,直到遍历完所有钩稽关系的请求,每一次翻页时执行一次python脚本

    备注:

        1.fiddler的代理设置灵活,即开即代理,存写数据包的代码写在Rules>Customize Rules点击后的CustomRules.js脚本里OnBeforeResponse方法中,代码如下:

    CustomRules.js里

    static function OnBeforeResponse(oSession: Session) {

            if (m_Hide304s && oSession.responseCode == 304) {

                oSession["ui-hide"] = "true";

            }

            if(oSession.HostnameIs("autodoc_mvip.paodingai.com")&&oSession.url.Contains("group-correlations")){

                var jsonString = oSession.GetResponseBodyAsString();

                var responseJSON = Fiddler.WebFormats.JSON.JsonDecode(jsonString);

                //FiddlerApplication.Log.LogString(" message from OnBeforeResponse: "+responseJSON.JSONObject["data"].JSONObject["data"].length);

                //for(var i=0;i<responseJSON.JSONObject["data"].JSONObject[""])

                // 保存文件到本地

                var fso;

                var file;

                fso = new ActiveXObject("Scripting.FileSystemObject");

                file = fso.OpenTextFile("D:\\Users\\{我的名字}\\data\\json.txt",2 ,true, -2);

                file.writeLine(jsonString);

                file.writeLine("\n");

                file.close();

            }

            if(oSession.HostnameIs("autodoc_mvip.paodingai.com")&&oSession.url.Contains("html_segment")&&oSession.url.Contains("PARAGRAPH")){

                var requestString = oSession.url;

                FiddlerApplication.Log.LogString("截获请求:"+requestString);

                var vars = requestString.split("&");

                var entity_id = vars[1].split("=")[1];

                FiddlerApplication.Log.LogString("截获entity_id="+entity_id+"的数据包");

                var jsonString = oSession.GetResponseBodyAsString();

                // 保存文件到本地

                var fso;

                var file;

                fso = new ActiveXObject("Scripting.FileSystemObject");

                file = fso.OpenTextFile("D:\\Users\\{{我的名字}}\\data\\"+entity_id+".txt",2 ,true, -2);

                file.writeLine(jsonString);

                file.writeLine("\n");

                file.close();

            }

        }

      2. 超级简单python解析脚本,这里的逻辑完全基于我对返回数据包的格式理解:

    extract.js

    import json

    import csv

    def formateQuadruple(q, t):

        if t=='公式不予处理':

            return q

        output = ""

        #attributes

        attributes = "attributes:"

        isFirst = True

        try:

            for a in q['attributes']:

                if isFirst:

                    attributes += a['text']

                    isFirst = False

                else:

                    attributes += ','+a['text']

        except:

            print('no attributes detected')

        output += attributes

        #preattributes / head_attributes

        if t=='paragraph':

            preattributes = "preattributes:"

            isFirst = True

            try:

                for a in q['preattributes']:

                    if isFirst:

                        preattributes += a['text']

                        isFirst = False

                    else:

                        preattributes += ','+a['text']

            except:

                print('no preattributes detected')

            output += " "+preattributes

        if t=='table':

            head_attributes = "head_attributes:"

            isFirst = True

            try:

                for a in q['head_attributes']:

                    if isFirst:

                        head_attributes += a['text']

                        isFirst = False

                    else:

                        head_attributes += ','+a['text']

            except:

                print('no head_attributes detected')

            output += " "+head_attributes

        #value

        value = "value:"

        isFirst = True

        try:

            for a in q['value']:

                if isFirst:

                    value += a['text']

                    isFirst = False

                else:

                    value += ','+a['text']

        except:

            print('no value detected')

        output += " "+value

        #time

        time = "time:"

        isFirst = True

        try:

            for a in q['time']:

                if isFirst:

                    time += a['text']

                    isFirst = False

                else:

                    time += ','+a['text']

        except:

            print('no time detected')

        output += " "+time

        return output

    def findRawContent(entity_id):

        content="not captured"

        try:

            data = open("./data/"+str(entity_id)+".txt","r").read()

            obj = json.loads(data)

            content=obj['data']['entity']

        except:

            print('there has no txt file captured for entity='+str(entity_id))

        return content

    if __name__ == '__main__':

        print('begin to extract data packages from current page...')

        data = open("./data/json.txt","r").read()

        f = open('q.csv','a+',encoding='gbk',newline='')

        csv_writer = csv.writer(f)

        obj = json.loads(data)

        #当前页面所有钩稽关系

        keys = list(obj['data']['data'].keys())

        for key in keys:

            #print(obj['data']['data'][key])

            #遍历一个钩稽关系内的多条关联:每一个关联里的main_correlation_item都是一样的

            for relation in obj['data']['data'][key]:

                main_entity = relation['main_correlation_item']['data']['entity']

                main_page = relation['main_correlation_item']['page']

                main_q = relation['main_correlation_item']['data']['entity']['quadruple']

                main_origin_content=""

                if main_entity['type']=='paragraph':

                    main_origin_content=findRawContent(main_entity['id'])

                else:

                    main_origin_content="表格和公式不予捕捉原始来源"

                matching_degree = relation['matching_degree']

                #如果是多个数组就是 公式,暂不予处理

                if len(relation['correlation_items'])>1:

                    print('公式不予处理')

                    correlate_entity={}

                    correlate_entity['id']='公式不予处理'

                    correlate_entity['type']='公式不予处理'

                    correlate_page='公式不予处理'

                    correlate_q='公式不予处理'

                    correlate_origin_content='公式不予处理'

                else:

                    if len(relation['correlation_items'])==1:

                        correlate_entity = relation['correlation_items'][0]['data']['entity']

                        correlate_page = relation['correlation_items'][0]['page']

                        correlate_q = relation['correlation_items'][0]['data']['entity']['quadruple']

                        correlate_origin_content=""

                        if correlate_entity['type']=='paragraph':

                            correlate_origin_content=findRawContent(correlate_entity['id'])

                        else:

                            correlate_origin_content="表格和公式不予捕捉原始来源"

                    else:

                        correlate_entity={}

                        correlate_entity['id']='无关联'

                        correlate_entity['type']='无关联'

                        correlate_page='无关联'

                        correlate_q='无关联'

                        correlate_origin_content='无关联'

                csv_writer.writerow([main_entity['id'], main_entity['type'],

                                     formateQuadruple(main_q, main_entity['type']),

                                     main_page,

                                     main_origin_content,

                                     matching_degree,

                                     correlate_entity['id'], correlate_entity['type'],

                                     formateQuadruple(correlate_q, correlate_entity['type']),

                                     correlate_page,

                                    correlate_origin_content])

        f.close()

        print('done.')

    3. uiPath流程文件:(执行前打开fiddler代理,执行中逐一点击钩稽关系对,并在每次翻页前调用一次上述python脚本)

    相关文章

      网友评论

          本文标题:Fiddler+Python+UiPath抓包

          本文链接:https://www.haomeiwen.com/subject/kgdfnhtx.html