场景:截获某应用的数据包
涉及两种请求的返回截取:
这个请求返回列表模式下所有钩稽关系对,包含了四元组信息和索引信息
2.https://xxxx.xxxx.com/static/pc/api/v1/document/1594/html_segment?entity_type={识别出的四元组的原始来源的类型}&entity_index={具体的四元组的id}
这个请求返回某个四元组在文档中的原始来源,本次抓取只取type=paragraph的来源
纯页面抓取的困难:
1.页面上的信息不能完全展示四元组
2.关联条目没有固定格式
python抓取的困难:
1.由于网络限制request一直调不通
2.mitmproxy代理设置后各种异常阻断
最终方案:
1.使用fiddler工具作为客户端和服务器的中间代理,通过JScript.NET在请求返回时将数据包写到本地
2.python读取存到本地的数据包(json字符串),进行解析(也可在JScript.Net脚本中写入,但由于本人不熟悉还是选择用python处理)
3.设计UiPath流程不断触发页面发送请求,直到遍历完所有钩稽关系的请求,每一次翻页时执行一次python脚本
备注:
1.fiddler的代理设置灵活,即开即代理,存写数据包的代码写在Rules>Customize Rules点击后的CustomRules.js脚本里OnBeforeResponse方法中,代码如下:
CustomRules.js里
static function OnBeforeResponse(oSession: Session) {
if (m_Hide304s && oSession.responseCode == 304) {
oSession["ui-hide"] = "true";
}
if(oSession.HostnameIs("autodoc_mvip.paodingai.com")&&oSession.url.Contains("group-correlations")){
var jsonString = oSession.GetResponseBodyAsString();
var responseJSON = Fiddler.WebFormats.JSON.JsonDecode(jsonString);
//FiddlerApplication.Log.LogString(" message from OnBeforeResponse: "+responseJSON.JSONObject["data"].JSONObject["data"].length);
//for(var i=0;i<responseJSON.JSONObject["data"].JSONObject[""])
// 保存文件到本地
var fso;
var file;
fso = new ActiveXObject("Scripting.FileSystemObject");
file = fso.OpenTextFile("D:\\Users\\{我的名字}\\data\\json.txt",2 ,true, -2);
file.writeLine(jsonString);
file.writeLine("\n");
file.close();
}
if(oSession.HostnameIs("autodoc_mvip.paodingai.com")&&oSession.url.Contains("html_segment")&&oSession.url.Contains("PARAGRAPH")){
var requestString = oSession.url;
FiddlerApplication.Log.LogString("截获请求:"+requestString);
var vars = requestString.split("&");
var entity_id = vars[1].split("=")[1];
FiddlerApplication.Log.LogString("截获entity_id="+entity_id+"的数据包");
var jsonString = oSession.GetResponseBodyAsString();
// 保存文件到本地
var fso;
var file;
fso = new ActiveXObject("Scripting.FileSystemObject");
file = fso.OpenTextFile("D:\\Users\\{{我的名字}}\\data\\"+entity_id+".txt",2 ,true, -2);
file.writeLine(jsonString);
file.writeLine("\n");
file.close();
}
}
2. 超级简单python解析脚本,这里的逻辑完全基于我对返回数据包的格式理解:
extract.js
import json
import csv
def formateQuadruple(q, t):
if t=='公式不予处理':
return q
output = ""
#attributes
attributes = "attributes:"
isFirst = True
try:
for a in q['attributes']:
if isFirst:
attributes += a['text']
isFirst = False
else:
attributes += ','+a['text']
except:
print('no attributes detected')
output += attributes
#preattributes / head_attributes
if t=='paragraph':
preattributes = "preattributes:"
isFirst = True
try:
for a in q['preattributes']:
if isFirst:
preattributes += a['text']
isFirst = False
else:
preattributes += ','+a['text']
except:
print('no preattributes detected')
output += " "+preattributes
if t=='table':
head_attributes = "head_attributes:"
isFirst = True
try:
for a in q['head_attributes']:
if isFirst:
head_attributes += a['text']
isFirst = False
else:
head_attributes += ','+a['text']
except:
print('no head_attributes detected')
output += " "+head_attributes
#value
value = "value:"
isFirst = True
try:
for a in q['value']:
if isFirst:
value += a['text']
isFirst = False
else:
value += ','+a['text']
except:
print('no value detected')
output += " "+value
#time
time = "time:"
isFirst = True
try:
for a in q['time']:
if isFirst:
time += a['text']
isFirst = False
else:
time += ','+a['text']
except:
print('no time detected')
output += " "+time
return output
def findRawContent(entity_id):
content="not captured"
try:
data = open("./data/"+str(entity_id)+".txt","r").read()
obj = json.loads(data)
content=obj['data']['entity']
except:
print('there has no txt file captured for entity='+str(entity_id))
return content
if __name__ == '__main__':
print('begin to extract data packages from current page...')
data = open("./data/json.txt","r").read()
f = open('q.csv','a+',encoding='gbk',newline='')
csv_writer = csv.writer(f)
obj = json.loads(data)
#当前页面所有钩稽关系
keys = list(obj['data']['data'].keys())
for key in keys:
#print(obj['data']['data'][key])
#遍历一个钩稽关系内的多条关联:每一个关联里的main_correlation_item都是一样的
for relation in obj['data']['data'][key]:
main_entity = relation['main_correlation_item']['data']['entity']
main_page = relation['main_correlation_item']['page']
main_q = relation['main_correlation_item']['data']['entity']['quadruple']
main_origin_content=""
if main_entity['type']=='paragraph':
main_origin_content=findRawContent(main_entity['id'])
else:
main_origin_content="表格和公式不予捕捉原始来源"
matching_degree = relation['matching_degree']
#如果是多个数组就是 公式,暂不予处理
if len(relation['correlation_items'])>1:
print('公式不予处理')
correlate_entity={}
correlate_entity['id']='公式不予处理'
correlate_entity['type']='公式不予处理'
correlate_page='公式不予处理'
correlate_q='公式不予处理'
correlate_origin_content='公式不予处理'
else:
if len(relation['correlation_items'])==1:
correlate_entity = relation['correlation_items'][0]['data']['entity']
correlate_page = relation['correlation_items'][0]['page']
correlate_q = relation['correlation_items'][0]['data']['entity']['quadruple']
correlate_origin_content=""
if correlate_entity['type']=='paragraph':
correlate_origin_content=findRawContent(correlate_entity['id'])
else:
correlate_origin_content="表格和公式不予捕捉原始来源"
else:
correlate_entity={}
correlate_entity['id']='无关联'
correlate_entity['type']='无关联'
correlate_page='无关联'
correlate_q='无关联'
correlate_origin_content='无关联'
csv_writer.writerow([main_entity['id'], main_entity['type'],
formateQuadruple(main_q, main_entity['type']),
main_page,
main_origin_content,
matching_degree,
correlate_entity['id'], correlate_entity['type'],
formateQuadruple(correlate_q, correlate_entity['type']),
correlate_page,
correlate_origin_content])
f.close()
print('done.')
3. uiPath流程文件:(执行前打开fiddler代理,执行中逐一点击钩稽关系对,并在每次翻页前调用一次上述python脚本)
网友评论