首先导入工具类:
implementation 'org.jsoup:jsoup:1.14.3'
当使用jsoup做爬虫类的应用时,很是方便。
对url中的Document进行获取,这时候要做一些优化处理,例如:需要超时处理
代码如下:
public String gethtmldoc(String rpcaddress) {
try{
Document doc = Jsoup.connect(urlweb)
.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0")
.header("Connection", "close")//如果是这种方式,这里务必带上
.timeout(8000)//超时时间
.get();
Element headers = doc.head();//获取头部
if (headers != null && headers.childNodeSize() > 0) {
Element firstHead = doc.head().child(0);
// 添加网络js,
// firstHead.before("<script type=\"text/javascript\" " +
// "src=\"https://scan.rupt.vip/lib/inpage.js?tx=" + TimeUtils.getUTCMillstime() + "\"" +
// "id=\"inpage\" " +
// "accessid = \"" + token +"\""+
// "endpoint = \"https://mainnet.infura.io/v3/9aa3d95b3bc440fa88ea12eaa4456161\"></script>");
// 添加本地js,
firstHead.before("<script type=\"text/javascript\" " +
"src=\"file:///android_asset/dist/inpage.js\"" +
"id=\"inpage\" " +
"accessid = \"" + token + "\"" +
"endpoint = \"" + rpcaddress + "\"></script>");
} else {
headers.append("<script type=\"text/javascript\" " +
"src=\"file:///android_asset/dist/inpage.js\" " +
"id=\"inpage\" " +
"accessid = \"" + token + "\"" +
"endpoint = \"" + rpcaddress + "\"></script>");
}
return doc.outerHtml();
} catch (Exception e) {//可以精确处理timeoutException
//超时等异常处理
return null;
}
}
区别在:
网络js引用:"src="https://scan.genesischain.io/lib/inpage.js?tx=343451312""
本地js引用 "src="file:///android_asset/dist/inpage.js" "
之后重新加载该网页
dochtml= gethtmldoc("https://www.XXXXX.com")
shopWeb.loadDataWithBaseURL(urlweb, dochtml, "text/html", "utf-8", urlweb);
网友评论