使用java爬虫抓取网站信息时,为了模拟浏览进行访问我们习惯实用Httpclient来模拟浏览器的自带信息,发送到服务器端,通过这种方式来绕过服务器的一些根据浏览器信息判断的防爬手段。当然我们也可以根据Httpclient来检测服务器根据不同网站返回的不同信息而不用真的下载那么多浏览器来观察访问信息。
HttpClient的主要功能:
实现了所有 HTTP 的方法(GET、POST、PUT、HEAD、DELETE、HEAD、OPTIONS 等)
支持 HTTPS 协议
支持代理服务器(Nginx等)等
支持自动(跳转)转向
使用httpclient来实现抓取页面信息
pom.xml 引入httpclient 的jar包
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.8</version>
</dependency>
示例:获取中国人才网信息

@Test
public void test2() {
// TODO Auto-generated method stub
// 创建httpclient实例
CloseableHttpClient httpclient = HttpClients.createDefault();
// 创建httpget实例
HttpGet httpget = new HttpGet("http://s.cjol.com/service/joblistjson.aspx?KeywordType=3&KeyWord=java%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88&SearchType=3&ListType=2&page=1");
// 模拟浏览器
httpget.setHeader("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36");
httpget.setHeader("Host","s.cjol.com");
httpget.setHeader("Origin","http://s.cjol.com");
httpget.setHeader("Referer","http://s.cjol.com/kw-java%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88/?SearchType=3&KeywordType=3");
httpget.setHeader("X-Requested-With","XMLHttpRequest");
try {
CloseableHttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
// 获取返回实体
String content = null;
content = EntityUtils.toString(entity, "utf-8");
System.out.println(content);
} catch (Exception e) {
e.printStackTrace();
}
}
结果输出:

以上通过httpclient 模拟浏览器发送了get请求 并伪装成谷歌浏览器,获取当前信息。(如果访问智联等网站,不伪装成浏览器会获取信息失败)
而在spring中也有这样的一个类 RestTemplate 可以实现httpclient功能想网站发送http请求。
RestTemplate
是 Spring 提供的用于访问 Rest 服务的客户端,RestTemplate 提供了多种便捷访问远程Http服务的方法,能够大大提高客户端的编写效率,所以很多客户端比如 Android或者第三方服务商都是使用 RestTemplate 请求 restful 服务。

其执行过程为
- 1.1调用RestTemplate类的doExcute(URI url, @Nullable HttpMethod method, @Nullable RequestCallback requestCallback, @Nullable ResponseExtractor<T> responseExtractor)方法。
- 调用HttpAccessor类createRequest(url,method)方法
3.调用SimpleClientHttpRequestFactory类的createRequest(url,method)方法
2.1.调用SimpleClientHttpRequestFactory的openConnection(url,prox)获得连接conn
2.2.调用SimpleClientHttpRequestFactory的prepareConnection(conn,method)封装信息
3 如果method为POST、PUT、PATCH、DELETE则封装信息
- 调用HttpAccessor类createRequest(url,method)方法
- 1.2 执行requestCallback.doWithRequest(request);
- 1.3 request.execute();
- 1.4 handleResponse(url, method, response);
其执行过程为调用doExcute方法

@Nullable
protected <T> T doExecute(URI url, @Nullable HttpMethod method, @Nullable RequestCallback requestCallback, @Nullable ResponseExtractor<T> responseExtractor) throws RestClientException {
Assert.notNull(url, "'url' must not be null");
Assert.notNull(method, "'method' must not be null");
ClientHttpResponse response = null;
String resource;
try {
//1 .RestTemplate 内部通过调用 doExecute 方法,首先就是获取 ClientHttpRequest,并将请求信息进行封装。
ClientHttpRequest request = this.createRequest(url, method);
if (requestCallback != null) {
requestCallback.doWithRequest(request);
}
response = request.execute();
this.handleResponse(url, method, response);
if (responseExtractor != null) {
Object var14 = responseExtractor.extractData(response);
return var14;
}
resource = null;
} catch (IOException var12) {
resource = url.toString();
String query = url.getRawQuery();
resource = query != null ? resource.substring(0, resource.indexOf(63)) : resource;
throw new ResourceAccessException("I/O error on " + method.name() + " request for \"" + resource + "\": " + var12.getMessage(), var12);
} finally {
if (response != null) {
response.close();
}
}
return resource;
}

protected ClientHttpRequest createRequest(URI url, HttpMethod method) throws IOException {
ClientHttpRequest request = this.getRequestFactory().createRequest(url, method);
if (this.logger.isDebugEnabled()) {
this.logger.debug("Created " + method.name() + " request for \"" + url + "\"");
}
return request;
}

public ClientHttpRequest createRequest(URI uri, HttpMethod httpMethod) throws IOException {
HttpURLConnection connection = this.openConnection(uri.toURL(), this.proxy);
this.prepareConnection(connection, httpMethod.name());
return (ClientHttpRequest)(this.bufferRequestBody ? new SimpleBufferingClientHttpRequest(connection, this.outputStreaming) : new SimpleStreamingClientHttpRequest(connection, this.chunkSize, this.outputStreaming));
}
打开连接方法openConnection
protected HttpURLConnection openConnection(URL url, @Nullable Proxy proxy) throws IOException {
URLConnection urlConnection = proxy != null ? url.openConnection(proxy) : url.openConnection();
if (!HttpURLConnection.class.isInstance(urlConnection)) {
throw new IllegalStateException("HttpURLConnection required for [" + url + "] but got: " + urlConnection);
} else {
return (HttpURLConnection)urlConnection;
}
}
prepareConnection方法 使用连接发送请求体
// DoOutput 的属性作用是可以使用 conn.getOutputStream().write() ,这样就能发送请求体了
protected void prepareConnection(HttpURLConnection connection, String httpMethod) throws IOException {
if (this.connectTimeout >= 0) {
connection.setConnectTimeout(this.connectTimeout);
}
if (this.readTimeout >= 0) {
connection.setReadTimeout(this.readTimeout);
}
connection.setDoInput(true);
if ("GET".equals(httpMethod)) {
connection.setInstanceFollowRedirects(true);
} else {
connection.setInstanceFollowRedirects(false);
}
//如果请求不是POST、PUT、PATCH、DELETE时 设置为false
if (!"POST".equals(httpMethod) && !"PUT".equals(httpMethod) && !"PATCH".equals(httpMethod) && !"DELETE".equals(httpMethod)) {
connection.setDoOutput(false);
} else {
// DoOutput 的属性作用是可以使用 conn.getOutputStream().write() ,这样就能发送请求体了
connection.setDoOutput(true);
}
connection.setRequestMethod(httpMethod);
}

URL连接 可用于输入或输出。
设置DoOutput。如果要使用URL连接进行输出,则将标记为true,如果不进行输出 设置为 false。默认值为false。
public void setDoOutput(boolean dooutput) {
if (connected)
throw new IllegalStateException("Already connected");
doOutput = dooutput;
}
RequestCallback 封装了请求体和请求头对象
在该对象里面可以拿到我们需要的请求参数,在执行 doWithRequest 时,有一个非常重要的步骤,他和前面Connection发送请求体有着密切关系,我们知道请求头就是 SimpleBufferingClientHttpRequest.addHeaders 方法,那么请求体 bufferedOutput 是如何赋值的呢?就是在 doWithRequest 里面进行实现。
接着执行 response = request.execute();
然后使用实例 SimpleBufferingClientHttpRequest 封装请求体和请求头
FileCopyUtils.copy(bufferedOutput, this.connection.getOutputStream());
所以服务端无法获取到请求体,会出现 HttpMessageNotReadableException: Required request body is missing
最后解析response
接着就是 response 的解析了,主要还是 Error 的解析。
handleResponseError(method, url, response);
使用
网友评论