使用httpclient结合jsoup做网页爬虫总结

作者: 春天还没到 | 来源:发表于2017-08-09 10:29 被阅读0次

由于项目需要，学习了一下如何从网页抓取数据，进行数据分析。实际上单独使用jsoup也可以直接处理，但是测试过程中发现jsoup处理页页有连接超时的情况，因此，结合httpclient和jsoup做分析处理。
httpclient和jsoup的maven配置如下：

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.3.6</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

分析了一下目标页面，页面通过post请求，httpclient封装post请求，直接上代码

/**
     * 封装post请求
     * @param url 访问的url
     * @param map 参数列表
     * @param charset 字符编码
     * @return
     */
    public static String doPost(String url,Map<String,String> map,String charset){  
        HttpClient httpClient = null;  
        HttpPost httpPost = null;  
        String result = null;  
        try{  
            httpClient = new DefaultHttpClient();  
            httpPost = new HttpPost(url);  
            //设置参数  
            List<NameValuePair> list = new ArrayList<NameValuePair>();  
            Iterator iterator = map.entrySet().iterator();  
            while(iterator.hasNext()){  
                Entry<String,String> elem = (Entry<String, String>) iterator.next();  
                list.add(new BasicNameValuePair(elem.getKey(),elem.getValue()));  
            }  
            if(list.size() > 0){  
                UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list,charset);  
                httpPost.setEntity(entity);  
            }  
            HttpResponse response = httpClient.execute(httpPost);  
            if(response != null){  
                HttpEntity resEntity = response.getEntity();  
                if(resEntity != null){  
                    result = EntityUtils.toString(resEntity,charset);  
                }  
            }  
        }catch(Exception ex){  
            ex.printStackTrace();  
        }  
        return result;  
    }

上述的返回结果，采用jsoup解析，即Jsoup.parse方法，封装方法如下：

    public static List<String> getElement(String content){
//      try {
//          Document document = Jsoup.connect(url).get();//这种情况可以直接解析url
            Document document = Jsoup.parse(content);//这种情况是解析网页内容
            List<String> list = new ArrayList<>();
//          System.out.println(document.toString());
//          Elements tableElements = document.getElementsByTag("tr");
            Elements tableElements = document.getElementsByClass("viewTable");
            Elements trElements = tableElements.get(0).getElementsByTag("tr");
            for(int i=1;i<trElements.size();i++){
                list.add(trElements.get(i).text().replaceAll(" ", ","));
//              System.out.println(trElements.get(i).text().replaceAll(" ", ","));
            }
            return list;
//      } catch (IOException e) {
//          e.printStackTrace();
//      }
    }

通过测试，处理的结果如下：

image.png

然后对结果进行处理、入库、分析、查询、展示等操作，达到自己的目标。

网友评论

本文标题：使用httpclient结合jsoup做网页爬虫总结

本文链接：https://www.haomeiwen.com/subject/vfefrxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

使用httpclient结合jsoup做网页爬虫总结

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读