最近公司在搞一些抓新闻的东西,使用了Jsoup,加上本人刚开始写CSDN浏览量,想来测试一下,使用Jsoup是否可以通过打开CSDN的链接来增加浏览量,答案是可行的!
在抓取网页的时候,如果不使用IP代代理,有可能会被封的,所以我们需要一个IP代理池通过代理IP来进行访问。
话不多说,先上代码
- 扒取工具类
package com.lyc.cn.ipProxy;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.springframework.util.StringUtils;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class JsoupUtils {
private static Log logger = LogFactory.getLog(JsoupUtils.class);
private static List<String> list = new ArrayList<String>();
/**
* 随机获取一个Ip地址
* @return
*/
public static String[] getRandomIp() {
if (list.size() == 0) {
list.add("212.77.130.65:30493");
list.add("168.228.166.116:47059");
list.add("54.38.202.253:54321");
list.add("47.105.137.4:80");
list.add("47.105.84.52:80");
list.add("47.105.129.220:80");
list.add("47.105.137.51:80 ");
list.add("47.105.84.67:80");
list.add("103.11.99.66:80");
list.add("50.226.134.50:80");
list.add("39.135.11.97:8080");
list.add("111.7.130.101:8080");
list.add("122.117.165.51:8080");
list.add("47.105.131.35:80");
list.add("47.105.137.135:80");
list.add("47.105.115.176:80");
list.add("157.65.28.91:3128");
list.add("153.149.169.215:3128");
list.add("140.143.105.229:80");
list.add("117.127.0.201:8080");
}
Random random = new Random();
int n = random.nextInt(list.size());
return list.get(n).split(":");
}
/**
* Jsoup打开连接地址获取Document对象
* @param url
* @return
*/
public static Document getDocument(String url) {
try {
Connection conn = Jsoup.connect(url).ignoreContentType(true).ignoreHttpErrors(true).userAgent("Mozilla");
// 设置代理
String ip[] = JsoupUtils.getRandomIp();
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(ip[0].trim(), Integer.parseInt(ip[1].trim())));
conn.proxy(proxy);
// 设置超时时间并获取Document对象
Document document = conn.timeout(8000).get();
if (null != document && !StringUtils.isEmpty(document.toString())) {// 表示ip被拦截或者其他情况
System.out.println(proxy.toString());
return document;
}
} catch (Exception e) {
logger.error("抓取失败...");
}
return null;
}
}
- 测试类
package com.lyc.cn.ipProxy;
import org.jsoup.nodes.Document;
import org.junit.Test;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class MyTest {
List<String> list = new ArrayList<>();
@Test
public void testDetail1() throws InterruptedException {
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82383245");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82383422");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82383797");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82384128");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82384247");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82384376");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82384794");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82384899");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82388043");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82388479");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82391647");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82424122");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82428726");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82432993");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82464980");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82493058");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82585752");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82591822");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82630434");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82633936");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82691306");
list.add("https://blog.csdn.net/lyc_liyanchao/article/details/82696236");
// 随机访问其中一篇博客
for (int i = 0; i <= 1000; i++) {
Random random = new Random();
int n = random.nextInt(list.size());
String url = list.get(n);
Document doc = JsoupUtils.getDocument(url);
if (null != doc) {
System.out.println("第" + i + "次抓取,url: " + url);
}
}
}
}
首先我们用List集合模拟了一个IP代理池,每次随机从中取出一个作为访问的代理IP,其次将自己想要访问的博客地址再次缓存到List中,每次从中随机取出一个,这样一来,就可以通过定义for循环的参数,来刷博客的访问量了。
本博客仅仅是为了实验,而不是真的是要去鼓励大家刷自己博客的浏览量,好的文章么,总归是有人看的。
大家也可以在自己的开发中,使用Jsoup来扒取他人的网站数据,简单易用,只要稍稍懂一些CSS和CSS选择器的规则,就可以了。
另外给大家附上一个IP代理的网站:http://www.goubanjia.com/
网友评论