需求:爬取某一网页下的邮箱
public class CrawTest
{
public static void Craw() throws Exception
{
URL url = new URL("https://tieba.baidu.com/p/2702208078?pid=109728187052&cid=109728324464#109728324464");
URLConnection conn = url.openConnection();//和网页建立连接
BufferedReader bufin = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line = null;
String mailregx = "\\w+@\\w+(\\.\\w+)+";//邮箱的正则表达式,\w表示字母+数字
Pattern p = Pattern.compile(mailregx);//将正则封装成一个对象
Set<String> set = new HashSet<String>();
//一开始用的arrayList来装这些邮箱,但是会出现重复数据,由于set集合里面本身自带唯一性属性,故使用set
while ((line=bufin.readLine())!=null)//逐行读取网页上的字符串
{
Matcher m = p.matcher(line);//字符串和规则对象关联,生成一个匹配器
while(m.find())//循环查找匹配规则的子串
{
set.add(m.group());
}
}
for(String string : set){
System.out.println(string);//获取匹配后的结果
}
}
public static void main(String[] args) throws Exception{
Craw();
}
}
打印结果:
648258994@QQ.COM
154947908@qq.com
u4e86308599300@qq.com
308599300@qq.com
u542c976522243@qq.com
976522243@qq.com
...
网友评论