最近开始学习爬虫,从最简单的爬糗事百科网开始,要爬的是24小时热榜,网站长这样:
data:image/s3,"s3://crabby-images/dcc9f/dcc9f50a22ff645c01c42a7e5e48c620fc66edcd" alt=""
看下网页源码:
data:image/s3,"s3://crabby-images/3c281/3c28112bf331ccf8723533a9af0368d7975858e0" alt=""
可以看到要爬的糗事的位置,等下要写正则
OK,下面开始打开PYTHON,要用urllib,urllib2包,和re包
Request访问网页,urlopen打开网页,read把网页内容读取下来
然后正则匹配要查找的内容
# -*- coding: utf-8 -*-
import urllib
import urllib2
import re
#import sys
#reload(sys)
#sys.setdefaultencoding('uft-8')
page=1
url='http://www.qiushibaike.com/hot/page/'+str(page)
user_agent='Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
headers = {'User-Agent': user_agent}
try:
request = urllib2.Request(url,headers= headers)
response =urllib2.urlopen(request)
content =response.read().decode('utf-8')
pattern = re.compile('<span>([^<].*?)</span>.*?<div.*?>\s<span.*?>\s*?<span.*?>(\d*?)</i>',re.S)
items =re.findall(pattern,content)
#把匹配内容存入文档
a = open("E:\edx 6.001\qsbk.txt","w")
for item in items:
a.write(item[0].encode('utf-8'))
a.write('\n')
a.write(item[1])
a.write('\n')
a.close()
except urllib2.URLError as e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
这是爬下来的糗事:
data:image/s3,"s3://crabby-images/55e28/55e28d743e3b9cba852e7fa3b2ef79ea37ebf502" alt=""
爬下来后该干嘛呢,不如发邮件给好友分享一下,发邮件怎么能手写呢,写个python吧,
python库里有访问邮件服务器的包
有几点要注意的:
首先,要打开你使用的邮箱的SMTP服务,登录邮箱设置一下;
然后会要你设置一个授权码,以下代码里的password要写这个授权码,而不是你的邮箱密码。
之后我运行发现发不出去邮件,报554错误,原来被网易当垃圾邮件拦截,到网上找了一圈,终于发现,要加两行代码:就是'from' 'to'要写上邮箱
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
sender = '*****@126.com'
receiver = '*******@qq.com'
subject = 'python email test'
smtpserver = 'smtp.126.com'
username = '********'
password = '1111111'
msgRoot = MIMEMultipart('related')
msgRoot['Subject'] = 'test mail'
#防止被当垃圾邮件拦截
msgRoot['from']='*******@126.com'
msgRoot['to']='********@qq.com'
#构造附件
att = MIMEText(open('qsbk.txt', 'rb').read(), 'base64', 'utf-8')
att["Content-Type"] = 'application/octet-stream'
att["Content-Disposition"] = 'attachment; filename="it's funny.txt"'
msgRoot.attach(att)
smtp = smtplib.SMTP()
smtp.connect('smtp.126.com')
smtp.login(username, password)
smtp.sendmail(sender, receiver, msgRoot.as_string())
smtp.quit()
print ("ok")
之后干嘛呢,可以建个bat批处理,每天定时发送,嘿嘿
网友评论