美文网首页python爬虫程序员
爬虫第一弹之情人节前夕

爬虫第一弹之情人节前夕

作者: 莫比乌斯的小纸带 | 来源:发表于2018-05-20 00:58 被阅读78次

    最近学了点python,想写个爬虫玩玩,刚好遇到情人节


    时间:2018.5.19
    地点:208教室
    工具:Chrome、阿里云服务器


    先来列举下要爬取的内容

    (1)爬取天气
    (2)爬取文章
    (3)爬取句子


    爬取天气

    # -*-coding:utf-8 -*-
    import requests
    from pyquery import PyQuery as pq
    
    def get_response(url):
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecho) Chrome/67.0.3396.10 Safari/537.36'
        }
        res = requests.get(url,headers=headers)
        res.encoding = 'utf-8'
        return res
    
    def get_weather(doc):
        wea = doc('li.sky.skyid.lv3.on').text()
        one = doc('.hide.show > .clearfix > li.li1').text()
        two = doc('.hide.show > .clearfix > li.li2').text()
        three = doc('.hide.show > .clearfix > li.li3').text()
        four = doc('.hide.show > .clearfix > li.li6').text()
        print(wea,'\n')
        print(one,'\n')
        print(two,'\n')
        print(three,'\n')
        print(four,'\n')
    
    if __name__ == '__main__':
        url = 'http://www.weather.com.cn/weather/101120206.shtml'
        res = get_response(url)
        doc = pq(res.text)
        get_weather(doc)
    
    

    运行结果如下:

    19日(今天)
    阵雨
    14℃
    4-5级 
    
    弱 紫外线指数
    辐射较弱,涂擦SPF12-15、PA+护肤品。 
    
    减肥指数
    风雨相伴,坚持室内运动吧。 
    
    较冷 穿衣指数
    建议着厚外套加毛衣等服装。 
    
    良 空气污染扩散指数
    气象条件有利于空气污染物扩散。
    

    获取文章链接

    # -*-coding:utf-8 -*-
    import requests
    from pyquery import PyQuery as pq   
    
    def get_response(url):
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecho) Chrome/67.0.3396.10 Safari/537.36'
        }
        res = requests.get(url,headers=headers)
        res.encoding = 'utf-8'
        return res
    
    def get_article(doc):
        items = doc('.list-base-article a').items()             #---①
        for item in items:
            href = item.attr('href')                        #---②
            string = 'https://www.duanwenxue.com' + href    #---③
            with open('article.txt','a') as f:              #---④
                f.write(string + '\n')
            print(string)
    
    if __name__ == '__main__':
        page_num = 1
        url = 'https://www.duanwenxue.com/jingdian/lizhi/list_{}.html'
        while page_num < 8:
            res = get_response(url.format(page_num))   #---⑤
            doc = pq(res.text)
            get_article(doc)
            page_num = page_num + 1
    
    
    知识点总结:
    ①:pyquery的选择结果可能是多个节点,类型都是PyQuery类型,它不会返回一个列表,调用items()方法会得到一个生成器,遍历一下,就可以得到每个节点了。
    ②:用attr()方法获取属性值
    ③:string为拼接后的完整URL,可进一步获取文章内容
    ③:以追加的方式('a')写入文件,不然会覆盖掉原来的内容
    ④:手动翻页发现URL中最后的数字代表页数。使用format()格式化字符串,用来替换掉字符串中的占位符‘{}’,以达到翻页的效果。

    爬取句子

    import requests
    from pyquery import PyQuery as pq
    
    def get_response(url):
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecho) Chrome/67.0.3396.10 Safari/537.36'
        }
        res = requests.get(url,headers=headers)
        return res
    
    def get_article(doc):
        items = doc('.list-short-article p a').items()
        for item in items:
            content = item.text()
            with open('letters.txt','a') as f:
                f.write(content + '\n')
            print(content)
    if __name__ == '__main__':
        page_num = 1
        url = 'https://www.duanwenxue.com/duanwen/lizhi/list_{}.html'
        while page_num < 8:
            res = get_response(url.format(page_num))
            doc = pq(res.text)
            get_article(doc)
            page_num = page_num + 1
    
    

    随机选取一篇文章

    # -*- coding:utf-8-*-
    import linecache
    import random
    
    a = random.randrange(1,99)
    print(a)
    theline = linecache.getline(r'letters_text.txt',a)
    print(theline)
    

    python发送QQ邮件

    # -*- coding:utf-8 -*-
    from email.mime.text import MIMEText
    from email.header import Header
    from smtplib import SMTP_SSL
    import linecache
    import random
    
    def send_masseger():
        host_server = 'smtp.qq.com'
        sender_qq = '1394604132'
        pwd = '***********'
        sender_qq_mail = '**********@qq.com'
        receiver = '**********@qq.com'
        mail_content = "Hello World!"
        mail_title = 'The Mail from ***'
        smtp = SMTP_SSL(host_server)
        smtp.set_debuglevel(1)
        smtp.ehlo(host_server)
        smtp.login(sender_qq,pwd)
        msg = MIMEText(mail_content,'plain','utf-8')
        msg['Subject'] = Header(mail_title,'utf-8')
        msg['From'] = sender_qq_mail
        msg['To'] = receiver
        smtp.sendmail(sender_qq_mail,receiver,msg.as_string())
        smtp.quit()
        
    if __name__ == '__main__':
        send_masseger()
    
    python发送邮件代码参照:https://zhuanlan.zhihu.com/p/25565454

    终于搞完了,先发给自己试一下~
    mail.jpg

    其实现在已经是5.20了,那个天气网上的天气还没更新,所以显示的还是昨天的天气 ~ 文章链接和句子是事先爬好存储起来,随机抽取的

    代码如下:
    # -*- coding:utf-8 -*-
    from email.mime.text import MIMEText
    from email.header import Header
    from smtplib import SMTP_SSL
    import linecache
    import random
    import datetime
    import requests
    from pyquery import PyQuery as pq
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    def get_time():
        start = datetime.datetime(2017,5,20)
        now = datetime.datetime.now()
        time = now - start
        return str(time)
    
    def send_masseger(content):
        host_server = 'smtp.qq.com'
        sender_qq = '1394604132'
        pwd = 'jzqwgplfatpggcic'
        sender_qq_mail = 'zhanghe0309@qq.com'
        receiver = '1270090178@qq.com'
        mail_content = content
        mail_title = 'The Mail from lover'
        smtp = SMTP_SSL(host_server)
        smtp.set_debuglevel(1)
        smtp.ehlo(host_server)
        smtp.login(sender_qq,pwd)
        msg = MIMEText(mail_content,'plain','utf-8')
        msg['Subject'] = Header(mail_title,'utf-8')
        msg['From'] = sender_qq_mail
        msg['To'] = receiver
        smtp.sendmail(sender_qq_mail,receiver,msg.as_string())
        smtp.quit()
        
    if __name__ == '__main__':
        res = requests.get('http://www.weather.com.cn/weather/101120206.shtml')   
        res.encoding = 'utf-8'
        doc = pq(res.text)
        wea = doc('li.sky.skyid.lv3.on').text()
        one = doc('.hide.show > .clearfix > li.li1').text()
        two = doc('.hide.show > .clearfix > li.li2').text()
        three = doc('.hide.show > .clearfix >li.li3').text()
        four = doc('.hide.show > .clearfix >li.li6').text()
    
        a = random.randrange(1,350)
        b = random.randrange(1,100)
        theline1 = linecache.getline(r'article.txt',a)
        theline2 = linecache.getline(r'letters_text.txt',b)
        article = wea + '\n\n' + one + '\n\n'+ two + '\n\n' + three + '\n\n' + four + '\n\n' + '来自小哥哥的文章推荐:' + '\n' + theline1 + '\n' + theline2 + '\n' + '爱你的日子:' +  get_time()
        send_masseger(article)
    
    

    PS:

    地点:7#409
    时间:2018.5.20 0:53
    好困,睡了睡了
    YZ,等我的邮件呦~

    相关文章

      网友评论

        本文标题:爬虫第一弹之情人节前夕

        本文链接:https://www.haomeiwen.com/subject/kukadftx.html