爬虫:保存linux网页教程为pdf格式

作者: 停下浮躁的心 | 来源:发表于2017-04-22 19:00 被阅读28次

    安装wkhtmltopdf,在官网下载或在windows终端输入 pip install wkhtmltopdf

    添加到系统变量Path后重启电脑

    使用 pip install pdfkit 安装pdfkit

    使用wkhtmltopdf将http://www.linuxprobe.com/ 中的linux网页教程为pdf格式 存在问题:

    1. 编码问题: 如果将保存的pdf文件命名为中文则出现 names = unicode(name, encoding='utf-8') TypeError: decoding Unicode is not supported 错误
    #-*-coding:utf-8-*-
    
    import pdfkit
    import requests
    import sys
    import os
    import urllib2
    import re
    
    
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    class Spider:
        def get_pdf(self, urls, name):
            if os.path.exists('E:/linux_pdf'):
                print 'already exists!!!'
            else:
                os.mkdir('E:/linux_pdf')
            os.chdir('E:/linux_pdf/')
            print 'url = ' + urls
            # names = unicode(name, encoding='utf-8')
            # print names
    
            pdfkit.from_url(urls, name + '.pdf')
            print 'hahhaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
        def get_main_page(self,url):
            header = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
    
            html = requests.get(url)
            url_field = re.findall('<ul class="dropdown-menu">(.*?)</ul>', html.text, re.S)[0]
            url_lists = re.findall('class="menu-item.*?<a href="(.*?)">', url_field, re.S)
            print url_lists
            print '-'*100
            name_lists = re.findall('<a href=".*?">(.*?)</a></li>',url_field, re.S)
    
    
            if len(url_lists) == len(name_lists):
                for i in range(1, len(url_lists)+1):
                    self.get_pdf(url_lists[i-1], str(i))
            else:
                print "crawl lists is wrong!!!"
    
    if __name__ == '__main__':
        url = 'http://www.linuxprobe.com/'
        spider = Spider()
        spider.get_main_page(url)
    

    相关文章

      网友评论

        本文标题:爬虫:保存linux网页教程为pdf格式

        本文链接:https://www.haomeiwen.com/subject/jngszttx.html