爬取网页文件并批量解析pdf

作者: estate47 | 来源:发表于2021-01-08 09:53 被阅读0次

爬取网页文件并批量解析pdf
爬虫篇（4）——qq音乐文件的爬取
三阶段day26-爬虫介绍
使用R语言爬取DailyMed药物信息
爬取Python教程博客并转成PDF
爬虫设计
python金融数据挖掘与分析（六）——解析上市公司PDF公告
iOS解析json
Python·爬取当当网图书信息
第六章 spider批量爬取伯乐在线所有文章

很多时候我们需要爬取网上的文件并提取文件的数据做对比，文件一般为pdf格式需要转化为excel表格，现在可以用python实现采集数据到提取数据的全流程操作。
一、首先要爬取网页内容下载pdf文件

import requests
from lxml import html
etree = html.etree
import os
import time
def main(i):
    #第一页
    if i==1:
        url = "http://www.innocom.gov.cn/gxjsqyrdw/gswj/list.shtml"
   #进行翻页处理
 else:
        url = 'http://www.innocom.gov.cn/gxjsqyrdw/gswj/list'+'_'+str(i)+'.shtml'
    html = requests.get(url)
    time.sleep(60)
    xhtml = etree.HTML(html.content.decode("utf-8"))  
   #定位到需要提取的内容
    node = xhtml.xpath('/html/body/div[2]/div[1]/div[3]/ul/li/a[contains(text(), "拟认定")]/@href')
    res = []
    for url in node:
            #拼接pdf的url
            url = 'http://www.innocom.gov.cn/' + url 
            html = requests.get(url)  
            time.sleep(60)
            xhtml = etree.HTML(html.content.decode("utf-8"))   
            node = xhtml.xpath('//*[@id="content"]//@href')
            url_1 =url[::-1]
            a= url[:-url_1.find('/')]
            res.append(a+node[0]) 
            print(a+node[0])  
            #点击url下载pdf文件      
            for i in range(len(res)):
                r = requests.get(res[i])
                os.makedirs('名单./',exist_ok=True) #创建目录存放文件
                f = open('名单./'+f"{i}.pdf", 'wb')
                for chunk in r.iter_content(): 
                    if chunk: # filter out keep-alive new chunks
                        f.write(chunk)
                f.close()
            
if __name__=='__main__':
    for i in range(1,15):
        main(i)

二、把pdf解析为excel文件
1.使用tabula模块解析

import tabula
import pandas as pd
df = tabula.read_pdf("1.pdf", encoding='utf-8', pages='all')
df = pd.DataFrame(df)
print(df)

2.使用adobe.acrobat来批量解析某个文件夹下所有的pdf文件

import os    
import winerror
from win32com.client.dynamic import Dispatch, ERRORS_BAD_CONTEXT
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
my_dir = r"C:\Users\sq\Desktop\名单"
file_list = os.listdir(my_dir)
print(file_list)
for i in file_list:
    my_pdf = f"{i}"
    os.chdir(my_dir)
    src = os.path.abspath(my_pdf)
    try:
        AvDoc = Dispatch("AcroExch.AVDoc")    

        if AvDoc.Open(src, ""):            
            pdDoc = AvDoc.GetPDDoc()
            jsObject = pdDoc.GetJSObject()
            i = i[:-4]
            #也可以把后缀转为其他格式
            jsObject.SaveAs(os.path.join(my_dir, f'{i}.xlsx'), "com.adobe.acrobat.xlsx")

    except Exception as e:
        print(str(e))

    finally:       
        AvDoc.Close(True)
        jsObject = None
        pdDoc = None
        AvDoc = None