S01E07.智能翻页和批量下载文件【京客隆超市】

作者: 布衣夜行人 | 来源:发表于2021-12-22 22:59 被阅读0次

S01E07.智能翻页和批量下载文件【京客隆超市】
S01E07.智能翻页和批量下载文件【京客隆超市】
S01E07.智能翻页和批量下载文件【京客隆超市】
S01E04.批量下载文件【京客隆超市】
wget批量下载gitlab uploads的文件和图片
2020-01-20
使用linux 的shell脚本进行sftp文件上传与下载
六、唧唧down
Python 3.7 批量下载多个文件 -- 协程
shell脚本实现FTP上传下载

本程序没有跑通，出现错误如下：

ConnectionError: HTTPConnectionPool(host='wwww.jkl.com.cn', port=80): Max retries exceeded with url: /newsList.aspx?TypeId=10009 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001C43C1628E0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

错误原因大概是服务器端口被占用，无法打开新的链接
曾尝试改进如下，未能成功：

#断开已有链接
ss=requests.session()
ss.keep_alive=False

当前编写程序如下：

# -*- coding: utf-8 -*-
"""
程序作用：爬取京客隆网站的文件下载专栏，分门别列地按照既有顺序在不同的文件夹中，下载保存好对应的文件。
"""

import requests
from lxml import etree
import re
import os

web_adress='http://www.jkl.com.cn/newsList.aspx?TypeId=10009'
My_agent={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

#1-获取【财务资料】专栏的网页信息
xiangying_date=requests.get(url=web_adress,headers=My_agent).text
jiexi_date=etree.HTML(xiangying_date)
#2-获取左侧各个专栏的名称及对应的链接
project_name=jiexi_date.xpath('//div[@class="infoLis"]//a/text()')
project_adress=jiexi_date.xpath('//div[@class="infoLis"]//@href')
#print(project_adress)
#断开已有链接
ss=requests.session()
ss.keep_alive=False
#3-逐个打开专栏链接，把每个专栏名称和对应链接组合为键值对
project_adress=['http://wwww.jkl.com.cn/'+project_adress for project_adress in project_adress]
jiangzhidui=dict(zip(project_name,project_adress))
for project_name,project_adress in jiangzhidui.items():
    #print(project_name)
    #替换掉专栏中命名有可能影响到后续字符串处理的部分
    project_name=project_name.replace('/','.')
    project_name=project_name.replace('...','报表')
    print(project_name)
    print(project_adress)
    #4-建立储存文件的路径
    lujing='D:/'+project_name
    if not os.path.exists(lujing):
        os.mkdir(lujing)
    #加入以下代码后，程序发生错误。
    xiangying_date=requests.get(url=project_adress,headers=My_agent).text
    jiexi_date=etree.HTML(xiangying_date)
    print(jiexi_date)
    #5-每个专栏页内含多个分页，此处是在获取每个专栏页下的尾页链接，以此智能判断每个专栏共有多少分页
    '''weiye=jiexi_date.xpath('//a[text()="尾页"]//@href')
    if weiye!=[]:
        zhengze=re.search("(\d+)'\)",weiye[0])
        page_number=zhengze.group(1)
    else:
        page_number=1
    for page_number in range(1,int(page_number)+1):
        print(project_adress)
        
        new_project_adress='http://www.jkl.com.cn/newsList.aspx?current='+page_number+'&TypeId=10009'
        xiangying_date1=requests.get(url=project_adress,headers=My_agent).text
        jiexi_date=etree.HTML(xiangying_date1)
        weiye=jiexi_date.xpath('//a[text()="尾页"]//@href')
        
'''

网友评论

本文标题：S01E07.智能翻页和批量下载文件【京客隆超市】

本文链接：https://www.haomeiwen.com/subject/xbucqrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

S01E07.智能翻页和批量下载文件【京客隆超市】

相关文章

S01E07.智能翻页和批量下载文件【京客隆超市】

S01E07.智能翻页和批量下载文件【京客隆超市】

S01E07.智能翻页和批量下载文件【京客隆超市】

S01E04.批量下载文件【京客隆超市】

wget批量下载gitlab uploads的文件和图片

2020-01-20

使用linux 的shell脚本进行sftp文件上传与下载

六、唧唧down

Python 3.7 批量下载多个文件 -- 协程

shell脚本实现FTP上传下载

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读