下载smtebooks的IT书单
smtebooks是最好用的英文IT书籍网站,书籍涵盖Programming & IT, Business, Magazines, Ebooks, History, Medical, Art, Non-fiction, Academic, Textbooks, Cooking, SEO, Science & Math, Travel & Tourism等多个方面。
图片.png现在请爬取https://smtebooks.net/category/programming-it的所有书籍。
- 参考答案
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# 讨论钉钉免费群21745728 qq群144081101 567351477
# CreateDate: 2018-10-22
import time
import re
from selenium import webdriver
import pandas as pd
def find_address(browser):
valids = []
url_base = 'https://smtebooks.net/Category/programming-it?page='
for page in range(1,3):
time.sleep(2)
browser.get(url_base + str(page))
books = re.findall(r'<a\s+href="(/book/\S+)">(.*?)</a>', browser.page_source)
print(books)
if not books:
break
valids = valids + books
return valids
browser = webdriver.Chrome()
results = find_address(browser)
df = pd.DataFrame(results)
df.to_csv('address.csv')
browser.quit()
该书籍列表比较大,已经下载好存储在:smtebooks IT类书籍列表 2018-10-22.csv
下载smtebooks的IT书
- 基于上面下载的书单,下载smtebooks上的书籍
- 参考答案
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# 讨论钉钉免费群21745728 qq群144081101 567351477
# CreateDate: 2018-10-20
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def download(url,driver):
driver.get(url)
print(url)
browser.find_element_by_link_text('Download Book').click()
output = r"d:\down"
options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2,
"download.default_directory": output}
options.add_argument(r"user-data-dir=C:\Users\andrew\AppData\Local\Google\Chrome\User Data\Default")
options.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(chrome_options=options)
browser.maximize_window()
browser.implicitly_wait(25)
df = pd.read_csv('address2.csv', index_col=0)
print(df.head())
for i in range(len(df)):
row = df.iloc[i]
url = 'https://smtebooks.us' + row[0]
url2 = 'https://smtebooks.net/getfile/' + row[0].split('/')[2]
print(url)
download(url2, browser)
注意:drive超过额度,文件过大、文件滥用等异常没有处理。
参考:
网友评论