这个小任务的目的主要是想命令行输入关键字然后自动打开前几位的google 搜索网页(简单版本的feeling lucky)。过程中发现requests.get()的页面和browser inspect的HTML元素是有差异的。。。比如我需要爬到的转向链接,即class= "iUh30"这个元素 (如图),用bs4死活找不到...
利用chrome developer tool 去 inspect elements
那这个时候就需要设置好你的header中的UA以及用urllib.request 代替requests. Python3里面urllib2已经划入urllib中,所以直接导入urllib.request, 针对“SSL: CERTIFICATE_VERIFY_FAILED” Error的问题直接import ssl 然后利用gcontext = ssl.SSLContext() 来解决。 代码如下:
#! python3
# opens several google page at once
import urllib.request
import requests, sys, webbrowser
from bs4 import BeautifulSoup
import ssl
print("googling...")
if len(sys.argv) > 1:
kw = "+".join(sys.argv[1:])
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
link = "https://www.google.com/search?q=" + kw
print("link:", link)
headers = {'User-Agent': user_agent}
try:
#res = requests.get(link) # 得到的HTML和浏览器直接访问不一致!
request = urllib.request.Request(url=link, headers=headers)
gcontext = ssl.SSLContext() # bypass “SSL: CERTIFICATE_VERIFY_FAILED” Error
html = urllib.request.urlopen(request,context=gcontext).read()
google_soup = BeautifulSoup(html,"html.parser")
g_blocks = google_soup.select("cite.iUh30")
for block in g_blocks:
target_link = block.get_text()
#webbrowser.open(target_link) # 直接打开:
print(target_link)
except Exception as err:
print("something wrong:", err)
else:
print("Please type search keywords as arguments: python3 xx.py keyword")
运行结果:
c18pxxx:Py4e Nick$ python3 webscraping-feelinglucky.py python
googling...
link: http://www.google.com/search?q=python
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://sv.wikipedia.org/wiki/Python_(programspråk)
https://www.w3schools.com/python/
https://www.codecademy.com/learn/learn-python
https://www.tutorialspoint.com/python/
网友评论