美文网首页
requests.get(url) 与实际加载网页的元素不一致的

requests.get(url) 与实际加载网页的元素不一致的

作者: realnickman | 来源:发表于2019-03-17 00:11 被阅读0次

这个小任务的目的主要是想命令行输入关键字然后自动打开前几位的google 搜索网页(简单版本的feeling lucky)。过程中发现requests.get()的页面和browser inspect的HTML元素是有差异的。。。比如我需要爬到的转向链接,即class= "iUh30"这个元素 (如图),用bs4死活找不到...


利用chrome developer tool 去 inspect elements

那这个时候就需要设置好你的header中的UA以及用urllib.request 代替requests. Python3里面urllib2已经划入urllib中,所以直接导入urllib.request, 针对“SSL: CERTIFICATE_VERIFY_FAILED” Error的问题直接import ssl 然后利用gcontext = ssl.SSLContext() 来解决。 代码如下:

#! python3
# opens several google page at once

import urllib.request
import requests, sys, webbrowser
from bs4 import BeautifulSoup
import ssl

print("googling...")

if len(sys.argv) > 1:
    kw = "+".join(sys.argv[1:])
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
    link = "https://www.google.com/search?q=" + kw
    print("link:", link)
    headers = {'User-Agent': user_agent}
    try:
        #res = requests.get(link) # 得到的HTML和浏览器直接访问不一致!
        request = urllib.request.Request(url=link, headers=headers)
        gcontext = ssl.SSLContext() # bypass “SSL: CERTIFICATE_VERIFY_FAILED” Error
        html = urllib.request.urlopen(request,context=gcontext).read()

        google_soup = BeautifulSoup(html,"html.parser")
        g_blocks = google_soup.select("cite.iUh30")

        for block in g_blocks:
            target_link = block.get_text()
            #webbrowser.open(target_link) # 直接打开:
            print(target_link)

    except Exception as err:
        print("something wrong:", err)
else:
    print("Please type search keywords as arguments: python3 xx.py keyword")

运行结果:

c18pxxx:Py4e Nick$ python3 webscraping-feelinglucky.py python
googling...
link: http://www.google.com/search?q=python
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://sv.wikipedia.org/wiki/Python_(programspråk)
https://www.w3schools.com/python/
https://www.codecademy.com/learn/learn-python
https://www.tutorialspoint.com/python/

相关文章

网友评论

      本文标题:requests.get(url) 与实际加载网页的元素不一致的

      本文链接:https://www.haomeiwen.com/subject/svxymqtx.html