美文网首页
Python:爬取encyclopedia.thefreedic

Python:爬取encyclopedia.thefreedic

作者: 树懒吃糖_ | 来源:发表于2020-03-04 10:46 被阅读0次

    开发者页面源代码和抓取下来的格式不同。
    很多<p>标签在源代码中有,但是爬虫的html 中没有。

    静态文本?

    目前脚本的检索速度为10000条/24h,考虑时间因素调整了检索词,最后只选择了13000条检索词。
    分析第一次检索结果,发现很多结果没有将所有‘<p>’标签都抓取下来,发现是headers["user_agent"] 的原因。调整了部分user_agents后,问题解决。 但是暂时还不理解,为什么会出现这种现象。

    user_agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
                       'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
                       'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
                       'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
                       'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',]
    
        user_agent = random.choice(user_agents)
        headers = {"Host": 'encyclopedia2.thefreedictionary.com',
                   "User-Agent": user_agent,
                   "Connection": 'Keep-alive',
                   "Referer": 'https://encyclopedia2.thefreedictionary.com',
                   "Cookie": '_ga=GA1.2.458496055.1563949704; _pubcid=4A31DAFF-C3BC-4277-9E5E-64DD665D9979; c11=guid=07/24/2019 02:28|cn.bing.com%252f|07/24/2019 02:28|02/27/2020 21:12; _ga=GA1.3.458496055.1563949704; c01=track=1&brain=60&2.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0&3.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1&6.1=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0&2.0=0,0,0,2,1,2,2,1,1,1,1,1,1,2,2,2,2,2,2&3.0=0,0,0,0,2,2,2,2,2,2,2,1,2,2,0,0,0,0,0,2,2,2&5.0=0,0,3,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,3,3,3&6.0=0,0,0,2,1,2,0,0,0,0,0,2,2,2,2,2,2,2,2,2,2,0,0,0,0; __gads=ID=753af11087b3965b:T=1582855970:S=ALNI_MaBuflXrpTwFl_eZZD39mZvbPB-IQ; _gid=GA1.3.1548116447.1583039917; _gid=GA1.2.587480635.1583289807',
                   }
    
    """
    去掉部分user_agents,调整为: 
    """
    user_agents = ['Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0',
                      'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
                      ]
    

    相关文章

      网友评论

          本文标题:Python:爬取encyclopedia.thefreedic

          本文链接:https://www.haomeiwen.com/subject/gitmlhtx.html