美文网首页
python爬虫学习(三)

python爬虫学习(三)

作者: rrrwx | 来源:发表于2019-06-12 14:21 被阅读0次

爬虫实例
(一)更换用户代理
爬取时默认user-agent是'python-requests/2.18.4'

import requests

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        print(r.request.headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text[500:800]
    except:
        return "wrong connection..."

if __name__ == "__main__":
    this_url = "http://www.amazon.cn/gp/product/B01M8L5Z3Y"
    print(getHTMLText(this_url))

{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
ue = {};
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].

有可能会被网站根据user-agent识别出是爬虫而不是普通用户,然后可以通过更改user-agent来爬取

        kv = {'user-agent':'Mozilla/5.0'}
        r = requests.get(url, headers = kv)
        print(r.request.headers)
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

这里user-agent替换成Mozilla/5.0(或者其他浏览器),而且是在get的头部headers里面进行替换

(二)根据关键词搜索
baidu的关键词接口:
http://www.baidu.com/s?wd=keyword
360的关键词接口:
http://www.baidu.com/s?q=keyword

        kv = {'wd':'python'}
        r = requests.get(url, params = kv)

注意这里是把params进行替换,同时打印结果的时候注意获取的数据长度,有时候可能太大,需要进一步筛选

(三)图片的爬取和存储
网络图片链接格式:http://www.example.com/picture.jpg

import requests
import os

url = "http://i0.hdslb.com/bfs/article/7cdff66e4d44de434f7096fcda11f05505a8f831.jpg"
root = ".//pics//"
kv = {'user-agent':'Mozilla/5.0'}
path = root + url.split('/')[-1]
print(path)
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url, headers = kv)
        print(r.status_code)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("File saved successfully.")
    else:
        print("File saved.")
except:
    print("Failed.")

(四)ip地址归属地查询
iP138可以用来查询ip地址归属地

url = 'http://m.ip138.com/ip.asp?ip='+ipaddress

相关文章

网友评论

      本文标题:python爬虫学习(三)

      本文链接:https://www.haomeiwen.com/subject/ragkfctx.html