美文网首页
一个基于python的爬虫脚本

一个基于python的爬虫脚本

作者: jerrylee529 | 来源:发表于2018-09-13 17:25 被阅读0次

    最近由于工作需要,接管了爬虫开发,用的开发语言是python。为了测试一些网站的流程,尤其是一些需要登录的网站,以便于后期面向这些网站爬虫的开发,因此决定开发一个脚本作为探测爬取流程的工具。

    准备工作:

    需要安装requests库和re库,具体安装请百度

    概况:

    发起一个http请求,大概需要method, headers, data, cookies, allow_redirects这几个关键要素,如下:

    1. method指的是get或者post;

    2. headers是http的请求头,例如:

    X-Requested-With: XMLHttpRequest

    Accept: application/json, text/javascript, /; q=0.01

    Referer: https://xxx.cn/html/login/login.html

    Accept-Language: zh-CN

    Accept-Encoding: gzip, deflate

    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko

    Content-Length: 0

    Connection: Keep-Alive

    3. data指的是post的数据,例如:

    userName=34567890

    4. cookies指的是请求中的cookie,例如:

    Cookie: CaptchaCode=abcde; rdmdmd5=3CD2F62D7935C4BFB24495821462D153; lgToken=1e364d3d891846bd9cd65f2550cd62a4

    5. allow_redirects指的是否允许request根据返回的http应答直接执行redirect操作

    以上基本概念介绍完了,接下来就从以上内容人手了:

    1. 首先构造设置method的函数:

    
    def get_command():
    
        command = ""
    
        while True:
    
            command = raw_input("Please input command[g/p/q], g as get, p as post, q as quit: ")
    
            if command not in ["g", "p", "q"]:
    
                print("command could be g, p or q")
    
                continue
    
            else:
    
                break
    
        return command
    
    

    2. 设置headers的函数,headers的格式为Accept: image/png, image/svg+xml, image/;q=0.8, /;q=0.5|Accept-Language: zh-CN|User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko,以“|”作为不同key:value的分隔符*

    
    # 默认的头
    
    HEADERS = {
    
        "Accept": "text/html, application/xhtml+xml, */*",
    
        "Accept-Language": "zh-CN",
    
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
    
        "Accept-Encoding": "gzip, deflate",
    
        "Connection": "Close"
    
    }
    
    def get_headers():
    
        headers = HEADERS
    
        header_text = raw_input("Please input headers:")
    
        header_text = header_text.strip()
    
        if len(header_text) > 0:
    
            key_value_list = header_text.split('|')
    
            headers = {}
    
            for key_value in key_value_list:
    
                #items = key_value.split(':')
    
                items = re.split('\:+', key_value, 1)
    
                if len(items) == 2:
    
                    headers[items[0]] = items[1].strip()
    
        headers["Connection"] = "close"
    
        return headers 
    
    

    3. 设置data的函数,数据格式为a=b&c=d,以“&”号作为不同key=value之间的分隔符:

    
    def get_data():
    
        data = {}
    
        data_text = raw_input("Please input data:")
    
        data_text = data_text.strip()
    
        if len(data_text) > 0:
    
            key_value_list = data_text.split('&')
    
            for key_value in key_value_list:
    
                items = key_value.split('=')
    
                if len(items) == 2:
    
                    data[items[0]] = items[1]
    
        return data 
    
    

    4. 设置cookie,cookie的数据格式例如CaptchaCode=bacd; rdmdmd5=3CD2F62D7935C4BFB24495821462D153; lgToken=1e364d3d891846bd9cd65f2550cd62a4,以“;”号作为不同key=value之间的分隔符:

    
    def get_cookies():
    
        data = {}
    
        data_text = raw_input("Please input cookies:")
    
        data_text = data_text.strip()
    
        if len(data_text) > 0:
    
            key_value_list = data_text.split(';')
    
            for key_value in key_value_list:
    
                items = key_value.split('=')
    
                if len(items) == 2:
    
                    data[items[0]] = items[1]
    
        return data 
    
    

    5. 发送请求并获取应答:

    
    def get_response(session, command, headers, data, cookies, allow_redirects):
    
        rsp = None
    
        while True:
    
            url = raw_input("Please input url: ")
    
            commands = {"g": "GET", "p": "POST"}
    
            print "request headers: ", headers
    
            try:
    
                if len(cookies) > 0:
    
                    rsp = session.request(method=commands[command], url=url, data=data, headers=headers, cookies=cookies, allow_redirects=allow_redirects)
    
                else:
    
                    rsp = session.request(method=commands[command], url=url, data=data, headers=headers, allow_redirects=allow_redirects)
    
                break
    
            except Exception as e:
    
                print(e.message)
    
                next_step = raw_input("Retry [y/n]: ")
    
                if next_step in ["y", "Y"]:
    
                    continue
    
                else:
    
                    break
    
        return rsp 
    
    

    6. 运行脚本

    
    def run():
    
        print("start crawler")
    
        session = Session()
    
        session.keep_alive = False
    
        refer = ""
    
        while True:
    
            command = get_command()
    
            if command == "q":
    
                break
    
            headers = get_headers()
    
            data = get_data()
    
            #allow_redirects = is_allow_redirects()
    
            cookies = get_cookies()
    
            rsp = get_response(session=session, command=command, headers=headers, data=data, cookies=cookies, allow_redirects=True)
    
            # 如果返回为空,是否退出
    
            if rsp is None:
    
                is_continue = raw_input("response is none, continue[y/n]:")
    
                if is_continue in ["y", "Y"]:
    
                    continue
    
                else:
    
                    break
    
            print "--- request cookie: ", session.cookies.__dict__
    
            print_cookies(session.cookies)
    
            print "--- response status code: ", rsp.status_code
    
            print "--- response headers: ", rsp.headers
    
            print "--- response cookie: ", rsp.cookies.__dict__
    
            print_cookies(rsp.cookies)
    
            if (rsp.history is not None) and (len(rsp.history) >= 1):
    
                print "--- response history: "
    
                for item in rsp.history:
    
                    print "redirect location: ", item.headers['Location']
    
                    print "redirect cookie: ", item.cookies.__dict__
    
                    for cookie in item.cookies:
    
                        session.cookies.set(cookie.name, cookie.value)
    
            print "--- response html: ", rsp.text
    
            # 是否包含图片,是则保存
    
            if 'Content-Type' in rsp.headers:
    
                if rsp.headers['Content-Type'] in ['image/jpeg', 'image/png']:
    
                    with open('image.jpg', 'ab') as f:
    
                        f.write(rsp.content)
    
                        f.close()
    
        session.close()
    
        print("quit crawler")   
    
    if __name__ =="__main__":
    
        run()
    
    

    运行结果:

    image

    相关文章

      网友评论

          本文标题:一个基于python的爬虫脚本

          本文链接:https://www.haomeiwen.com/subject/tkcggftx.html