HTTP协议 (一) 格式与编码

作者: Gascognya | 来源:发表于2020-08-30 15:42 被阅读0次

无论你用AJAX还是Request库，都可以用很简单的方式向一个服务器发送一个HTTP请求。但是你如果做这样的尝试。

from urllib.parse import urlparse
import socket

def sock_http(url: str, port: int = 80) -> bytes:
    url = urlparse(url)
    host, path = url.netloc, "/" if url.path == "" else url.path

    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as client:
        client.connect((host, port))
        client.send(f"GET {path} HTTP/1.1\r\nHost:{host}\r\nConnection:close\r\n\r\n".encode('utf-8'))

        data = b""
        while True:
            d = client.recv(1024)
            if not d:
                break
            data += d

    return data

print(sock_http("http://www.baidu.com/"))

我们不做任何加工，你会得到如下返回结果

b'HTTP/1.1 200 OK\r\nConnection: close\r\nSet-Cookie: _cnzz_zz=1101;domain=.baidu.com;path=/;max-age=600\r\n\r\n'

加上.decode('utf-8')

HTTP/1.1 200 OK

Connection: close

Set-Cookie: _cnzz_zz=1101;domain=.baidu.com;path=/;max-age=600

当我们第二次执行的时候

HTTP/1.1 200 OK

Accept-Ranges: bytes

Cache-Control: no-cache

Content-Length: 14615

Content-Type: text/html

Date: Sun, 30 Aug 2020 06:47:21 GMT

P3p: CP=" OTI DSP COR IVA OUR IND COM "

P3p: CP=" OTI DSP COR IVA OUR IND COM "

Pragma: no-cache

Server: BWS/1.1

Set-Cookie: BAIDUID=BC099AFFC0885DCEA56DBAFA5CB7D4A3:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com

Set-Cookie: BIDUPSID=BC099AFFC0885DCEA56DBAFA5CB7D4A3; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com

Set-Cookie: PSTM=1598770041; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com

Set-Cookie: BAIDUID=BC099AFFC0885DCE67646AD3C1DF151B:FG=1; max-age=31536000; expires=Mon, 30-Aug-21 06:47:21 GMT; domain=.baidu.com; path=/; version=1; comment=bd

Traceid: 1598770041071592193012343510593275444080

Vary: Accept-Encoding

X-Ua-Compatible: IE=Edge,chrome=1

Connection: close



<!DOCTYPE html><!--STATUS OK-->

<html>

<head>

    <meta http-equiv="content-type" content="text/html;charset=utf-8">

    <meta http-equiv="X-UA-Compatible" content="IE=Edge">

    <link rel="dns-prefetch" href="//s1.bdstatic.com"/>

    <link rel="dns-prefetch" href="//t1.baidu.com"/>

    <link rel="dns-prefetch" href="//t2.baidu.com"/>

    <link rel="dns-prefetch" href="//t3.baidu.com"/>

    <link rel="dns-prefetch" href="//t10.baidu.com"/>

    <link rel="dns-prefetch" href="//t11.baidu.com"/>

    <link rel="dns-prefetch" href="//t12.baidu.com"/>

    <link rel="dns-prefetch" href="//b1.bdstatic.com"/>

    <title>百度一下，你就知道</title>

    <link href="http://s1.bdstatic.com/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />

    <!--[if lte IE 8]><style index="index" >#content{height:480px\9}#m{top:260px\9}</style><![endif]-->

    <!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->

    <script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}var ns_c = function(){};</script>

    <script>function h(obj){obj.style.behavior='url(#default#homepage)';var a = obj.setHomePage('//www.baidu.com/');}</script>

    <noscript><meta http-equiv="refresh" content="0; url=/baidu.html?from=noscript"/></noscript>

    <script>window._ASYNC_START=new Date().getTime();</script>

</head>

<body link="#0000cc"><div id="wrapper" style="display:none;"><div id="u"><a href="//www.baidu.com/gaoji/preferences.html"  onmousedown="return user_c({'fm':'set','tab':'setting','login':'0'})">搜索设置</a>

剩余内容忽略

可以看到直接将html源码返回过来。

这与我们熟悉的爬虫十分相似。实际上HTTP其基础原理便是基于TCP/IP的Socket。
即，HTTP本质是遵循协议的Socket短连接。
其分为两个阶段，当我们socket.send时，发送的是Request报文。服务器接收到报文后，发送相应的Response报文被我们socket.recv到。这就好比一问一答。

我们重新回到编码格式来分析

作为一个网站，百度的内容太多了。我们这回自己开个简单的API接口，用作测试。

from fastapi import FastAPI
app = FastAPI()

@app.get('/')
def main():
    return {'msg': 'hello'}

if __name__ == '__main__':
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

让我们试试 print(sock_http("http://127.0.0.1/", 8000))，来获取Response

b'HTTP/1.1 200 OK\r\ndate: Sun, 30 Aug 2020 07:14:54 GMT\r\nserver: uvicorn\r\ncontent-length: 15\r\ncontent-type: application/json\r\nconnection: close\r\n\r\n{"msg":"hello"}'

{"msg":"hello"}是我们的正文即Body，而前面的是我们常说的Header，他记录了关于报文的许多信息。我们从这段当中可以看到许多细节。

Body与Header是用\r\n\r\n两个回车换行来分割
Header中的每个字段，都是靠一个\r\n
开头为协议版本和状态码，而后是data，server，content-length... 等字段

decode一下的话，其格式为:

HTTP/1.1 200 OK
date: Sun, 30 Aug 2020 07:23:27 GMT
server: uvicorn
content-length: 15
content-type: application/json
connection: close

{"msg":"hello"}

HTTP的特点

无连接，短连接即连即断，不会保留连接。
无状态，不具有记忆，每次连接双方都如同第一次见面。
请求-响应，发送一个Request，一定且仅会获得一个Response。
客户端为主动请求的一方，服务端为被动响应的一方。服务端无法主动发送给客户端

网友评论

@IT·互联网

本文标题：HTTP协议 (一) 格式与编码

本文链接：https://www.haomeiwen.com/subject/wpqesktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！