URI
关于URI的标准,见RFC 3986 URI Generic Syntax January 2005
定义
A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource
A URI is a sequence of characters from a very limited set: the letters of the basic Latin alphabet, digits, and a few special characters.
格式
a URI reference representation:
[scheme:][//[userinfo@]host][/]path[?query][#fragment]
The following is an example URI and its component parts (from wiki):
hierarchical part
┌───────────────────┴─────────────────────┐
authority path
┌───────────────┴───────────────┐┌───┴────┐
abc://username:password@example.com:123/path/data?key=value&key2=value2#fragid1
└┬┘ └───────┬───────┘ └────┬────┘ └┬┘ └─────────┬─────────┘ └──┬──┘
scheme user information host port query fragment
urn:example:mammal:monotreme:echidna
└┬┘ └──────────────┬───────────────┘
scheme path
标准给出的一些URI实例
The following example URIs illustrate several URI schemes and
variations in their common syntax components:
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
URL是URI的子集
The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").
URL通过提供资源“位置”的方式去唯一定位资源,如ftp://ftp.is.co.za/rfc/rfc1808.txt
, http://www.ietf.org/rfc/rfc2396.txt
百分号编码
当URI需要包含一些标准之外的字符(如中文字符)、或者需要使用/ #
这些保留字符的时候,需要对这些字符进行编码,用标准内的字符进行替代表示
百分号编码(英语:Percent-encoding),又称:URL编码(URL encoding)是特定上下文的统一资源定位符 (URL)的编码机制,实际上也适用于统一资源标志符(URI)的编码
将一个数据字节拆分成2部分,高4位和低4位,分别使用一个16进制数对应,然后前置%
即可
pct-encoded = "%" HEXDIG HEXDIG
http 1.1
see rfc7230
参考 https://duoani.github.io/HTTP-RFCs.zh-cn/RFC7230.html(注意里面的中文翻译有些地方也不太准确)
HTTP relies upon the Uniform Resource Identifier (URI) standard [RFC3986] to indicate the target resource (Section 5.1) and relationships between resources.
Messages are passed in a format similar to that used by Internet mail [RFC5322] and the Multipurpose Internet Mail Extensions (MIME) [RFC2045]
HTTP基于URI来确定目标资源,支持发送MIME类型的消息
声明MIME类型格式为type/subtype
, 如application/json image/jpeg
,由header Content-Type
指定
http message format
http request format
HTTP-request = request-line CRLF
( header-field CRLF ) ──|
CRLF |── MIME-LIKE message
[ message-body ] ──|
1.beginning with a request-line that includes a method, URI, and protocol version
request-line = method SP **request-target** SP HTTP-version CRLF
, SP means space
接收端在解析请求行request-line的过程中,通过空白符SP分割出请求行的3个组件(method, URI, and protocol version),因此,method, URI, protocol version的内容都不能带有空白
2.followed by header fields containing request modifiers, client information, and representation metadata
3.an empty line to indicate the end of the header section
4.finally a optional message body containing the payload body
2-4 is a MIME-LIKE message
request-target
request-target是request-line中用于指示目标资源的部分
request-target is unmodified as sent by the client to a server
发往代理的request message的request-target的结构和直连服务器的request-target不同
request-target = origin-form # The most common form, wildly used when making a request directly to an origin server
/ absolute-form # that start with a scheme (e.g., http:, https:, telnet:, mailto:) and conform to scheme-specific syntax and semantics, wildly used when making a request to a proxy
/ authority-form
/ asterisk-form
例如直连服务器时,request-line如下
GET /pub/WWW/TheProject.html HTTP/1.1
通过代理连接服务器时,request-line如下
GET http://www.example.org/pub/WWW/TheProject.html HTTP/1.1
详见rfc7230 5.3
server reconstruct an "effective request URI"(服务端重建真实URI)
effective request URI是一个absolute-form的URI
而如上所述,request-target有4种form,server需要基于request-target重建真实URI
- 如果request-target是absolute-form,那么effective request URI = absolute-form
- 如果request-target不是absolute-form,那么就还需要确定scheme和authority
- 确定scheme。如果服务器的配置信息(或者出站网关)提供了一个固定的 URI scheme,那么,这个 URI scheme 会用于参与重建实际请求 URI。没有提供固定的 URI scheme,如果该请求是在一个 TLS 安全的(TLS-secured)的 TCP 连接,那么实际请求 URI 的 scheme 为 "https",否则,scheme 为 "http"
- 确定authority。简单来说,根据http 1.1的Host,或者服务端自己的默认设置确定
最后通过依次连结 scheme、"://"、authority、以及 path 和 query
,组合为绝对 URI(absolute-form)形式,得到effective request URI
http response format
HTTP-response = status-line CRLF
( header-field CRLF )
CRLF
[ message-body ]
1.beginning with a status-line that includes the protocol version, a success or error code, and textual reason phrase
status-line = HTTP-version SP status-code SP reason-phrase CRLF
接收端应该依据消息的状态码所定义的语义来解释消息的剩余部分(即除了状态行以外的部分)
2.possibly followed by header fields containing server information, resource metadata, and representation metadata
3.an empty line to indicate the end of the header section
4.finally a optional message body containing the payload body
2-4 is a MIME-LIKE message
在消息构成上,和http request只存在start line上的差距
The following example illustrates a typical message exchange for a GET request on the URI
"http://www.example.com/hello.txt":
Client request:
GET /hello.txt HTTP/1.1 # -> request-line
User-Agent: curl/7.16.3 libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3
Host: www.example.com
Accept-Language: en, mi
Server response:
HTTP/1.1 200 OK # -> status line
Date: Mon, 27 Jul 2009 12:28:53 GMT
Server: Apache
Last-Modified: Wed, 22 Jul 2009 19:15:56 GMT
ETag: "34aa387-d-1568eb00"
Accept-Ranges: bytes
Content-Length: 51
Vary: Accept-Encoding
Content-Type: text/plainHello World! My payload includes a trailing CRLF
解析http message
- read the start-line into a structure
- read each header field into a hash table by field name until the empty line
- and then use the parsed data to determine if a message body is expected.
- If a message body has been indicated, then it is read as a stream until an amount of octets(字节) equal to the message body length is read or the connection is closed.
是否会出现消息体,以消息头中是否带有 Content-Length 或者 Transfer-Encoding 头字段作为信号
接收方如何判断http消息接收完毕?
- 完成请求行和请求头的解析后,如果头字段有Content-Length,读取Content-Length得到body的字节长度,继续读取该长度的字节
- 分块传输编码(Chunked transfer encoding)是超文本传输协议(HTTP)中的一种数据传输机制,允许HTTP由应用服务器发送给客户端应用的数据(通常数据量较大或者长度未告知,比如视频,文件文档,动态生成)可以分成多个部分。分块传输编码只在HTTP协议1.1版本(HTTP/1.1)中提供
A sender MUST NOT send a Content-Length header field in any message that contains a Transfer-Encoding header field
HTTP 1.1引入分块传输编码提供了以下几点好处(from wiki):
- HTTP分块传输编码允许服务器为动态生成的内容维持HTTP持久链接。通常,持久链接需要服务器在开始发送消息体前发送Content-Length消息头字段,但是对于动态生成的内容来说,在内容创建完之前是不可知的。
- 分块传输编码允许服务器在最后发送消息头字段。对于那些头字段值在内容被生成之前无法知道的情形非常重要,例如消息的内容要使用散列进行签名,散列的结果通过HTTP消息头字段进行传输。没有分块传输编码时,服务器必须缓冲内容直到完成后计算头字段的值并在发送内容前发送这些头字段的值。
- HTTP服务器有时使用压缩 (gzip或deflate)以缩短传输花费的时间。分块传输编码可以用来分隔压缩对象的多个部分。在这种情况下,块不是分别压缩的,而是整个负载进行压缩,压缩的输出使用本文描述的方案进行分块传输。在压缩的情形中,分块编码有利于一边进行压缩一边发送数据,而不是先完成压缩过程以得知压缩后数据的大小。
使用Transfer-Encoding的chunked值表示数据以分块的形式发送,这种情况Content-Length通常不赋值
# golang http using chunked
func handleChunkedHttpResp(conn net.Conn) {
buffer := make([]byte, 1024)
n, err := conn.Read(buffer)
if err != nil {
log.Fatalln(err)
}
fmt.Println(n, string(buffer))
conn.Write([]byte("HTTP/1.1 200 OK\r\n")) // status-line
conn.Write([]byte("Transfer-Encoding: chunked\r\n")) // header
conn.Write([]byte("\r\n"))
conn.Write([]byte("6\r\n")) // body in chunks
conn.Write([]byte("hello,\r\n"))
conn.Write([]byte("8\r\n"))
conn.Write([]byte("chunked!\r\n"))
conn.Write([]byte("0\r\n")) // refer to the final chunk
conn.Write([]byte("\r\n"))
}
func main() {
ln, err := net.Listen("tcp", ":8080")
if err != nil {
log.Fatalln(err)
}
for {
conn, err := ln.Accept()
if err != nil {
log.Println(err)
continue
}
go handleChunkedHttpResp(conn)
}
}
注意,虽然http在应用层进行了chunk,但是tcp可能会将几个小chunk合并为一个chunk发送,见https://www.bwangel.me/2018/11/01/http-chunked/
消息路由--【http协议下】的3种常见【中间人】proxy, gateway, and tunnel
在某些情况下,一个中间人可以依据当前接收到的请求来决定是以源服务器、代理、网关还是隧道的方式来处理这个请求。
> > > >
UA =========== A =========== B =========== C =========== O
< < < <
The figure above shows three intermediaries (A, B, and C) between the user agent and origin server. A request or response message that travels the whole chain will pass through four separate connections
一个请求消息或者响应消息通过依次建立四个单独的连接来穿越整条链路,因此HTTP 的某些通信选项可能仅适用于通信链路上的某些节点上(只能影响所在连接而不是全链路),例如离其最近的非隧道节点、链路的端点,或者适用于链路上的所有节点。
在消息传播的过程中,upstream表示消息的上游,downstream表示消息的下游。消息总是从upstream流向downstream,犹如流水
proxy 代理
A "proxy" is a message-forwarding agent that is selected by the client
proxy对请求的改动可大可小,通常代理http uri的请求所进行的改动是较小的,而代理能够完成的大改动甚至可以将http请求转译成其他应用层协议的请求
一个完整的代理请求过程为:
- 客户端首先根据代理服务器所使用的代理协议,与代理服务器创建连接;
- 接着代理服务器按照协议请求与目标服务器创建连接,获得目标服务器的指定资源(如:文件)
在后一种情况中,代理服务器可能将目标服务器的资源下载至本地缓存,如果客户端所要获取的资源在代理服务器的缓存之中,则代理服务器并不会向目标服务器发送请求,而是直接传回已缓存的资源(CDN其实就是做了这个事情)。一些代理协议允许代理服务器改变客户端的原始请求、目标服务器的原始响应,以满足代理协议的需要
代理对于header field Host的处理:
HTTP/1.0的header不带host field, HTTP/1.1 header新增Host
The "Host" header field in a request provides the host and port information from the target URI, enabling the origin server to distinguish among resources while servicing requests for multiple
host names on a single IP address.
Host = uri-host [ ":" port ]
When a proxy receives a request with an
absolute-form
of request-target, the proxy MUST ignore the received Host header field (if any) and instead replace it with the host information of the request-target. A proxy that forwards such a request MUST generate a new Host field-value based on the received request-target rather than forward the received Host field-value.
当一个代理接收到一个带有以absolute-form表示的request-target的请求消息时,必须 基于其接收到的request-target来生成一个新的 Host 字段值,而不是直接转发原本的 Host 字段值
gateway 网关(本质上是反向代理)
A "gateway" (a.k.a. "reverse proxy") is an intermediary that acts as an origin server for the outbound connection but translates received requests and forwards them inbound to another server or servers.
网关接受网络外部的请求,并将其转发到网络内部的主机
tunnel
A "tunnel" acts as a blind relay between two connections without changing the messages. Once active, a tunnel is not considered a party to the HTTP communication
tunnel一般都使用2个IP头
HTTP 也可以作为一种中间人协议来使用,对非 HTTPnon-HTTP信息系统的相互通信进行翻译translate。HTTP 代理proxy和网关gateway能够提供对可替代的信息服务的访问,具体是通过将它们的驱动协议翻译为一种能够被客户端查看和操作的超文本格式,使之能像访问 HTTP 服务一样的方式来访问
client =====(http)====== proxy =========(other protocol)===== server
网友评论