暂不考虑zip加密,zip64
介绍
- 常见zip格式文件有.JAR, .WAR, .DOCX, .XLSX, .PPTX, .ODT, .ODS, .ODP等
- 支持多种压缩方法,Deflate (Compression method 8)是用的最多的,也是默认的方法,还有个压缩方法叫Stored,就是直接存,没有压缩
- 每个文件有CRC32字段做校验用
- 每个zip文件必须有且仅有一个end of central directory record
- 每个被压缩的文件前都有一个local file header,每个local file header都对应了一个central directory record
- File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.
ZIP与ZIP64
The format of the Local file header and Central directory entry are the same in ZIP and ZIP64, but for sizes always 0xffffffff stored, and an extra field always exists:
On the other hand, the format of EOCD for ZIP64 is slightly different than the normal ZIP version
加密
TODO
general purpose bit flag
Bit 0: If set, indicates that the file is encrypted.
文件结构说明
总体结构
[local file header 1]
[encryption header 1]
[file data 1]
[data descriptor 1]
.
.
.
[local file header n]
[encryption header n]
[file data n]
[data descriptor n]
[archive decryption header]
[archive extra data record]
[central directory header 1]
.
.
.
[central directory header n]
[zip64 end of central directory record]
[zip64 end of central directory locator]
[end of central directory record]
local file header
offset | description |
---|---|
0 | Local file header signature = 0x04034b50 (read as a little-endian number) |
4 | Version needed to extract (minimum) |
6 | General purpose bit flag |
8 | Compression method |
10 | File last modification time |
12 | File last modification date |
14 | CRC-32 of uncompressed data |
18 | Compressed size (or 0xffffffff for ZIP64) |
22 | Uncompressed size (or 0xffffffff for ZIP64) |
26 | File name length (n) |
28 | Extra field length (m) |
30 | File name |
30+n | Extra field |
30+n+m | the end |
encryption header
TODO
data descriptor
central directory header
offset | description |
---|---|
0 | Central directory file header signature = 0x02014b50 |
4 | Version made by |
6 | Version needed to extract (minimum) |
8 | General purpose bit flag |
10 | Compression method |
12 | File last modification time |
14 | File last modification date |
16 | CRC-32 of uncompressed data |
20 | Compressed size (or 0xffffffff for ZIP64) |
24 | Uncompressed size (or 0xffffffff for ZIP64) |
28 | File name length (n) |
30 | Extra field length (m) |
32 | File comment length (k) |
34 | Disk number where file starts |
36 | Internal file attributes |
38 | External file attributes |
42 | Relative offset of local file header. This is the number of bytes between the start of the first disk on which the file occurs, and the start of the local file header. This allows software reading the central directory to locate the position of the file inside the ZIP file. |
46 | File name |
46+n | Extra field |
46+n+m | File comment |
46+n+m+k | the end |
end of central directory record
EOCD
offset | description |
---|---|
0 | End of central directory signature = 0x06054b50 |
4 | Number of this disk |
6 | Disk where central directory starts |
8 | Number of central directory records on this disk |
10 | Total number of central directory records |
12 | Size of central directory (bytes) |
16 | Offset of start of central directory, relative to start of archive |
20 | Comment length (n) |
22 | Comment |
22+n | the end |
EOCD64
offset | description |
---|---|
0 | End of central directory signature = 0x06064b50 |
4 | Size of the EOCD64 - 8 |
8 | Version made by |
10 | Version needed to extract (minimum) |
12 | Number of this disk |
16 | Disk where central directory starts |
20 | Number of central directory records on this disk |
28 | Total number of central directory records |
36 | Size of central directory (bytes) |
44 | Offset of start of central directory, relative to start of archive |
52 | Comment (up to the size of EOCD64) |
52+n | the end |
zip文件解析步骤
一般解析,从后往前
一般来说,软件首先找到EOCD (记录了centrla directories的开始处offset),然后通过它找到central directories (每个directory记录了对应的local file header的offset),然后再通过central directories找到local file header (local file header后面的就是对应的每个文件的内容),进而解压缩每一个文件。
数据流式解析,从前向后
有的时候,zip数据只能从前向后读,不能先找到后面的EOCD啥得再到前面解析其它内容,可以直接从local file header解析起走。而且通常压缩使用的deflate算法也是支持这样的流式操作的。
解压deflate
用python解压缩deflate数据,这部分可以从zip文件中截取出来。
import zlib
inflator = zlib.decompressobj(-zlib.MAX_WBITS)
x = inflator.decompress(deflate_data)
用C/C++解压缩deflate数据
z_stream strm;
strm.zalloc = 0;
strm.zfree = 0;
strm.opaque = 0;
strm.avail_out = dst_len;
strm.next_out = dst;
strm.next_in = src;
strm.avail_in = src_len;
if (inflateInit2(&strm, -MAX_WBITS) != Z_OK) {
fprintf(stderr, "init fail!\n");
return -1;
}
int ret = inflate(&strm, Z_NO_FLUSH);
printf("ret: %d\n", ret);
switch (ret) {
case Z_NEED_DICT:
ret = Z_DATA_ERROR; /* and fall through */
case Z_DATA_ERROR:
case Z_MEM_ERROR:
(void)inflateEnd(&strm);
return -2;
}
printf("avail_out: %lu\n", strm.avail_out);
fp = fopen("123.txt", "wb");
fwrite(dst, sizeof(unsigned char), dst_len, fp);
Signatures汇总
sig | description |
---|---|
0x02014b50 | Central directory file header signature |
0x04034b50 | Local file header signature |
0x06054b50 | End of central directory signature |
0x06064b50 | Zip64 End of central directory record |
0x07064b50 | Zip64 end of central directory locator |
0x08074b50 | Optional data descriptor signature |
参考
- zlib/contrib/minizip
- zlib manual
- 解析步骤参考
- zip文件格式说明:表格,清晰,不包含zip64, 不包含加密
- home/compression file formats/zip
- 权威zip说明
- deflate代码来源
- rfc1950 (zlib format)
- rfc1951 (deflate format)
- rfc1952 (gzip format)
网友评论