美文网首页
zip文件解析

zip文件解析

作者: devilisdevil | 来源:发表于2021-02-02 23:34 被阅读0次

    暂不考虑zip加密,zip64

    介绍

    • 常见zip格式文件有.JAR, .WAR, .DOCX, .XLSX, .PPTX, .ODT, .ODS, .ODP等
    • 支持多种压缩方法,Deflate (Compression method 8)是用的最多的,也是默认的方法,还有个压缩方法叫Stored,就是直接存,没有压缩
    • 每个文件有CRC32字段做校验用
    • 每个zip文件必须有且仅有一个end of central directory record
    • 每个被压缩的文件前都有一个local file header,每个local file header都对应了一个central directory record
    • File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.

    ZIP与ZIP64

    The format of the Local file header and Central directory entry are the same in ZIP and ZIP64, but for sizes always 0xffffffff stored, and an extra field always exists:

    On the other hand, the format of EOCD for ZIP64 is slightly different than the normal ZIP version

    加密

    TODO

    general purpose bit flag

    Bit 0: If set, indicates that the file is encrypted.

    文件结构说明

    总体结构

    [local file header 1]
    [encryption header 1]
    [file data 1]
    [data descriptor 1]
    . 
    .
    .
    [local file header n]
    [encryption header n]
    [file data n]
    [data descriptor n]
    [archive decryption header] 
    [archive extra data record] 
    [central directory header 1]
    .
    .
    .
    [central directory header n]
    [zip64 end of central directory record]
    [zip64 end of central directory locator] 
    [end of central directory record]
    

    local file header

    offset description
    0 Local file header signature = 0x04034b50 (read as a little-endian number)
    4 Version needed to extract (minimum)
    6 General purpose bit flag
    8 Compression method
    10 File last modification time
    12 File last modification date
    14 CRC-32 of uncompressed data
    18 Compressed size (or 0xffffffff for ZIP64)
    22 Uncompressed size (or 0xffffffff for ZIP64)
    26 File name length (n)
    28 Extra field length (m)
    30 File name
    30+n Extra field
    30+n+m the end

    encryption header

    TODO

    data descriptor

    central directory header

    offset description
    0 Central directory file header signature = 0x02014b50
    4 Version made by
    6 Version needed to extract (minimum)
    8 General purpose bit flag
    10 Compression method
    12 File last modification time
    14 File last modification date
    16 CRC-32 of uncompressed data
    20 Compressed size (or 0xffffffff for ZIP64)
    24 Uncompressed size (or 0xffffffff for ZIP64)
    28 File name length (n)
    30 Extra field length (m)
    32 File comment length (k)
    34 Disk number where file starts
    36 Internal file attributes
    38 External file attributes
    42 Relative offset of local file header. This is the number of bytes between the start of the first disk on which the file occurs, and the start of the local file header. This allows software reading the central directory to locate the position of the file inside the ZIP file.
    46 File name
    46+n Extra field
    46+n+m File comment
    46+n+m+k the end

    end of central directory record

    EOCD

    offset description
    0 End of central directory signature = 0x06054b50
    4 Number of this disk
    6 Disk where central directory starts
    8 Number of central directory records on this disk
    10 Total number of central directory records
    12 Size of central directory (bytes)
    16 Offset of start of central directory, relative to start of archive
    20 Comment length (n)
    22 Comment
    22+n the end

    EOCD64

    offset description
    0 End of central directory signature = 0x06064b50
    4 Size of the EOCD64 - 8
    8 Version made by
    10 Version needed to extract (minimum)
    12 Number of this disk
    16 Disk where central directory starts
    20 Number of central directory records on this disk
    28 Total number of central directory records
    36 Size of central directory (bytes)
    44 Offset of start of central directory, relative to start of archive
    52 Comment (up to the size of EOCD64)
    52+n the end

    zip文件解析步骤

    一般解析,从后往前

    一般来说,软件首先找到EOCD (记录了centrla directories的开始处offset),然后通过它找到central directories (每个directory记录了对应的local file header的offset),然后再通过central directories找到local file header (local file header后面的就是对应的每个文件的内容),进而解压缩每一个文件。

    数据流式解析,从前向后

    有的时候,zip数据只能从前向后读,不能先找到后面的EOCD啥得再到前面解析其它内容,可以直接从local file header解析起走。而且通常压缩使用的deflate算法也是支持这样的流式操作的。

    解压deflate

    用python解压缩deflate数据,这部分可以从zip文件中截取出来。

    import zlib
    inflator = zlib.decompressobj(-zlib.MAX_WBITS)
    x = inflator.decompress(deflate_data)
    

    用C/C++解压缩deflate数据

    z_stream strm;
    strm.zalloc = 0;
    strm.zfree = 0;
    strm.opaque = 0;
    strm.avail_out = dst_len;
    strm.next_out = dst;
    strm.next_in = src;
    strm.avail_in = src_len;
    if (inflateInit2(&strm, -MAX_WBITS) != Z_OK) {
      fprintf(stderr, "init fail!\n");
      return -1;
    }
    int ret = inflate(&strm, Z_NO_FLUSH);
    printf("ret: %d\n", ret);
    switch (ret) {
        case Z_NEED_DICT:
            ret = Z_DATA_ERROR;     /* and fall through */
        case Z_DATA_ERROR:
        case Z_MEM_ERROR:
            (void)inflateEnd(&strm);
            return -2;
    }
    printf("avail_out: %lu\n", strm.avail_out);
    fp = fopen("123.txt", "wb");
    fwrite(dst, sizeof(unsigned char), dst_len, fp);
    

    Signatures汇总

    sig description
    0x02014b50 Central directory file header signature
    0x04034b50 Local file header signature
    0x06054b50 End of central directory signature
    0x06064b50 Zip64 End of central directory record
    0x07064b50 Zip64 end of central directory locator
    0x08074b50 Optional data descriptor signature

    参考

    相关文章

      网友评论

          本文标题:zip文件解析

          本文链接:https://www.haomeiwen.com/subject/hegutltx.html