Contents: Basic-Facts about ZIP ZIP APPNOTE.TXT Released by PKWARE http://www.pkware.com/documents/casestudies/APPNOTE.TXT Tar-Format: http://en.wikipedia.org/wiki/Tar_%28file_format%29 ZipPseudo-Code : PseudoCode-libzip: From NiH: http://nih.at/libzip/README.html Encoding-Binary: http://www.javaworld.com/javatips/jw-javatip117.html libzip is under BSD license: http://www.nih.at/listarchive/libzip-discuss/msg00013.html Local_downloaded_locations: See below. Zlib data compression library. -Official Location: http://zlib.net -Local Directory: file:/home/thava/download/zlib-1.2.5 zlib_synopsis: Whos_who. Difference_between_zlib_and_gzip Tar_format ============================================================================== Basic-Facts: - ZIP is for archive+compression. Tar is for archive only. - Info-ZIP is collection of individuals (including gzip author), who provide zip source code whose license is BSD based. - ZIP format was created in 1989, first implemented by PKWARE. Open Format. - JAR, open doc adaopted ZIP based formats, but it is not in any RFC. More Information: - zip program (by Info-Zip) can write file to stdout can read input from stdin. - Info-Zip : http://sourceforge.net/projects/infozip/files/ http://www.info-zip.org - libzip provides better interface, but does not allow non-seekable output! ============================================================================== ZIP APPNOTE.TXT Released by PKWARE : - ZIP format was defined by Phil Katz in 1989 by PKWARE. - JAR and Open doc format, etc have adopted format based on PKWARE. - PKWARE while it says it is open format, still trying to patent using authentication mechanism in ZIP format. - PKWARE introduced ZIP format in 1989. File Format: V. General Format of a .ZIP file -------------------------------- Files stored in arbitrary order. Large .ZIP files can span multiple volumes or be split into user-defined segment sizes. All values are stored in little-endian byte order unless otherwise specified. Overall .ZIP file format: [local file header 1] [file data 1] [data descriptor 1] . . . [local file header n] [file data n] [data descriptor n] [archive decryption header] [archive extra data record] [central directory] [zip64 end of central directory record] [zip64 end of central directory locator] [end of central directory record] A. Local file header: Offset: 00 local file header signature 4 bytes (0x04034b50) 04 version needed to extract 2 bytes 06 general purpose bit flag 2 bytes 08 compression method 2 bytes 10 last mod file time 2 bytes 12 last mod file date 2 bytes 14 crc-32 4 bytes 18 compressed size 4 bytes 22 uncompressed size 4 bytes 26 file name length 2 bytes 28 extra field length 2 bytes 30 file name (variable size) 30+ extra field (variable size) B. File data [ Available at offset: 30+len(filename)+len(extrafield) ] Immediately following the local header for a file is the compressed or stored data for the file. The series of [local file header][file data][data descriptor] repeats for each file in the .ZIP archive. C. Data descriptor: [ Available at offset: 30+len(filename)+len(extrafield) + len(compsize) ] crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes This descriptor exists only if bit 3 of the general purpose bit flag is set (see below). It is byte aligned and immediately follows the last byte of compressed data. This descriptor is used only when it was not possible to seek in the output .ZIP file, e.g., when the output .ZIP file was standard output or a non-seekable device. For ZIP64(tm) format archives, the compressed and uncompressed sizes are 8 bytes each. Q: What if the output file is seekable ? Although not originally assigned a signature, the value 0x08074b50 has commonly been adopted as a signature value for the data descriptor record. When the Central Directory Encryption method is used, the data descriptor record is not required, but may be used. If present, and bit 3 of the general purpose bit field is set to indicate its presence, the values in fields of the data descriptor record should be set to binary zeros. F. Central directory structure: [file header 1] . . . [file header n] [digital signature] File header: central file header signature 4 bytes (0x02014b50) version made by 2 bytes version needed to extract 2 bytes general purpose bit flag 2 bytes compression method 2 bytes last mod file time 2 bytes last mod file date 2 bytes crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes file name length 2 bytes extra field length 2 bytes file comment length 2 bytes [offset 34 bytes i.e. 0x22 from central file header begining: ] disk number start 2 bytes [ central-dir-specific ] internal file attributes 2 bytes [ central-dir-specific ] external file attributes 4 bytes [ central-dir-specific ] relative offset of local header 4 bytes [ central-dir-specific ] [offset 46 bytes i.e. 0x2e from central file header begining: ] file name (variable size) extra field (variable size) file comment (variable size) [ central-dir-specific ] Digital signature: header signature 4 bytes (0x05054b50) size of data 2 bytes signature data (variable size) G. Zip64 end of central directory record [ Only for Zip64 ] zip64 end of central dir signature 4 bytes (0x06064b50) size of zip64 end of central directory record 8 bytes ........ ........ H. Zip64 end of central directory locator [ Only for Zip64 ] zip64 end of central dir locator signature 4 bytes (0x07064b50) number of the disk with the start of the zip64 end of central directory 4 bytes relative offset of the zip64 end of central directory record 8 bytes total number of disks 4 byte I. End of central directory record: end of central dir signature 4 bytes (0x06054b50) number of this disk 2 bytes number of the disk with the start of the central directory 2 bytes total number of entries in the central directory on this disk 2 bytes total number of entries in the central directory 2 bytes size of the central directory 4 bytes offset of start of central directory with respect to the starting disk number 4 bytes .ZIP file comment length 2 bytes .ZIP file comment (variable size) ============================================================================== Zip can take input from stdinput then, it will mark file name length as 0. disk number start: (2 bytes) The number of the disk on which this file begins. If an archive is in ZIP64 format and the value in this field is 0xFFFF, the size will be in the corresponding 4 byte zip64 extended information extra field. internal file attributes: (2 bytes) Bits 1 and 2 are reserved for use by PKWARE. The lowest bit of this field indicates, if set, that the file is apparently an ASCII or text file. If not set, that the file apparently contains binary data. The remaining bits are unused in version 1.0. external file attributes: (4 bytes) The mapping of the external attributes is host-system dependent (see 'version made by'). For MS-DOS, the low order byte is the MS-DOS directory attribute byte. If input came from standard input, this field is set to zero. Tar-Format: http://en.wikipedia.org/wiki/Tar_%28file_format%29 The format was created in the early days of Unix and standardized by POSIX.1-1988 and later POSIX.1-2001. ZipPseudo-Code zip3.0 code from Info-Zip: readzipfile: check if it is not stdin and can be opened for reading. zip.c : 4339 : filetime(z->name, ...); platform dependent routine. on unix: Make sure you nullify the last "/" character. If file is "-" do fstat(fileno(stdin) ... ); or do lstat(); set the attributes and preserve the time. if encryption enabled: crc_32_tab = get_crc_table(); open zip file for writing; get stat info of the zipfile that we just created; create a temp zip file (in case of update of existing zip) ifdef _IOFBF, setvbuf(8K buffer); check if output is seekable: fseek(fd, current-pos, SEEK_SET) == 0 ? write contents; write central directory: For each entry: putcentral(z): // write to memblock, then will write to file. only if the size > 4GB, set the zip64 extra field flag; Write: Signature = 0x02014b50L; 4bytes; Version-made-by = 798; 2bytes; version needed to extract 2 bytes 10 general purpose bit flag 2 bytes 0 compression method 2 bytes 0 last mod file time 2 bytes xx last mod file date 2 bytes xx crc-32 4 bytes 0x96170874 compressed size 4 bytes 4 (if >4GB, write 4GB here) uncompressed size 4 bytes for zip64, =4GB; else size>4G? 4GB : size; file name length 2 bytes 10 for thava.txt(includes 0) extra field length 2 bytes 24 ??? file comment length 2 bytes 0 disk number start 2 bytes 0 internal file attributes 2 bytes 1 flag to indicate ASCII File external file attributes 4 bytes filetime value. can be 0. relative offset of local header 4 bytes 72 first-file 72bytes+ second file 72bytes. putcentral oem comment: current_disk ; 0 number_of_disk for start of central dir; 0 total number of entries ; 2 size of central directory. 160 (if >4GB, write 4GB here) 4 bytes offset of start of central dir from starting disk number; 144 write comment if any; write out the whole of central dir from memory buf; rename temp file as original zip file. setfileattr: chmod the file to reasonable mode; struct zlist { contains local header info: flg, disknumber, offset, extra field, filename, unicode name, size, that's it. } Examine an example zipped file (without compression) : one.txt and two.txt: (small files) Hex dump of zip file: Start: Local Header: Offset 0: (local signature) 50 4b 03 04 : i.e. PK 3 4 : Local File Header Signature (4 bytes) Offset 4: (version_needed) 0a 00 - Version needed to extract. 10 => i.e. 1.0 ZIP format. 14 00 - Version needed to extract. 20 => i.e. 2.0 ZIP format. From version 2.0 i.e. 20 (0x14), files can be Deflate compressed. Offset 6: (general purpose bit flag - 2bytes) 00 00 - Note: This 16 bits used generally depend on the next 2 bytes of compression method (deflate, LZMA, etc). For Deflate method, bit1-bit2 are: 00 - normal compression. 11 - superfast compression, etc. Note: Useful for streaming! bit3: If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data. Note: In our example, it is always set to 0, even for big file which is streamed from stdin. ! Offset 8: (Compression method - 2 bytes) 08 00 : The value 8 is for "deflate" method. Offset 10: time and date (4bytes) c0 11 6f 3d : Number of secs since Jan 1, 1970. See time_t time(); Offset 14: CRC-32 (4bytes): (lower-to-higher): 8f 1d c9 25: (i.e 25 c9 1d 8f ): It is not CRC-32 seed. It is not CRC of compressed content. It is CRC of original content. The 'magic number' for the CRC is 0xdebb20e3. The proper CRC pre and post conditioning is used, meaning that the CRC register is pre-conditioned with all ones (a starting value of 0xffffffff) and the value is post-conditioned by taking the one's complement of the CRC residual. If bit 3 of the general purpose flag is set, this field is set to zero in the local header and the correct value is put in the data descriptor and in the central directory. This field is set even if stdin is being streamed! how?!!! It is per file CRC. It is same for 7z, and zip programs for given one.txt file -- the value is same if compression on/off. Offset 18: compressed size (4 bytes): 33 00 00 00: (for small 78 bytes file, it is 51 bytes) 78 3c 12 04: (It is 68MB for video file lsb-msb) Offset 22: uncompressed size (4 bytes) 4e00 0000 (for small 78 bytes file) 8ca7 2404 (for video file it is exactly: 69511052 bytes original size) Offset 26: File name length (2bytes) 0700 : (for one.txt file) 0100 : (for stdin - file) Offset 28 extra field length : 2 bytes: 1c00 : zip30 uses 29 bytes (for compressed data of small files) Data starts with UT (time related) info. 0000 : for minizip, 7zip and for stdin compressed file. 1400 : zip2.32 uses 20 bytes for stdin zip. Filled mostly with zeros. 1500 : zip2.32 uses 21 bytes for single zip of one.txt file. And it starts with UT... meaning Universal time is saved. The central-directory extra field contains: - A subfield with ID 0x5455 (universal time) and 5 data bytes. The local extra field has UTC/GMT modification/access times. - A subfield with ID 0x7855 (Unix UID/GID (16-bit)) and 0 data bytes. ZIPINFO Output: /home/thava/bin/zipinfo3 -v small.bin says: Archive: comp.bin There is no zipfile comment. End-of-central-directory record: ------------------------------- Zip archive file size: 381 (000000000000017Dh) Actual end-cent-dir record offset: 359 (0000000000000167h) Expected end-cent-dir record offset: 359 (0000000000000167h) (based on the length of the central directory and its expected offset) This zipfile constitutes the sole disk of a single-part archive; its central directory contains 2 entries. The central directory is 132 (0000000000000084h) bytes long, and its (expected) offset in bytes from the beginning of the zipfile is 227 (00000000000000E3h). Central directory entry #1: --------------------------- one.txt offset of local header from start of archive: 0 (0000000000000000h) bytes file system or operating system of origin: Unix version of encoding software: 2.3 minimum file system compatibility required: MS-DOS, OS/2 or NT FAT minimum software version required to extract: 2.0 compression method: deflated compression sub-type (deflation): normal file security status: not encrypted extended local header: no file last modified on (DOS date/time): 2010 Nov 15 02:14:00 file last modified on (UT extra field modtime): 2010 Nov 15 02:14:00 local file last modified on (UT extra field modtime): 2010 Nov 14 20:44:00 UTC 32-bit CRC value (hex): 25c91d8f compressed size: 51 bytes uncompressed size: 78 bytes length of filename: 7 characters length of extra field: 13 bytes length of file comment: 0 characters disk number on which file begins: disk 1 apparent file type: text Unix file attributes (100644 octal): -rw-r--r-- MS-DOS file attributes (00 hex): none The central-directory extra field contains: - A subfield with ID 0x5455 (universal time) and 5 data bytes. The local extra field has UTC/GMT modification/access times. - A subfield with ID 0x7855 (Unix UID/GID (16-bit)) and 0 data bytes. There is no file comment. Central directory entry #2: --------------------------- two.txt offset of local header from start of archive: 109 (000000000000006Dh) bytes file system or operating system of origin: Unix version of encoding software: 2.3 minimum file system compatibility required: MS-DOS, OS/2 or NT FAT minimum software version required to extract: 2.0 compression method: deflated compression sub-type (deflation): normal file security status: not encrypted extended local header: no file last modified on (DOS date/time): 2010 Nov 15 02:14:42 file last modified on (UT extra field modtime): 2010 Nov 15 02:14:42 local file last modified on (UT extra field modtime): 2010 Nov 14 20:44:42 UTC 32-bit CRC value (hex): 5a92ece0 compressed size: 60 bytes uncompressed size: 100 bytes length of filename: 7 characters length of extra field: 13 bytes length of file comment: 0 characters disk number on which file begins: disk 1 apparent file type: text Unix file attributes (100644 octal): -rw-r--r-- MS-DOS file attributes (00 hex): none The central-directory extra field contains: - A subfield with ID 0x5455 (universal time) and 5 data bytes. The local extra field has UTC/GMT modification/access times. - A subfield with ID 0x7855 (Unix UID/GID (16-bit)) and 0 data bytes. There is no file comment. ZIP Version-History: A summary of key advances in various versions of the PKWARE specification: * 2.0: File entries can be compressed with DEFLATE. * 4.5: Documented 64-bit ZIP format. * 5.0: DES, Triple DES, RC2, RC4 supported for encryption * 5.2: RC2-64 supported for Encryption. * 6.1: Documented certificate storage. * 6.2.0: Documented Central Directory Encryption. * 6.3.0: Documented Unicode (UTF-8) filename storage. Expanded list of supported hash, compression, encryption algorithms. * 6.3.1: Corrected standard hash values for SHA-256/384/512. * 6.3.2: Documented compression method 97 (WavPack). How to use libzip? READING ZIP ARCHIVES open archive zip_open(3) add/change files and directories zip_add(3) zip_add_dir(3) zip_replace(3) zip_set_file_comment(3) zip_source_buffer(3) zip_source_file(3) zip_source_filep(3) zip_source_func tion(3) zip_source_zip(3) zip_source_free(3) ============================================================================== Encoding-Binary: http://www.javaworld.com/javatips/jw-javatip117.html Any binary string in international character set use non-ascii values. Unless an editor or software "understands" the characters in non-ASCII format, it will get confused. e.g. A compiler can not compile C program written in Hindi. However, if we say, a file is "encoded" in UTF-8 format, then all constant Hindi strings will be converted to equivalent ASCII string. Change every byte into 2-digit hexadecimal byte: And this UTF-8 has the property that all existing 7-bit ASCII strings are still valid. UTF-8 only affects the meaning of bytes greater than 127, which it uses to represent higher Unicode characters. A character might require 1, 2, 3, or 4 bytes of storage depending on its value; more bytes are needed as values get larger. To store the full range of possible 32-bit characters, UTF-8 would require a whopping 6 bytes. But again, Unicode only defines characters up to 0x10FFFF, so this should never happen in practice. UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF: 00000000 -- 0000007F: 0xxxxxxx 00000080 -- 000007FF: 110xxxxx 10xxxxxx 00000800 -- 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 -- 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx If a string is UTF-8 encoded, then strcmp() strlen() etc still works, since NULL character can not appear in the encoded string! Open file with filename unicode file. #include int main() { //UTF data for ??.txt (?? -> chinese characters) unsigned char fname[20]={0xE6, 0x98, 0x8E, 0xE5, 0xA4, 0xA9, 0x2E, 0x74, 0x78, 0x74, 0x00}; FILE *fp; fp = fopen((char *)fname, "w"); fp = fopen((char *)fname, "r"); } ============================================================================== For writing unicode strings, use: WPRINTF(3) Linux Programmer's Manual WPRINTF(3) NAME wprintf, fwprintf, swprintf, vwprintf, vfwprintf, vsw_ printf - formatted wide character output conversion SYNOPSIS #include #include int wprintf(const wchar_t *format, ...); int fwprintf(FILE *stream, const wchar_t *format, ...); int swprintf(wchar_t *wcs, size_t maxlen, const wchar_t *format, ...); ============================================================================== http://www.linuxdocs.org/HOWTOs/Unicode-HOWTO-3.html You will need a program to convert your locally (probably ISO-8859-1) encoded texts to UTF-8. (The alternative would be to keep using texts in different encodings on the same machine; this is not fun in the long run.) One such program is `iconv', which comes with glibc-2.1. Simply use $ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file ============================================================================== Local_downloaded_locations: /home/thava/download/zip/ - Info zip; Use zip30 and unzip 60 Based on BSD license. /home/thava/download/zip/zlib-1.2.5/contrib/untgz - Simple unzip/untar program. ============================================================================== Whos_who: zlib Authors & Copy righted to: Jean-loup Gailly Mark Adler The deflate format was defined by: Phil Katz. The deflate and zlib specifications written by: L. Peter Deutsch. Windows DLL Version & minizip written by: Gilles Vollant. Difference_between_zlib_and_gzip: zlib provides mainly deflate(), inflate() interfaces. gzip is a file format which uses zlib. However zlib also provides gzopen() etc interfaces which can be used to read/create gzip format files. ============================================================================== zlib_synopsis: NAME zlib -- general purpose compression library SYNOPSIS #include Basic functions int deflateInit(z_streamp strm, int level); int deflate(z_streamp strm, int flush); int deflateEnd(z_streamp strm); int inflateInit(z_streamp strm); int inflate(z_streamp strm, int flush); int inflateEnd(z_streamp strm); Utility functions typedef voidp gzFile ; int compress(Bytef *dest, uLongf *destLen, const Bytef *source, uLong sourceLen); int compress2(Bytef *dest, uLongf *destLen, const Bytef *source, uLong sourceLen, int level); int uncompress(Bytef *dest, uLongf *destLen, const Bytef *source, uLong sourceLen); gzFile gzopen(const char *path, const char *mode); gzFile gzdopen(int fd, const char *mode); int gzsetparams(gzFile file, int level, int strategy); int gzread(gzFile file, voidp buf, unsigned len); int gzwrite(gzFile file, const voidp buf, unsigned len); z_off_t gzseek(gzFile file, z_off_t offset, int whence); Checksum functions uLong adler32(uLong adler, const Bytef *buf, uInt len); uLong crc32(uLong crc, const Bytef *buf, uInt len); uLong adler32(uLong adler, const Bytef *buf, uInt len); The adler32() function updates a running Adler-32 checksum with the bytes buf[0..len-1] and returns the updated checksum. If buf is NULL, this function returns the required initial value for the checksum. An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much faster. Usage example: uLong adler = adler32(0L, Z_NULL, 0); while (read_buffer(buffer, length) != EOF) { adler = adler32(adler, buffer, length); } if (adler != original_adler) error(); uLong crc32(uLong crc, const Bytef *buf, uInt len); The crc32() function updates a running CRC with the bytes buf[0..len-1] and returns the updated CRC. If buf is NULL, this function returns the required initial value for the CRC. Pre- and post-conditioning (one's complement) is performed within this function so it shouldn't be done by the application. Usage exam- ple: uLong crc = crc32(0L, Z_NULL, 0); while (read_buffer(buffer, length) != EOF) { crc = crc32(crc, buffer, length); } if (crc != original_crc) error(); ============================================================================== Tar_format /* tar header */ #define BLOCKSIZE 512 #define SHORTNAMESIZE 100 struct tar_header { /* byte offset */ char name[100]; /* 0 */ char mode[8]; /* 100 */ char uid[8]; /* 108 */ char gid[8]; /* 116 */ char size[12]; /* 124 */ char mtime[12]; /* 136 */ char chksum[8]; /* 148 */ char typeflag; /* 156 */ char linkname[100]; /* 157 */ char magic[6]; /* 257 */ char version[2]; /* 263 */ char uname[32]; /* 265 */ char gname[32]; /* 297 */ char devmajor[8]; /* 329 */ char devminor[8]; /* 337 */ char prefix[155]; /* 345 */ /* 500 */ }; union tar_buffer { char buffer[BLOCKSIZE]; struct tar_header header; }; Note: 512 bytes of 0 is EOF marker. Still needs file size in advance.