Contents:

Basic-Facts about ZIP

ZIP APPNOTE.TXT Released by PKWARE
            http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Tar-Format: http://en.wikipedia.org/wiki/Tar_%28file_format%29

ZipPseudo-Code  :
PseudoCode-libzip:   From NiH: http://nih.at/libzip/README.html

Encoding-Binary: http://www.javaworld.com/javatips/jw-javatip117.html

libzip is under BSD license: 
               http://www.nih.at/listarchive/libzip-discuss/msg00013.html

Local_downloaded_locations: See below.

Zlib data compression library.
     -Official Location:  http://zlib.net
     -Local Directory:    file:/home/thava/download/zlib-1.2.5 

zlib_synopsis:

Whos_who.

Difference_between_zlib_and_gzip

Tar_format

==============================================================================

Basic-Facts:
  - ZIP is for archive+compression. Tar is for archive only.
  - Info-ZIP is collection of individuals (including gzip author), who provide
    zip source code whose license is BSD based.
  - ZIP format was created in 1989, first implemented by PKWARE. Open Format.
  - JAR, open doc adaopted ZIP based formats, but it is not in any RFC.

More Information:

  - zip program (by Info-Zip) can write file to stdout can read input from stdin.
  - Info-Zip : http://sourceforge.net/projects/infozip/files/
               http://www.info-zip.org
  - libzip provides better interface, but does not allow non-seekable output!


==============================================================================


ZIP APPNOTE.TXT Released by PKWARE :
 - ZIP format was defined by Phil Katz in 1989 by PKWARE.
 - JAR and Open doc format, etc have adopted format based on PKWARE.
 - PKWARE while it says it is open format, still trying to patent using
   authentication mechanism in ZIP format.
 - PKWARE introduced ZIP format in 1989.

File Format:
V. General Format of a .ZIP file
--------------------------------

  Files stored in arbitrary order.  Large .ZIP files can span multiple
  volumes or be split into user-defined segment sizes. All values
  are stored in little-endian byte order unless otherwise specified. 

  Overall .ZIP file format:

    [local file header 1]
    [file data 1]
    [data descriptor 1]
    . 
    .
    .
    [local file header n]
    [file data n]
    [data descriptor n]
    [archive decryption header] 
    [archive extra data record] 
    [central directory]
    [zip64 end of central directory record]
    [zip64 end of central directory locator] 
    [end of central directory record]


  A.  Local file header:

Offset:
 00     local file header signature     4 bytes  (0x04034b50)
 04     version needed to extract       2 bytes
 06     general purpose bit flag        2 bytes
 08     compression method              2 bytes
 10     last mod file time              2 bytes
 12     last mod file date              2 bytes
 14     crc-32                          4 bytes
 18     compressed size                 4 bytes
 22     uncompressed size               4 bytes
 26     file name length                2 bytes
 28     extra field length              2 bytes

 30     file name (variable size)
 30+    extra field (variable size)

  B.  File data  
      [ Available at offset: 30+len(filename)+len(extrafield) ]

      Immediately following the local header for a file
      is the compressed or stored data for the file. 
      The series of [local file header][file data][data
      descriptor] repeats for each file in the .ZIP archive. 

  C.  Data descriptor:
      [ Available at offset: 30+len(filename)+len(extrafield) + len(compsize) ]

        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes

      This descriptor exists only if bit 3 of the general
      purpose bit flag is set (see below).  It is byte aligned
      and immediately follows the last byte of compressed data.
      This descriptor is used only when it was not possible to
      seek in the output .ZIP file, e.g., when the output .ZIP file
      was standard output or a non-seekable device.
      For ZIP64(tm) format
      archives, the compressed and uncompressed sizes are 8 bytes each.


Q: What if the output file is seekable ? 

Although not originally assigned a signature, the value 
      0x08074b50 has commonly been adopted as a signature value 
      for the data descriptor record.

 When the Central Directory Encryption method is used, the data
      descriptor record is not required, but may be used.  If present,
      and bit 3 of the general purpose bit field is set to indicate
      its presence, the values in fields of the data descriptor
      record should be set to binary zeros.

  F.  Central directory structure:

      [file header 1]
      .
      .
      . 
      [file header n]
      [digital signature] 

      File header:

        central file header signature   4 bytes  (0x02014b50)
        version made by                 2 bytes
        version needed to extract       2 bytes
        general purpose bit flag        2 bytes
        compression method              2 bytes
        last mod file time              2 bytes
        last mod file date              2 bytes
        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes
        file name length                2 bytes
        extra field length              2 bytes
        file comment length             2 bytes

[offset 34 bytes i.e. 0x22 from central file header begining: ]

        disk number start               2 bytes     [ central-dir-specific ]
        internal file attributes        2 bytes     [ central-dir-specific ]
        external file attributes        4 bytes     [ central-dir-specific ]
        relative offset of local header 4 bytes     [ central-dir-specific ]

[offset 46 bytes i.e. 0x2e from central file header begining: ]
        file name (variable size)
        extra field (variable size)
        file comment (variable size)                [ central-dir-specific ]

      Digital signature:

        header signature                4 bytes  (0x05054b50)
        size of data                    2 bytes
        signature data (variable size)

G.  Zip64 end of central directory record      [ Only for Zip64 ]

        zip64 end of central dir 
        signature                       4 bytes  (0x06064b50)
        size of zip64 end of central
        directory record                8 bytes
        ........
        ........

H.  Zip64 end of central directory locator      [ Only for Zip64 ]

        zip64 end of central dir locator 
        signature                       4 bytes  (0x07064b50)
        number of the disk with the
        start of the zip64 end of 
        central directory               4 bytes
        relative offset of the zip64
        end of central directory record 8 bytes
        total number of disks           4 byte
       
I.  End of central directory record:

        end of central dir signature    4 bytes  (0x06054b50)
        number of this disk             2 bytes
        number of the disk with the
        start of the central directory  2 bytes
        total number of entries in the
        central directory on this disk  2 bytes
        total number of entries in
        the central directory           2 bytes
        size of the central directory   4 bytes
        offset of start of central
        directory with respect to
        the starting disk number        4 bytes
        .ZIP file comment length        2 bytes
        .ZIP file comment       (variable size)


==============================================================================


Zip can take input from stdinput then, it will mark file name length as 0.

      disk number start: (2 bytes)

          The number of the disk on which this file begins.  If an 
          archive is in ZIP64 format and the value in this field is 
          0xFFFF, the size will be in the corresponding 4 byte zip64 
          extended information extra field.

    internal file attributes: (2 bytes)

          Bits 1 and 2 are reserved for use by PKWARE.

          The lowest bit of this field indicates, if set, that
          the file is apparently an ASCII or text file.  If not
          set, that the file apparently contains binary data.
          The remaining bits are unused in version 1.0.

      external file attributes: (4 bytes)

          The mapping of the external attributes is
          host-system dependent (see 'version made by').  For
          MS-DOS, the low order byte is the MS-DOS directory
          attribute byte.  If input came from standard input, this
          field is set to zero.

Tar-Format: http://en.wikipedia.org/wiki/Tar_%28file_format%29

The format was created in the early days of Unix and standardized by 
POSIX.1-1988 and later POSIX.1-2001.


ZipPseudo-Code

  zip3.0 code from Info-Zip:

  readzipfile:
    check if it is not stdin and can be opened for reading.

zip.c : 4339 :    
    filetime(z->name, ...); platform dependent routine. 
      on unix:  Make sure you nullify the last "/" character.
        If file is "-" do fstat(fileno(stdin) ... );
        or do lstat();
      set the attributes and preserve the time.

      if encryption enabled:  crc_32_tab = get_crc_table();
      open zip file for writing;
      get stat info of the zipfile that we just created;
      create a temp zip file (in case of update of existing zip)
      ifdef _IOFBF,  setvbuf(8K buffer);
      check if output is seekable:  fseek(fd, current-pos, SEEK_SET) == 0 ?

      write contents;
      write central directory:
        For each entry:
        putcentral(z): 
        // write to memblock, then will write to file.
        only if the size > 4GB, set the zip64 extra field flag;
        Write:
        Signature = 0x02014b50L;  4bytes;
        Version-made-by = 798;    2bytes;
        version needed to extract       2 bytes   10 
        general purpose bit flag        2 bytes   0 
        compression method              2 bytes   0
        last mod file time              2 bytes   xx
        last mod file date              2 bytes   xx
        crc-32                          4 bytes   0x96170874
        compressed size                 4 bytes   4 (if >4GB, write 4GB here)   
        uncompressed size               4 bytes   for zip64, =4GB; else 
                                                      size>4G? 4GB : size;
        file name length                2 bytes   10 for thava.txt(includes 0)
        extra field length              2 bytes   24 ??? 
        file comment length             2 bytes   0
        disk number start               2 bytes   0
        internal file attributes        2 bytes   1  flag to indicate ASCII File
        external file attributes        4 bytes   filetime value. can be 0. 
        relative offset of local header 4 bytes   72 first-file 72bytes+
                                                      second file 72bytes.

        putcentral oem comment:
        current_disk  ;   0
        number_of_disk for start of central dir; 0
        total number of entries ; 2
        size of central directory. 160  (if >4GB, write 4GB here)
        4 bytes offset of start of central dir from starting disk number; 144
        write comment if any;
        write out the whole of central dir from memory buf; 
        rename temp file as original zip file.
        setfileattr: chmod the file to reasonable mode;


struct zlist { contains local header info:
    flg, disknumber, offset, extra field,
    filename, unicode name, size, that's it.
    }


Examine an example zipped file (without compression) :
  one.txt and two.txt: (small files)

Hex dump of zip file:

Start:  Local Header:  

Offset 0: (local signature)
  50 4b 03 04 : i.e. PK 3 4 : Local File Header Signature (4 bytes)
Offset 4: (version_needed)
  0a 00  - Version needed to extract.  10 => i.e. 1.0 ZIP format.
  14 00  - Version needed to extract.  20 => i.e. 2.0 ZIP format.
              From version 2.0 i.e. 20 (0x14), files can be Deflate compressed.
Offset 6: (general purpose bit flag - 2bytes)
  00 00  - Note: This 16 bits used generally depend on the next 2 bytes
           of compression method (deflate, LZMA, etc).
           For Deflate method, bit1-bit2 are: 
             00 - normal compression. 11 - superfast compression, etc.
           Note: Useful for streaming!
           bit3: If this bit is set, the fields crc-32, compressed 
                 size and uncompressed size are set to zero in the 
                 local header.  The correct values are put in the 
                 data descriptor immediately following the compressed
                 data. 
           Note: In our example, it is always set to 0, even for big file
                 which is streamed from stdin. !
 Offset 8: (Compression method - 2 bytes)
  08 00 : The value 8 is for "deflate" method.

 Offset 10:  time and date (4bytes)
  c0 11 6f 3d  : Number of secs since Jan 1, 1970. See time_t time();

Offset 14: CRC-32 (4bytes): (lower-to-higher):
   8f 1d c9 25:  (i.e 25 c9 1d 8f ): It is not CRC-32 seed. It is not CRC of
                 compressed content. It is CRC of original content.
                 The 'magic number' for
          the CRC is 0xdebb20e3.  The proper CRC pre and post
          conditioning is used, meaning that the CRC register
          is pre-conditioned with all ones (a starting value
          of 0xffffffff) and the value is post-conditioned by
          taking the one's complement of the CRC residual.
          If bit 3 of the general purpose flag is set, this
          field is set to zero in the local header and the correct
          value is put in the data descriptor and in the central
          directory.
          This field is set even if stdin is being streamed! how?!!!
          It is per file CRC. It is same for 7z, and zip programs for
          given one.txt file -- the value is same if compression on/off.

Offset 18: compressed size (4 bytes):
   33 00 00 00: (for small 78 bytes file, it is 51 bytes)
   78 3c 12 04: (It is 68MB for video file lsb-msb)

Offset 22: uncompressed size (4 bytes)
   4e00 0000  (for small 78 bytes file)
   8ca7 2404  (for video file it is exactly: 69511052 bytes original size)

Offset 26: File name length (2bytes)
  0700   : (for one.txt file)
  0100   : (for stdin - file)

Offset 28

  extra field length : 2 bytes:

  1c00   : zip30 uses 29 bytes (for compressed data of small files)
           Data starts with UT (time related) info.
  0000   : for minizip, 7zip and for stdin compressed file.
  1400   : zip2.32 uses 20 bytes for stdin zip. Filled mostly with zeros.
  1500   : zip2.32 uses 21 bytes for single zip of one.txt file.
           And it starts with UT... meaning Universal time is saved.

  The central-directory extra field contains:
  - A subfield with ID 0x5455 (universal time) and 5 data bytes.
    The local extra field has UTC/GMT modification/access times.
  - A subfield with ID 0x7855 (Unix UID/GID (16-bit)) and 0 data bytes.


ZIPINFO Output:

/home/thava/bin/zipinfo3 -v small.bin says:
Archive:  comp.bin
There is no zipfile comment.

End-of-central-directory record:
-------------------------------

  Zip archive file size:                       381 (000000000000017Dh)
  Actual end-cent-dir record offset:           359 (0000000000000167h)
  Expected end-cent-dir record offset:         359 (0000000000000167h)
  (based on the length of the central directory and its expected offset)

  This zipfile constitutes the sole disk of a single-part archive; its
  central directory contains 2 entries.
  The central directory is 132 (0000000000000084h) bytes long,
  and its (expected) offset in bytes from the beginning of the zipfile
  is 227 (00000000000000E3h).


Central directory entry #1:
---------------------------

  one.txt

  offset of local header from start of archive:   0
                                                  (0000000000000000h) bytes
  file system or operating system of origin:      Unix
  version of encoding software:                   2.3
  minimum file system compatibility required:     MS-DOS, OS/2 or NT FAT
  minimum software version required to extract:   2.0
  compression method:                             deflated
  compression sub-type (deflation):               normal
  file security status:                           not encrypted
  extended local header:                          no
  file last modified on (DOS date/time):          2010 Nov 15 02:14:00
  file last modified on (UT extra field modtime): 2010 Nov 15 02:14:00 local
  file last modified on (UT extra field modtime): 2010 Nov 14 20:44:00 UTC
  32-bit CRC value (hex):                         25c91d8f
  compressed size:                                51 bytes
  uncompressed size:                              78 bytes
  length of filename:                             7 characters
  length of extra field:                          13 bytes
  length of file comment:                         0 characters
  disk number on which file begins:               disk 1
  apparent file type:                             text
  Unix file attributes (100644 octal):            -rw-r--r--
  MS-DOS file attributes (00 hex):                none

  The central-directory extra field contains:
  - A subfield with ID 0x5455 (universal time) and 5 data bytes.
    The local extra field has UTC/GMT modification/access times.
  - A subfield with ID 0x7855 (Unix UID/GID (16-bit)) and 0 data bytes.

  There is no file comment.

Central directory entry #2:
---------------------------

  two.txt

  offset of local header from start of archive:   109
                                                  (000000000000006Dh) bytes
  file system or operating system of origin:      Unix
  version of encoding software:                   2.3
  minimum file system compatibility required:     MS-DOS, OS/2 or NT FAT
  minimum software version required to extract:   2.0
  compression method:                             deflated
  compression sub-type (deflation):               normal
  file security status:                           not encrypted
  extended local header:                          no
  file last modified on (DOS date/time):          2010 Nov 15 02:14:42
  file last modified on (UT extra field modtime): 2010 Nov 15 02:14:42 local
  file last modified on (UT extra field modtime): 2010 Nov 14 20:44:42 UTC
  32-bit CRC value (hex):                         5a92ece0
  compressed size:                                60 bytes
  uncompressed size:                              100 bytes
  length of filename:                             7 characters
  length of extra field:                          13 bytes
  length of file comment:                         0 characters
  disk number on which file begins:               disk 1
  apparent file type:                             text
  Unix file attributes (100644 octal):            -rw-r--r--
  MS-DOS file attributes (00 hex):                none

  The central-directory extra field contains:
  - A subfield with ID 0x5455 (universal time) and 5 data bytes.
    The local extra field has UTC/GMT modification/access times.
  - A subfield with ID 0x7855 (Unix UID/GID (16-bit)) and 0 data bytes.

  There is no file comment.


ZIP Version-History:
A summary of key advances in various versions of the PKWARE specification:

  * 2.0: File entries can be compressed with DEFLATE.
  * 4.5: Documented 64-bit ZIP format.
  * 5.0: DES, Triple DES, RC2, RC4 supported for encryption
  * 5.2: RC2-64 supported for Encryption.
  * 6.1: Documented certificate storage.
  * 6.2.0: Documented Central Directory Encryption.
  * 6.3.0: Documented Unicode (UTF-8) filename storage. Expanded list of
    supported hash, compression, encryption algorithms.
  * 6.3.1: Corrected standard hash values for SHA-256/384/512.
  * 6.3.2: Documented compression method 97 (WavPack).


How to use libzip?

READING ZIP ARCHIVES
open archive
       zip_open(3)

add/change files and directories
   zip_add(3)     zip_add_dir(3)     zip_replace(3)     zip_set_file_comment(3)
   zip_source_buffer(3) zip_source_file(3) zip_source_filep(3) zip_source_func
   tion(3) zip_source_zip(3) zip_source_free(3)

==============================================================================

Encoding-Binary: http://www.javaworld.com/javatips/jw-javatip117.html

Any binary string in international character set use non-ascii values.
Unless an editor or software "understands" the characters in non-ASCII format,
it will get confused. e.g. A compiler can not compile C program written in
Hindi. However, if we say, a file is "encoded" in UTF-8 format, then all
constant Hindi strings will be converted to equivalent ASCII string.

Change every byte into 2-digit hexadecimal byte: And this


UTF-8 has the property that all existing 7-bit ASCII strings are still valid.
UTF-8 only affects the meaning of bytes greater than 127, which it uses to
represent higher Unicode characters. A character might require 1, 2, 3, or 4
bytes of storage depending on its value; more bytes are needed as values get
larger. To store the full range of possible 32-bit characters, UTF-8 would
require a whopping 6 bytes. But again, Unicode only defines characters up to
0x10FFFF, so this should never happen in practice.

UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number
from 0x000000 to 0x10FFFF:

00000000 -- 0000007F: 	0xxxxxxx
00000080 -- 000007FF: 	110xxxxx 10xxxxxx
00000800 -- 0000FFFF: 	1110xxxx 10xxxxxx 10xxxxxx
00010000 -- 001FFFFF: 	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If a string is UTF-8 encoded, then strcmp() strlen() etc still works, since
NULL character can not appear in the encoded string!

Open file with filename unicode file.

#include<stdio.h>
int main()
{
//UTF data for ??.txt (?? -> chinese characters)
unsigned char fname[20]={0xE6, 0x98, 0x8E, 0xE5, 0xA4, 0xA9, 0x2E, 0x74, 0x78, 0x74, 0x00};
FILE *fp;

fp = fopen((char *)fname, "w");
fp = fopen((char *)fname, "r");
}
==============================================================================

For writing unicode strings, use:
WPRINTF(3) Linux Programmer's Manual WPRINTF(3)

NAME
wprintf, fwprintf, swprintf, vwprintf, vfwprintf, vsw_
printf - formatted wide character output conversion

SYNOPSIS
#include <stdio.h>
#include <wchar.h>

int wprintf(const wchar_t *format, ...);
int fwprintf(FILE *stream, const wchar_t *format, ...);
int swprintf(wchar_t *wcs, size_t maxlen,
const wchar_t *format, ...);

==============================================================================
http://www.linuxdocs.org/HOWTOs/Unicode-HOWTO-3.html

You will need a program to convert your locally (probably ISO-8859-1) encoded
texts to UTF-8. (The alternative would be to keep using texts in different
encodings on the same machine; this is not fun in the long run.) One such
program is `iconv', which comes with glibc-2.1. Simply use

    $ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file


==============================================================================

Local_downloaded_locations:

   /home/thava/download/zip/   - Info zip; Use zip30 and unzip 60
                                 Based on BSD license.

   /home/thava/download/zip/zlib-1.2.5/contrib/untgz 
                               - Simple unzip/untar program.

==============================================================================

Whos_who:

   zlib Authors & Copy righted to:
        Jean-loup Gailly <jloup@gzip.org>
        Mark Adler <madler@alumni.caltech.edu>

    The deflate format was defined by: Phil Katz.
    The deflate and zlib specifications written by: L. Peter Deutsch.

    Windows DLL Version & minizip written by: Gilles Vollant.

Difference_between_zlib_and_gzip:

   zlib provides mainly deflate(), inflate() interfaces.
gzip is a file format which uses zlib.
   However zlib also provides gzopen() etc interfaces which can be used
to read/create gzip format files.

==============================================================================

zlib_synopsis:
NAME
     zlib -- general purpose compression library

SYNOPSIS
     #include <zlib.h>

   Basic functions

     int deflateInit(z_streamp strm, int level);
     int deflate(z_streamp strm, int flush);
     int deflateEnd(z_streamp strm);

     int inflateInit(z_streamp strm);
     int inflate(z_streamp strm, int flush);
     int inflateEnd(z_streamp strm);

   Utility functions
     typedef voidp gzFile ;

     int compress(Bytef *dest, uLongf *destLen, const Bytef *source,
         uLong sourceLen);

     int compress2(Bytef *dest, uLongf *destLen, const Bytef *source,
         uLong sourceLen, int level);

     int uncompress(Bytef *dest, uLongf *destLen, const Bytef *source,
         uLong sourceLen);

     gzFile gzopen(const char *path, const char *mode);
     gzFile gzdopen(int fd, const char *mode);
     int gzsetparams(gzFile file, int level, int strategy);
     int gzread(gzFile file, voidp buf, unsigned len);
     int gzwrite(gzFile file, const voidp buf, unsigned len);
     z_off_t gzseek(gzFile file, z_off_t offset, int whence);

   Checksum functions

     uLong adler32(uLong adler, const Bytef *buf, uInt len);
     uLong crc32(uLong crc, const Bytef *buf, uInt len);

    uLong adler32(uLong adler, const Bytef *buf, uInt len);
             The adler32() function updates a running Adler-32 checksum with
             the bytes buf[0..len-1] and returns the updated checksum.  If buf
             is NULL, this function returns the required initial value for the
             checksum.

             An Adler-32 checksum is almost as reliable as a CRC32 but can be
             computed much faster.  Usage example:

                   uLong adler = adler32(0L, Z_NULL, 0);

                   while (read_buffer(buffer, length) != EOF) {
                   adler = adler32(adler, buffer, length);
                   }
                   if (adler != original_adler) error();

     uLong crc32(uLong crc, const Bytef *buf, uInt len);
             The crc32() function updates a running CRC with the bytes
             buf[0..len-1] and returns the updated CRC.  If buf is NULL, this
             function returns the required initial value for the CRC.  Pre-
             and post-conditioning (one's complement) is performed within this
             function so it shouldn't be done by the application.  Usage exam-
             ple:

                   uLong crc = crc32(0L, Z_NULL, 0);

                   while (read_buffer(buffer, length) != EOF) {
                   crc = crc32(crc, buffer, length);
                   }
                   if (crc != original_crc) error();

==============================================================================

Tar_format

/* tar header */

#define BLOCKSIZE     512
#define SHORTNAMESIZE 100

struct tar_header
{                               /* byte offset */
  char name[100];               /*   0 */
  char mode[8];                 /* 100 */
  char uid[8];                  /* 108 */
  char gid[8];                  /* 116 */
  char size[12];                /* 124 */
  char mtime[12];               /* 136 */
  char chksum[8];               /* 148 */
  char typeflag;                /* 156 */
  char linkname[100];           /* 157 */
  char magic[6];                /* 257 */
  char version[2];              /* 263 */
  char uname[32];               /* 265 */
  char gname[32];               /* 297 */
  char devmajor[8];             /* 329 */
  char devminor[8];             /* 337 */
  char prefix[155];             /* 345 */
                                /* 500 */
};

union tar_buffer
{
  char               buffer[BLOCKSIZE];
  struct tar_header  header;
};

Note: 512 bytes of 0 is EOF marker. Still needs file size in advance.