Dpkg7Zip

Revision 5 as of 2005-12-01 00:45:31

Clear message

Summary

Evaluate 7zip compression for use in debs, as an alternative to gzip or bzip2.

Rationale

7zip is a new compression algorithm that boasts reduced file size over the existing gzip and bzip2 schemes. If this can be used to reduce the size of packages, it frees up space on the CD for more packages.

Use cases

  • Colin is an Ubuntu Developer who constructs the CD images. He notices that the amd64 CD is too large; instead of removing another language pack, he'd like to increase the compression of existing packages to fit them.
  • Matthias is an Ubuntu user who only has dial-up internet access, smaller packages for security updates would be preferable to enable him to get them quicker.

Design

First evaluation of the compression needs to be performed. A selection of different packages should be collated and recompressed with 7zip instead; if a significant size benefit is gained without incurring a significant time or cost benefit for creation or unpacking, then we could consider compressing packages of that type with 7zip instead.

Particular types:

  • Small binary packages such as dpkg, coreutils, etc.
  • Large binary packages such as firefox and openoffice.
  • Documentation packages.
  • Language packs.

Smurf has performed an entirely unscientific test by recompressing /var/cache/apt/archives/[a-g]* on his laptop and achieved the following results:

 size   directory  decompress in
58176   repo.7z       19 sec
68032   repo.bz       25 sec
75872   repo.gz        4 sec

Futher test

Matt has run the following test sceniro's using the appropriate sample data. Specifically his Debian package pool from his Ubuntu 5.10 ("Breezy Badger" - Release i386 ) installation CD.

  1. Unpackage all Deb packages on the CD (A mix of gzip and bzip2 formats) and repackage using only tar format and no compression. That is 1488 Files adding to 1.5Gb uncompressed.
  2. Repackage these tar only Deb packages using entirely:-
  3. Maximum Gzip compression "-9" b. Maximum Bzip2 compression "-9" c. 7zip compression
  4. with default compression "-mx5" ii. with high level compression "-mx7" iii. with maximum compression "-mx9"
  5. Decompress each of these compressed archieves.

The Results

  • Original package size is 579.9Mb

  • Tar only is 1612.5mb

  • Gzip is 596.9Mb and took 8:34s and 02:25s to decompress.

  • Bzip2 is 539Mb and took 16:19s and 02:11 to decompress.

  • 7zip(default) is 445.6Mb and took 39:12s and 06:35s to decompress.

  • 7zip(high) is 424.3Mb and took 1:09:47s and 06:29s to decompress.

  • 7zip(max) is 423.2Mb and took 1:12:39s and 06:25s to decompress.

Summary

7zip offers a 24%-27% saving in storage space but is 3 times slower to compress and decompress. A 24% saving would allow a massive 134Mb of addition packages to be added to the standard CD. These would increasing package access and download speed of individual packages. A 3 Fold increase in compression time is considerable but is still pretty fast for individual packages on my machine. Decompression time concerns me more but I beleive the this addition 4mins may not be so noticeable by end users considering time required for the rest of the install process.

It is also worthy of mention that at current packages can either be in gzip or bzip2 format. Most packages in the sample data where in gzip format, if all deveoplers where to use bzip2 format it would free up 40Mb more space on the cd.

Observations

  • As with bzip2 it would not be nescery for all packages to use this compression format
  • It maybe reasonable to recompile packages automatically only for CD release builds. (To squese more on a single CD.
  • A 25% increase in the number of packages would be good for marketing.
  • 7zip is a newer format with which is described as a quick linux port it may be possible to further optimise the compression times. It also as many command line optimization options.
  • 7zip uses a plugin based achitecute which may improve compression in the future.
  • Some 7zip compression may be (x86 32bit package) specific. (This needs testing)

Notes

I used the test host: * Athlon Xp 2700 / 1024Mb 2700mz ram with a 60Gb 7200rpm harddisk. These factors will effect test results which a meant as a pratical examples of what is currently possible and is not meant as a bench mark.

I used to standard availible packages for Ubuntu Breezy, P7zip 4.20, Gzip 1.3.5, Bzip2 1.0.2.

Implementation

Code

The inclusion of bzip2 support into dpkg introduced a generic compression layer, in lib/compression.c. 7zip support can be added in a similar way:

  • Add the location of the 7zip support as a ZIP7 macros in lib/dpkg.h.

  • Add ZIP7 to the compression_type enum in lib/dpkg.h

  • Define data member name macros in dpkg-deb/dpkg-deb.h.

  • Handle the new data member in dpkg-deb/build.c and dpkg-deb/extract.c.

  • Add both "by exec" and "by library" support to lib/compression.c

  • Add format selection options in dpkg-deb/main.c

Data preservation and migration

Packages that would benefit from the conversion would select it when building in their debian/rules as we did for the bzip2 change; they would also Pre-Depend on the appropriate version of dpkg.

Outstanding issues

Reproduction of test results possibly on packages for another articture. Feedback from developers about the practibillity of the slow compression time. Feedback about the practibillity of the slow decompression times.

Is it possible to do a recompress all the packages for the final release?