Dpkg7Zip

Summary

Evaluate 7zip compression for use in debs, as an alternative to gzip or bzip2.

Rationale

7zip is a new compression algorithm that boasts reduced file size over the existing gzip and bzip2 schemes. If this can be used to reduce the size of packages, it frees up space on the CD for more packages.

Use cases

  • Colin is an Ubuntu Developer who constructs the CD images. He notices that the amd64 CD is too large; instead of removing another language pack, he'd like to increase the compression of existing packages to fit them.
  • Matthias is an Ubuntu user who only has dial-up internet access, smaller packages for security updates would be preferable to enable him to get them quicker.

Design

First evaluation of the compression needs to be performed. A selection of different packages should be collated and recompressed with 7zip instead; if a significant size benefit is gained without incurring a significant time or cost benefit for creation or unpacking, then we could consider compressing packages of that type with 7zip instead.

Particular types:

  • Small binary packages such as dpkg, coreutils, etc.
  • Large binary packages such as firefox and openoffice.
  • Documentation packages.
  • Language packs.

Smurf has performed an entirely unscientific test by recompressing /var/cache/apt/archives/[a-g]* on his laptop and achieved the following results:

 size   directory  decompress in
58176   repo.7z       19 sec
68032   repo.bz       25 sec
75872   repo.gz        4 sec

Test Case

Matt has run the following test sceniro's using the appropriate sample data. Specifically his Debian package pool from his Ubuntu 5.10 ("Breezy Badger" - Release i386 ) installation CD.

  1. Unpackage all Deb packages on the CD (A mix of gzip and bzip2 formats) and repackage using only tar format and no compression. That is 1488 Files adding to 1.5Gb uncompressed. 2.) Repackage these tar only Deb packages using only:
    • a.) Maximum Gzip compression b.) Maximum Bzip2 compression c.) 7zip with default compression d.) 7zip with high level compression e.) 7zip with maximum compression '
    3.) Decompress each of these compressed achieves.

The Results

  • Original package size is 579.9Mb

  • Tar only is 1612.5Mb

  • Gzip is 596.9Mb and took 8:34s and 02:25s to decompress.

  • Bzip2 is 539Mb and took 16:19s and 02:11s to decompress.

  • 7zip(default) is 445.6Mb and took 39:12s and 06:35s to decompress.

  • 7zip(high) is 424.3Mb and took 1:09:47s and 06:29s to decompress.

  • 7zip(max) is 423.2Mb and took 1:12:39s and 06:25s to decompress.

Summary

7zip offers a 24%-27% saving in storage space but is 3 times slower to compress and decompress. A 24% saving would allow a massive 134Mb of addition packages to be added to the standard CD. These would increasing package access and download speed of individual packages. A 3 Fold increase in compression time is considerable but is still pretty fast for individual packages on my machine. Decompression time concerns me more but I believe the this addition 4mins may not be so noticeable by end users considering time required for the rest of the install process.

It is also worthy of mention that at current packages can either be in Gzip or Bzip2 format. Most packages in the sample data where in Gzip format, if all developers where to use Bzip2 format it would free up 40Mb more space on the CD.

Notes

I used a test host with an Athlon Xp 2700, 1Gb (2700) ram, and 60Gb 7200rpm disk. This specification with effect the results, which are intended as practical guide lines not theoretical benchmarks.

I'm running Ubuntu Breezy with standard packages, P7zip 4.20, Bzip 1.02, Gzip 1.3.5.

Implementation

Code

The inclusion of bzip2 support into dpkg introduced a generic compression layer, in lib/compression.c. 7zip support can be added in a similar way:

  • Add the location of the 7zip support as a ZIP7 macros in lib/dpkg.h.

  • Add ZIP7 to the compression_type enum in lib/dpkg.h

  • Define data member name macros in dpkg-deb/dpkg-deb.h.

  • Handle the new data member in dpkg-deb/build.c and dpkg-deb/extract.c.

  • Add both "by exec" and "by library" support to lib/compression.c

  • Add format selection options in dpkg-deb/main.c

Data preservation and migration

Packages that would benefit from the conversion would select it when building in their debian/rules as we did for the bzip2 change; they would also Pre-Depend on the appropriate version of dpkg.

Outstanding issues

Reproduction of test results possibly on packages for another architecture.

Feedback from developers about the practicability of the slow compression time. Feedback about the practicability of the slow decompression times.

Is it possible to do a re-compress all the packages for the final release?

Comments

  • PhillipSusi: I think that it is important to note that 7zip benefits significantly from large data sets, and as such, it is a good idea to compress groups of packages as one unit, rather than each package individually. To quote myself from the mailing list:

Compression algorithms generally perform better the more data you give them, especially 7-zip.  Rather than compress each language pack individually, I decided to try compressing them all as one unit.  I extracted all of the language packs and tared up and compressed the resulting directory tree:

.tar:     184,934,400
.tar.gz:  61,478,589
.tar.bz2: 49,982,949
.tar.7z:  23,081,869


As you can see, 7-zip's compression REALLY improved with the combined data set giving a space savings of 54% over the original .debs, and 43% over individually 7-zipping each package.  Are all of these language packs on the setup/live cds?  If so then compressing them this way would free up 27 MB of space.  I wonder what other packages this could be applied to? 

I think it will require more work to compress multiple packages as one unit ( modifying apt instead of dpkg ), but when you are talking about a 54% space savings instead of ~25%, I think the work is well worth it. I wonder if Matt could do that test again with all the packages on the CD, but instead of 7zipping each tar individually, add them all to one big solid 7z archive? I believe the results would be rather impressive.

  • MattZimmerman: this would require that users download all language packs even if they only want one, which would more than offset the space savings of compressing them all together

    • soc: I agree with it. Maybe we could leave the "server"-packages out of this big 7z archive so we won't discriminate users with older hardware?
  • soc: In my opinion we should use 7z only for packages (openoffice, kde ...) which won't make sense using them on very old computers. This way we could avoid discriminating users with old hardware, because we would require more performance only from those users who actually have it. <soc ├ĄT krg-nw d0T de>

  • shish: is decompression time really significant compared to everything else in the install process? When I'm installing a single package it's only a couple of seconds, when I'm installing multiple packages I go off and get something to eat; it's going to take a while no matter what compression you use (based on my experiences with installing from scratch on a few 200MHz boxes)
  • maxim: I think that if we are adding support for this we should add support for rsyncable format too. It will need changes to same files (archive methods of dpkg). Adding rsyncable will be helpfull for delta updates using bsdiff and zsync. But lzma (which 7z uses) is harded to implement rsyncable (it could lower compression ratio). So additional test about it is needed.

Another possibilty is that we could use the different 7z settings (espacially word-size!) as well. An interesting approach would be a developer/packager/community-driven database where people could rate packages like this:

  • 1 => package is only used on high-end systems => 7z(max)

  • 2 => package is used on common systems => 7z(default/high)

  • 3 => package is essential on old systems => gz/bz2(max)

Im sure that when we start to use 7z the 7zip developers will take care of our needs soon.

  • The reduction in speed comes from using 7zip in the first place and not it's varying levels of compression. The only difference between high and low compression is the time it takes to CREATE the package. If you notice from the above tests, maximum 7zip compression actually took the shortest time to decompress out of all three 7zips used. Therefore using weaker compression for common systems and maximum for high end systems is rather useless. If 7zip is used at all then it may as well be used at maximum every time (from a users viewpoint that is). Individual developers may want to cut back on compression time, but perhaps some developers with powerful, yet mostly idle, computers could offer to compress packages with 7zip which are sent to them (gzipped to reduce download time of course)?
  • jacek2v: check speed mode in 7zip compression, it is faster that bzip and with better ratio that bzip.

My simply test (tar file 123699200 Bytes (openoffice bin 1.1.5) -> (123699200/1024)/1024=117,96875 ~ 118MB) :

gzip(default)

compresion   :  12.324s; decompresion :  2.090s; size: 42140523 B

bzip(default)

compresion   :  40.870s; decompresion : 11.123s; size: 39331698 B

7zip(default)

compresion   :1m45.295s; decompresion :  5.478s; size: 28887391 B

7zip(speed:-mx1)

compresion   :  21.428s; decompresion :  6.186s; size: 34746350 B

Machine: AMD 64 3200+; 3 disks 80G softRAID5; 2GB RAM

  • Leandro: Do you mean Mb or MiB?
  • De .7z archive format & the 7-zip archiver support several different compression algorithms, but they are not a compression algorithm themselves. Currently the default compression algorithm used is LZMA, but zip deflate (aka gzip or RFC-1951), bzip2, PPMD and some others are supported too (PPMD should be better than LZMA for text files).

Many aspects of this spec are superceded by dpkg-lzma


CategorySpec

Dpkg7Zip (last edited 2008-08-06 17:00:44 by localhost)