LangpackCompression

paul@katu:~/ubuntu/fix/bzip2-compression/original/tmp$ ls -l data*bz2 ../language-pack-gnome-en-base_20050826_all.deb | sort -k 5
-rw-r--r--  1 paul paul 1216424 2005-09-16 14:10 data-repacked-by-alpha-r.tar.bz2
-rw-r--r--  1 paul paul 1216878 2005-09-16 13:50 data-repacked-by-percentage.tar.bz2
-rw-r--r--  1 paul paul 1228604 2005-09-16 14:09 data-repacked-by-alpha.tar.bz2
-rw-r--r--  1 paul paul 1237817 2005-09-16 14:06 data-repacked-by-percentage-r.tar.bz2
-rw-r--r--  1 paul paul 1502554 2005-09-16 13:39 data.tar.bz2
-rw-r--r--  1 paul paul 1508720 2005-08-26 17:40 ../language-pack-gnome-en-base_20050826_all.deb

mkdir -p original
cp -a /var/cache/apt/archives/language-pack-* original/
mv original/ bzip2-compression/
cd bzip2-compression/original/
mkdir -p tmp
cd tmp

ar x ../language-pack-gnome-en-base_20050826_all.deb data.tar.bz2
mkdir -p data ; cd data
tar jxvf ../data.tar.bz2
find usr/ -type f -print0 | xargs -0 bzip2 -cv > /dev/null 2>repack-results
sort -n -k 3 repack-results | awk '{print $1}' | sed -e 's/:$//' | xargs tar jcvf data-repacked-by-percentage.tar.bz2

find usr/ -type f | sort -t '/' -k 5 > last-part-alpha
cat last-part-alpha |  xargs tar jcvf ../data-repacked-by-alpha.tar.bz2
find usr/ -type f | sort -r -t '/' -k 5 > last-part-alpha-r
cat last-part-alpha-r |  xargs tar jcvf ../data-repacked-by-alpha-r.tar.bz2

Top 20 packages on the install CD come to 200MB:

$ find pool/ -type f -ls | awk '{print $7, $11}' | sort -n > cd-contents
$ tail -20 cd-contents | awk '{x = x + $1; print } END {print x, "Total"}'
4807608 pool/main/r/rss-glx/rss-glx_0.7.5-4ubuntu1_i386.deb
4886658 pool/main/g/glibc/libc6_2.3.5-1ubuntu11_i386.deb
5094342 pool/main/g/gcc-4.0/libgcj6_4.0.1-4ubuntu6_i386.deb
5167552 pool/restricted/l/linux-restricted-modules-2.6.12/linux-restricted-modules-2.6.12-8-386_2.6.12.2-12_i386.deb
5381934 pool/main/t/ttf-kochi/ttf-kochi-mincho_1.0.20030809-3_all.deb
5602128 pool/main/t/ttf-arphic-bsmi00lp/ttf-arphic-bsmi00lp_2.10-6_all.deb
5646236 pool/main/x/xfonts-core/xfonts-base_6.8.2.1-3_all.deb
5918140 pool/main/l/linux-source-2.6.12/linux-headers-2.6.12-8_2.6.12-8.12_i386.deb
5984092 pool/main/t/ttf-arphic-bkai00mp/ttf-arphic-bkai00mp_2.10-6_all.deb
6522278 pool/main/h/hplip/hplip-data_0.9.4-3ubuntu1_all.deb
7280072 pool/main/f/foomatic-filters-ppds/foomatic-filters-ppds_20050720-1ubuntu1_all.deb
8440374 pool/main/f/firefox/firefox_1.0.6-1ubuntu10_i386.deb
8962800 pool/main/c/cupsys/cupsys_1.1.23-10ubuntu3_i386.deb
10334574 pool/main/m/mozilla-thunderbird/mozilla-thunderbird_1.0.6-0ubuntu6_i386.deb
10981938 pool/main/e/emacs21/emacs21-common_21.4a-1ubuntu1_all.deb
11504944 pool/main/t/ttf-baekmuk/ttf-baekmuk_2.2-1ubuntu1_all.deb
12661874 pool/main/m/mesa/libgl1-mesa-dri_6.3.2-0ubuntu5_i386.deb
17786728 pool/main/l/linux-source-2.6.12/linux-image-2.6.12-8-386_2.6.12-8.12_i386.deb
25643634 pool/main/o/openoffice.org2/openoffice.org2-common_1.9.125+2.0beta2-1ubuntu1_all.deb
31338372 pool/main/o/openoffice.org2/openoffice.org2-core_1.9.125+2.0beta2-1ubuntu1_i386.deb
199946278 Total

OOo2, Linux (+binary modules, +headers), Mesa, Emacs, 4x intriguing Truetype fonts, Core X fonts, Thunderbird, Firefox, Cups (+foomatic PPDs), GCJ/Java and Lots of OpenGL screensavers (rss-glx).

At a guess, Foomatic and linux-headers would reduce fairly well. Although it might be simpler to move everything not in the default install seed to bzip2 compression (since install speed is less important).

Foomatic

Foomatic contains PPDs (PostScript Printer Definitons) for 2599 individual printers. Each of these is incredibly similar but each file is individually gzipped.

$ find -name '*.gz' | wc -l
2599
$ find -name '*.gz' | xargs du -c -h | tail -1
1.2M    total

Yet these files take up 13MB on disk because each one is occupying a 4k inode. After unpacking each PPD, then tar'ing the whole lot up again with gzip or bzip2 it uses:

-rw-r--r--  1 paul paul  929986 2005-09-16 17:16 ../data-ppd-no-gz.tar.bz2
-rw-r--r--  1 paul paul 2976299 2005-09-16 17:15 ../data-ppd-no-gz.tar.gz
-rw-r--r--  1 paul paul 7210215 2005-09-16 17:03 ../data.tar.gz

I think these PPDs should really be stored on-disk as a tarball or .zip; the inode/block count wastage is huge. ZIP has an advantage that it is directly addressible, but lacks cross-file compression savings. But uncompressing the PPDs, the raw on-disk usage goes up from 13MB to 120MB.

$ time find -type f -name \*.ppd | xargs gzip
real    0m2.912s  user    0m2.543s
$ time find -type f -name \*.gz | xargs gunzip
real    0m1.680s  user    0m1.192s

Even the Bzip2 archive (because of the huge compression) is amazingly fast to decompress:

$ time tar tjf ../data-ppd-no-gz.tar.bz2 | grep HP | head -1
usr/share/ppd/HP/
real    0m1.446s  user    0m0.042s

Solution: Store as a .tar.bz2 and teach CUPS how to go and fetch PPDs. Preferably abstract the fetching behind a script so that they can be transparently downloaded off the web or from the Printer itself. Possibility to remain 100% compatible by using a post-install to unpack the .tar.bz2 and recompress as individual files.

Saving: 7.2MB -> 0.9MB on CD and 13MB -> 0.9MB on disk. 87% saving on CD.

Linux Headers

$ find usr/ -type f -name '*.h' | xargs du -sch | tail -1
9.9M    Header Files
$ find usr/ -type f \! -name '*.h' | xargs du -sch | tail -1
6.4M    Not Header Files
$ find usr/ -type f -name 'Makefile' | xargs du -sch | tail -1
3.3M    Of Which, Makefiles
$ find usr/ -type f -name 'Kconfig*' | xargs du -sch | tail -1
2.5M    Of Which Kconfigs
$ find usr/ -type f \! -name 'Kconfig*' \! -name Makefile \! -name '*.h' | xargs du -sch | tail -1
684K    Of Which Other/Uknown

9.9MB of headers. 6.4MB of Makefile/Kconfig. 176kB of the Unknowns are ARM assembly .S assembler files for every particular variation of embedded board/device. 160kB are files such as MAINTAINERS.gz and README.gz that have already been compressed; these unpack to 440kB on disk and compress back to 118kB using bzip2.

3.3MB of it is include/asm-arm/ so this probably warrants special attention.

So far, the best I have is:

  • Unpacking all *.gz files (280kB on-disk increase)---could be reclaimed by post-inst.

find usr/ -type f -name \*.gz | xargs gunzip
find usr/ -type f | LANG=C sed -e 's-^.*/\([^/][^/]*\)-\1 &-' | sort | awk '{print $2}' | tar jcf data-by-filename.tar.bz2 -T -

-rw-r--r--  1 paul paul 4555509 2005-09-16 19:21 data-by-filename.tar.bz2
-rw-r--r--  1 paul paul 4823926 2005-09-16 17:45 data.tar.bzip2
-rw-r--r--  1 paul paul 5916271 2005-09-16 17:43 data.tar.gz

Saving: 1.3MB on CD (18% with bzip2, 23% with bzip2 + fiddling), 280kB extra on disk.

PaulSladen/LangpackCompression (last edited 2008-08-06 16:15:19 by localhost)