LangpackCompression
paul@katu:~/ubuntu/fix/bzip2-compression/original/tmp$ ls -l data*bz2 ../language-pack-gnome-en-base_20050826_all.deb | sort -k 5 -rw-r--r-- 1 paul paul 1216424 2005-09-16 14:10 data-repacked-by-alpha-r.tar.bz2 -rw-r--r-- 1 paul paul 1216878 2005-09-16 13:50 data-repacked-by-percentage.tar.bz2 -rw-r--r-- 1 paul paul 1228604 2005-09-16 14:09 data-repacked-by-alpha.tar.bz2 -rw-r--r-- 1 paul paul 1237817 2005-09-16 14:06 data-repacked-by-percentage-r.tar.bz2 -rw-r--r-- 1 paul paul 1502554 2005-09-16 13:39 data.tar.bz2 -rw-r--r-- 1 paul paul 1508720 2005-08-26 17:40 ../language-pack-gnome-en-base_20050826_all.deb
mkdir -p original cp -a /var/cache/apt/archives/language-pack-* original/ mv original/ bzip2-compression/ cd bzip2-compression/original/ mkdir -p tmp cd tmp ar x ../language-pack-gnome-en-base_20050826_all.deb data.tar.bz2 mkdir -p data ; cd data tar jxvf ../data.tar.bz2 find usr/ -type f -print0 | xargs -0 bzip2 -cv > /dev/null 2>repack-results sort -n -k 3 repack-results | awk '{print $1}' | sed -e 's/:$//' | xargs tar jcvf data-repacked-by-percentage.tar.bz2
find usr/ -type f | sort -t '/' -k 5 > last-part-alpha cat last-part-alpha | xargs tar jcvf ../data-repacked-by-alpha.tar.bz2 find usr/ -type f | sort -r -t '/' -k 5 > last-part-alpha-r cat last-part-alpha-r | xargs tar jcvf ../data-repacked-by-alpha-r.tar.bz2
Top 20 packages on the install CD come to 200MB:
$ find pool/ -type f -ls | awk '{print $7, $11}' | sort -n > cd-contents $ tail -20 cd-contents | awk '{x = x + $1; print } END {print x, "Total"}' 4807608 pool/main/r/rss-glx/rss-glx_0.7.5-4ubuntu1_i386.deb 4886658 pool/main/g/glibc/libc6_2.3.5-1ubuntu11_i386.deb 5094342 pool/main/g/gcc-4.0/libgcj6_4.0.1-4ubuntu6_i386.deb 5167552 pool/restricted/l/linux-restricted-modules-2.6.12/linux-restricted-modules-2.6.12-8-386_2.6.12.2-12_i386.deb 5381934 pool/main/t/ttf-kochi/ttf-kochi-mincho_1.0.20030809-3_all.deb 5602128 pool/main/t/ttf-arphic-bsmi00lp/ttf-arphic-bsmi00lp_2.10-6_all.deb 5646236 pool/main/x/xfonts-core/xfonts-base_6.8.2.1-3_all.deb 5918140 pool/main/l/linux-source-2.6.12/linux-headers-2.6.12-8_2.6.12-8.12_i386.deb 5984092 pool/main/t/ttf-arphic-bkai00mp/ttf-arphic-bkai00mp_2.10-6_all.deb 6522278 pool/main/h/hplip/hplip-data_0.9.4-3ubuntu1_all.deb 7280072 pool/main/f/foomatic-filters-ppds/foomatic-filters-ppds_20050720-1ubuntu1_all.deb 8440374 pool/main/f/firefox/firefox_1.0.6-1ubuntu10_i386.deb 8962800 pool/main/c/cupsys/cupsys_1.1.23-10ubuntu3_i386.deb 10334574 pool/main/m/mozilla-thunderbird/mozilla-thunderbird_1.0.6-0ubuntu6_i386.deb 10981938 pool/main/e/emacs21/emacs21-common_21.4a-1ubuntu1_all.deb 11504944 pool/main/t/ttf-baekmuk/ttf-baekmuk_2.2-1ubuntu1_all.deb 12661874 pool/main/m/mesa/libgl1-mesa-dri_6.3.2-0ubuntu5_i386.deb 17786728 pool/main/l/linux-source-2.6.12/linux-image-2.6.12-8-386_2.6.12-8.12_i386.deb 25643634 pool/main/o/openoffice.org2/openoffice.org2-common_1.9.125+2.0beta2-1ubuntu1_all.deb 31338372 pool/main/o/openoffice.org2/openoffice.org2-core_1.9.125+2.0beta2-1ubuntu1_i386.deb 199946278 Total
OOo2, Linux (+binary modules, +headers), Mesa, Emacs, 4x intriguing Truetype fonts, Core X fonts, Thunderbird, Firefox, Cups (+foomatic PPDs), GCJ/Java and Lots of OpenGL screensavers (rss-glx).
At a guess, Foomatic and linux-headers would reduce fairly well. Although it might be simpler to move everything not in the default install seed to bzip2 compression (since install speed is less important).
Foomatic
Foomatic contains PPDs (PostScript Printer Definitons) for 2599 individual printers. Each of these is incredibly similar but each file is individually gzipped.
$ find -name '*.gz' | wc -l 2599 $ find -name '*.gz' | xargs du -c -h | tail -1 1.2M total
Yet these files take up 13MB on disk because each one is occupying a 4k inode. After unpacking each PPD, then tar'ing the whole lot up again with gzip or bzip2 it uses:
-rw-r--r-- 1 paul paul 929986 2005-09-16 17:16 ../data-ppd-no-gz.tar.bz2 -rw-r--r-- 1 paul paul 2976299 2005-09-16 17:15 ../data-ppd-no-gz.tar.gz -rw-r--r-- 1 paul paul 7210215 2005-09-16 17:03 ../data.tar.gz
I think these PPDs should really be stored on-disk as a tarball or .zip; the inode/block count wastage is huge. ZIP has an advantage that it is directly addressible, but lacks cross-file compression savings. But uncompressing the PPDs, the raw on-disk usage goes up from 13MB to 120MB.
$ time find -type f -name \*.ppd | xargs gzip real 0m2.912s user 0m2.543s $ time find -type f -name \*.gz | xargs gunzip real 0m1.680s user 0m1.192s
Even the Bzip2 archive (because of the huge compression) is amazingly fast to decompress:
$ time tar tjf ../data-ppd-no-gz.tar.bz2 | grep HP | head -1 usr/share/ppd/HP/ real 0m1.446s user 0m0.042s
Solution: Store as a .tar.bz2 and teach CUPS how to go and fetch PPDs. Preferably abstract the fetching behind a script so that they can be transparently downloaded off the web or from the Printer itself. Possibility to remain 100% compatible by using a post-install to unpack the .tar.bz2 and recompress as individual files.
Saving: 7.2MB -> 0.9MB on CD and 13MB -> 0.9MB on disk. 87% saving on CD.
Linux Headers
$ find usr/ -type f -name '*.h' | xargs du -sch | tail -1 9.9M Header Files $ find usr/ -type f \! -name '*.h' | xargs du -sch | tail -1 6.4M Not Header Files $ find usr/ -type f -name 'Makefile' | xargs du -sch | tail -1 3.3M Of Which, Makefiles $ find usr/ -type f -name 'Kconfig*' | xargs du -sch | tail -1 2.5M Of Which Kconfigs $ find usr/ -type f \! -name 'Kconfig*' \! -name Makefile \! -name '*.h' | xargs du -sch | tail -1 684K Of Which Other/Uknown
9.9MB of headers. 6.4MB of Makefile/Kconfig. 176kB of the Unknowns are ARM assembly .S assembler files for every particular variation of embedded board/device. 160kB are files such as MAINTAINERS.gz and README.gz that have already been compressed; these unpack to 440kB on disk and compress back to 118kB using bzip2.
3.3MB of it is include/asm-arm/ so this probably warrants special attention.
So far, the best I have is:
Unpacking all *.gz files (280kB on-disk increase)---could be reclaimed by post-inst.
find usr/ -type f -name \*.gz | xargs gunzip find usr/ -type f | LANG=C sed -e 's-^.*/\([^/][^/]*\)-\1 &-' | sort | awk '{print $2}' | tar jcf data-by-filename.tar.bz2 -T -
-rw-r--r-- 1 paul paul 4555509 2005-09-16 19:21 data-by-filename.tar.bz2 -rw-r--r-- 1 paul paul 4823926 2005-09-16 17:45 data.tar.bzip2 -rw-r--r-- 1 paul paul 5916271 2005-09-16 17:43 data.tar.gz
Saving: 1.3MB on CD (18% with bzip2, 23% with bzip2 + fiddling), 280kB extra on disk.
PaulSladen/LangpackCompression (last edited 2008-08-06 16:15:19 by localhost)