OpenOfficeL10nCompression
HoaryHedgehog/OpenOffice/L10nCompression
OpenOffice is too Big!
Okay, OpenOffice is quite nice and it's good to ship it. It would be even better to be able to ship lots of Localized language packages "out-of-the-box" too (actually on the install CD). The problem is the install CD has a limited capacity.
Skip to the Conclusion below.
PaulSladen 2004-09-18
Files
- ~3MB each, times 23 = ~70MB
[http://archive.ubuntulinux.org/ubuntu/pool/main/o/openoffice.org-help-en/ http://archive.ubuntulinux.org/ubuntu/pool/main/o/openoffice.org-help-xx/]openoffice.org-help-'''xx'''_1.1+20030814-3_all.deb
- ~12MB each, times 11 = ~130MB
Format
deb archive gz tar zip zip gif
Fundamentals: Localized Help Files
With the exception of domain-specific compression protocols (eg. JPEG DCTDecode) it is better to have one single layer of compression at the highest level. Being at the highest level ensures the maximum aggregation between separate (and likely similar) files, it also means being able to choose the most appropriate compressor available (eg. bzip2 over gzip).
In the stack above, on the 12MB Japanese locale:
- uncompressing the .jar (zip) files provided a 10% saving when the whole file was recompressed with gzip (11MB), and a ~20% saving when recompressed with bzip2 (10MB).
- The tar of uncompressed .jar/.zip files came to 30MB.
- Hunting through, approximately 6MB of this 30MB (and 6MB of the 10MB result!) was taken up with a single jar called 'shared81.zip', which in turn contains the file 'pictures.zip' containing 5MB of GIF images for icons and the what appear to be an interactive guide.
- decompressing every GIF in 'pictures.zip' produced a new 'pictures.zip' of approximately 60MB.
- when 'pictures.jar' was compressed with gzip it came to the original 5MB (the GIF LZW algorithm if very close to the gzip libz algorithm); however bzip2 reduced this by 10% to ~4.5MB.
This in itself isn't much use. What is important is:
Most of the images in pictures.jar appear unchanged across locales; by separating off this data, there it may be possible to get each locale help reduced from ~12MB to 5MB->6MB; plus a shared 'pictures' set of around 4MB->5MB.
- It may be possible to reduce the help files from ~130MB to around ~65MB for 11 locales.
Rather than hardlinking or separate files, it may be possible to achieve the cross-locale optimisation by striping files from each locale, for instance:
en/foo ja/foo en/foo/01.gif ja/foo/01.gif en/foo/02.gif ja/foo/02.gif
bzip2 uses a compression and sorting window of ~900kB which should be enough to achieve both vertical (cross-locale) and horizontal (intra-locale) compression.
The translations are in XML files, but there are also a large number of Berkley-DB index/tree/key/hash files. These appear to be generated from the contents of the .jar files (the XML documents). I can't work out how these are generated but it maybe possible to generate these are install times and save transporting them precalculated.
Figure out how much space the db files are taking in the final tarball
Localized (translations)
These are a different format to the Help files above (which are XML based). The translations mainly contain a huge number of '.res' files. Presumbly meaning 'Resource'. I'm unsure what format these are in. The size varies from a ~1kB to ~3MB but the contains are similar in a high-level sense. The start of the file contains UTF-8 encoded strings separated (mostly) by NULL (zero) bytes. The latter half of the file contains much more binary/random data---knowing exactly what this is pobably the key to getting better compression ratios.
-rw-r--r-- 1 root root 2.1M Sep 18 13:58 resource.tar.bz2 -rw-r--r-- 1 root root 2.5M Sep 18 13:58 resource.tar.gz -rw-r--r-- 1 root root 16M Sep 18 13:58 resource.tar
There isn't that much to be saved with using bzip2 and both are already managing 7:1 ratios.
/usr/lib/openoffice/ -rw-r--r-- 1 root root 2.5M Sep 18 14:56 program.tar.gz (.res toolbars) -rw-r--r-- 1 root root 2.1M Sep 18 14:56 program.tar.bz2 -rw-r--r-- 1 root root 918k Sep 18 14:56 share.tar.gz (XML etc) -rw-r--r-- 1 root root 926k Sep 18 14:56 share.tar.bz2 (BZip actually looses here!)
So, one third of the problem is the XML files...
/usr/lib/openoffice/share/ -rw-r--r-- 1 root root 722 Sep 18 15:02 wordbook.tar.gz -rw-r--r-- 1 root root 7.8k Sep 18 15:02 autocorr.tar.gz -rw-r--r-- 1 root root 69k Sep 18 15:02 registry.tar.gz -rw-r--r-- 1 root root 94k Sep 18 15:02 autotext.tar.gz -rw-r--r-- 1 root root 746k Sep 18 15:02 template.tar.gz
Ah ha! A good proporation of the space is taken up with template files. These (like othe OpenOffice documents) are Zip files containing XML data.
Recompressing each of these with zero compression, actually produces a larger file when tar and then gzipped. But with Bzip2...:
-rw-r--r-- 1 paul paul 848k Sep 18 15:31 share.tar.gz (was 918k) -rw-r--r-- 1 paul paul 468k Sep 18 15:31 share.tar.bz2 (40%-50% saving, though not in all locales) -rw-r--r-- 1 paul paul 2.5M Sep 18 15:36 program.tar.gz (no difference) -rw-r--r-- 1 paul paul 2.1M Sep 18 15:36 program.tar.bz2 (no difference)
There is an 40%-50% reduction (remember, that this completely comes off the final file size!).
Going through the templates (dimenishing returns now...), the largest templates are those containing WMF vector pictures which don't compress that well (binary format).
Interlacing
Try and interlace similar files from each translation:
2.6M openoffice.org-l10n-af_1.1.2-2ubuntu5_all.deb 2.6M openoffice.org-l10n-ar_1.1.2-2ubuntu5_all.deb 2.6M openoffice.org-l10n-ca_1.1.2-2ubuntu5_all.deb 3.5M openoffice.org-l10n-cs_1.1.2-2ubuntu5_all.deb 2.5M openoffice.org-l10n-cy_1.1.2-2ubuntu5_all.deb 3.5M openoffice.org-l10n-da_1.1.2-2ubuntu5_all.deb 3.3M openoffice.org-l10n-de_1.1.2-2ubuntu5_all.deb 2.7M openoffice.org-l10n-el_1.1.2-2ubuntu5_all.deb 3.4M openoffice.org-l10n-en_1.1.2-2ubuntu5_all.deb 3.4M openoffice.org-l10n-es_1.1.2-2ubuntu5_all.deb 30M total
Eleven Languages, which unpackaged are:
177M root 177M total
Decompressed (ZIP files recompressed with zero compression)
200M root 200M total
Non-Interlaced tar generated with a listing such:
find -type f | rev | sort | rev | xargs tar rvf ../root.sort.tar -rw-r--r-- 1 paul paul 195M Sep 18 17:26 root.sort.tar -rw-r--r-- 1 paul paul 30M Sep 18 17:26 root.sort.tar.gz real 7m41.155s user 7m16.920s sys 0m1.370s -rw-r--r-- 1 paul paul 22M Sep 18 17:26 root.sort.tar.bz2
25% saving, in exchange for 7minute packing the thing... For 11 languages == 2MB/language.
This works for all the files, except the resource (toolbar) files which are named in the format:
fooNNNNN.res foo=toolbar NNNNN=language code
These ideally want to be sorted with the 'foo' as a key, rather than the number.
Interlaced tarball:
time (find -type f \! -name \*.res | rev | sort | rev ; find -type f -name \*.res | sort ) | xargs tar rf ../root.sort.tar ; ls -lhd ../root.sort* real 1m18.848s time bzip2 ../root.sort.tar ; ls -lh ../root.sort.tar* real 8m2.507s -rw-r--r-- 1 paul paul 20M Sep 18 17:48 ../root.sort.tar.bz2
BTW, decompression is fairly quick---and that's matters at the users' end.
time bunzip2 ../root.sort.tar* ; ls -lh ../root.sort.tar* real 0m44.022s
Conclusion
It looks like it's possible to achieve a 33% reduction by using various round-about techniques and without actually dropping anything. This may make it easier to justify shipping slightly more OOo locales on the install CD. Although a complete set may not be possible yet.
The CD packages could be done such they are:
openoffice.org-l10n-af-ar-ca-cr-cy-de_1.1.2-2ubuntu5_all.deb Provides: openoffice.org-l10n-af_1.1.2-2ubuntu5_all.deb, openoffice.org-l10n-ar_1.1.2-2ubuntu5_all.deb, openoffice.org-l10n-ca_1.1.2-2ubuntu5_all.deb, ...
Which then should easily get replaced during an update.
Notes
Email regarding Mozilla Firefox locales squish:
Date: Sat, 18 Sep 2004 00:24:20 +0100 (BST) From: Paul Sladen <ubuntu@paul.sladen.org> To: ubuntu-devel Subject: Re: adding mozilla-firefox-locale-* to the desktop ? On Fri, 17 Sep 2004, Matt Zimmerman wrote: > On Sat, Sep 18, 2004 at 12:43:47AM +0200, Sebastien Bacher wrote: > > consider adding the mozilla-firefox-locale-* to the desktop seed ? > > On i386, this would add 2M to the CD and 5M to the desktop install, so I > don't see a problem with the size (unlike with openoffice.org). If this type of thing (lots of locales) becomes an issue, I've just done a test which has got the above packages down by 50%. The Mozilla locales are: deb archive gz tar zip zip In this case, I extracted the zip (.jar) files and recompressed them with zero compression (`-0'). Having only one level of compression and using bzip2 instead took the final Ukrainian Mozilla locale from 169kB -> 82kB. While this might be a contrived and small example, aggregating compression between locales may well introduce similar efficiencies and enable shipping {,more of} them than would be possible otherwise. -Paul (Yes, This is a Hoary type thing...) -- Is there no safe way to travel? Nottingham, GB
Scripts
Quick and dirty scripts to use for proof of concept:
redecompress.sh
SELF=$0 TMP=./decompress.$$ #ZIPS=`find ./ -type f -a \( -name '*.zip' -or -name '*.jar' \) -print` ZIPS=`find . -type f | xargs file | grep 'Zip archive data, at least v2.0 to extract' | sed -e 's/^\.\/\(.*\):.*$/\1/'` for ZIP in $ZIPS ; do mkdir -p $TMP && ( cd $TMP && ( echo Doing $ZIP $* unzip -q ../$ZIP && rm ../$ZIP && $SELF RECURSIVE && zip -q -r -0 ../$ZIP ./ ) && cd .. ) && rm -rf $TMP done
derecompress-gif.sh
giffix < $1 > $1.$$ && mv $1.$$ $1
Ideally a better program than 'giffix' (found in 'libungif-bin' in Debian) should be found and simply decompresses lzw compressed sections of an existing gif and replaces it in situ. NB: 'giffix' appears to screw up transparency. 'convert'/'mogrify' from imagemagick might also do for the moment.