OpenOfficeL10nCompression

Differences between revisions 5 and 6
Revision 5 as of 2005-06-15 09:32:59
Size: 11293
Editor: adsl-213-190-44-43
Comment: imported from the old wiki
Revision 6 as of 2005-08-03 07:21:25
Size: 11245
Editor: S0106000000cc07fc
Comment: remove double title
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= HoaryHedgehog/OpenOffice/L10nCompression =

OpenOffice is too Big!

Okay, OpenOffice is quite nice and it's good to ship it. It would be even better to be able to ship lots of Localized language packages "out-of-the-box" too (actually on the install CD). The problem is the install CD has a limited capacity.

Skip to the Conclusion below.

PaulSladen 2004-09-18

Files

Format

  deb 
    archive
      gz
        tar
          zip
            zip
              gif

Fundamentals: Localized Help Files

With the exception of domain-specific compression protocols (eg. JPEG DCTDecode) it is better to have one single layer of compression at the highest level. Being at the highest level ensures the maximum aggregation between separate (and likely similar) files, it also means being able to choose the most appropriate compressor available (eg. bzip2 over gzip).

In the stack above, on the 12MB Japanese locale:

  • uncompressing the .jar (zip) files provided a 10% saving when the whole file was recompressed with gzip (11MB), and a ~20% saving when recompressed with bzip2 (10MB).
  • The tar of uncompressed .jar/.zip files came to 30MB.
  • Hunting through, approximately 6MB of this 30MB (and 6MB of the 10MB result!) was taken up with a single jar called 'shared81.zip', which in turn contains the file 'pictures.zip' containing 5MB of GIF images for icons and the what appear to be an interactive guide.
  • decompressing every GIF in 'pictures.zip' produced a new 'pictures.zip' of approximately 60MB.
  • when 'pictures.jar' was compressed with gzip it came to the original 5MB (the GIF LZW algorithm if very close to the gzip libz algorithm); however bzip2 reduced this by 10% to ~4.5MB.

This in itself isn't much use. What is important is:

  • Most of the images in pictures.jar appear unchanged across locales; by separating off this data, there it may be possible to get each locale help reduced from ~12MB to 5MB->6MB; plus a shared 'pictures' set of around 4MB->5MB.

  • It may be possible to reduce the help files from ~130MB to around ~65MB for 11 locales.

Rather than hardlinking or separate files, it may be possible to achieve the cross-locale optimisation by striping files from each locale, for instance:

en/foo
ja/foo
en/foo/01.gif
ja/foo/01.gif
en/foo/02.gif
ja/foo/02.gif

bzip2 uses a compression and sorting window of ~900kB which should be enough to achieve both vertical (cross-locale) and horizontal (intra-locale) compression.

The translations are in XML files, but there are also a large number of Berkley-DB index/tree/key/hash files. These appear to be generated from the contents of the .jar files (the XML documents). I can't work out how these are generated but it maybe possible to generate these are install times and save transporting them precalculated.

Warning /!\ Figure out how much space the db files are taking in the final tarball

Localized (translations)

These are a different format to the Help files above (which are XML based). The translations mainly contain a huge number of '.res' files. Presumbly meaning 'Resource'. I'm unsure what format these are in. The size varies from a ~1kB to ~3MB but the contains are similar in a high-level sense. The start of the file contains UTF-8 encoded strings separated (mostly) by NULL (zero) bytes. The latter half of the file contains much more binary/random data---knowing exactly what this is pobably the key to getting better compression ratios.

-rw-r--r--    1 root     root         2.1M Sep 18 13:58 resource.tar.bz2
-rw-r--r--    1 root     root         2.5M Sep 18 13:58 resource.tar.gz
-rw-r--r--    1 root     root          16M Sep 18 13:58 resource.tar

There isn't that much to be saved with using bzip2 and both are already managing 7:1 ratios.

                                    /usr/lib/openoffice/
-rw-r--r--    1 root     root         2.5M Sep 18 14:56 program.tar.gz  (.res toolbars)
-rw-r--r--    1 root     root         2.1M Sep 18 14:56 program.tar.bz2
-rw-r--r--    1 root     root         918k Sep 18 14:56 share.tar.gz    (XML etc)
-rw-r--r--    1 root     root         926k Sep 18 14:56 share.tar.bz2   (BZip actually looses here!)

So, one third of the problem is the XML files...

                              /usr/lib/openoffice/share/
-rw-r--r--    1 root     root          722 Sep 18 15:02 wordbook.tar.gz
-rw-r--r--    1 root     root         7.8k Sep 18 15:02 autocorr.tar.gz
-rw-r--r--    1 root     root          69k Sep 18 15:02 registry.tar.gz
-rw-r--r--    1 root     root          94k Sep 18 15:02 autotext.tar.gz
-rw-r--r--    1 root     root         746k Sep 18 15:02 template.tar.gz

Ah ha! A good proporation of the space is taken up with template files. These (like othe OpenOffice documents) are Zip files containing XML data.

Recompressing each of these with zero compression, actually produces a larger file when tar and then gzipped. But with Bzip2...:

-rw-r--r--    1 paul     paul         848k Sep 18 15:31 share.tar.gz    (was 918k)
-rw-r--r--    1 paul     paul         468k Sep 18 15:31 share.tar.bz2   (40%-50% saving, though not in all locales)
-rw-r--r--    1 paul     paul         2.5M Sep 18 15:36 program.tar.gz  (no difference)
-rw-r--r--    1 paul     paul         2.1M Sep 18 15:36 program.tar.bz2 (no difference)

There is an 40%-50% reduction (remember, that this completely comes off the final file size!).

Going through the templates (dimenishing returns now...), the largest templates are those containing WMF vector pictures which don't compress that well (binary format).

Interlacing

Try and interlace similar files from each translation:

2.6M    openoffice.org-l10n-af_1.1.2-2ubuntu5_all.deb
2.6M    openoffice.org-l10n-ar_1.1.2-2ubuntu5_all.deb
2.6M    openoffice.org-l10n-ca_1.1.2-2ubuntu5_all.deb
3.5M    openoffice.org-l10n-cs_1.1.2-2ubuntu5_all.deb
2.5M    openoffice.org-l10n-cy_1.1.2-2ubuntu5_all.deb
3.5M    openoffice.org-l10n-da_1.1.2-2ubuntu5_all.deb
3.3M    openoffice.org-l10n-de_1.1.2-2ubuntu5_all.deb
2.7M    openoffice.org-l10n-el_1.1.2-2ubuntu5_all.deb
3.4M    openoffice.org-l10n-en_1.1.2-2ubuntu5_all.deb
3.4M    openoffice.org-l10n-es_1.1.2-2ubuntu5_all.deb
30M     total

Eleven Languages, which unpackaged are:

177M    root
177M    total

Decompressed (ZIP files recompressed with zero compression)

200M    root
200M    total

Non-Interlaced tar generated with a listing such:

find  -type f | rev | sort | rev | xargs tar rvf ../root.sort.tar
-rw-r--r--    1 paul     paul         195M Sep 18 17:26 root.sort.tar
-rw-r--r--    1 paul     paul          30M Sep 18 17:26 root.sort.tar.gz
real    7m41.155s
user    7m16.920s
sys     0m1.370s
-rw-r--r--    1 paul     paul          22M Sep 18 17:26 root.sort.tar.bz2

25% saving, in exchange for 7minute packing the thing... For 11 languages == 2MB/language.

This works for all the files, except the resource (toolbar) files which are named in the format:

fooNNNNN.res   foo=toolbar  NNNNN=language code

These ideally want to be sorted with the 'foo' as a key, rather than the number.

Interlaced tarball:

time (find  -type f \! -name \*.res | rev | sort | rev ; find -type f -name \*.res | sort ) | xargs tar rf ../root.sort.tar ; ls -lhd ../root.sort*
real    1m18.848s
time bzip2 ../root.sort.tar ; ls -lh ../root.sort.tar*
real    8m2.507s
-rw-r--r--    1 paul     paul          20M Sep 18 17:48 ../root.sort.tar.bz2

BTW, decompression is fairly quick---and that's matters at the users' end.

time bunzip2 ../root.sort.tar* ; ls -lh ../root.sort.tar*
real    0m44.022s

Conclusion

It looks like it's possible to achieve a 33% reduction by using various round-about techniques and without actually dropping anything. This may make it easier to justify shipping slightly more OOo locales on the install CD. Although a complete set may not be possible yet.

The CD packages could be done such they are:

openoffice.org-l10n-af-ar-ca-cr-cy-de_1.1.2-2ubuntu5_all.deb
Provides: openoffice.org-l10n-af_1.1.2-2ubuntu5_all.deb,
  openoffice.org-l10n-ar_1.1.2-2ubuntu5_all.deb,
  openoffice.org-l10n-ca_1.1.2-2ubuntu5_all.deb, 
  ...

Which then should easily get replaced during an update.

Notes

Email regarding Mozilla Firefox locales squish:

Date: Sat, 18 Sep 2004 00:24:20 +0100 (BST)
From: Paul Sladen <ubuntu@paul.sladen.org>
To: ubuntu-devel
Subject: Re: adding mozilla-firefox-locale-* to the desktop ?

On Fri, 17 Sep 2004, Matt Zimmerman wrote:
> On Sat, Sep 18, 2004 at 12:43:47AM +0200, Sebastien Bacher wrote:
> > consider adding the mozilla-firefox-locale-* to the desktop seed ?
>
> On i386, this would add 2M to the CD and 5M to the desktop install, so I
> don't see a problem with the size (unlike with openoffice.org).

If this type of thing (lots of locales) becomes an issue, I've just done a
test which has got the above packages down by 50%.

The Mozilla locales are:

  deb 
    archive
      gz
        tar
          zip
            zip

In this case, I extracted the zip (.jar) files and recompressed them with
zero compression (`-0').  Having only one level of compression and using
bzip2 instead took the final Ukrainian Mozilla locale from 169kB -> 82kB.

While this might be a contrived and small example, aggregating compression
between locales may well introduce similar efficiencies and enable shipping
{,more of} them than would be possible otherwise.

        -Paul

(Yes, This is a Hoary type thing...)
-- 
Is there no safe way to travel?  Nottingham, GB

Scripts

Quick and dirty scripts to use for proof of concept:

redecompress.sh

SELF=$0
TMP=./decompress.$$
#ZIPS=`find ./ -type f -a \( -name '*.zip' -or -name '*.jar' \) -print`
ZIPS=`find . -type f | xargs file | grep 'Zip archive data, at least v2.0 to extract' | sed -e 's/^\.\/\(.*\):.*$/\1/'`

for ZIP in $ZIPS ; do
        mkdir -p $TMP && (
            cd $TMP && (
                echo Doing $ZIP $*
                unzip -q ../$ZIP && rm ../$ZIP && $SELF RECURSIVE && zip -q -r -0 ../$ZIP ./
            ) && cd ..
        ) && rm -rf $TMP
done

derecompress-gif.sh

giffix < $1 > $1.$$ && mv $1.$$ $1

Ideally a better program than 'giffix' (found in 'libungif-bin' in Debian) should be found and simply decompresses lzw compressed sections of an existing gif and replaces it in situ. NB: 'giffix' appears to screw up transparency. 'convert'/'mogrify' from imagemagick might also do for the moment.

PaulSladen/OpenOfficeL10nCompression (last edited 2008-08-06 16:15:36 by localhost)