Like cloop, but using multiple files
The UbuntuExpress LiveCD is going to contain a copying-based installer based on casper This will provide a Graphical option for users wanting a simple install now that they've "tried Ubuntu out".
To reduce shipping costs, it's likely that the UbuntuExpress enabled LiveCD will be the only CD shipped to users by post for Breezy. To make it more widely usable, the LiveCD will also include enough .debs for a base-server system to be installed using the classic text-installer route. These extra .deb (containing duplicated information) will add an estimated 70MB to the CD size. This goes over the 650MB limit and will likely mean that some software (eg. Win FLOSS) will need to be dropped or the CD size increased to 700MB.
This increase in size comes from duplicated data (once in the LiveCD filesystem, and once in the .deb).
It would be really good to avoid this data duplication and keep both the LiveCD and installer options without wasteing space:
.deb's are archives of gzip compressed tarfiles.
- Multiple gzip blocks act as one gzip stream.
To create a dloop you start off with:
Huge LiveCD ext3 image. (3GB in size, partimage -e is used to zero unused-blocks).
A Pile of .deb's likely to contain duplicate information to the big filesystem image (eg. they were used to create it!).
This input is identical to a cloop setup, except for the provision of the extra .debs as starting points.
Explode each .deb, extracting the tarball and take the md5sum of each 64kB block of file data. Realigning at the start of each file within the tarball (after the 512 byte header blocks).
- Walk LiveCD filesystem image and take md5sum of each 64kB block.
Compare md5sums of uncompressed 64kB blocks to find ones that are duplicated in both a .deb and in the filesystem image.
Compress all 64kB blocks not found in a .deb and append this to a new dloop/base file. This will contain mostly filesystem inode data (not part of real files) and any files created during the installation process. To reduce complexity, this would include any files <64kB in size where the half-used block would not match against a .deb part.
Open a dloop/index binary file and write cloop style offsets to this file, along with an additional file-number.
Open a dloop/files file and append the name the base file, such that it matches the index of file-number noted in 5.
Repack each .deb
Extract the data.tar.gz
Make a note of what 64kB runs are referenced from the filesystem image
- Compress tar stream up until the point that a 64kB block is needed.
- Compress each referenced 64kB block separately and record the offset from the start of the output stream
Reinsert the new data.tar.gz into the .deb
Resign .deb with new dloop-cd-repacker-key (if required).
Add the offset of data.tar.gz within the .deb to each recorded offset from step d.
Append the name of the .deb to dloop/files
Append the offsets recorded during recomrpession to dloop/index
There are now a handful of files that together form the equivalent of a classical cloop image.
Create updated Packages.gz with a list of the .deb files contained on the CD.
A kernel driver is required, based on the existing kernel cloop driver. This existing driver can be modified slightly to accept reading from multiple input files, instead of just one. Small adjustments will have to be made it its structure handling so that it selects the correct source file to find a give block in, before seek()ing to it
- A tool is required that can md5sum each 64kB block within the big filesystem image.
A tool is required that can extracted data.tar.gz from a .deb, gunzip and parse the tar stream to calculate the md5sum of each 64kB that is part of a file.
- A tool is required that can compare matching blocks.
- A tool is required that can compress each unmatched block from the filesystem image and record the offset.
- A tool is required that can recompress a tar-stream such that the gzip-stream/zlib data is restarted on each 64kB block that is matched to the big file, recording the offsets.
A tool is required that can repack the .deb.
A tool is required that can adjust the recorded offsets based on new location for the data.tar.gz within the .deb.
A nice side-effect of dloop is that data duplicate files would be combined, but in a way still compatible with the dm-snapshot capability used to allow a writeable LiveCD.
Only the .deb files presented would be searched for duplicates. For example, the following options are possible:
Put only base .deb packages on the CD. The smaller amount of duplicate-dat will be used where possible, but the main dloop/base will still end up containing most of the filesystem and be greater than 400MB in size.
Show the dloop creation program a list of .deb packages known to actually be in the filesystem (eg. only those in ubuntu-desktop, but not ship-seed).
It would be best to keep dloop simple at first, the following would be sensible additions:
- Support for uncompressed 64kB blocks (direct mapping onto an existing file).
Gzip dloop/files. This is a binary file mostly full of zeros and self-similar data. It's likely to compress down from 4MB to less than 1MB. This would be uncompressed into kernel memory as the base data-structure.
Medium Level ideas that might be useful, even if the .deb weren't shipped.
If the WinFLOSS installation software files use a zlib-compatible algorithm for compressing the installation files, teach the dloop creation-software about this format, including how to find the start of files and how to repack them. This would bring massive gains, particularly with OpenOffice where alot of the file data is duplicated between Win32 and Unix.
In the case of Firefox and OpenOffice, both of these use .jar files (aka .zip files) to store most of their internal files. This are unlikely to compress further and are likely to be stored. These could be mapped directly between the installer .cab and the copy in the filesystem image. Therefore it may just be worth presenting these as additional search files to the dloop packer and seeing if it finds the matches by itself.
.zip files within a self-executing file maybe compatible
Using a .tar.gz compatible installer under windows maybe possible.
Complicited ideas with deminishing returns:
- Add support for longer compressed blocks than 64kB (eg. up to 512kB)
Add support for bzip2 compressed blocks (also supported within the .deb format)
Investigate start decompressed at and skip offsets to allow the the use of non-repacked .deb files at the cost of having to decode much longer runs of data (the whole data.tar.gz at worst case, just for a small amount of information).
Created: PaulSladen, 2005-05-11. The idea has been swimming around in my head for several months.