|Deletions are marked like this.||Additions are marked like this.|
|Line 4:||Line 4:|
|* '''Packages affected''': apt, apt-sync (to be created)||* '''Packages affected''': apt, apt-sync (to be added to Ubuntu)|
|Line 47:||Line 47:|
|however, deployed in Debian or Ubuntu at the current time, for reasons
unclear to LarsWirzenius (drafter of spec) at the time of writing.
|however, deployed in Debian or Ubuntu at the current time.|
|Line 53:||Line 52:|
|the server, various kinds of client systems, and how much benefit
it has as far as the amount of time downloaded.
|the archive mirrors, various kinds of client systems, and how much benefit
it has as far as the amount of bandwidth spent downloading.
|Line 65:||Line 64:|
| since the hardy ISO was created. There is several hundred megabytes of
them (FIXME: check this).
| since the hardy ISO was created. There are several hundreds or thousands
of megabytes of such updates.
|Line 72:||Line 71:|
| * We are mainly interested in optimizing the actual transfer and avoiding
big impacts on the mirrors. The client system is assumed to have sufficient
processing power. (If it turns out to be inefficient at the client end,
we'll optimize that later.)
|Line 74:||Line 77:|
The RPM world has presto, which relies on the package archive generating
delta-RPM packages. These are then downloaded normally, and applied by
the client differently from normal RPMs. This requires upgrades to happen
from a specific version, which the rsync/zsync/apt-sync approach avoids.
|Line 82:||Line 90:|
|For every package.|
|Line 83:||Line 92:|
|to the actual download.||to the actual download. For every package.|
|Line 88:||Line 97:|
| * How much data is transferred when updating Packages files daily, for
a development branch of Ubuntu.
|Line 96:||Line 107:|
| * Create a KVM image with hardy, without any security updates, but with
all packages installed that have security updates.
* Set up way to measure bandwidth usage in/out from the KVM guest.
* Install security upgrades in various ways, measure bandwidth use.
| * Find a way to measure the bandwidth used to update a package with apt-sync.
Preferably without actually transferring files. If necessary, just use zsync
* Measure the bandwidth to download all security updates, both plain HTTP
and via apt-sync.
* If necessary, re-pack the .debs so that apt-sync can deal with them, by
using gzip --rsyncable. Test also bzip2.
|Line 121:||Line 136:|
| * Generate .aptsync files for every deb. Count their total size.
* Also tabulate combined size of .aptsync files per source package.
| * Generate .aptsync files for every deb. Count their total size, both
actual file size (st_size) and disk usage (st_blocks).
* Also tabulate combined size of .aptsync files per source package:
cat them to one file, check both st_size and st_blocks.
* Find snapshots of Packages files for as long a period of time as possible,
from karmic. At least ten days' worth of them.
* Measure the amount of data transferred when updating from oldest to newest
version of the Packages file, both for plain HTTP and for zsync.
|Line 126:||Line 150:|
|There should be no UI changes.||There should be no UI changes, except that apt-get and Synaptic and
update-manager might be modified to tell how much data apt-sync managed
to NOT transfer. This would allow users to see that it is working.
|Line 133:||Line 159:|
|The apt-sync program should need fairly little changes to get it usable,
if at all.
|The apt-sync program should need fairly little changes to get it usable.|
|Line 138:||Line 163:|
|be a useful thing.||be a useful thing. For testing, the .aptsync files can be provided by
different servers than the usual mirrors.
|Line 157:||Line 183:|
|# It's important that we are able to test new features, and demonstrate
# them to users. Use this section to describe a short plan that anybody
# can follow that demonstrates the feature is working. This can then be
# used during testing, and to show off after release. Please add an entry
# to http://testcases.qa.ubuntu.com/Coverage/NewFeatures for tracking test
# This need not be added or completed until the specification is nearing beta.
== Unresolved issues ==
== BoF agenda and discussion ==
* deltadebs: between specific versions
* apt-sync (https://code.launchpad.net/apt-sync): from any installed version to whatever is to be downloaded
* consider xdeltas ( see comments from Rusty below)
delta rpms are kinda screwy:
* have to make an assumption about what version you have, gets quickly unmanagable with large numbers of changes
* requires pushing many deltas onto the archive
* zsync (don't need an rsync server, works with plain HTTP)
- generates file with checksum of every file on server; size of file depends on block size (will be about .3% of original file)
- in theory we could put zsync files on every server
- debs are compressed though, so a small change at beginning might fall apart
Proof that this helps is needed:
* google summer of code project 3 years ago, unused
* prototype works
* we need benchmarks (bandwidth saved, how big impact on web server)
* mirror admins would not like lots of bandwidth dumped on them
* we're prioritizing bandwidth for user rather than mirror, although it's not necessarily a tradeoff
* even with free bandwidth you wait less
* mostly downloading .debs and upgrades
If user doesn't have original .deb, we can still use original deb on system: dpkg-repack:
* takes files you have installed and creates new .deb (doesn't have to be identical, rsync can fix it)
Can also rsync packages.gz (preferably with zsync on client side - not all mirrors support rsync (also some mirrors have rsyncd client limits))
* security updates benefit more from this since they can be pushed faster
gzip by default isn't very friendly with rsync, but they have a gzip --rsyncable option
* gzip has this option, but lzma doesn't
* question then becomes are we favoring initial download or people doing upgrade
Things to research:
* does bzip2 rsync well?
* reprepro and (other apt repo software) won't be extended initially (though we'll file bugs once we get this working in launchpad)
* does apt-sync sync packages.gz, or just package files?
* bandwidth for archive mirror
* bandwidth for client
* decompress time for client
* cpu/memory impact on mirror httpd
* resource requirements for archive to generate .aptsync files
* size impact of .aptsync files for whole mirror
* if it is too expensive in memory/cpu for eg netbooks, could make it optional
* optionally have apt-sync report back benchmark data so we can measure things
for all users
* James is concerned about number of files in archive
Package description translations file should be synced as well
If we wanted to go really wild, could have mirrors provide unpacked debs
- would be useful for user to have multiple mirrors listed; if identical then could sync from both (especially if one is slow)
- also will be better for user when one mirror is failing/really slow
- James wants this
- apt needs graceful fallback, but this is a general problem, not specific to apt-sync
Author of apt-sync, zsync authors are "around and helpful"
- reach out, maybe they'll help with benchmarks
- zsync may need to be modified to handle .deb (which are tars in r)
We should package up apt-sync into Ubuntu and get people to try this in real
situations, but make it optional until it's proven to work.
Provide both Packages.gz and Packages.lzma
* user can download Packages.lzma first time, then zsync to the Packages.gz subsequently
* start with Packages file as it's clear and easy for everyone (michael's suggestion)
try zsync on cdimage
zsync.ubuntu.com to provide .zsync files (only)
- james is concerrned about cpu impact and number of files in the archive
- we could group .aptsync files into single files per source package
 IRC Transcript w/Rusty
<rusty> hughhalf: pong!
<hughhalf> rusty: did you happen to look at whether bzip could be given a --rsyncable flag back in the day ?
<rusty> hughhalf: tridge and I discussed it, the problem is that bzip uses a fixed (900k, default) block size. We didn't follow it further.
<rusty> hughhalf: with a centralized system like ubuntu, I think that xdeltas (or something like) makes more sense.
<hughhalf> rusty: Indeed, however for first pass think desire is to run with something that won't affect mirrors overmuch
and there's some package/ssytem that allows us to do this (sounds reminiscent of rproxy)
<rusty> hughhalf: well, I think given the scale, deltas are going to be more optimal.
<rusty> hughhalf: given that most mirrors don't have rsync access, esp.
<hughhalf> rusty: Yep
Lars et. al. say hello and thanks :)
<rusty> hughhalf: nw!
<hughhalf> rusty: Took liberty of cut and paste salient bits of above in the gobby doc we're using for this
<rusty> hughhalf: of course, not a problem!
When an older version of the deb is not in the cache, do we want to use dpkg-repack? We could just tar | gzip the files installed by the package and some of the control files, saving a few seconds.
How about using zsync in a mode that looks inside gzipped data instead of handling it as a raw bitstream? In that case, we could just tar (not gzip) the installed files, or perhaps even pass the list of installed files as separate inputfiles to zsync. It would probably have to gzip the data in the end, though.
Supporting an apt method outputting debs in an unpacked form would be interesting. Zsync is able to download gzipped data, unpack it, merge with local unpacked data and output unpacked data. Gzipping it for apt, only to have it gunzipped, could be avoided. We could add signed checksums for the unpacked data.
(fastest of seven runs, [[http://heh.fi/tmp/apt-sync/|script and raw data]])
|| |||| '''devicekit''' |||| '''coreutils''' |||| '''openoffice.org-core''' ||
|| '''dpkg-repack''' ||<)> 0.870 s ||<)> 100 % ||<)> 7.100 s ||<)> 100 % ||<)> 115.236 s ||<)> 100 % ||
|| '''tar | gzip''' ||<)> 0.313 s ||<)> 36 % ||<)> 5.470 s ||<)> 77 % ||<)> 98.316 s ||<)> 85 % ||
|| '''tar''' ||<)> 0.228 s ||<)> 26 % ||<)> 0.820 s ||<)> 12 % ||<)> 11.990 s ||<)> 10 % ||
Another thing: if foo-1.6 Replaces: bar-1.5, give bar-1.5 as an inputfile to zsync. We know the Replaces in advance from the Packages file. What’s the overhead of parsing it, though?
Some additional benchmarking:
I generated foo-0.tar:
-rw-r--r-- ion/users 18476682 2009-06-03 18:56 part0
-rw-r--r-- ion/users 374325 2009-06-03 18:56 part1a
-rw-r--r-- ion/users 11813697 2009-06-03 18:56 part2
-rw-r--r-- ion/users 464163 2009-06-03 18:57 part1b
-rw-r--r-- ion/users 11813697 2009-06-03 18:56 part2
-rw-r--r-- ion/users 18476682 2009-06-03 18:56 part0
part0, part1a, part1b and part2 all from /dev/urandom. The change from foo-0.tar to foo-1.tar is that part1a was substituted with part1b and the order was changed.
I benchmarked zsyncing foo-1 with foo-0 as the inputfile:
(fastest of seven runs, [[http://heh.fi/tmp/zsynctest/|script and raw data]])
|| '''get''' || '''process''' || '''output''' |||| '''time''' |||| '''fetched''' || '''checksum''' ||
|| tar.gz || unpacked || tar ||<)> 1.596 s ||<)> 100 % ||<)> 466684 ||<)> 100 % || only results in unpacked tar, must checksum it ||
|| tar.gz || unpacked || tar.gz ||<)> 6.125 s ||<)> 384 % ||<)> 466684 ||<)> 100 % || checksum of resulting tar.gz may not match, must checksum unpacked tar inside ||
|| tar.gz || raw data || tar.gz ||<)> 2.447 s ||<)> 153 % ||<)> 4958684 ||<)> 1063 % || checksum of tar.gz guaranteed to match ||
|| tar.gz --rsyncable || raw data || tar.gz --rsyncable ||<)> 1.793 s ||<)> 112 % ||<)> 479358 ||<)> 103 % || checksum guaranteed to match, but if inputfile and outputfile not gzipped exactly the same way, must fetch big amount of data ||
||||||...with inputfile gzipped without --rsyncable ||<)> 2.445 s ||<)> 153 % ||<)> 16319311 ||<)> 3497 % || ||
* Method 0: Infrastructure changes: must checksum data inside gzip. Also must make apt accept the unpacked deb.
* Method 1: Infrastructure changes: must checksum data inside gzip. Also overhead from zsync having to gzip data before passing it to apt.
* Method 2: We lose much of the benefit of zsync, mitigating the very point of this spec. Also overhead from gzipping local package if the user doesn’t have the inputfile in cache.
* Method 3: Overhead from gzipping local package if the user doesn’t have the inputfile in cache. Also must be very careful to have all the debs gzipped in exactly the same way, otherwise we lose much of the benefit of zsync.
How method 0 could be implemented:
* Generate such debs that contain debian-binary, control.'''tar''', data.'''tar'''; i.e. do not gzip the members. It seems dpkg already supports the format (since 2004).
* Put deb.'''gz''' files into archive. Thus, the archive size would not change substantially.
* Put “Filename: pool/...deb.gz” into Packages.
* Keep the Packages checksums as the checksums of the deb within the deb.gz.
* Make it possible for apt methods to gunzip the deb.gz on the fly, outputting just debs.
* Have apt gunzip the deb if the method did not do that.
* ''/var/cache/apt/archives would gain size.''
* Along with foo.deb.gz, put foo.deb.zsync into archive.
* Have the zsync method use zsync’s existing functionality to use /var/cache/apt/archives/foo-old.deb or equivalent as inputfile, fetch the needed parts of foo-new.deb.gz from the archive and output the gunzipped deb.
Launchpad Entry: rsync-based-deb-downloads
Packages affected: apt, apt-sync (to be added to Ubuntu)
When updating the list of available packages, or upgrading to a new version of a package, most of the data is already on the system. Using an rsync-based download method should significantly speed up the download. The apt-sync program already does this, or at least most of it. apt-sync should be packaged and uploaded to Ubuntu karmic, and benchmarked to evaluate its impact on the archive, mirrors, and various kinds of systems running Ubuntu.
The apt-sync package is now included in Ubuntu and will make upgrades faster.
Bandwidth is scarce or expensive in large parts of the world. Users may have Internet access only via a dial-up modem, GPRS, 3G. Even users on fast connections may be paying by the byte, or have monthly download limits. Avoiding needless downloads is a worthy goal and would benefit a lot of Ubuntu users.
When downloading a new version of the Packages file, it will almost always be only slightly different from the one on the system already. By using the rsync algorithm, it is possible to download only the changed parts.
Likewise, when upgrading packages to newer versions, the new version is often mostly identical to the old one, with parts of it changed. Again, it would be good to only download the changed parts.
The rsync program of the rsync algorithm is unsuitable for this, primarily because it requires a lot of server-side resources. However, the zsync implementation works around this: the server is a standard HTTP server, and a new data file is added (containing the rsync "signature" data). All the rest of the computation will then happen on the client, which uses HTTP Range requests to fetch parts of the file from the server.
The apt-sync program was written in 2006 to implement this. It is not, however, deployed in Debian or Ubuntu at the current time.
apt-sync should be packaged for Debian and Ubuntu, and deployed in an experimental manner, to gather feedback from users. Additionally, some benchmarks should be run to see how big an impact it has on the archive mirrors, various kinds of client systems, and how much benefit it has as far as the amount of bandwidth spent downloading.
apt-sync may require changes to combine .aptsync files per source package: otherwise the number of files in the archive will increase by the number of .deb files, i.e., probably about a quarter million, and this is a big impact on the archive and mirror network.
- Alfred is connected to the Internet over GPRS. He has just installed hardy and wants to install all security updates that have happened since the hardy ISO was created. There are several hundreds or thousands of megabytes of such updates.
- Alfred's system has installed all packages for which there are security updates.
- We are mainly interested in optimizing the actual transfer and avoiding big impacts on the mirrors. The client system is assumed to have sufficient processing power. (If it turns out to be inefficient at the client end, we'll optimize that later.)
The RPM world has presto, which relies on the package archive generating delta-RPM packages. These are then downloaded normally, and applied by the client differently from normal RPMs. This requires upgrades to happen from a specific version, which the rsync/zsync/apt-sync approach avoids.
At least the following results need to be produced by the benchmarks:
- How much data is transferred by plain HTTP (full download), apt-sync. This should be compared to the size of the delta produced by xdelta or similar tools (compressed delta of uncompressed data). For every package.
- Time it takes by the client to reconstruct the new .deb, in addition to the actual download. For every package.
- CPU/memory impact on httpd resulting from a lot of clients doing HTTP Range requests.
- Size impact on mirror from .aptsync files: both number of files, and their size.
- How much data is transferred when updating Packages files daily, for a development branch of Ubuntu.
For bandwidth use:
- Take snapshot of hardy and hardy-security archives.
- Find a way to measure the bandwidth used to update a package with apt-sync. Preferably without actually transferring files. If necessary, just use zsync for this.
- Measure the bandwidth to download all security updates, both plain HTTP and via apt-sync.
- If necessary, re-pack the .debs so that apt-sync can deal with them, by using gzip --rsyncable. Test also bzip2.
For .deb reconstruction:
- Instrument apt-sync to measure the amount of time and CPU (and memory, if feasible) it takes to get the updated .deb onto the system, measuring separately pre-download, download, and post-download phases, as well as totals.
- Run instrumented apt-sync on various clients: a high-end desktop machine, a "normal" laptop, a netbook.
Impact on server resources:
- Set up a machine with httpd and an Ubuntu mirror of hardy and hardy-security.
- Set up as many client machines as possible to run upgrades, with and without apt-sync, preferably enough to sature server machine's LAN connection.
- Measure CPU and memory load on server.
Archive size impact:
- Generate .aptsync files for every deb. Count their total size, both actual file size (st_size) and disk usage (st_blocks).
- Also tabulate combined size of .aptsync files per source package: cat them to one file, check both st_size and st_blocks.
- Find snapshots of Packages files for as long a period of time as possible, from karmic. At least ten days' worth of them.
- Measure the amount of data transferred when updating from oldest to newest version of the Packages file, both for plain HTTP and for zsync.
There should be no UI changes, except that apt-get and Synaptic and update-manager might be modified to tell how much data apt-sync managed to NOT transfer. This would allow users to see that it is working.
The apt package may require minor changes to support apt-sync as a download method, but probably not.
The apt-sync program should need fairly little changes to get it usable.
Launchpad will require code to generate .aptsync files for the entire archive, but this can happen later, after apt-sync has been proven to be a useful thing. For testing, the .aptsync files can be provided by different servers than the usual mirrors.
If the user is using a mirror without .aptsync files, everything should work just fine, the way it works now. Once .aptsync files get added to the mirrors, upgrade downloads should happen faster. The archives can add only those .aptsync files that are deemed necessary (e.g., only for -security).
If apt-sync turns out to work well, it should be installed and enabled by default, and users should not have to change anything.
Suggestion: we modify apt and update-manager to report how much data it managed to NOT transfer, thanks to apt-sync. This will subtly tell users that things work well.