AptSyncInKarmicSpec

Differences between revisions 5 and 8 (spanning 3 versions)
Revision 5 as of 2009-06-03 17:04:27
Size: 15742
Editor: xdsl-237-224
Comment: A summary of the last benchmark
Revision 8 as of 2009-06-03 17:40:53
Size: 15930
Editor: xdsl-237-224
Comment: Update benchmark with new test case
Deletions are marked like this. Additions are marked like this.
Line 322: Line 322:
|| '''get''' || '''process''' || '''output''' |||| '''time''' |||| '''fetched''' || '''checksum''' ||
|| tar.gz || unpacked || tar ||<)> 1.934&#xa0;s ||<)> 100&#xa0;% ||<)> 466684 ||<)> 100&#xa0;% || only results in unpacked tar, must checksum it ||
|| tar.gz || unpacked || tar.gz ||<)> 6.129&#xa0;s ||<)> 317&#xa0;% ||<)> 466684 ||<)> 100&#xa0;% || checksum of resulting tar.gz may not match, must checksum unpacked tar inside ||
|| tar.gz || raw data || tar.gz ||<)> 2.179&#xa0;s ||<)> 113&#xa0;% ||<)> 4958684 ||<)> 1063&#xa0;% || checksum of tar.gz guaranteed to match ||
|| tar.gz --rsyncable || raw data || tar.gz --rsyncable ||<)> 1.624&#xa0;s ||<)> 84&#xa0;% ||<)> 479358 ||<)> 103&#xa0;% || checksum guaranteed to match, but if inputfile and outputfile not gzipped exactly the same way, must fetch big amount of data ||
|| '''get''' || '''process''' || '''output''' |||| '''time''' |||| '''fetched'''  || '''checksum''' ||
|| tar.gz || unpacked || tar ||<)> 1.596&#xa0;s ||<)> 100&#xa0;% ||<)>  466684 ||<)> 100&#xa0;% || only results in unpacked tar, must checksum it ||
|| tar.gz || unpacked || tar.gz ||<)> 6.125&#xa0;s ||<)> 384&#xa0;% ||<)>  466684 ||<)> 100&#xa0;% || checksum of resulting tar.gz may not match, must checksum unpacked tar inside ||
|| tar.gz || raw data || tar.gz ||<)> 2.447&#xa0;s ||<)> 153&#xa0;% ||<)>  4958684 ||<)> 1063&#xa0;% || checksum of tar.gz guaranteed to match ||
|| tar.gz --rsyncable || raw data || tar.gz --rsyncable ||<)> 1.793&#xa0;s ||<)> 112&#xa0;% ||<)>  479358 ||<)> 103&#xa0;% || checksum guaranteed to match, but if inputfile and outputfile not gzipped exactly the same way, must fetch big amount of data ||
||||||as above, but with differently gzipped inputfile ||<)> 2.445&#xa0;s ||<)> 153&#xa0;% ||<)> 16319311 ||<)> 3497&#xa0;% || ||
Line 329: Line 330:
 * Choice 0: Infrastructure changes: must checksum tar inside gzip.
 * Choice 1: Infrastructure changes: must checksum tar inside gzip. Also overhead from zsync having to gzip data before passing it to apt.
 * Choice 2: We lose much of the benefit of zsync, mitigating the very point of this spec. Also overhead from gzipping local package if the user doesn’t have the inputfile in cache.
 * Choice 3: Overhead from gzipping local package if the user doesn’t have the inputfile in cache. Also must be very careful to have all the debs gzipped in exactly the same way, otherwise we lose much of the benefit of zsync.
 * Method 0: Infrastructure changes: must checksum data inside gzip. Also must make apt accept the unpacked deb.
 * Method 1: Infrastructure changes: must checksum data inside gzip. Also overhead from zsync having to gzip data before passing it to apt.
 * Method 2: We lose much of the benefit of zsync, mitigating the very point of this spec. Also overhead from gzipping local package if the user doesn’t have the inputfile in cache.
 * Method 3: Overhead from gzipping local package if the user doesn’t have the inputfile in cache. Also must be very careful to have all the debs gzipped in exactly the same way, otherwise we lose much of the benefit of zsync.

Summary

When updating the list of available packages, or upgrading to a new version of a package, most of the data is already on the system. Using an rsync-based download method should significantly speed up the download. The apt-sync program already does this, or at least most of it. apt-sync should be packaged and uploaded to Ubuntu karmic, and benchmarked to evaluate its impact on the archive, mirrors, and various kinds of systems running Ubuntu.

Release Note

The apt-sync package is now included in Ubuntu and will make upgrades faster.

Rationale

Bandwidth is scarce or expensive in large parts of the world. Users may have Internet access only via a dial-up modem, GPRS, 3G. Even users on fast connections may be paying by the byte, or have monthly download limits. Avoiding needless downloads is a worthy goal and would benefit a lot of Ubuntu users.

When downloading a new version of the Packages file, it will almost always be only slightly different from the one on the system already. By using the rsync algorithm, it is possible to download only the changed parts.

Likewise, when upgrading packages to newer versions, the new version is often mostly identical to the old one, with parts of it changed. Again, it would be good to only download the changed parts.

The rsync program of the rsync algorithm is unsuitable for this, primarily because it requires a lot of server-side resources. However, the zsync implementation works around this: the server is a standard HTTP server, and a new data file is added (containing the rsync "signature" data). All the rest of the computation will then happen on the client, which uses HTTP Range requests to fetch parts of the file from the server.

The apt-sync program was written in 2006 to implement this. It is not, however, deployed in Debian or Ubuntu at the current time, for reasons unclear to LarsWirzenius (drafter of spec) at the time of writing.

apt-sync should be packaged for Debian and Ubuntu, and deployed in an experimental manner, to gather feedback from users. Additionally, some benchmarks should be run to see how big an impact it has on the server, various kinds of client systems, and how much benefit it has as far as the amount of time downloaded.

apt-sync may require changes to combine .aptsync files per source package: otherwise the number of files in the archive will increase by the number of .deb files, i.e., probably about a quarter million, and this is a big impact on the archive and mirror network.

User stories

  • Alfred is connected to the Internet over GPRS. He has just installed hardy and wants to install all security updates that have happened since the hardy ISO was created. There is several hundred megabytes of them (FIXME: check this).

Assumptions

  • Alfred's system has installed all packages for which there are security updates.

Design

Benchmarks

At least the following results need to be produced by the benchmarks:

  • How much data is transferred by plain HTTP (full download), apt-sync. This should be compared to the size of the delta produced by xdelta or similar tools (compressed delta of uncompressed data).
  • Time it takes by the client to reconstruct the new .deb, in addition to the actual download.
  • CPU/memory impact on httpd resulting from a lot of clients doing HTTP Range requests.
  • Size impact on mirror from .aptsync files: both number of files, and their size.

Implementation

Benchmarks

For bandwidth use:

  • Take snapshot of hardy and hardy-security archives.
  • Create a KVM image with hardy, without any security updates, but with all packages installed that have security updates.
  • Set up way to measure bandwidth usage in/out from the KVM guest.
  • Install security upgrades in various ways, measure bandwidth use.

For .deb reconstruction:

  • Instrument apt-sync to measure the amount of time and CPU (and memory, if feasible) it takes to get the updated .deb onto the system, measuring separately pre-download, download, and post-download phases, as well as totals.
  • Run instrumented apt-sync on various clients: a high-end desktop machine, a "normal" laptop, a netbook.

Impact on server resources:

  • Set up a machine with httpd and an Ubuntu mirror of hardy and hardy-security.
  • Set up as many client machines as possible to run upgrades, with and without apt-sync, preferably enough to sature server machine's LAN connection.
  • Measure CPU and memory load on server.

Archive size impact:

  • Generate .aptsync files for every deb. Count their total size.
  • Also tabulate combined size of .aptsync files per source package.

UI Changes

There should be no UI changes.

Code Changes

The apt package may require minor changes to support apt-sync as a download method, but probably not.

The apt-sync program should need fairly little changes to get it usable, if at all.

Launchpad will require code to generate .aptsync files for the entire archive, but this can happen later, after apt-sync has been proven to be a useful thing.

Migration

If the user is using a mirror without .aptsync files, everything should work just fine, the way it works now. Once .aptsync files get added to the mirrors, upgrade downloads should happen faster. The archives can add only those .aptsync files that are deemed necessary (e.g., only for -security).

If apt-sync turns out to work well, it should be installed and enabled by default, and users should not have to change anything.

Test/Demo Plan

Suggestion: we modify apt and update-manager to report how much data it managed to NOT transfer, thanks to apt-sync. This will subtly tell users that things work well.

# It's important that we are able to test new features, and demonstrate # them to users. Use this section to describe a short plan that anybody # can follow that demonstrates the feature is working. This can then be # used during testing, and to show off after release. Please add an entry # to http://testcases.qa.ubuntu.com/Coverage/NewFeatures for tracking test # coverage. # # This need not be added or completed until the specification is nearing beta.

Unresolved issues

TBD.

BoF agenda and discussion

* deltadebs: between specific versions
* apt-sync (https://code.launchpad.net/apt-sync): from any installed version to whatever is to be downloaded
* consider xdeltas ([1] see comments from Rusty below)

Benchmark
=========

delta rpms are kinda screwy:
 * have to make an assumption about what version you have, gets quickly unmanagable with large numbers of changes
 * requires pushing many deltas onto the archive

rsync suggestion:
 * zsync (don't need an rsync server, works with plain HTTP)
   - generates file with checksum of every file on server; size of file depends on block size (will be about .3% of original file)
   - in theory we could put zsync files on every server
   - debs are compressed though, so a small change at beginning might fall apart

Proof that this helps is needed:


apt-sync possibility
 * google summer of code project 3 years ago, unused
 * prototype works
 * we need benchmarks (bandwidth saved, how big impact on web server)
   * mirror admins would not like lots of bandwidth dumped on them
   * we're prioritizing bandwidth for user rather than mirror, although it's not necessarily a tradeoff
 * even with free bandwidth you wait less
 * mostly downloading .debs and upgrades

If user doesn't have original .deb, we can still use original deb on system: dpkg-repack:
 * takes files you have installed and creates new .deb (doesn't have to be identical, rsync can fix it)

Can also rsync packages.gz (preferably with zsync on client side - not all mirrors support rsync (also some mirrors have rsyncd client limits))
 * security updates benefit more from this since they can be pushed faster


gzip by default isn't very friendly with rsync, but they have a gzip --rsyncable option
 * gzip has this option, but lzma doesn't
 * question then becomes are we favoring initial download or people doing upgrade


Things to research:
 * does bzip2 rsync well?
 * reprepro and (other apt repo software) won't be extended initially (though we'll file bugs once we get this working in launchpad)
 * does apt-sync sync packages.gz, or just package files?

Benchmark:
 * bandwidth for archive mirror
 * bandwidth for client
 * decompress time for client
 * cpu/memory impact on mirror httpd
 * resource requirements for archive to generate .aptsync files
 * size impact of .aptsync files for whole mirror
 * if it is too expensive in memory/cpu for eg netbooks, could make it optional
 * optionally have apt-sync report back benchmark data so we can measure things
   for all users
 * James is concerned about number of files in archive

Package description translations file should be synced as well

If we wanted to go really wild, could have mirrors provide unpacked debs

Multiple mirrors:
 - would be useful for user to have multiple mirrors listed; if identical then could sync from both (especially if one is slow)
 - also will be better for user when one mirror is failing/really slow
 - James wants this
 - apt needs graceful fallback, but this is a general problem, not specific to apt-sync

Author of apt-sync, zsync authors are "around and helpful"
 - reach out, maybe they'll help with benchmarks
 - zsync may need to be modified to handle .deb (which are tars in r)

We should package up apt-sync into Ubuntu and get people to try this in real
situations, but make it optional until it's proven to work.

Provide both Packages.gz and Packages.lzma
 * user can download Packages.lzma first time, then zsync to the Packages.gz subsequently
 * start with Packages file as it's clear and easy for everyone (michael's suggestion)


try zsync on cdimage

zsync.ubuntu.com to provide .zsync files (only)
- james is concerrned about cpu impact and number of files in the archive
- we could group .aptsync files into single files per source package

 [1] IRC Transcript w/Rusty
 <rusty> hughhalf: pong!
<hughhalf> rusty: did you happen to look at whether bzip could be given a --rsyncable flag back in the day ?
<rusty> hughhalf: tridge and I discussed it, the problem is that bzip uses a fixed (900k, default) block size.  We didn't follow it further.
<hughhalf> ok
 https://blueprints.edge.launchpad.net/ubuntu/+spec/rsync-based-deb-downloads
<rusty> hughhalf: with a centralized system like ubuntu, I think that xdeltas (or something like) makes more sense.
<hughhalf> rusty: Indeed, however for first pass think desire is to run with something that won't affect mirrors overmuch
 and there's some package/ssytem that allows us to do this (sounds reminiscent of rproxy)
<rusty> hughhalf: well, I think given the scale, deltas are going to be more optimal.
<hughhalf> 'k
<rusty> hughhalf: given that most mirrors don't have rsync access, esp.
<hughhalf> rusty: Yep
 Lars et. al. say hello and thanks :)
<rusty> hughhalf: nw!
<hughhalf> rusty: Took liberty of cut and paste salient bits of above in the gobby doc we're using for this
<rusty> hughhalf: of course, not a problem!
<hughhalf> tnx

JohanKiviniemi’s ramblings:

When an older version of the deb is not in the cache, do we want to use dpkg-repack? We could just tar | gzip the files installed by the package and some of the control files, saving a few seconds.

How about using zsync in a mode that looks inside gzipped data instead of handling it as a raw bitstream? In that case, we could just tar (not gzip) the installed files, or perhaps even pass the list of installed files as separate inputfiles to zsync. It would probably have to gzip the data in the end, though.

Supporting an apt method outputting debs in an unpacked form would be interesting. Zsync is able to download gzipped data, unpack it, merge with local unpacked data and output unpacked data. Gzipping it for apt, only to have it gunzipped, could be avoided. We could add signed checksums for the unpacked data.

Some benchmarking:

(fastest of seven runs, script and raw data)

devicekit

coreutils

openoffice.org-core

dpkg-repack

0.870 s

100 %

7.100 s

100 %

115.236 s

100 %

tar | gzip

0.313 s

36 %

5.470 s

77 %

98.316 s

85 %

tar

0.228 s

26 %

0.820 s

12 %

11.990 s

10 %

Another thing: if foo-1.6 Replaces: bar-1.5, give bar-1.5 as an inputfile to zsync. We know the Replaces in advance from the Packages file. What’s the overhead of parsing it, though?

Some additional benchmarking:

I generated foo-0.tar:

-rw-r--r-- ion/users  18476682 2009-06-03 18:56 part0
-rw-r--r-- ion/users    374325 2009-06-03 18:56 part1a
-rw-r--r-- ion/users  11813697 2009-06-03 18:56 part2

and foo-1.tar:

-rw-r--r-- ion/users    464163 2009-06-03 18:57 part1b
-rw-r--r-- ion/users  11813697 2009-06-03 18:56 part2
-rw-r--r-- ion/users  18476682 2009-06-03 18:56 part0

part0, part1a, part1b and part2 all from /dev/urandom. The change from foo-0.tar to foo-1.tar is that part1a was substituted with part1b and the order was changed.

I benchmarked zsyncing foo-1 with foo-0 as the inputfile:

(fastest of seven runs, script and raw data)

get

process

output

time

fetched

checksum

tar.gz

unpacked

tar

1.596 s

100 %

466684

100 %

only results in unpacked tar, must checksum it

tar.gz

unpacked

tar.gz

6.125 s

384 %

466684

100 %

checksum of resulting tar.gz may not match, must checksum unpacked tar inside

tar.gz

raw data

tar.gz

2.447 s

153 %

4958684

1063 %

checksum of tar.gz guaranteed to match

tar.gz --rsyncable

raw data

tar.gz --rsyncable

1.793 s

112 %

479358

103 %

checksum guaranteed to match, but if inputfile and outputfile not gzipped exactly the same way, must fetch big amount of data

as above, but with differently gzipped inputfile

2.445 s

153 %

16319311

3497 %

The cons:

  • Method 0: Infrastructure changes: must checksum data inside gzip. Also must make apt accept the unpacked deb.
  • Method 1: Infrastructure changes: must checksum data inside gzip. Also overhead from zsync having to gzip data before passing it to apt.
  • Method 2: We lose much of the benefit of zsync, mitigating the very point of this spec. Also overhead from gzipping local package if the user doesn’t have the inputfile in cache.
  • Method 3: Overhead from gzipping local package if the user doesn’t have the inputfile in cache. Also must be very careful to have all the debs gzipped in exactly the same way, otherwise we lose much of the benefit of zsync.


CategorySpec

AptSyncInKarmicSpec (last edited 2009-10-21 14:39:46 by xdsl-237-224)