AptSyncInKarmicSpec

Differences between revisions 12 and 20 (spanning 8 versions)
Revision 12 as of 2009-06-08 12:33:49
Size: 16854
Editor: xdsl-237-224
Comment:
Revision 20 as of 2009-10-21 14:39:46
Size: 10778
Editor: xdsl-237-224
Comment: Add comment
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
 * '''Packages affected''': apt, apt-sync (to be created)  * '''Packages affected''': apt, apt-sync (to be added to Ubuntu)
Line 47: Line 47:
however, deployed in Debian or Ubuntu at the current time, for reasons
unclear to LarsWirzenius (drafter of spec) at the time of writing
.
however, deployed in Debian or Ubuntu at the current time.
Line 53: Line 52:
the server, various kinds of client systems, and how much benefit
it has as far as the amount of time downloaded.
the archive mirrors, various kinds of client systems, and how much benefit
it has as far as the amount of bandwidth spent downloading.
Line 65: Line 64:
 since the hardy ISO was created. There is several hundred megabytes of
them (FIXME: check this).
 since the hardy ISO was created. There are several hundreds or thousands
 of
megabytes of such updates.
Line 72: Line 71:
 * We are mainly interested in optimizing the actual transfer and avoiding
 big impacts on the mirrors. The client system is assumed to have sufficient
 processing power. (If it turns out to be inefficient at the client end,
 we'll optimize that later.)
Line 74: Line 77:

The RPM world has presto, which relies on the package archive generating
delta-RPM packages. These are then downloaded normally, and applied by
the client differently from normal RPMs. This requires upgrades to happen
from a specific version, which the rsync/zsync/apt-sync approach avoids.
Line 82: Line 90:
 For every package.
Line 83: Line 92:
 to the actual download.  to the actual download. For every package.
Line 88: Line 97:
 * How much data is transferred when updating Packages files daily, for
 a development branch of Ubuntu.
Line 96: Line 107:
 * Create a KVM image with hardy, without any security updates, but with
 all packages installed that have security updates.
 * Set up way to measure bandwidth usage in/out from the KVM guest.
 * Install security upgrades in various ways, measure bandwidth use.
 * Find a way to measure the bandwidth used to update a package with apt-sync.
 Preferably without actually transferring files. If necessary, just use zsync
 for this.
 * Measure the bandwidth to download all security updates, both plain HTTP
 and via apt-sync.
 * If necessary, re-pack the .debs so that apt-sync can deal with them, by
 using gzip --rsyncable. Test also bzip2.
  * ''Note that if this proves useful then dpkg-deb will need to be changed to use --rsyncable by default. This is a little bit non-trivial due to [[http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=290049|Debian bug 290049]], although it ought to be possible to arrange for dpkg-deb simply not to use zlib for compression. --ColinWatson''
  * ''If --rsyncable is required (which seems likely to me), then presumably bzip2 and lzma packages will just be out of luck? I don't think they have equivalents. --ColinWatson''
 * ''Yeah, my gut feeling right now is that --rsyncable is going to be required and that bzip2 and lzma are going to be out of luck. But perhaps we can find some solution. --LarsWirzenius''
  * ''Actually --rsyncable is not currently used by aptsync, instead zsync --look-inside is used and the correct checksum is reconstructed by storing the gzip header and recompressing the data. I'm not sure how to deal with bzip2 or lzma, they are not supported at the moment but it might be possible to integrate them into zsync. --FelixFeyertag''
Line 115: Line 133:
 with and without apt-sync, preferably enough to sature server machine's  with and without apt-sync, preferably enough to saturate server machine's
Line 121: Line 139:
 * Generate .aptsync files for every deb. Count their total size.
 * Also tabulate combined size of .aptsync files per source package.
 * Generate .aptsync files for every deb. Count their total size, both
 actual file size (st_size) and disk usage (st_blocks).
 * Also tabulate combined size of .aptsync files per source package:
 cat them to one file, check both st_size and st_blocks.
  * ''I strongly recommend considering whether a single aptsync file is possible: it could be stored under dists/ and could itself be updated using zsync. It seems to me that this would tend to have minimal impact on mirrors. rsync (which will still be used for inter-mirror syncs) has scaling properties related to the number of files, and an extra 28000 files will probably not do them any good. I think this would also be the politically easiest option to get implemented, since dists/ is where index files have always lived. --ColinWatson''
 * ''A single .aptsync file under dists/ would certainly be possible. I'll add that to the benchmark plan. --LarsWirzenius''
  * ''The mean *.aptsync file size is ~8kB (from a sample of 1508 files), but space on disk is often higher because many files are below the minimum block size. I support the idea of merging them, for the entire archive the size would be 200-300MB, and it could be compressed. --FelixFeyertag''
   * ''The merged file itself could be fetched with zsync. :-) A better way might be to generate an index for it and have clients only fetch the parts they need based on the index. --JohanKiviniemi''

Packages files:
 
 * Find snapshots of Packages files for as long a period of time as possible,
 from karmic. At least ten days' worth of them.
 * Measure the amount of data transferred when updating from oldest to newest
 version of the Packages file, both for plain HTTP and for zsync.

==== Benchmarks ====

 * ''Benchmarks for a [[AptSyncInKarmicSpec/JauntyBenchmark|Jaunty upgrade]] and [[AptSyncInKarmicSpec/KarmicBenchmark|Karmic dist-upgrade]] are available --FelixFeyertag''
Line 126: Line 161:
There should be no UI changes. There should be no UI changes, except that apt-get and Synaptic and
update-manager might be modified to tell how much data apt-sync managed
to NOT transfer. This would allow users to see that it is working.
Line 133: Line 170:
The apt-sync program should need fairly little changes to get it usable,
if at all
.
The apt-sync program should need fairly little changes to get it usable.
Line 138: Line 174:
be a useful thing. be a useful thing. For testing, the .aptsync files can be provided by
different servers than the usual mirrors.
 * ''It obviously makes sense to defer this until after benchmarks are complete, but once that's done I suggest prioritising this part. Launchpad feature changes usually have some lead time, and it would be best to make sure that things like file locations are agreed upon as early as possible. --ColinWatson''
Line 157: Line 195:
# It's important that we are able to test new features, and demonstrate
# them to users. Use this section to describe a short plan that anybody
# can follow that demonstrates the feature is working. This can then be
# used during testing, and to show off after release. Please add an entry
# to http://testcases.qa.ubuntu.com/Coverage/NewFeatures for tracking test
# coverage.
#
# This need not be added or completed until the specification is nearing beta.
Line 168: Line 197:
TBD.

== BoF agenda and discussion ==

{{{

* deltadebs: between specific versions
* apt-sync (https://code.launchpad.net/apt-sync): from any installed version to whatever is to be downloaded
* consider xdeltas ([1] see comments from Rusty below)

Benchmark
=========

delta rpms are kinda screwy:
 * have to make an assumption about what version you have, gets quickly unmanagable with large numbers of changes
 * requires pushing many deltas onto the archive

rsync suggestion:
 * zsync (don't need an rsync server, works with plain HTTP)
   - generates file with checksum of every file on server; size of file depends on block size (will be about .3% of original file)
   - in theory we could put zsync files on every server
   - debs are compressed though, so a small change at beginning might fall apart

Proof that this helps is needed:


apt-sync possibility
 * google summer of code project 3 years ago, unused
 * prototype works
 * we need benchmarks (bandwidth saved, how big impact on web server)
   * mirror admins would not like lots of bandwidth dumped on them
   * we're prioritizing bandwidth for user rather than mirror, although it's not necessarily a tradeoff
 * even with free bandwidth you wait less
 * mostly downloading .debs and upgrades

If user doesn't have original .deb, we can still use original deb on system: dpkg-repack:
 * takes files you have installed and creates new .deb (doesn't have to be identical, rsync can fix it)

Can also rsync packages.gz (preferably with zsync on client side - not all mirrors support rsync (also some mirrors have rsyncd client limits))
 * security updates benefit more from this since they can be pushed faster


gzip by default isn't very friendly with rsync, but they have a gzip --rsyncable option
 * gzip has this option, but lzma doesn't
 * question then becomes are we favoring initial download or people doing upgrade


Things to research:
 * does bzip2 rsync well?
 * reprepro and (other apt repo software) won't be extended initially (though we'll file bugs once we get this working in launchpad)
 * does apt-sync sync packages.gz, or just package files?

Benchmark:
 * bandwidth for archive mirror
 * bandwidth for client
 * decompress time for client
 * cpu/memory impact on mirror httpd
 * resource requirements for archive to generate .aptsync files
 * size impact of .aptsync files for whole mirror
 * if it is too expensive in memory/cpu for eg netbooks, could make it optional
 * optionally have apt-sync report back benchmark data so we can measure things
   for all users
 * James is concerned about number of files in archive

Package description translations file should be synced as well

If we wanted to go really wild, could have mirrors provide unpacked debs

Multiple mirrors:
 - would be useful for user to have multiple mirrors listed; if identical then could sync from both (especially if one is slow)
 - also will be better for user when one mirror is failing/really slow
 - James wants this
 - apt needs graceful fallback, but this is a general problem, not specific to apt-sync

Author of apt-sync, zsync authors are "around and helpful"
 - reach out, maybe they'll help with benchmarks
 - zsync may need to be modified to handle .deb (which are tars in r)

We should package up apt-sync into Ubuntu and get people to try this in real
situations, but make it optional until it's proven to work.

Provide both Packages.gz and Packages.lzma
 * user can download Packages.lzma first time, then zsync to the Packages.gz subsequently
 * start with Packages file as it's clear and easy for everyone (michael's suggestion)


try zsync on cdimage

zsync.ubuntu.com to provide .zsync files (only)
- james is concerrned about cpu impact and number of files in the archive
- we could group .aptsync files into single files per source package

 [1] IRC Transcript w/Rusty
 <rusty> hughhalf: pong!
<hughhalf> rusty: did you happen to look at whether bzip could be given a --rsyncable flag back in the day ?
<rusty> hughhalf: tridge and I discussed it, the problem is that bzip uses a fixed (900k, default) block size. We didn't follow it further.
<hughhalf> ok
 https://blueprints.edge.launchpad.net/ubuntu/+spec/rsync-based-deb-downloads
<rusty> hughhalf: with a centralized system like ubuntu, I think that xdeltas (or something like) makes more sense.
<hughhalf> rusty: Indeed, however for first pass think desire is to run with something that won't affect mirrors overmuch
 and there's some package/ssytem that allows us to do this (sounds reminiscent of rproxy)
<rusty> hughhalf: well, I think given the scale, deltas are going to be more optimal.
<hughhalf> 'k
<rusty> hughhalf: given that most mirrors don't have rsync access, esp.
<hughhalf> rusty: Yep
 Lars et. al. say hello and thanks :)
<rusty> hughhalf: nw!
<hughhalf> rusty: Took liberty of cut and paste salient bits of above in the gobby doc we're using for this
<rusty> hughhalf: of course, not a problem!
<hughhalf> tnx

}}}

JohanKiviniemi’s ramblings:

When an older version of the deb is not in the cache, do we want to use dpkg-repack? We could just tar | gzip the files installed by the package and some of the control files, saving a few seconds.

How about using zsync in a mode that looks inside gzipped data instead of handling it as a raw bitstream? In that case, we could just tar (not gzip) the installed files, or perhaps even pass the list of installed files as separate inputfiles to zsync. It would probably have to gzip the data in the end, though.

Supporting an apt method outputting debs in an unpacked form would be interesting. Zsync is able to download gzipped data, unpack it, merge with local unpacked data and output unpacked data. Gzipping it for apt, only to have it gunzipped, could be avoided. We could add signed checksums for the unpacked data.

Some benchmarking:

(fastest of seven runs, [[http://heh.fi/tmp/apt-sync/|script and raw data]])

|| |||| '''devicekit''' |||| '''coreutils''' |||| '''openoffice.org-core''' ||
|| '''dpkg-repack''' ||<)> 0.870&#xa0;s ||<)> 100&#xa0;% ||<)> 7.100&#xa0;s ||<)> 100&#xa0;% ||<)> 115.236&#xa0;s ||<)> 100&#xa0;% ||
|| '''tar | gzip''' ||<)> 0.313&#xa0;s ||<)> 36&#xa0;% ||<)> 5.470&#xa0;s ||<)> 77&#xa0;% ||<)> 98.316&#xa0;s ||<)> 85&#xa0;% ||
|| '''tar''' ||<)> 0.228&#xa0;s ||<)> 26&#xa0;% ||<)> 0.820&#xa0;s ||<)> 12&#xa0;% ||<)> 11.990&#xa0;s ||<)> 10&#xa0;% ||

Another thing: if foo-1.6 Replaces: bar-1.5, give bar-1.5 as an inputfile to zsync. We know the Replaces in advance from the Packages file. What’s the overhead of parsing it, though?

Some additional benchmarking:

I generated foo-0.tar:
{{{
-rw-r--r-- ion/users 18476682 2009-06-03 18:56 part0
-rw-r--r-- ion/users 374325 2009-06-03 18:56 part1a
-rw-r--r-- ion/users 11813697 2009-06-03 18:56 part2
}}}

and foo-1.tar:
{{{
-rw-r--r-- ion/users 464163 2009-06-03 18:57 part1b
-rw-r--r-- ion/users 11813697 2009-06-03 18:56 part2
-rw-r--r-- ion/users 18476682 2009-06-03 18:56 part0
}}}

part0, part1a, part1b and part2 all from /dev/urandom. The change from foo-0.tar to foo-1.tar is that part1a was substituted with part1b and the order was changed.

I benchmarked zsyncing foo-1 with foo-0 as the inputfile:

(fastest of seven runs, [[http://heh.fi/tmp/zsynctest/|script and raw data]])

|| '''get''' || '''process''' || '''output''' |||| '''time''' |||| '''fetched''' || '''checksum''' ||
|| tar.gz || unpacked || tar ||<)> 1.596&#xa0;s ||<)> 100&#xa0;% ||<)> 466684 ||<)> 100&#xa0;% || only results in unpacked tar, must checksum it ||
|| tar.gz || unpacked || tar.gz ||<)> 6.125&#xa0;s ||<)> 384&#xa0;% ||<)> 466684 ||<)> 100&#xa0;% || checksum of resulting tar.gz may not match, must checksum unpacked tar inside ||
|| tar.gz || raw data || tar.gz ||<)> 2.447&#xa0;s ||<)> 153&#xa0;% ||<)> 4958684 ||<)> 1063&#xa0;% || checksum of tar.gz guaranteed to match ||
|| tar.gz --rsyncable || raw data || tar.gz --rsyncable ||<)> 1.793&#xa0;s ||<)> 112&#xa0;% ||<)> 479358 ||<)> 103&#xa0;% || checksum guaranteed to match, but if inputfile and outputfile not gzipped exactly the same way, must fetch big amount of data ||
||||||...with inputfile gzipped without --rsyncable ||<)> 2.445&#xa0;s ||<)> 153&#xa0;% ||<)> 16319311 ||<)> 3497&#xa0;% || ||

The cons:
 * Method 0: Infrastructure changes: must checksum data inside gzip. Also must make apt accept the unpacked deb.
 * Method 1: Infrastructure changes: must checksum data inside gzip. Also overhead from zsync having to gzip data before passing it to apt.
 * Method 2: We lose much of the benefit of zsync, mitigating the very point of this spec. Also overhead from gzipping local package if the user doesn’t have the inputfile in cache.
 * Method 3: Overhead from gzipping local package if the user doesn’t have the inputfile in cache. Also must be very careful to have all the debs gzipped in exactly the same way, otherwise we lose much of the benefit of zsync.

How method 0 could be implemented:
 * Generate such debs that contain debian-binary, control.'''tar''', data.'''tar'''; i.e. do not gzip the members. It seems dpkg already supports the format (since 2004).
 * Put deb.'''gz''' files into archive. Thus, the archive size would not change substantially.
  * Put “Filename: pool/...deb.gz” into Packages.
  * Keep the Packages checksums as the checksums of the deb within the deb.gz.
 * Make it possible for apt methods to gunzip the deb.gz on the fly, outputting just debs.
  * Have apt gunzip the deb if the method did not do that.
  * ''/var/cache/apt/archives would gain size.''
 * Along with foo.deb.gz, put foo.deb.zsync into archive.
 * Have the zsync method use zsync’s existing functionality to use /var/cache/apt/archives/foo-old.deb or equivalent as inputfile, fetch the needed parts of foo-new.deb.gz from the archive and output the gunzipped deb.
Using HTTP content negotiation would allow another way to hide the large number
of extra files from mirrors that don't want the impact. The apt-sync client
would specifically tell the HTTP server it can handle apt-sync files, and
the server would give it if necessary. However, this would complicate things
for all mirrors who do want to support apt-sync, so the single apt-sync file
per release, suggested by Colin, would probably work better.

Summary

When updating the list of available packages, or upgrading to a new version of a package, most of the data is already on the system. Using an rsync-based download method should significantly speed up the download. The apt-sync program already does this, or at least most of it. apt-sync should be packaged and uploaded to Ubuntu karmic, and benchmarked to evaluate its impact on the archive, mirrors, and various kinds of systems running Ubuntu.

Release Note

The apt-sync package is now included in Ubuntu and will make upgrades faster.

Rationale

Bandwidth is scarce or expensive in large parts of the world. Users may have Internet access only via a dial-up modem, GPRS, 3G. Even users on fast connections may be paying by the byte, or have monthly download limits. Avoiding needless downloads is a worthy goal and would benefit a lot of Ubuntu users.

When downloading a new version of the Packages file, it will almost always be only slightly different from the one on the system already. By using the rsync algorithm, it is possible to download only the changed parts.

Likewise, when upgrading packages to newer versions, the new version is often mostly identical to the old one, with parts of it changed. Again, it would be good to only download the changed parts.

The rsync program of the rsync algorithm is unsuitable for this, primarily because it requires a lot of server-side resources. However, the zsync implementation works around this: the server is a standard HTTP server, and a new data file is added (containing the rsync "signature" data). All the rest of the computation will then happen on the client, which uses HTTP Range requests to fetch parts of the file from the server.

The apt-sync program was written in 2006 to implement this. It is not, however, deployed in Debian or Ubuntu at the current time.

apt-sync should be packaged for Debian and Ubuntu, and deployed in an experimental manner, to gather feedback from users. Additionally, some benchmarks should be run to see how big an impact it has on the archive mirrors, various kinds of client systems, and how much benefit it has as far as the amount of bandwidth spent downloading.

apt-sync may require changes to combine .aptsync files per source package: otherwise the number of files in the archive will increase by the number of .deb files, i.e., probably about a quarter million, and this is a big impact on the archive and mirror network.

User stories

  • Alfred is connected to the Internet over GPRS. He has just installed hardy and wants to install all security updates that have happened since the hardy ISO was created. There are several hundreds or thousands of megabytes of such updates.

Assumptions

  • Alfred's system has installed all packages for which there are security updates.
  • We are mainly interested in optimizing the actual transfer and avoiding big impacts on the mirrors. The client system is assumed to have sufficient processing power. (If it turns out to be inefficient at the client end, we'll optimize that later.)

Design

The RPM world has presto, which relies on the package archive generating delta-RPM packages. These are then downloaded normally, and applied by the client differently from normal RPMs. This requires upgrades to happen from a specific version, which the rsync/zsync/apt-sync approach avoids.

Benchmarks

At least the following results need to be produced by the benchmarks:

  • How much data is transferred by plain HTTP (full download), apt-sync. This should be compared to the size of the delta produced by xdelta or similar tools (compressed delta of uncompressed data). For every package.
  • Time it takes by the client to reconstruct the new .deb, in addition to the actual download. For every package.
  • CPU/memory impact on httpd resulting from a lot of clients doing HTTP Range requests.
  • Size impact on mirror from .aptsync files: both number of files, and their size.
  • How much data is transferred when updating Packages files daily, for a development branch of Ubuntu.

Implementation

Benchmarks

For bandwidth use:

  • Take snapshot of hardy and hardy-security archives.
  • Find a way to measure the bandwidth used to update a package with apt-sync. Preferably without actually transferring files. If necessary, just use zsync for this.
  • Measure the bandwidth to download all security updates, both plain HTTP and via apt-sync.
  • If necessary, re-pack the .debs so that apt-sync can deal with them, by using gzip --rsyncable. Test also bzip2.
    • Note that if this proves useful then dpkg-deb will need to be changed to use --rsyncable by default. This is a little bit non-trivial due to Debian bug 290049, although it ought to be possible to arrange for dpkg-deb simply not to use zlib for compression. --ColinWatson

    • If --rsyncable is required (which seems likely to me), then presumably bzip2 and lzma packages will just be out of luck? I don't think they have equivalents. --ColinWatson

  • Yeah, my gut feeling right now is that --rsyncable is going to be required and that bzip2 and lzma are going to be out of luck. But perhaps we can find some solution. --LarsWirzenius

    • Actually --rsyncable is not currently used by aptsync, instead zsync --look-inside is used and the correct checksum is reconstructed by storing the gzip header and recompressing the data. I'm not sure how to deal with bzip2 or lzma, they are not supported at the moment but it might be possible to integrate them into zsync. --FelixFeyertag

For .deb reconstruction:

  • Instrument apt-sync to measure the amount of time and CPU (and memory, if feasible) it takes to get the updated .deb onto the system, measuring separately pre-download, download, and post-download phases, as well as totals.
  • Run instrumented apt-sync on various clients: a high-end desktop machine, a "normal" laptop, a netbook.

Impact on server resources:

  • Set up a machine with httpd and an Ubuntu mirror of hardy and hardy-security.
  • Set up as many client machines as possible to run upgrades, with and without apt-sync, preferably enough to saturate server machine's LAN connection.
  • Measure CPU and memory load on server.

Archive size impact:

  • Generate .aptsync files for every deb. Count their total size, both actual file size (st_size) and disk usage (st_blocks).
  • Also tabulate combined size of .aptsync files per source package: cat them to one file, check both st_size and st_blocks.
    • I strongly recommend considering whether a single aptsync file is possible: it could be stored under dists/ and could itself be updated using zsync. It seems to me that this would tend to have minimal impact on mirrors. rsync (which will still be used for inter-mirror syncs) has scaling properties related to the number of files, and an extra 28000 files will probably not do them any good. I think this would also be the politically easiest option to get implemented, since dists/ is where index files have always lived. --ColinWatson

  • A single .aptsync file under dists/ would certainly be possible. I'll add that to the benchmark plan. --LarsWirzenius

    • The mean *.aptsync file size is ~8kB (from a sample of 1508 files), but space on disk is often higher because many files are below the minimum block size. I support the idea of merging them, for the entire archive the size would be 200-300MB, and it could be compressed. --FelixFeyertag

      • The merged file itself could be fetched with zsync. Smile :-) A better way might be to generate an index for it and have clients only fetch the parts they need based on the index. --JohanKiviniemi

Packages files:

  • Find snapshots of Packages files for as long a period of time as possible, from karmic. At least ten days' worth of them.
  • Measure the amount of data transferred when updating from oldest to newest version of the Packages file, both for plain HTTP and for zsync.

Benchmarks

UI Changes

There should be no UI changes, except that apt-get and Synaptic and update-manager might be modified to tell how much data apt-sync managed to NOT transfer. This would allow users to see that it is working.

Code Changes

The apt package may require minor changes to support apt-sync as a download method, but probably not.

The apt-sync program should need fairly little changes to get it usable.

Launchpad will require code to generate .aptsync files for the entire archive, but this can happen later, after apt-sync has been proven to be a useful thing. For testing, the .aptsync files can be provided by different servers than the usual mirrors.

  • It obviously makes sense to defer this until after benchmarks are complete, but once that's done I suggest prioritising this part. Launchpad feature changes usually have some lead time, and it would be best to make sure that things like file locations are agreed upon as early as possible. --ColinWatson

Migration

If the user is using a mirror without .aptsync files, everything should work just fine, the way it works now. Once .aptsync files get added to the mirrors, upgrade downloads should happen faster. The archives can add only those .aptsync files that are deemed necessary (e.g., only for -security).

If apt-sync turns out to work well, it should be installed and enabled by default, and users should not have to change anything.

Test/Demo Plan

Suggestion: we modify apt and update-manager to report how much data it managed to NOT transfer, thanks to apt-sync. This will subtly tell users that things work well.

Unresolved issues

Using HTTP content negotiation would allow another way to hide the large number of extra files from mirrors that don't want the impact. The apt-sync client would specifically tell the HTTP server it can handle apt-sync files, and the server would give it if necessary. However, this would complicate things for all mirrors who do want to support apt-sync, so the single apt-sync file per release, suggested by Colin, would probably work better.


CategorySpec

AptSyncInKarmicSpec (last edited 2009-10-21 14:39:46 by xdsl-237-224)