BazaarImports
Size: 8333
Comment: product description selection algorithm
|
Size: 8655
Comment: A bit more about hardware
|
Deletions are marked like this. | Additions are marked like this. |
Line 60: | Line 60: |
Line 89: | Line 87: |
Build does not provide load balancing functionality. Instead, each job is assigned statically to a slave. In the short term, one system will be used mainly for syncs so the import load will not interfere with syncs. '''TODO:''' specify association of jobs with slaves |
|
Line 177: | Line 180: |
Associating ProductSeries to different slaves. |
Bazaar Imports
Status
Created: 23/04/05 by RobertCollins
Priority: HighPriority
People: DavidAlloucheLead, ScottJamesRemnantSecond
- Contributors:
Interested: MarkShuttleworth, RobertCollins, JamesBlackwell, FabioDiNitto
Status: BrainDump, UduBof, BazaarSpecification, BazaarDevel
- Branch:
- Malone Bug:
- Packages:
Depends: UbuntuDevel/BazaarLaunchpadStrategy
UduSessions: 3
Introduction
This specification discusses improving and accelerating the bazaar CVS, SVN and Bitkeeper import process in order to make HCT able to leverage upstream changes on an ongoing basis.
Rationale
Baz is emerging as an excellent tool for distributed software development. In order to accelerate the adoption of baz, we are making available imports syncronised on a daily basis with upstream non-Bazaar repositories. The current software works but the process is too long. We need to identify what needs to be corrected, what pieces are missing, and plan to do that.
Scope and Use Cases
On August 1st 2005, 1000 upstream version control archives will have been imported as Bazaar archives and new upstream commits will be translated daily into new Bazaar revisions.
ScottJamesRemnant has also committed that on this date, both release tarball and ubuntu source package imports will have been performed and made available in the same system.
Bazaar imports will enable the full functionality of HCT, improving the productivity of the Ubuntu team and derived distributions.
Regularly updated Bazaar archives tracking upstream repositories in centralised version control systems will make it easy for community members to benefit from version control even if they do not have commit access to the repository.
"Ahead of time" archive conversion will make easy for established projects to evaluate Bazaar as a replacement for CVS or Subversion.
Implementation Plan
To meet this target we need to import, on average, 15 new source repository per working day. The entire Bazaar team will be employed to reduce the work load on each member.
In order to meet the target for release tarballs, we will need:
A process to examine upstream FTP/HTTP sites and populate the Product, ProductSeries and ProductRelease tables; placing the source files themselves in the Librarian.
A job to take the manifest-less ProductRelease records and import them using Sourcerer.
In order to meet the target for source package imports, we will need:
Regular runs of gina to populate the SourcePackageRelease, and publishing etc. tables
A job to take the manifest-less SourcePackageRelease records and import them using Sourcerer
Computing Bottlenecks
SSH connection caching
The current archive publication scheme involves the creation of one SSH session for each revision, even if it was already imported, between the machine performing the sync and arch.ubuntu.com.
That bottleneck can be removed by changing the SSH configuration to perform session caching.
Huge logs
Buildbot has a performance bug in handling big text logs. Combined with the high volume of logging produced by CSCVS it makes it impossible to import big repositories like OpenOffice or XOrg.
This performance problem can be fixed in two ways:
- Do not accumulate logs in memory, but use file storage instead.
- Truncate logs to keep only the last few thousand lines.
The latter solution would be simple to implement.
Although it discards information, that should not be a problem, since failures are generally diagnosed by examining the tail of the conversion log. More difficult conversion problems will be diagnosed by driving imports with a command line tool instead of Buildbot.
Hardware
Each repository must be updated daily. To keep 1000 repositories in sync with only one machine, we need an average sync time of about 90 seconds, and that does not account for cost of new imports.
The initial import of a repository is an expensive operation in terms of CPU and I/O bandwith. To achieve the 15 syncs/day (on average) target, additional hardware will be provided.
- One system to run the botmaster.
- Three systems to run imports and syncs.
Build does not provide load balancing functionality. Instead, each job is assigned statically to a slave. In the short term, one system will be used mainly for syncs so the import load will not interfere with syncs.
TODO: specify association of jobs with slaves
Baz vs. Buildbot
Bazaar development will be deprioritized in favour of source imports.
However, a level of activity must be maintained to animate the user community.
- Integration of community contributions.
- Bug fixing.
That is especially important since the Arch/Bazaar community has been repetitively damaged by lack of responsiveness from former the main developer. The current level of activity has been crucial in building community trust and enticing users to switch from GNU Arch to Bazaar. If the Bazaar project became unresponsive, that would also damage the company image in the community.
Product Data
Earch version control repository store in Launchpad is associated to a Product, which includes a human-written product description. Editing original and high quality product descriptions for projects one does not know about is a time-consuming and unrewarding activity.
To allow source imports to proceed, the description of newly created Products will be copied from the Debian package description.
A product is associated to source packages, but Debian source packages do not have a description. Only binary packages do. Here is a simple algorithm to pick one description:
- if the source package only creates one binary package, use the description of that source package.
- if there is a binary package with the same name as the source package, use the description of that binary package.
- otherwise, use the description of the first binary package defined in the control file.
Certifying Imports
For this volume of import we cannot afford manual sanity checking of imports.
The current conversion system does some sanity checks during the import process and compares the end result of the import to the HEAD in the upstream repository.
However that does not garantee that the annotation would be correct. It would be possible, but non-trivial to compare the annotated form of HEAD, that would provide a stronger garantee of correctness than a simple comparison.
But well... it does not seem all that important...
Ongoing Operation issues
A small percentage of syncs fail, because
- upstream modified the repository in an destructive manner
- CSCVS bugs
Need to implement better error reporting to identify sync failures.
Import initiation process
The process from "this repository has full RCS information" to "this repository is in sync" still requires many manual operations. A new process must be implemented.
Data Preservation and Migration
In place conversion of Bazaar archives to newer archives formats as Bazaar converges with Bazaar-NG.
Important to provide a visible migration path for users, where milestones are new revisions of the archives format.
Packages Affected
The 1,000 source packages selected are those in the Ubuntu main repository which have upstream revision control systems.
Eventually all of universe will also be imported, but that is outside this specifiction.
User Interface Requirements
Requirement for a user-interface for the Bazaar team to manage imports.
Outstanding Issues
Manual handling of CVS module aliases.
Telling server outages from conversion failures.
Implementation of error notification.
Sync scheduling, broken by slave bouncing.
Testing environment is messy.
Fire and Forget imports.
Hitting a wall (e.g. out of memory) in an import breaks the slave in a way that needs a manual restart.
Updating master from mirror when migrating a job across slaves.
Mark dangerous things (e.g. Xorg) DONTSYNC.
Scheduling issues with three people operating the same system.
Do not create archives on arch.ubuntu.com at import time, but only at initial sync time.
Cannot schedule autotesting in spare cycles, probably should short circuit it entirely so it will not cause contention with syncs or new imports.
Associating ProductSeries to different slaves.
UbuntuDownUnder/BOFs/BazaarImports (last edited 2008-08-06 16:30:07 by localhost)