StronglyConnectedSetImport

Importing people into the Launchpad using the Strongly Connected Set as a trust metric

Status

Introduction

This spec describes how we intend to import the GPG keys and UIDs available on the keyservers into Launchpad by using the Strongly Connected Set of GPG keys available in the Debian and Ubuntu keyrings as a reasonable trust metric. The script will look through all the keys that are available from the primary rotation keyservers. We estimate that the script will take about 24 hours to run to completion. The script can be run several times, with decreasing trust threshold, as necessary to get a good import.

Rationale

The Launchpad system defines a Person type, which is expected to be a unique element that identifies an existing real-world person. External tools, however, such as the debian changelogs, debian package information, freshmeat, sourceforge, etc., use mostly email to identify ownership. This causes scripts such as Gina and Nicole to create large numbers of Person instances, being unable to automatically associate two or more emails to a single user.

In order to reduce the need to merge people as the launchpad application begins to import the large amounts of data available in the freshmeat and sourceforge universes, we need a mechanism to merge the primary identifiers we use to locate people in the open source world.

Fortunately a possible solution exists. Many people in the open source arena use PGP (or GPG, its common open source relative) keys, which map a set of email addresses to a cryptographically unique key. These keys are signed and kept as part of a keyring (essentially a collection of keys).

The PGP/GPG keyring gives us a useful way to associate keys and userids. By importing that data into Launchpad we will effectively get a jumpstart on the user names and email addresses and keys that make up the Launchpad userbase. This will, we hope, greatly reduce the amount of merging of accounts that will be required in future.

By using a set of keys we consider to be trustworthy (E.g. the Debian keyring, the Ubuntu keyring and a trusted set of other keys such as that of MarkShuttleworth) we can calculate a count of trusted signatures on each key/email tuple and use that to determine a level of trust associated with each UID.

Implementation Plan

The result of this specification should be a commandline tool able to interact with the defined Strongly Connected Set keyring and the current Launchpad Database (LPDB).

The proposed tool will accept the following arguments:

  • Denoted "Strongly Connected Set": path to a customized GPGHOME
  • Level of Trust: number of signatures required to consider the key information "trustable".
  • Keyserver: the keyserver to source keydata from.
  • Dry-run: Optionally do a trial run, instead of store the data directly in LPDB

The tool will then execute the following basic algorithm

   1 # considered will contain keys that have at least one
   2 # trusted enough uid, and that we will consider for
   3 # inclusion in Launchpad.
   4 considered = {}
   5 
   6 # keyserverset is the set of all keys available from
   7 # the keyservers.
   8 for key in keyserverset:
   9     if key in LPDB:
  10         continue
  11     for uid in key:
  12         if trusted_sig_count(uid) >= threshold:
  13             considered.setdefault(key, []).append(uid)
  14 for key, uidlist in considered.items():
  15     LPDB.addkey(key)
  16     # P is the person based on the nominated primary uid for
  17     # the key.
  18     P = LPDB.findperson(considered[key][0])
  19     if P is None:
  20          P = LPDB.create_person(considered[key][0])
  21     for uid in uidlist:
  22         if not LPDB.find_email(uid.email):
  23             LPDB.attach_email(P, uid.email)

The expected result will be the information relating persons and emails within the Strongly Connected Set ready to be inserted in our FOAF model, creating new people with a set of emailaddress, adding new emailaddress to an existing person and possibly identifying duplicated person entries (see Outstanding Issues).

The advantage of this process, when running just before gina/nicole, is to avoid those tools to create new people based on "untrustable" information (basically project/package email address), by simply preceed the creation of those unknown people based on a certified source.

Data Preservation and Migration

The existing data should not be affected when adding the new information, just because it follows the same previous model, so no migration process is required too.

User Interface Requirements

There is no User Interface requirement except good comandline options and output.

Outstanding Issues

  • Can we sanely guess first and family names? (Is there already code in Nicole or Gina for generating people sanely that we might be able to reuse?)

    • CelsoProvidelo: there is no sane way to guess them from the displayname, and I wonder if it is important at all, since we do not identify users using displayname, firstname and familynames in LP system. The only who cares about it is the own user, and he can go to edit page and modify it any time.

  • Can we identify any attacks that might be used against Launchpad if we do this? (For example, poisoning of the keyservers - We believe this particular attack vector is rendered moot by means of the trust tests).
  • Can we sanely identify duplicated person entries in this process ? if yes, PaulSladen suggested a nice mechanism to inform it to the users probably presenting in the personal page an invitation to verify if the proposed dupe proceeds or not (it should require some LP infrastructure work), since the users check it, the persons could be merged automatically.

UbuntuDownUnder/BOFs/StronglyConnectedSetImport (last edited 2008-08-06 16:24:10 by localhost)