LaunchpadGooglification

Differences between revisions 4 and 5
Revision 4 as of 2005-04-25 00:54:09
Size: 2911
Editor: intern146
Comment: Formatting tweak.
Revision 5 as of 2005-04-26 06:33:28
Size: 4648
Editor: intern146
Comment: update with BoF outcome
Deletions are marked like this. Additions are marked like this.
Line 35: Line 35:
An example of how this might be useful to a user: a user might spot a bad translation in a program, and Google for it. If the translation in Launchpad turns up in the Google search results, the user can quickly go to it and fix it.
Line 43: Line 45:
list of all products, but the search engine will almost certainly never list of all products, but search engines will almost certainly never
Line 55: Line 57:
The following areas are bottlenecks: The following pages are bottlenecks:
Line 57: Line 59:
 1. the /products/ search page. It is recommended that we implement a "show
 all products" link which does exactly that.
 1. `/projects`
Line 60: Line 61:
 1. the /projects/ page. It is recommended that we implement a "show all
 projects" link which takes the viewer to a page listing all projects with
 links to the individual project pages.
 1. `/products`

 1. `/distros`

For each of these pages, we should implement a "show all X" link which do just that. The links will point to pages which list all projects/products/etc. with links to the the individual project/product/etc. pages.

Arguably, these "show all" pages will not be particularly useful for users, so there isn't much use in batching them. However, there are reasons why batching the pages is preferable:

 * A single long page will take longer to (a) generate and (b) send over the network. Note that each HTTP request locks up a Zope/DB thread for its duration.

 * The single long page might turn up in the search engine results page, meaning that a user might go to it.

Note that the `/people` page already has "show all people" and "show all teams" links, both of which lead to batched lists.
Line 66: Line 77:
 * should these "show all" pages be one-long-page, or should they be
 batched? It's not likely that a human will have any use for a batched
 interface of all products in any event, so it may be best to do it all as
 a single page.
 * Is there a performance issue with a web crawler hitting every single page in Launchpad? Are we ready for that load?
Line 71: Line 79:
 * is there a performance issue with a web crawler hitting every single page
 in Launchpad? Are we ready for that load?
   * We should perform tests using a tool such as `wget -r` or `linkchecker` on the Dogfood server to get some idea of how severe the load would be.

 * Things we should do to make Launchpad behave better when it's being crawled:

   * Block robots which are harmful using a robots.txt. There are lists available of robots which cause problems.

   * Consistently use batch navigation infrastructure so that all our batched pages are efficient. (E.g. SQL queries use OFFSET/LIMIT rather than Python slicing.)

   * Use the `rel="nofollow"` convention for certain links:

     * Links in bug comments, to help avoid bug comment spam.

     * Links to things which take a long time to generate and for which search engine indexing is not useful. For example, PO file exports should not be indexed.

     * Possibly: links to product homepages, to avoid Launchpad's Google juice being used to promote things.

     * Anything else that's not moderated.

 * We might need to set `<meta>` tags on pages to declare that they should not be indexed. We would need to work out how to do this with page templates.

 * External link portlets could conceivably be misconstrued as link farms.

Making Launchpad Google-friendly

Status

Introduction

This spec identifies issues related to search engine crawling the Launchpad web site, and making sure that the entire site is discoverable from the home page.

Rationale

We have a ton of very interesting content in Launchpad, and we also have a very neat URL schema. We need to make sure that Google and other search engines can crawl the entire web site, starting from the home page, without depending on outside links to interesting pages.

Scope and Use Cases

Google starts with the home page. From there, it should be possible to walk a list of every product, every project, every distro, every package, every branch, every bug, every bounty and every translation.

An example of how this might be useful to a user: a user might spot a bad translation in a program, and Google for it. If the translation in Launchpad turns up in the Google search results, the user can quickly go to it and fix it.

Implementation Plan

Currently, we have a few bottlenecks in the process for anyone crawling our site. For example, we don't publish a list of every product, with links to the individual product pages. We only have a search interface for "products", and then we give a list of matching products. So the search engine can't penetrate past that search box, because it has no idea what to put in there and "submit". In fact, doing an "empty" search would produce a list of all products, but search engines will almost certainly never simulate a form post.

We need to identify all such bottlenecks and make sure that we have a way to navigate past them. For example, if the product search page had a link saying "Show All Products" that took one to a list of all products, linked to their product pages in Launchpad, then Google could bypass the form, follow that link, then proceed to index each of the product pages individually.

User Interface Requirements

The following pages are bottlenecks:

  1. /projects

  2. /products

  3. /distros

For each of these pages, we should implement a "show all X" link which do just that. The links will point to pages which list all projects/products/etc. with links to the the individual project/product/etc. pages.

Arguably, these "show all" pages will not be particularly useful for users, so there isn't much use in batching them. However, there are reasons why batching the pages is preferable:

  • A single long page will take longer to (a) generate and (b) send over the network. Note that each HTTP request locks up a Zope/DB thread for its duration.
  • The single long page might turn up in the search engine results page, meaning that a user might go to it.

Note that the /people page already has "show all people" and "show all teams" links, both of which lead to batched lists.

Outstanding Issues

  • Is there a performance issue with a web crawler hitting every single page in Launchpad? Are we ready for that load?
    • We should perform tests using a tool such as wget -r or linkchecker on the Dogfood server to get some idea of how severe the load would be.

  • Things we should do to make Launchpad behave better when it's being crawled:
    • Block robots which are harmful using a robots.txt. There are lists available of robots which cause problems.
    • Consistently use batch navigation infrastructure so that all our batched pages are efficient. (E.g. SQL queries use OFFSET/LIMIT rather than Python slicing.)
    • Use the rel="nofollow" convention for certain links:

      • Links in bug comments, to help avoid bug comment spam.
      • Links to things which take a long time to generate and for which search engine indexing is not useful. For example, PO file exports should not be indexed.
      • Possibly: links to product homepages, to avoid Launchpad's Google juice being used to promote things.
      • Anything else that's not moderated.
  • We might need to set <meta> tags on pages to declare that they should not be indexed. We would need to work out how to do this with page templates.

  • External link portlets could conceivably be misconstrued as link farms.

UbuntuDownUnder/BOFs/LaunchpadGooglification (last edited 2008-08-06 16:39:12 by localhost)