LaunchpadGooglification

Differences between revisions 5 and 6
Revision 5 as of 2005-04-26 06:33:28
Size: 4648
Editor: intern146
Comment: update with BoF outcome
Revision 6 as of 2005-04-27 07:18:04
Size: 4800
Editor: intern146
Comment: more suggestions
Deletions are marked like this. Additions are marked like this.
Line 100: Line 100:

 * Set Last-modified headers on generated pages to help web crawlers. (Do any really care?)

 * Perhaps statically generate and cache such pages?

Making Launchpad Google-friendly

Status

Introduction

This spec identifies issues related to search engine crawling the Launchpad web site, and making sure that the entire site is discoverable from the home page.

Rationale

We have a ton of very interesting content in Launchpad, and we also have a very neat URL schema. We need to make sure that Google and other search engines can crawl the entire web site, starting from the home page, without depending on outside links to interesting pages.

Scope and Use Cases

Google starts with the home page. From there, it should be possible to walk a list of every product, every project, every distro, every package, every branch, every bug, every bounty and every translation.

An example of how this might be useful to a user: a user might spot a bad translation in a program, and Google for it. If the translation in Launchpad turns up in the Google search results, the user can quickly go to it and fix it.

Implementation Plan

Currently, we have a few bottlenecks in the process for anyone crawling our site. For example, we don't publish a list of every product, with links to the individual product pages. We only have a search interface for "products", and then we give a list of matching products. So the search engine can't penetrate past that search box, because it has no idea what to put in there and "submit". In fact, doing an "empty" search would produce a list of all products, but search engines will almost certainly never simulate a form post.

We need to identify all such bottlenecks and make sure that we have a way to navigate past them. For example, if the product search page had a link saying "Show All Products" that took one to a list of all products, linked to their product pages in Launchpad, then Google could bypass the form, follow that link, then proceed to index each of the product pages individually.

User Interface Requirements

The following pages are bottlenecks:

  1. /projects

  2. /products

  3. /distros

For each of these pages, we should implement a "show all X" link which do just that. The links will point to pages which list all projects/products/etc. with links to the the individual project/product/etc. pages.

Arguably, these "show all" pages will not be particularly useful for users, so there isn't much use in batching them. However, there are reasons why batching the pages is preferable:

  • A single long page will take longer to (a) generate and (b) send over the network. Note that each HTTP request locks up a Zope/DB thread for its duration.
  • The single long page might turn up in the search engine results page, meaning that a user might go to it.

Note that the /people page already has "show all people" and "show all teams" links, both of which lead to batched lists.

Outstanding Issues

  • Is there a performance issue with a web crawler hitting every single page in Launchpad? Are we ready for that load?
    • We should perform tests using a tool such as wget -r or linkchecker on the Dogfood server to get some idea of how severe the load would be.

  • Things we should do to make Launchpad behave better when it's being crawled:
    • Block robots which are harmful using a robots.txt. There are lists available of robots which cause problems.
    • Consistently use batch navigation infrastructure so that all our batched pages are efficient. (E.g. SQL queries use OFFSET/LIMIT rather than Python slicing.)
    • Use the rel="nofollow" convention for certain links:

      • Links in bug comments, to help avoid bug comment spam.
      • Links to things which take a long time to generate and for which search engine indexing is not useful. For example, PO file exports should not be indexed.
      • Possibly: links to product homepages, to avoid Launchpad's Google juice being used to promote things.
      • Anything else that's not moderated.
  • We might need to set <meta> tags on pages to declare that they should not be indexed. We would need to work out how to do this with page templates.

  • External link portlets could conceivably be misconstrued as link farms.
  • Set Last-modified headers on generated pages to help web crawlers. (Do any really care?)
  • Perhaps statically generate and cache such pages?

UbuntuDownUnder/BOFs/LaunchpadGooglification (last edited 2008-08-06 16:39:12 by localhost)