LaunchpadGooglification

Differences between revisions 7 and 8
Revision 7 as of 2005-04-27 09:07:26
Size: 4809
Editor: intern146
Comment: draftify
Revision 8 as of 2005-04-27 09:35:34
Size: 814
Editor: intern146
Comment: spec on Launchpad wiki
Deletions are marked like this. Additions are marked like this.
Line 10: Line 10:
  * Status: DraftSpecification, BreezyGoal, UduBof, LaunchpadSpecification[[BR]]
  * Branch: [[BR]]
  * Malone Bug: [[BR]]
  * Packages: [[BR]]
  * Depends: [[BR]]
  * Status: BrainDump, UduBof, LaunchpadSpecification, SpecElsewhere[[BR]]
Line 16: Line 12:
== Introduction == == Summary ==
Line 18: Line 14:
This spec identifies issues related to search engine crawling the Launchpad
web site, and making sure that the entire site is discoverable from the home
page.
Making Launchpad maximally accessible to search engines will increase exposure of Launchpad pages to developer eyeballs. There are four main issues: making useful pages indexed, making useless pages not indexed, preventing link spam, and removing the appearance of link spam. All of these problems can be solved or minimized with technical means.
Line 22: Line 16:
== Rationale == == Spec elsewhere ==
Line 24: Line 18:
We have a ton of very interesting content in Launchpad, and we also have a
very neat URL schema. We need to make sure that Google and other search
engines can crawl the entire web site, starting from the home page, without
depending on outside links to interesting pages.

== Scope and Use Cases ==

Google starts with the home page. From there, it should be possible to walk
a list of every product, every project, every distro, every package, every
branch, every bug, every bounty and every translation.

An example of how this might be useful to a user: a user might spot a bad translation in a program, and Google for it. If the translation in Launchpad turns up in the Google search results, the user can quickly go to it and fix it.

== Implementation Plan ==

Currently, we have a few bottlenecks in the process for anyone crawling our
site. For example, we don't publish a list of every product, with links to
the individual product pages. We only have a search interface for
"products", and then we give a list of matching products. So the search
engine can't penetrate past that search box, because it has no idea what to
put in there and "submit". In fact, doing an "empty" search would produce a
list of all products, but search engines will almost certainly never
simulate a form post.

We need to identify all such bottlenecks and make sure that we have a way to
navigate past them. For example, if the product search page had a link
saying "Show All Products" that took one to a list of all products, linked
to their product pages in Launchpad, then Google could bypass the form,
follow that link, then proceed to index each of the product pages
individually.

=== User Interface Requirements ===

The following pages are bottlenecks:

 1. `/projects`

 1. `/products`

 1. `/distros`

For each of these pages, we should implement a "show all X" link which do just that. The links will point to pages which list all projects/products/etc. with links to the the individual project/product/etc. pages.

Arguably, these "show all" pages will not be particularly useful for users, so there isn't much use in batching them. However, there are reasons why batching the pages is preferable:

 * A single long page will take longer to (a) generate and (b) send over the network. Note that each HTTP request locks up a Zope/DB thread for its duration.

 * The single long page might turn up in the search engine results page, meaning that a user might go to it.

Note that the `/people` page already has "show all people" and "show all teams" links, both of which lead to batched lists.

== Outstanding Issues ==

 * Is there a performance issue with a web crawler hitting every single page in Launchpad? Are we ready for that load?

   * We should perform tests using a tool such as `wget -r` or `linkchecker` on the Dogfood server to get some idea of how severe the load would be.

 * Things we should do to make Launchpad behave better when it's being crawled:

   * Block robots which are harmful using a robots.txt. There are lists available of robots which cause problems.

   * Consistently use batch navigation infrastructure so that all our batched pages are efficient. (E.g. SQL queries use OFFSET/LIMIT rather than Python slicing.)

   * Use the `rel="nofollow"` convention for certain links:

     * Links in bug comments, to help avoid bug comment spam.

     * Links to things which take a long time to generate and for which search engine indexing is not useful. For example, PO file exports should not be indexed.

     * Possibly: links to product homepages, to avoid Launchpad's Google juice being used to promote things.

     * Anything else that's not moderated.

 * We might need to set `<meta>` tags on pages to declare that they should not be indexed. We would need to work out how to do this with page templates.

 * External link portlets could conceivably be misconstrued as link farms.

 * Set Last-modified headers on generated pages to help web crawlers. (Do any really care?)

 * Perhaps statically generate and cache such pages?
https://wiki.launchpad.canonical.com/LaunchpadGooglification

Making Launchpad Google-friendly

Status

Summary

Making Launchpad maximally accessible to search engines will increase exposure of Launchpad pages to developer eyeballs. There are four main issues: making useful pages indexed, making useless pages not indexed, preventing link spam, and removing the appearance of link spam. All of these problems can be solved or minimized with technical means.

Spec elsewhere

https://wiki.launchpad.canonical.com/LaunchpadGooglification

UbuntuDownUnder/BOFs/LaunchpadGooglification (last edited 2008-08-06 16:39:12 by localhost)