SpellChecking

Spell Checking

Status

Introduction

Ubuntu main currently contains at least three spell checking engines (ispell, myspell, aspell). We should standardize on one.

Rationale

  • Conserve system resources (e.g., space on CD and on system)
  • Provide a better user experience (a unified supplemental dictionary, best available engine)
  • Focus localization efforts (a single word list per language)

Scope and Use Cases

There are three spell checking programs currently shipped in main:

  • ispell
  • aspell (which is based aspell)
  • myspell (a library based version of ispell)

There is another library called libenchant1 which provides a single interface to myspell, aspell, or ispell that is used by a number of programs in Ubuntu. Because it can interface with any of the libraries, programs that use enchant are less problematic.

Each of these spell checkers includes large sets of word lists (up to 30-40 packages). Some of these are generated automatically from a single source while many of these have separate sources.

Problems that we have to deal with here include:

  • The format of the shipped language-specific word lists differ;
  • The format of the user-specific personal word lists differ;
  • There are a wide range of packages that depend on each of the spell checking programs and libraries;

Some people think that aspell is much better than ispell or myspell although it's unclear if this is only for English. Enchant lets you select (via a configuration file) whichever backend is best for a given language.

Implementation

Implementation Plans

There are several different plans and ways to simplify the situation. These are split up into a number of different plans below.

Move Ispell Into Universe

ispell along has 68 reverse dependencies. 30-40 are dictionaries. Of the remaining, only mutt is in main and only kate is in Kubuntu. Both are listed only as Suggests and allow either aspell or ispell. aspell includes an option to run in ispell command-line compatibility mode.

We should be able to remove ispell with little impact and changes to other system and should be able to trivially write a wrapper to allow aspell to provide ispell.

Build Wordlist From Common Sources

Many of the aspell, myspell, and ispell dictionaries in Debian are already built from a common source. Moving all of these into common sources can be done easily without having to modify the spellchecker. The data will be duplicated in terms of binary packages on the archive, CD, and users disk but this is already the case. Getting a unified source dictionary would be a major step forward in terms of maintainability.

We should keep in mind that there are often license issues that provide different word lists from being combined and it will not be always obvious to us which one is the better or more complete word list. This is probably a job we should suggest and delegate to the LoCoTeams.

Aspell and Myspell Share Dictionaries

Ideally, aspell and myspell (the two spell checkers we would keep in main after removing ispell) will be able to share a single dictionary. At the moment, they each have their own format for personal dictionaries and for the main language-specific word list which are different than their personal dictionaries and different from each other. The first priority should be to work with a personal word list so that when a user adds a word to their personal dictionary it is accessible through programs that use either myspell or aspell.

The different formats are only trivially different. We should patch openoffice or myspell to be able to detect and then read a single the preferred format (the simple aspell one-word on one-line format).

We should not break its ability to read other word lists because OpenOffice has the ability to download word lists from within the program and we should preserve the ability to read these.

We should investigate similar solutions for the standards system-wide language-specific word lists as well.

One Spell checking Library!

The long term solution should be to centralize on a single library. That library might be libenchant although the ideal situation would be to modify packages to either:

  • Change myspell applications to communicate directly with libaspell
  • Build a single myspell interface layer to libaspell so all myspell programs can communicate directly with libaspell

Packages Affected

For removing ispell, 68 packages are affected. However, if we can modify aspell to provide ispell compatibility, we can reduce the number of packages to a minimum. All ispell dictionaries, and ispell, will be moved out of main into universe (~40 packages).

Building word lists from common sources will affect a wide variety of different word list source packages.

Moving to a single spell checking library will involve providing a new package with the new interface to libaspell and will result in moving all of myspell packages and word lists into universe as well.

Outstanding Issues

We should evaluate the options listed above and choose which one(s) we want to follow up on.

We should investigate the process that Mozilla is currently using for spell checking as it was not immediately obvious at the BoF. It may be another version of myspell.

We have not investigated issues with build dependencies on spell checking libraries.

UDU BOF Agenda

  • Mozilla and OpenOffice.org use myspell

    • Convert them to aspell?
    • Add myspell API compatibility to aspell?

UbuntuDownUnder/BOFs/SpellChecking (last edited 2008-08-06 16:29:21 by localhost)