SpeechRecognition

Revision 3 as of 2007-04-27 21:46:58

Clear message

Summary

A roadmap for providing speech recognition on Ubuntu.

Rationale

Robust speech recognition would be useful for many groups for both dictation and navigation. There are currently no workable solutions available on Linux.

Background

Speech recognition technology has slowly evolved over the past two decades, from recognition of a few pre-recorded words to dictation of general text without prior voice training. [http://www.nuance.com/naturallyspeaking/ Dragon NatuarallySpeaking] by Nuance is the current market leader but Microsoft looks poised to do a Netscape job on speech recognition by including it as a feature in Vista. The accuracy and responsiveness of speech recognition depends in part on the available processing power. With the advent of multi-core CPUs we could see an explosion of speech recognition usage, giving an edge to Vista over Linux and OSX.

Some efforts are being made at creating free speech recognition technology, but it is a large and complex problem. [http://www.voxforge.org/ VoxForge] is a promising initiative set up to provide the acoustic models needed by open source speech recognition engines such as [http://cmusphinx.sourceforge.net/html/cmusphinx.php Sphinx], [http://www.ece.msstate.edu/research/isip/projects/speech/index.html ISIP], [http://htk.eng.cam.ac.uk/ HTK], and [http://julius.sourceforge.jp/en_index.php?q=en/index.html Julius]. Currently, these free alternatives are only usable in limited applications, as they fail with larger vocabularies.

Use Cases

  • Professionals who perform dictation
  • Non-lating language input
  • Mobility impaired
  • Sufferers of RSI

Scope

This is an informational spec, charting the options for speech recognition on Ubuntu. The best long-term solution is a native port or new implementation of a speech recognition engine on Linux by an ISV or a research institution. However, the feasibility of and demand for such a solution must be established before the implementation becomes realistic. This spec charts the steps the open source community can take to move the process forward.

Design

Front end

Technically, the front end is the easiest part of the puzzle, and traditionally this would be left for the end. However there are good reasons for developing a good GUI early in this case as it can act as a catalyst for the more low-level work. See: ["/GUI"], ["/SpeechMaker"].

Speech recognition engines

Teams like Julius and Sphinx are working on open source solutions, but are largely held back by the lack of good free voice models, which in turn requires a large body of free, high quality voice data. The VoxForge project has been set up to provide this through community contributions, but the project needs a larger volunteer base and better end-user tools.

The front end should provide a simple way to record voice data and submit it to the VoxForge site directly. This will facilitate a distributed effort to improve recognition results. I should also be able to work with proprietary engines to be more immediately useful, speeding up general uptake of speech recognition on Linux.

Outstanding Issues

  • Front end design details

Comments


CategorySpec