Diff for "SpeechRecognition"

SpeechRecognition

Differences between revisions 4 and 5

Launchpad Entry: speech-recognition
Created: 2007-04-27
Contributors: HenrikOmma
Also see: ["/GUI"], ["/SpeechMaker"]

Summary

A roadmap for providing speech recognition on Ubuntu.

Rationale

Robust speech recognition would be useful for many groups for both dictation and navigation. There are currently no workable solutions available on Linux.

Background

Speech recognition technology has slowly evolved over the past two decades, from recognition of a few pre-recorded words to dictation of general text without prior voice training. [http://www.nuance.com/naturallyspeaking/ Dragon NatuarallySpeaking] by Nuance is the current market leader but Microsoft looks poised to do a Netscape job on speech recognition by including it as a feature in Vista. The accuracy and responsiveness of speech recognition depends in part on the available processing power. With the advent of multi-core CPUs we could see an explosion of speech recognition usage, giving an edge to Vista over Linux and OSX.

Some efforts are being made at creating free speech recognition technology, but it is a large and complex problem. [http://www.voxforge.org/ VoxForge] is a promising initiative set up to provide the acoustic models needed by open source speech recognition engines such as [http://cmusphinx.sourceforge.net/html/cmusphinx.php Sphinx], [http://www.ece.msstate.edu/research/isip/projects/speech/index.html ISIP], [http://htk.eng.cam.ac.uk/ HTK], and [http://julius.sourceforge.jp/en_index.php?q=en/index.html Julius]. Currently, these free alternatives are only usable in limited applications, as they fail with larger vocabularies.

Use Cases

Professionals who perform dictation
Non-lating language input
Mobility impaired
Sufferers of RSI

Scope

This is an informational spec, charting the options for speech recognition on Ubuntu. The best long-term solution is a native port or new implementation of a speech recognition engine on Linux by an ISV or a research institution. However, the feasibility of and demand for such a solution must be established before the implementation becomes realistic. This spec charts the steps the open source community can take to move the process forward.

Design

Front end

Technically, the front end is the easiest part of the puzzle, and traditionally this would be left for the end. However there are good reasons for developing a good GUI early in this case as it can act as a catalyst for the more low-level work. See: ["/GUI"], ["/SpeechMaker"].

Speech recognition engines

Teams like Julius and Sphinx are working on open source solutions, but are largely held back by the lack of good free voice models, which in turn requires a large body of free, high quality voice data. The VoxForge project has been set up to provide this through community contributions, but the project needs a larger volunteer base and better end-user tools.

The front end should provide a simple way to record voice data and submit it to the VoxForge site directly. This will facilitate a distributed effort to improve recognition results. I should also be able to work with proprietary engines to be more immediately useful, speeding up general uptake of speech recognition on Linux.

Outstanding Issues

Front end design details

Comments

The ability to record voice data is only a tiny fraction of what's needed to turn current open-source speech recognition engines into something usable. In addition to the basic recognition engine, you also need to increase the vocabulary to 80,000 to 120,000 words, language modeling, adaption processes, correction processes, accessibility features for injection and correction of dictated text. In a nutshell, expect to spend between five and $10 million worth of effort in order to make a usable speech recognition environment with current open-source tools. While this is a laudable goal, disabled people can't afford to wait. today, we go with what we have which means a commercial product (i.e. NaturallySpeaking) on Windows and cobble together a variety of tools which enable us to interact with Linux applications. The end result is a somewhat usable environment that could use some significant improvements.

I believe that there are two short-term solutions. The obvious solution is NaturallySpeaking in wine. This can work, but it will probably be very limited as it won't be able to interact with Linux applications without significant amount of work. The second solution is a bridge between NaturallySpeaking on Windows and a remote Linux environment. the goal behind this model is to speed up the delivery of speech recognition driving the Linux environment. All the development effort could be focused on making Linux more accessible using speech recognition.

while I wouldn't exactly say that handicap people don't care about ideology, the goal should be making handicapped people's lives easier and nothing else. if that means using a closed source product at the core in order to make people's lives better, then do so. Once handicapped people are working and playing with Linux, then start the effort to backfill closed source parts of the solution.

["Warbo"]: I don't know much about the technical implementation of such a system, but if WINE would run such software well enough (I personally have an old version ViaVoice and a cut-down version of Dragon NaturallySpeaking lying around somewhere, but have never tested since moving to Linux) then surely a Windows running voice recognition > Linux applications tool would be pretty similar to a WINE running voice recognition > Linux applications tool, thus there would not be a need for Windows. I agree that the current approach to improving voice recognition just involves pouring masses and masses of research data at the problem, which would be incredibly costly financially or chronologically for a free program to do (I have had discussions with various professors working in and around this field), but just don't see why running the cost and resource (and freedom?) overhead of Windows is needed just to send data through some protocol that WINE could also use (ie. integrating a WINE solution into every application would be hard work, but so would integrating a remote/virtualised Windows system. Using a common accessibility protocol between Windows and a Linux interpreter would be easier, but why not use that with WINE instead of integrating it directly?).

CategorySpec

SpeechRecognition (last edited 2011-03-19 15:16:56 by D9784B24)

-  ⇤ ← Revision 4 as of 2007-04-30 04:54:38 → 
  Size: 5356
  Editor: harvee
  Comment:
+   ← Revision 5 as of 2007-04-30 13:23:02 → ⇥
  Size: 6635
  Editor: dyn233056
  Comment: Is Windows needed? WINE doesn't have to be integrated, it could send data too
-Deletions are marked like this.
+Additions are marked like this.
 Line 55:
+ * ["Warbo"]: I don't know much about the technical implementation of such a system, but if WINE would run such software well enough (I personally have an old version ViaVoice and a cut-down version of Dragon NaturallySpeaking lying around somewhere, but have never tested since moving to Linux) then surely a Windows running voice recognition > Linux applications tool would be pretty similar to a WINE running voice recognition > Linux applications tool, thus there would not be a need for Windows. I agree that the current approach to improving voice recognition just involves pouring masses and masses of research data at the problem, which would be incredibly costly financially or chronologically for a free program to do (I have had discussions with various professors working in and around this field), but just don't see why running the cost and resource (and freedom?) overhead of Windows is needed just to send data through some protocol that WINE could also use (ie. integrating a WINE solution into every application would be hard work, but so would integrating a remote/virtualised Windows system. Using a common accessibility protocol between Windows and a Linux interpreter would be easier, but why not use that with WINE instead of integrating it directly?).

Ubuntu Wiki