Diff for "SpeechRecognition"

SpeechRecognition

Differences between revisions 6 and 7

Launchpad Entry: speech-recognition
Created: 2007-04-27
Contributors: HenrikOmma
Also see: ["/GUI"], ["/SpeechMaker"]

Summary

A roadmap for providing speech recognition on Ubuntu.

Rationale

Robust speech recognition would be useful for many groups for both dictation and navigation. There are currently no workable solutions available on Linux.

Background

Speech recognition technology has slowly evolved over the past two decades, from recognition of a few pre-recorded words to dictation of general text without prior voice training. [http://www.nuance.com/naturallyspeaking/ Dragon NatuarallySpeaking] by Nuance is the current market leader but Microsoft looks poised to do a Netscape job on speech recognition by including it as a feature in Vista. The accuracy and responsiveness of speech recognition depends in part on the available processing power. With the advent of multi-core CPUs we could see an explosion of speech recognition usage, giving an edge to Vista over Linux and OSX.

Some efforts are being made at creating free speech recognition technology, but it is a large and complex problem. [http://www.voxforge.org/ VoxForge] is a promising initiative set up to provide the acoustic models needed by open source speech recognition engines such as [http://cmusphinx.sourceforge.net/html/cmusphinx.php Sphinx], [http://www.ece.msstate.edu/research/isip/projects/speech/index.html ISIP], [http://htk.eng.cam.ac.uk/ HTK], and [http://julius.sourceforge.jp/en_index.php?q=en/index.html Julius]. Currently, these free alternatives are only usable in limited applications, as they fail with larger vocabularies.

Use Cases

Professionals who perform dictation
Non-lating language input
Mobility impaired
Sufferers of RSI

Scope

This is an informational spec, charting the options for speech recognition on Ubuntu. The best long-term solution is a native port or new implementation of a speech recognition engine on Linux by an ISV or a research institution. However, the feasibility of and demand for such a solution must be established before the implementation becomes realistic. This spec charts the steps the open source community can take to move the process forward.

Design

Front end

Technically, the front end is the easiest part of the puzzle, and traditionally this would be left for the end. However there are good reasons for developing a good GUI early in this case as it can act as a catalyst for the more low-level work. See: ["/GUI"], ["/SpeechMaker"].

Speech recognition engines

Teams like Julius and Sphinx are working on open source solutions, but are largely held back by the lack of good free voice models, which in turn requires a large body of free, high quality voice data. The VoxForge project has been set up to provide this through community contributions, but the project needs a larger volunteer base and better end-user tools.

The front end should provide a simple way to record voice data and submit it to the VoxForge site directly. This will facilitate a distributed effort to improve recognition results. I should also be able to work with proprietary engines to be more immediately useful, speeding up general uptake of speech recognition on Linux.

Outstanding Issues

Front end design details

Comments

["EricSJohansson"] The ability to record voice data is only a tiny fraction of what's needed to turn current open-source speech recognition engines into something usable. In addition to the basic recognition engine, you also need to increase the vocabulary to 80,000 to 120,000 words, language modeling, adaption processes, correction processes, accessibility features for injection and correction of dictated text. In a nutshell, expect to spend between five and $10 million worth of effort in order to make a usable speech recognition environment with current open-source tools. While this is a laudable goal, disabled people can't afford to wait. today, we go with what we have which means a commercial product (i.e. NaturallySpeaking) on Windows and cobble together a variety of tools which enable us to interact with Linux applications. The end result is a somewhat usable environment that could use some significant improvements.
- Comment: I wonder if you have gotten to know to the current state of the mentioned open-source projects? Yes, vocabularity is lacking, but the other features (language modeling, adaptation/correction etc.) you have listed are not like they wouldn't be existing in current open-source solutions. I understand that you would like to use NaturallySpeaking as you've probably been using it for a long time, but your words sound a bit rash towards the projects like Sphinx-4 which is in many ways quite an advanced speech recognition project. I've tested it to some degree of success years ago. With VoxForge and of course the needed easy UI if such doesn't yet exist, the usable open source speech recognition might not be that far away. Of course, that doesn't mean good instructions for using Wine + NaturallySpeaking shouldn't be done, but it's not really related to this specification much. Anyone is free to start doing eg. scripts or something with which to enable NaturallySpeaking usage in some way, but it's always going to be an ugly hack. Anyway, NaturallySpeaking version 7 is apparently possible to get working under Wine, so that would be a starting point for anyone interested. It might still prove quite a task to have any reasonable interaction between the Wine application and the wanted target application.
I believe that there are two short-term solutions. The obvious solution is NaturallySpeaking in wine. This can work, but it will probably be very limited as it won't be able to interact with Linux applications without significant amount of work. The second solution is a bridge between NaturallySpeaking on Windows and a remote Linux environment. the goal behind this model is to speed up the delivery of speech recognition driving the Linux environment. All the development effort could be focused on making Linux more accessible using speech recognition. while I wouldn't exactly say that handicap people don't care about ideology, the goal should be making handicapped people's lives easier and nothing else. if that means using a closed source product at the core in order to make people's lives better, then do so. Once handicapped people are working and playing with Linux, then start the effort to backfill closed source parts of the solution.
["Warbo"]: I don't know much about the technical implementation of such a system, but if WINE would run such software well enough (I personally have an old version ViaVoice and a cut-down version of Dragon NaturallySpeaking lying around somewhere, but have never tested since moving to Linux) then surely a Windows running voice recognition > Linux applications tool would be pretty similar to a WINE running voice recognition > Linux applications tool, thus there would not be a need for Windows. I agree that the current approach to improving voice recognition just involves pouring masses and masses of research data at the problem, which would be incredibly costly financially or chronologically for a free program to do (I have had discussions with various professors working in and around this field), but just don't see why running the cost and resource (and freedom?) overhead of Windows is needed just to send data through some protocol that WINE could also use (ie. integrating a WINE solution into every application would be hard work, but so would integrating a remote/virtualised Windows system. Using a common accessibility protocol between Windows and a Linux interpreter would be easier, but why not use that with WINE instead of integrating it directly?).
["EricSJohansson"] You will need a bridge between NaturallySpeaking and Linux whether it runs natively on Windows or in wine. Bridging between Windows and Linux has some advantages. It simplifies the development process by eliminating additional dependencies and potential problems with wine, it eliminates licensing problems and potential sabotage by nuance, enables use of different speech recognition packages and it provides support for users that must work both in Windows and Linux. The development of a bridge is going to be difficult enough without adding wine into the mix. Wine introduces instability and may leave the user or developer wondering what failed, the bridge or wine? wine will be good in the future but in the short term, it will only complicate things.
Nuance has taken an adversarial relationship to its customers by charging fees for bug reports, persistent problems with subsystems used to interface with many nonsupported applications, aggressive DRM, using update tool to advertise their own products, and licensing changes forbidding use of third-party macro packages with any version of NaturallySpeaking except NaturallySpeaking professional-based products, i.e. its most expensive products. therefore, I wouldn't be surprised if nuance sabotages any attempt to run NaturallySpeaking on wine unless they can find some way to get more money from the customer for that feature.
NaturallySpeaking on wine will create further dependence on nuance whereas a system using both Windows and Linux will allow the user to choose between NaturallySpeaking and Microsoft speech recognition. Was not much of a choice, at least it's a choice that keeps the user minimally free of a dependency on a monopoly supplier. Most people, myself included, can't totally walk away from Windows. I need to use Windows applications occasionally because the application isn't available on Linux or because the Windows version works better. An outgrowth of this ability to switch should be the ability to switch between multiple Linux instances, either on virtual machines or across the network. I've always thought the philosophy for speech recognition or indeed any handicap accessibility interface should be that you have your own box with all accessibility capabilities and that box can drive any other system thereby making it accessible to you without forcing the remote machine to have all of your accessibility aids as well.

CategorySpec

SpeechRecognition (last edited 2011-03-19 15:16:56 by D9784B24)

-  ⇤ ← Revision 6 as of 2007-05-01 09:29:51 → 
  Size: 9131
  Editor: harvee
  Comment:
+   ← Revision 7 as of 2007-05-03 22:31:06 → ⇥
  Size: 10465
  Editor: adsl-212-16-103-22
  Comment: comment (the open source asr projects are in many ways advanced, too)
-Deletions are marked like this.
+Additions are marked like this.
 Line 51:
+   '''Comment:''' I wonder if you have gotten to know to the current state of the mentioned open-source projects? Yes, vocabularity is lacking, but the other features (language modeling, adaptation/correction etc.) you have listed are not like they wouldn't be existing in current open-source solutions. I understand that you would like to use NaturallySpeaking as you've probably been using it for a long time, but your words sound a bit rash towards the projects like Sphinx-4 which is in many ways quite an advanced speech recognition project. I've tested it to some degree of success years ago. With VoxForge and of course the needed easy UI if such doesn't yet exist, the usable open source speech recognition might not be that far away. Of course, that doesn't mean good instructions for using Wine + NaturallySpeaking shouldn't be done, but it's not really related to this specification much. Anyone is free to start doing eg. scripts or something with which to enable NaturallySpeaking usage in some way, but it's always going to be an ugly hack. Anyway, NaturallySpeaking version 7 is apparently possible to get working under Wine, so that would be a starting point for anyone interested. It might still prove quite a task to have any reasonable interaction between the Wine application and the wanted target application.

Ubuntu Wiki