SpeechRecognition/GUI

Launchpad Entry: speech-recognition
Created: 2007-04-27
Contributors: HenrikOmma

Summary

A front-end GUI for speech recognition suitable for any engine, including those running remotely on a separate computer.

Rationale

Speech recognition is a large and complex project, but the front end is a relatively simple component that could be written now, and may help provide a user base and momentum for the recognition engine work.

Use Cases

Anders has advanced RSI and has been advised not to use a keyboard for any sustained typing for six months. He runs a commercial speech recognition engine on a Windows virtual machine in Ubuntu and pipes the text into the Gnome desktop. He can dictate the bulk of his text this way and even make some corrections by voice. He can also supplement with limited use of mouse and keyboard.
Daniela has set up an Ubuntu box as a home entertainment system that she controls via a wireless microphone. She uses a simple open source speech recognition engine to interpret the control commands and uses the GUI front end to configure their effect in the host applications via hotkeys.

Scope

A local Ubuntu client for collecting text output from speech recognition engines
A simple windows client that can receive text input from a commercial engine and feed it across the network
A protocol for transmitting text and control commands over IP

Design

The GUI

Simple yet flexible ...

Windows client

A simple text-input window that can collect text from the speech engine and send it to the Ubuntu application, converting commands and macros as needed.

Text transfer protocol

A simple XML protocol to transmit the text feed plus some embedded command statements.

NaturallySpeaking and ViaVoice both support user-defined macros. The user records voice sequence such as 'Computer: delete that sentence' or 'insert my address block' and can also define the corresponding action or text block. We can use this to create a rich set of editing commands on the Ubuntu end as well. The phrase 'Computer: delete that sentence' could be linked with the macro text <command>delete-sentence</command>. This would be transferred from the Windows client as any other text, but would be given special meaning at the receiving application and would invoke the appropriate action in the host editor.

This scheme does require some configuration, such as creating the macros in the commercial system, but yields a highly configurable solution.

Implementation

Outstanding Issues

Is there some preferred existing protocol that can be used to pipe simple text and metadata. RSS or some streaming XML format? (The Jabber protocol may be a good example to work from; it is effectively a streaming XML format, using short message updates for IM messages, presence, etc.)

BoF agenda and discussion

CategorySpec