Accessibility/doc/SpeechRecognition

The following is an account of my experiments with using speech recognition to dictate in Ubuntu (indirectly). Ideas for improving the approach are welcome.

Speech recognition engines

Linux: ViaVoice, etc.

You could get hold of the old IBM ViaVoice binaries for Linux (Mandrake supplied them once upon a time), but that speech engine isn't really all that good. (If someone else has tried this recently with good results, please share your experiences here). Note: The IBM ViaVoice engine is also closed-source. Some open source speech recognition engines are being worked on, but they have a long way to go.

Dragon Naturally Speaking

The best speech recognition engine seems to be Dragon Naturally Speaking, which is only available for Windows. It would be cool to write a Linux wrapper for it or get a native Linux compile, but in the meantime we need to cheat.

The basic idea is to have two physical computers running, one with Windows and one with Ubuntu. You run to the dictation program on Windows along with a VNC client and feed the text into Ubuntu over the network.

Hardware

I run two fast beefy systems in order to get the highest possible quality out of the speech recognition system and also to have the Linux system perform well while acting as a VNC server (but with speed you also get noise).

Dictation System:

AMD64 4000+
2GB RAM
AKG UHF-SR40 microphone system
Creative Live! 24 USB sound card (external)
Quiet PC case components (Zalman, AcustiPack, AcoustiCase, etc.)

Ubuntu system:

AMD64 3200+
1GB RAM
Shuttle mini case

The basic setup

You have the choice of which system you want as your local environment. This is likely to be a matter of taste. Since I mainly use Ubuntu for my daily work I prefer to have that is my native system so that it will respond well and I can perform various operations directly on the system. Alternatively, you could use Windows as your local display and operate Ubuntu through the VNC client. I've run both Gnome and KDE (with both VNC and NX) with similar results.

Ubuntu-as-local

Setup: The Ubuntu machine runs a standard Gnome or KDE session and a VNC server capable of sending the current desktop, such as x11vnc. The VNC viewer runs on the Windows box, but the user doesn't look at it, but rather looks directly at the native Ubuntu display (display:0)
Pros: Ubuntu is your native system so it should (in theory) be quite snappy. You are not dependent on having high network throughput, as you are only concerned with the text being piped in.
Cons: In reality the VNC servers seem to be very inefficient when sending out display:0. You don't get to see the Dragon GUI, which is used for making corrections and spelling out words.

Windows-as-local

Setup: A VNC server runs on Ubuntu, transmitting the local desktop or a virtual one (display:1 ...). You view the Ubuntu desktop through the VNC viewer as in a normal VNC session. The viewer can be run in full screen mode to give the impression that you are sitting at the Ubuntu box.
Pros: You can use the editing features of the dictation program.
Cons: The experience may be sluggish, esp for graphics intensive applications like using the Gimp or even browsing the web.

You can also set up a hybrid of these two systems by using a KVM switch to toggle between the native displays at will. It may also be worth trying a Gigabit network on both machines, esp. if you can then also use low compression levels to reduce CPU usage.

VNC vs. NX vs. X11

I have tried running the system with both VNC and NX. Both of these seem to have some peculiar issues, but these can probably be fixed quite easily (in the VNC and FreeNX code). The NX setup seems to perform better in terms of CPU usage on the Ubuntu machine than the VNC, which uses 40% of the CPU in idling mode and saturates the CPU when you start moving windows around.

VNC

Setup: I chose x11vnc as the local vnc server because I wanted to share display:0, while vncX shares an alternate display by default (e.g. display:1). Vino, native to Gnome, also sends display:0 but the performance of Vino was so bad that it wasn't usable.

Issues:

Performance: The performance of x11vnc, while not great, is acceptable because you usually don't dictate and move windows around at the same time. However, even so, the system is noticeably less responsive than when running it without the VNC server in the background. It is annoying to have to keep closing the session to free up the CPU for when you want to do more CPU intensive stuff.
Local input device conflicts: The VNC setup does not play at all nice with the local input devices, esp. when keyboard accessibility features are enabled. With the standard setting for StickyKeys you soon get a conflict when dictating and are prompted to switch off StickyKeys. De-selecting the 'Disable if two keys are pressed together' option seems to resolve this. Repeat keys also seems blocked when the VNC server is running and the mouse cursor performs some strange jumps sometimes. It would probably help if you could switch off the mouse input on the client, but the way it is configured now, that also means switch off keyboard input (which you obviously don't want to do).

NX

The NX system does some strange things with capitalization: every other phrase is UPPERCASE. I was also only able to run a secondary (display:1) session while ideally I would like to run the same session on the Windows machine as the local native session (display:0). At least I want to have the flexibility of choice. There might well be a way of doing this in NX that I haven't come across yet.

x11

You could also run a native X session with an Xserver like Cygwin or Hummingbird on the Windows machine. This needs further testing.

Overall results

The speech recognition itself is quite good. This is in part due to the present maturity of Dragon Naturally Speaking but also the choice of good hardware including a good microphone, sound card and a fast PC with 2GB of RAM. I'll want to do more tweaking of the hardware setup though, and test the effect this has on the audio-input quality.

I have two power-hungry PCs sitting in my office that might normally produce a fair bit of noise which microphone would pick up (just record and playback to hear it). To combat this I am employing sound-reducing measures. The most recently built PC (the 4000+) is purpose-built to be a silent PC using silent components and sound dampening, while the older one is a Shuttle mini system where I've added some sound-reduction padding (AcoustiPack foam) with fairly good results.

They also seem to produce a fair bit of electronic noise, which might interfere with the sound card or audio cabling or even the wireless microphone system that I'm using using. (anyone know how to test for electronic noise other than switching on an FM radio in the room?)

Future developments (wish list)

We cannot do much to change the proprietary speech recognition engine, other than hope that we will soon get a native Linux version, but we can make improvements to the VNC servers and clients to make them more suitable for this purpose.

The only thing we really need from VNC with the Ubuntu-is-local setup is to have text piped in from the Win32 box. We don't actually need to send the picture back, though we probably do need an active window on the Win32 system for the voice engine to dump its text into. Transmission of the screen image is obviously the part of the VNC connection that requires the most bandwidth and CPU usage. It ought to be a fairly simple hack to the VNC protocol (or server-client pair) to allow for text piping only.

CategoryAccessibility CategoryDocumentation

Accessibility/doc/SpeechRecognition (last edited 2008-08-06 16:32:02 by localhost)