The following is an account of my experiments with using speech recognition to dictate in Ubuntu (indirectly). Ideas for improving the approach are welcome.

Speech recognition engines

Linux: ViaVoice, etc.

You could get hold of the old IBM ViaVoice binaries for Linux (Mandrake supplied them once upon a time), but that speech engine isn't really all that good. (If someone else has tried this recently with good results, please share your experiences here). Note: The IBM ViaVoice engine is also closed-source. Some open source speech recognition engines are being worked on, but they have a long way to go.

Dragon Naturally Speaking

The best speech recognition engine seems to be Dragon Naturally Speaking, which is only available for Windows. It would be cool to write a Linux wrapper for it or get a native Linux compile, but in the meantime we need to cheat.

The basic idea is to have two physical computers running, one with Windows and one with Ubuntu. You run to the dictation program on Windows along with a VNC client and feed the text into Ubuntu over the network.

Hardware

I run two fast beefy systems in order to get the highest possible quality out of the speech recognition system and also to have the Linux system perform well while acting as a VNC server (but with speed you also get noise).

Dictation System:

Ubuntu system:

The basic setup

You have the choice of which system you want as your local environment. This is likely to be a matter of taste. Since I mainly use Ubuntu for my daily work I prefer to have that is my native system so that it will respond well and I can perform various operations directly on the system. Alternatively, you could use Windows as your local display and operate Ubuntu through the VNC client. I've run both Gnome and KDE (with both VNC and NX) with similar results.

Ubuntu-as-local

Windows-as-local

You can also set up a hybrid of these two systems by using a KVM switch to toggle between the native displays at will. It may also be worth trying a Gigabit network on both machines, esp. if you can then also use low compression levels to reduce CPU usage.

VNC vs. NX vs. X11

I have tried running the system with both VNC and NX. Both of these seem to have some peculiar issues, but these can probably be fixed quite easily (in the VNC and FreeNX code). The NX setup seems to perform better in terms of CPU usage on the Ubuntu machine than the VNC, which uses 40% of the CPU in idling mode and saturates the CPU when you start moving windows around.

VNC

Setup: I chose x11vnc as the local vnc server because I wanted to share display:0, while vncX shares an alternate display by default (e.g. display:1). Vino, native to Gnome, also sends display:0 but the performance of Vino was so bad that it wasn't usable.

Issues:

NX

The NX system does some strange things with capitalization: every other phrase is UPPERCASE. I was also only able to run a secondary (display:1) session while ideally I would like to run the same session on the Windows machine as the local native session (display:0). At least I want to have the flexibility of choice. There might well be a way of doing this in NX that I haven't come across yet.

x11

You could also run a native X session with an Xserver like Cygwin or Hummingbird on the Windows machine. This needs further testing.

Overall results

The speech recognition itself is quite good. This is in part due to the present maturity of Dragon Naturally Speaking but also the choice of good hardware including a good microphone, sound card and a fast PC with 2GB of RAM. I'll want to do more tweaking of the hardware setup though, and test the effect this has on the audio-input quality.

I have two power-hungry PCs sitting in my office that might normally produce a fair bit of noise which microphone would pick up (just record and playback to hear it). To combat this I am employing sound-reducing measures. The most recently built PC (the 4000+) is purpose-built to be a silent PC using silent components and sound dampening, while the older one is a Shuttle mini system where I've added some sound-reduction padding (AcoustiPack foam) with fairly good results.

They also seem to produce a fair bit of electronic noise, which might interfere with the sound card or audio cabling or even the wireless microphone system that I'm using using. (anyone know how to test for electronic noise other than switching on an FM radio in the room?)

Future developments (wish list)

We cannot do much to change the proprietary speech recognition engine, other than hope that we will soon get a native Linux version, but we can make improvements to the VNC servers and clients to make them more suitable for this purpose.

The only thing we really need from VNC with the Ubuntu-is-local setup is to have text piped in from the Win32 box. We don't actually need to send the picture back, though we probably do need an active window on the Win32 system for the voice engine to dump its text into. Transmission of the screen image is obviously the part of the VNC connection that requires the most bandwidth and CPU usage. It ought to be a fairly simple hack to the VNC protocol (or server-client pair) to allow for text piping only.


CategoryAccessibility CategoryDocumentation

Accessibility/doc/SpeechRecognition (last edited 2008-08-06 16:32:02 by localhost)