Planning, challenges, and roadmap for voice driven user interfaces in Ubuntu, with respect to and a focus on mobile platform needs.

Release Note

Voice driven user interface work is meant to compliment other aspects of user interface development that makes Ubuntu a preferred distribution. This effort strongly compliments and utilizes work already done in accessibility already using atk, gnome speech, and relates loosely to gnome-voice-control. However, this spec is meant to compliment and extend this work further into more complete use of voice driven user interfaces, particularly for mobile devices.


The assumption of any "input challenged" user device, such as MID's, is that users may wish to have alternate means of interacting with the device. Voice also offers an alternate means for performing such interactions, whether we speak of MID devices, or other specialty or even headless voice-driven applications such as automotive applications. This differs from accessibility, where voice (and primarily tts) is used to assist in and compliment existing navigation of an already GUI driven desktop environment. This spec describes changes needed in Ubuntu to achieve voice driven user interfaces, and to synchronize work in other groups with the needs of mobile.

User stories

Broadly speaking, there are a number of user stories that make of this spec, and it will likely be broken down further. Core user stories include Voice Driven application selection (speak and execute); use of audio to synchronize and speak notification events from multiple sources; seemless transition between audio ui with media players and voip clients; and application specific interactive voice dialogs.

These broadly form under a number of use cases that were identified at UDS which may revolve around current and future uses of Ubuntu. These uses cases include smartphones which in some places require hands-free voice operation, automotive applications, home media and control applications, as well as some assisted uses.

Speech recognition in itself involves a separate set of use cases. In fact, there are several approaches to consider speech recognition. The first is "limited domain" applications, where the entire use case of the deployment as a whole can be defined as requiring a limited and known or "closed" vocabulary for all interactions. For example, a theoretical (at least so far) "Ubuntu Automotive Remix" likely would have a limited and very known set of applications and speech domains that could be tested very well in advance. On the other hand, a true general purpose speech ui will have different limited domains based on the application running, but the set and combination of applications that may be used is not known.

"Limited domain" systems come in two forms. Those that require training to a given speaker, and those that are speaker independent. For most use cases, speaker independent limited domain speech recognition should provide both accuracy and simplicity, since we avoid the training requirement. There is a special case where users may give their own "tag" for a given item. This occurs for example in a smartphone, where a user may give their own names for calling people by voice. Other forms of speech recognition include open-ended systems, such as used in voice dictation systems. These latter will be treated separately for now.


We have free software already available for speaker independent limited domain recognition of sufficient quality to be acceptable. We can build upon atk, gnome speech (or it's replacement), and other existing tools which already support a foundation for creating voice driven ui in Ubuntu.


There are speaker independent voice recognition engines available such as Sphinx. Most of these work best when one can define a narrow vocabulary (domain specific recognition). Hence, what I am proposing is that one might create a vocabulary of recognized words for use when navigating the device to select applications through voice, and specific vocabularies of recognized words and phrases when in/to use with specific applications. The ui state hence must be aware of which application is currently active and set the ASR to the appropriate vocabulary.

Since we are talking about having speech recognition ui domains by application, presumably we would have "speech-enabled" applications that have a subpackage which defines the application's speech domain and some configuration information from the voice ui. Hence, for example, we might have a "openoffice-writer-speech" package, etc. This means that the task of defining and maintaining a speech ui for a specific application would happen at the discretion and control of individual package and upstream maintainers.

Text to speech is a simpler problem, and there are several good free software tts systems already out there. However, there is a challenge in understanding how tts works if multiple programs each wish to use tts for notification (obviously, queued and serially presented to the user), or tts fade in over media playback, etc. This suggests that there may be a need for a tts manager daemon and some work relating tts to the existing notification framework.


  • Initial identification and verification of selected tts/asr packages on Ubuntu.
  • Selection of user interface model, initially proposed "modal", where the voice ui state (that is phrases that will be recognized) is driven by the application running.
  • Identification of what packages need to be changed to support this.
  • Do we need to create a new daemon to manage tts notification?
  • Do we need to create a new daemon to manage asr state and context?

For Karmic: Packaging of pocketsphix and other voice tools, minimally at least in universe. Some patches for Rhythmbox for voice driven ui.

For Karmic+1: Modify notification system (notifyosd) for voice driven notifications.

For Karmic+?: Listening daemon with ability to switch vocabulary sets per active app.


Karmic: Most effort would be related to exploring and testing. Actual work items, such as packaging, would be minor. Examining and implementing changes for Rhythmbox as a model for voice driven ui apps may be one to two weeks time.

Karmic+1: There is perhaps several weeks to a month-long effort playing with notify, and any other apps that might be similarly extended for voice driven ui.

Karmic+?: This implies a new (upstream) project to support the effort of creating a listening daemon with vocabulary context switching, correct integration with pulse audio sources, etc.

UI Changes

This is a new UI proposal related to voice. This will require some applications to make more and better use of the existing ATK to effect better voice interaction and voice driven dialogs rather than just relying on orca-like desktop readers.

Code Changes

We may at some point have use of Launchpad or similar resource to identify and define application specific vocabularies. We do not anticipate mass migrating applications for voice driven ui in Karmic, so this is noted as something for potential future work.

For Karmic, our goal is to make sure the base tools required for future development are present and available at minimum in Universe. We may also choose to select a very specific application, and I am proposing Rhythmbox, as a model reference application to modify for true voice interaction and as a guideline for future work.

It is suggested we may need the option to voice enable and optionally support purely audio notifications in the new notification system, notifyosd. If this is done, this will be initially a set of experimental patches likely not introduced in Karmic, but rather proposed for formal introduction in Karmic +1.

It is suggested we may need a new voice "listening" deamon to maintain vocabulary sets based on the "active"application and that can act as an input source presented through ATK. This is an entirely new package and will be fleshed out with the community.


Test/Demo Plan

Unresolved issues

Seemless audio. Does Ubuntu do more with pure alsa or migrate to embrace Pulse in the future? Can we get all existing core applications that have audio capability (media players, voip clients, etc) to work and transition correctly and seamlessly with whichever choice is made?

BoF agenda and discussion

INTRO Relationship to accessibility - but also looking at possibility of voice becoming the primary interface. We already have most of what's needed (atk). Need to look at it in this context - related to accessibility - rather than as a new...


  • Smart-phones?
  • automobiles?
  • book-readers - might be intuitive to respond with voice also (ie. "go
    • back one paragraph").
  • smart-homes (MythTV, or computer in kitchen used while cooking etc.)


  • Review the available apps - and ensure packages are available
  • Define scope/limit - ie. does it make sense to enable everything via voice.
  • Setup discussion/ml to
    • - @ubuntu, @launchpad, what?


  • Mythtv control
  • media player control
  • Ekiga/Skype (ie. "Call Ian")



  • Example of how to control Rhythmbox with voice using Julius with VoxForge speech corpora (speaker-independant, limited dictionary):

  • Long term: Improve voxforge (upstream is open to ideas)
  • Anyone got gnome-voice-control running?
    • - The version in Ubuntu is outdated and doesn't work on amd64 (hangs on calibration).
  • voice activation daemon -- activate app input by saying "Computer, ..."
    • - Shouldn't be that difficult to have a daemon to which applications can connect (via D-Bus or whatever), but I don't know if Sphinx/Julius can change the words to recognize at fly -- if not just fork a small process for each domain when needed).


  • Packaging pocket sphinx (noodles)
  • Investigate voice creation (sil)
  • Create a voice mailinglist (dyfet)
  • look at activation daemon vs. at-spi


Specs/MobileVoiceDrivenUserInterface (last edited 2009-06-15 17:57:03 by dyfet)