cmusphinx-train

Summary

The initial drafting, laying out and eventual goals of the uVRT.

Release Note

Formerly named CMUSphinx Trainer, the uVRT [Ubuntu Voice Recognition Toolkit] is an application that automates the processing of adapting voice models, uploading training results to VoxForge, configuring voice models for speech recognition engines, and calibrate a system to best fit the user's needs of voice recognition.

Rationale

Such an application would not only improve voice recognition for the community; it'd make voice recognition more approachable by other applications; given that they've implemented or utilize a speech recognition engine that relies on an compatible engine.

User stories

Roseanne is a user that wants to contribute to the open source community. She also wanted to see if there was a way for her to use her desktop using her voice, like voice command on her phone. After a few solutions like GNOME Voice Control, she realized that voice recognition isn't too great on Ubuntu, as compared to Naturally Speaking for Windows and Macs. She searches online for a way to help out and she stumbles upon VoxForge. Unfortunately, in order for the Java applet to work correctly; it needs Sun's Java, not the OpenJRE that's usually with Ubuntu. So she searches for another to help out. She finds VRT. VRT allows her to choose her own text documents that she's customized to record or select from a collection of free documents online. It records her voice and saves it into a special session so she can resume it at a later time. She then uploads her progress to VoxForge to share with the community. After she feels that it's time to adapt her local model, she just clicks 'Adapt' and lets it do the work. She then tries GNOME Voice Control once more to find it working better than ever.

Assumptions

So far, the HTK toolkit is one of the best ways to adapt and enhance voice models for voice recognition engine. Unfortunately, the license is not free and that's not under the typical licensing cloud of Ubuntu. Also, such a toolkit is intended for a intermediate Linux developer and not the casual user. Implementing a simple, efficient and effective means of improving and improvising voice recognition on Ubuntu would be an benefit all around.

Design

The main idea of VRT is to improve voice recognition on Ubuntu systems and redistribute these training results with the open source community. As it is with most online services; users will want the ability to "opt out" of the uploading part of training, and such a feature is totally voluntarily. The application's designed with accessibility in mind, as it is with all SpeechControl's projects. The main UI will have readily accessible means of accessing components, and that's implemented by enlarging certain controls and adding full translation and ATK support into the application.

Implementation

Effort

Lucid: Most likely a good testing ground of LTE support. Maverick: Primary development ground and testing base. Maverick+: Development will be targeted at maverick for backwards compability.

UI Changes

Traditionally, adapting ARPA voice models for PocketSphinx or Julius was done primarily on the command line. VRT aims to remove the need of using the terminal to remove human error when training, and only have the text and the user's accent as the main contributing factors that they play to the training process.

Code Changes

This application was originally written in Qt, but a total infrastructural re-haul occurred when Manuela, the team's leader came and reinstated that it must be accessible and easy to use for all. We ported the application to GTKmm, which has a better ATK than that of Qt.

Migration

Test/Demo Plan

We'll have 10 individuals for a group of 6 languages and 5 from English record approximately 4 hours worth of audio. After this audio session, they'll be prompted to utilize the testing component of VRT that will use either of the trained clients to recognize the spoken words to screen. A testing session will occur with a 200 word reading session and the test results will be returned to SpeechControl so we can analyze and tweak what would be necessary.

Unresolved issues

Julius tends to prefer the HTK toolkit for voice model adaption, as it's one of the best voice model adaption tools around. We might incorporate code from the maker of SHoUT (if not he collaborates with us) and tweak the ASR adaption methods with that of Bill Cox's sonic voice utilities and see if we can create a GNU toolkit that anyone and everyone can use.

BoF agenda and discussion

*** Will be filled after Developers' Meeting on 2011.12.02 20:00 UTC.


CategorySpec

SpeechControl/Blueprints/cmusphinx-train (last edited 2011-02-09 09:43:03 by jackyalcine)