<> ##title Speech Control - Blueprints ##master-page:HomepageTemplate ||{{../attachment:gplv3-127x51.png}}<>|| * '''Launchpad Entry''': [[https://www.launchpad.net/cmusphinx-train]] * '''Created''': Monday, February 07 2011, 03:22:40 AM ETC * '''Contributors''': Jacky Alcine * '''Packages affected''': N/A == Summary == The initial drafting, laying out and eventual goals of the uVRT. == Release Note == Formerly named CMUSphinx Trainer, the uVRT [Ubuntu Voice Recognition Toolkit] is an application that automates the processing of adapting voice models, uploading training results to VoxForge, configuring voice models for speech recognition engines, and calibrate a system to best fit the user's needs of voice recognition. == Rationale == Such an application would not only improve voice recognition for the community; it'd make voice recognition more approachable by other applications; given that they've implemented or utilize a speech recognition engine that relies on an compatible engine. == User stories == Roseanne is a user that wants to contribute to the open source community. She also wanted to see if there was a way for her to use her desktop using her voice, like voice command on her phone. After a few solutions like GNOME Voice Control, she realized that voice recognition isn't too great on Ubuntu, as compared to Naturally Speaking for Windows and Macs. She searches online for a way to help out and she stumbles upon [[http://www.voxforge.org|VoxForge]]. Unfortunately, in order for the Java applet to work correctly; it needs Sun's Java, not the OpenJRE that's usually with Ubuntu. So she searches for another to help out. She finds VRT. VRT allows her to choose her own text documents that she's customized to record or select from a collection of free documents online. It records her voice and saves it into a special session so she can resume it at a later time. She then uploads her progress to VoxForge to share with the community. After she feels that it's time to adapt her local model, she just clicks 'Adapt' and lets it do the work. She then tries GNOME Voice Control once more to find it working better than ever. == Assumptions == So far, the HTK toolkit is one of the best ways to adapt and enhance voice models for voice recognition engine. Unfortunately, the license is not free and that's not under the typical licensing cloud of Ubuntu. Also, such a toolkit is intended for a intermediate Linux developer and not the casual user. Implementing a simple, efficient and effective means of improving and improvising voice recognition on Ubuntu would be an benefit all around. == Design == The main idea of VRT is to improve voice recognition on Ubuntu systems and redistribute these training results with the open source community. As it is with most online services; users will want the ability to "opt out" of the uploading part of training, and such a feature is totally voluntarily. The application's designed with accessibility in mind, as it is with all SpeechControl's projects. The main UI will have readily accessible means of accessing components, and that's implemented by enlarging certain controls and adding full translation and ATK support into the application. == Implementation == - Searching of possible alternatives * SHoUT (http://shout-toolkit.sourceforge.net/user_manual.html) - Identification of voice recognition engines packaged with Ubuntu (provide immediate adaption support) * CMUSphinx * Julius - Identification of adaption methods for specified engines in respective order * http://cmusphinx.sourceforge.net/wiki/tutorialadapt#creating_an_adaptation_corpus * UPDATE: Lee Akinbou, one of the developers of Julius informed me (JackyAlcine) of a method of adapting models for Julius using ARPA. == Effort == Lucid: Most likely a good testing ground of LTE support. Maverick: Primary development ground and testing base. Maverick+: Development will be targeted at maverick for backwards compability. === UI Changes === Traditionally, adapting ARPA voice models for PocketSphinx or Julius was done primarily on the command line. VRT aims to remove the need of using the terminal to remove human error when training, and only have the text and the user's accent as the main contributing factors that they play to the training process. === Code Changes === This application was originally written in Qt, but a total infrastructural re-haul occurred when [[hajour|Manuela]], the team's leader came and reinstated that it '''must''' be accessible and easy to use for all. We ported the application to GTKmm, which has a better ATK than that of Qt. === Migration === == Test/Demo Plan == We'll have 10 individuals for a group of 6 languages and 5 from English record approximately 4 hours worth of audio. After this audio session, they'll be prompted to utilize the testing component of VRT that will use either of the trained clients to recognize the spoken words to screen. A testing session will occur with a 200 word reading session and the test results will be returned to SpeechControl so we can analyze and tweak what would be necessary. == Unresolved issues == Julius tends to prefer the HTK toolkit for voice model adaption, as it's one of the best voice model adaption tools around. We might incorporate code from the maker of SHoUT (if not he collaborates with us) and tweak the ASR adaption methods with that of Bill Cox's sonic voice utilities and see if we can create a GNU toolkit that anyone and everyone can use. == BoF agenda and discussion == *** Will be filled after Developers' Meeting on 2011.12.02 20:00 UTC. ---- CategorySpec