SpeechRecognition/SpeechMaker

SpeechMaker

Launchpad Entry: speech-recognition
Created: 2007-04-27
Contributors: HenrikOmma

Summary

A dual-purpose desktop client that collects spoken text recordings for the VoxForge project and feedback on text synthesis for eSpeak.

Rationale

Both VoxForge and eSpeak need to collect large amounts of data from users to improve speech recognition and speech synthesis respectively, both in a range of languages. Ubuntu has a large user community that is generally happy to help with such tasks provided user-friendly tools.

Use Cases

Jonathan is an eSpeak developer. He creates rudimentary speech models for different languages based on linguistic rules, but needs detailed feedback from native speakers to fine-tune them.
The VoxForge project needs a simple way to collect recorded voice data from a large number of users.
Ingvild uses Ubuntu. She has a cousin with low vision and is happy to find and easy way to contribute to improved text-to-speech in her native Norwegian within the Ubuntu project.

Scope

A desktop client for Ubuntu and a server setup to receive the data.

Design

The user can select the language to work in (defaulting to the desktop language). The application downloads a collection of texts for that language.

In speech recording mode, the text is displayed 3-4 lines at a time. The user reads the text into her microphone and clicks Save when done. The speech is stored in a high fidelity ogg file corresponding to the text excerpt, or using a lossless audio codec like FLAC.

The user should be able to select the sampling rate (16kHz, 32kHz or 48kHz ...) and bits per sample (8-bit, 16-bits or 32-bit float) depending on what their audio card can support. To make things as easy as possible for the user, the application should be able to poll the user's audio hardware (audio card, audio components on their motherboard, or usb pod/microphone, ...) to determine which sampling rates and bits per sample it can support.

The application should have a waveform display to give users feedback on their recording. User should be permitted to easily replay and re-record.

Users should be able to upload audio to an online Speech Corpus repository (such as VoxForge, or other ...) with one click, once the client has been properly configured with URL and login credentials. Users should be required to confirm that they understand that their submission will be made under the GPL. They should be given the option to assign their copyright to the Free Software Foundation if they prefer to submit their speech audio anonymously.

It should be possible to reuse elements of SpeechMaker for the speech recognition GUI.

Implementation

PyGTK, mono?
PortAudio? - portable cross-platform Audio API: http://www.portaudio.com/

Outstanding Issues

BoF agenda and discussion

VoxForge has created an initial version of a speech submission application (it is a modified version of the MoodleSpeex Java applet). It allows users to read prompts, record their speech, and click one button to upload the audio to VoxForge.

CategorySpec

SpeechRecognition/SpeechMaker (last edited 2008-08-06 16:23:02 by localhost)

Ubuntu Wiki