How Speech Recognition Works (Microsoft.Speech)

Microsoft Speech Platform SDK 11

Collapse image Expand Image Copy image CopyHover image

A speech recognition engine (or speech recognizer) takes an audio stream as input and turns it into a text transcription. The speech recognition process can be thought of as having a front end and a back end.

Convert Audio Input

The front end processes the audio stream, isolating segments of sound that are probably speech and converting them into a series of numeric values that characterize the vocal sounds in the signal.

Match Input to Speech Models

The back end is a specialized search engine that takes the output produced by the front end and searches across three databases: an acoustic model, a lexicon, and a language model.

  • The acoustic model represents the acoustic sounds of a language, and can be trained to recognize the characteristics of a particular user's speech patterns and acoustic environments.

  • The lexicon lists a large number of the words in the language, and provides information on how to pronounce each word.

  • The language model represents the ways in which the words of a language are combined.

For any given segment of sound, there are many things the speaker could potentially be saying. The quality of a recognizer is determined by how good it is at refining its search, eliminating the poor matches, and selecting the more likely matches. This depends in large part on the quality of its language and acoustic models and the effectiveness of its algorithms, both for processing sound and for searching across the models.

Grammars

While the built-in language model of a recognizer is intended to represent a comprehensive language domain (such as everyday spoken English), a speech application will often need to process only certain utterances that have particular semantic meaning to that application. Rather than using the general purpose language model, an application should use a grammar that constrains the recognizer to listen only for speech that is meaningful to the application. This provides the following benefits:

  • Increases the accuracy of recognition

  • Guarantees that all recognition results are meaningful to the application

  • Enables the recognition engine to specify the semantic values inherent in the recognized text

The Microsoft Speech Platform SDK 11 provides processes for authoring grammars programmatically, and also supports grammars authored using industry-standard markup language. See Create Grammars (Microsoft.Speech).

See Also