Using Wav File Input with Speech Recognition Engines
Overview
This document is intended to help developers of speech recognition (SR) applications use the Microsoft Speech API's speech recognition and audio APIs to connect a wav file with an SR engine. The topics covered include:
Typical file input scenario
There are many different types of audio input configurations used by SR applications, which include:
The shared desktop microphone scenario uses the default SR engine and the default audio input. The user selects each in Speech properties in Control Panel, and each is hosted in the shared speech server.
The telephony scenario can use either the SAPI 5 standard multimedia audio input object or a custom audio object combined with an InProc SR engine.
The wav file input scenario is special because it uses controlled, reproducible audio input and requires a dedicated SR engine, without interference from other applications (e.g., a shared desktop microphone). The file input scenario should use a generic SAPI audio stream connected to the input wav file and an InProc SR engine.
Typical scenarios that would use the wav file audio input configuration include:
Follow these basic steps to perform SR on a wav file:
- Create and configure basic SAPI audio stream object for wav file input
- Create an InProc SR engine using the code samples in this document
- Set the audio stream object from step 1 as the SR engine's input
- Activate grammars and begin SR
- Respond to recognition events until end of audio stream is reached
Relevant wav audio file input APIs for COM/C/C++ Developers:
Relevant wav audio file input APIs for Automation/Visual Basic/Scripting Developers:
Wav audio file input outcome specific to SAPI
Finite-length audio input stream
Unlike microphone input which has no predetermined stream length, a finite-length audio input stream is a file which has a specific length that is known before recognition begins. Similarly, applications that use microphone input will toggle between actively listening and not listening states until the speech application is closed. However, transcription applications are typically designed to listen to one continuous audio stream, and then close when the stream ends. Consequently, the application must specifically acknowledge the end audio stream event (SPEI_SR_END_STREAM for C/C++, ISpeechRecoContext::EndStream event for Automation). Transcription applications can potentially record multiple recognitions on a single audio stream, if the speaker pauses or breaks between sections of audio. If the transcription application exits after the first recognition event is received, it will miss any further recognizable audio that remains.
Non-real-time audio input
Microphone input and networked audio streams are typically real-time audio objects. This means that the audio object is designed to support audio buffering and dynamic state manipulation (e.g. stop->play->pause->play->stop) to handle delays and latency in the audio source and/or the SR engine's processing.
Sample wav audio file input source code
COM/C++ Developers
C-style is very similar to C++ and COM
{
CComPtr<ISpStream> cpInputStream;
CComPtr<ISpRecognizer> cpRecognizer;
CComPtr<ISpRecoContext> cpRecoContext;
CComPtr<ISpRecoGrammar> cpRecoGrammar;
// Create basic SAPI stream object
// NOTE: The helper SpBindToFile can be used to perform the following operations
hr = cpInputStream.CoCreateInstance(CLSID_SpStream);
// Check hr
CSpStreamFormat sInputFormat;
// generate WaveFormatEx structure, assuming the wav format is 22kHz, 16-bit, Stereo
hr = sInputFormat.AssignFormat(SPSF_22kHz16BitStereo);
// Check hr
// setup stream object with wav file MY_WAVE_AUDIO_FILENAME
// for read-only access, since it will only be access by the SR engine
hr = cpInputStream->BindToFile(MY_WAVE_AUDIO_FILENAME,
SPFM_OPEN_READONLY,
sInputFormat.FormatId(),
sInputFormat.WaveFormatExPtr(),
SPFEI_ALL_EVENTS);
// Check hr
// Create in-process speech recognition engine
hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);
// Check hr
// connect wav input to recognizer
// SAPI will negotiate mismatched engine/input audio formats using system audio codecs, so second parameter is not important - use default of TRUE
hr = cpRecognizer->SetInput(cpInputStream, TRUE);
// Check hr
// Create recognition context to receive events
hr = cpRecognizer->CreateRecoContext(&cpRecoContext;);
// Check hr
// Create grammar, and load dictation
// ignore grammar ID for simplicity's sake
// NOTE: Voice command apps would load CFG here
hr = cpRecognizer->CreateGrammar(NULL, &cpRecoGrammar;);
// Check hr
hr = cpRecoGrammar->LoadDictation(NULL,SPLO_STATIC);
// Check hr
// check for recognitions and end of stream event
hr = cpRecoContext->SetInterest(SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_SR_END_STREAM), SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_SR_END_STREAM));
// use Win32 events for command-line style application
hr = cpRecoContext->SetNotifyWin32Event();
// Check hr
// activate dictation, and begin recognition
hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
// Check hr
// while events occur, continue processing
// timeout should be greater than the audio stream length, or a reasonable amount of time expected to pass before no more recognitions are expected in an audio stream
BOOL fEndStreamReached = FALSE;
while (!fEndStreamReached && S_OK == cpRecoContext->WaitForNotifyEvent(MY_REASONABLE_TIMEOUT))
{
CSpEvent spEvent;
// pull all queued events from the reco context's event queue
while (!fEndStreamReached && S_OK == spEvent.GetFrom(cpRecoContext))
{
// Check event type
switch (spEvent.eEventId)
{
// speech recognition engine recognized some audio
case SPEI_RECOGNITION:
// TODO: log/report recognized text
break;
// end of the wav file was reached by the speech recognition engine
case SPEI_SR_END_STREAM:
fEndStreamReached = TRUE;
break;
}
// clear any event data/object references
spEvent.Clear();
}// END event pulling loop - break on empty event queue OR end stream
}// END event polling loop - break on event timeout OR end stream
// deactivate dictation
hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
// Check hr
// unload dictation topic
hr = cpRecoGrammar->UnloadDictation();
// Check hr
// close the input stream, since we're done with it
// NOTE: smart pointer will call SpStream's destructor, and consequently ::Close, but code may want to check for errors on ::Close operation
hr = cpInputStream->Close();
// Check hr
}
Automation/Visual Basic 6.0 Developers
Scripting code is similar to Visual Basic.
Option Explicit
Dim WithEvents RecoContext as ISpeechRecoContext ' context for receiving SR events
Dim Grammar as ISpeechRecoGrammar ' grammar
Dim InputFile as SpeechLib.SpFileStream ' wav audio input file stream
' Setup InProc reco context and wav audio input file
Private Sub MyForm_Load()
' Create new recognizer
Dim Recognizer as New SpInprocRecognizer
' create input file stream
Set InputFile as New SpFileStream
' Defaults to open for read-only, and DoEvents false
InputFile.Open MY_WAVE_AUDIO_FILENAME
' connect wav audio input to speech recognition engine
Set Recognizer.AudioInputStream = InputFile
' create recognition context
Set RecoContext = Recognizer.CreateRecoContext
' create grammar
Set Grammar = RecoContext.CreateGrammar
' ... and load dictation
Grammar.DictationLoad
' start dictating
Grammar.DictationSetState SGDSActive
End Sub
' Event fired on app shutdown
Private Sub MyForm_Unload(Cancel as Boolean)
InputFile.Close ' close audio input file
End Sub
' Event fired when speech recognition engine recognizes audio
Private Sub RecoContext_Recognition(StreamNumber as Long, StreamPosition as Variant, RecognitionType As SpeechRecognitionType, Result As ISpeechRecoResult)
' Log/Report recognized phrase/information
End Sub
' End of wav Input Stream reached by speech recognition engine
Private Sub RecoContext_EndStream(StreamNumber as Long, StreamPosition as Variant)
' Disable dictation and unload grammars on app close
Grammar.DictationSetState SGDSInactive
Grammar.DictationUnload
Unload Me ' shutdown app on end of stream
End Sub