TTS Events Explanation
Events are structures that pass information from the TTS engine back to the application. When the audio data is output, SAPI fires corresponding events. Applications react to audio output as it occurs. Examples of reactions include animating a face appropriately as viseme events are received, or highlighting text as it is spoken. See the sample application, TTSApp, for an example of each.
Applications call ISpEventSource::SetInterest to inform SAPI about the types of events that they are interested in receiving. Applications can also call this through ISpVoice, because it inherits from ISpEventSource. Applications can then call ISpEventSource::GetEvents to retrieve fired events from SAPI.
The following is a set of event types generated by TTS engines (this is a subset of the SPEVENTENUM enumeration):
typedef enum SPEVENTENUM { //--- TTS engine SPEI_START_INPUT_STREAM = 1, SPEI_END_INPUT_STREAM = 2, SPEI_VOICE_CHANGE = 3, // LPARAM_IS_TOKEN SPEI_TTS_BOOKMARK = 4, // LPARAM_IS_STRING SPEI_WORD_BOUNDARY = 5, SPEI_PHONEME = 6, SPEI_SENTENCE_BOUNDARY = 7, SPEI_VISEME = 8, SPEI_TTS_AUDIO_LEVEL = 9 } SPEVENTENUM;
The SPEVENT structure contains varying information depending on which of these event types it represents.
typedef struct SPEVENT { WORD eEventId; WORD elParamType; ULONG ulStreamNum; ULONGLONG ullAudioStreamOffset; WPARAM wParam; LPARAM lParam; } SPEVENT;
You can analyze the various fields of the SPEVENT structure for the event types they correspond to. For all event types, ulStreamNum corresponds to the stream number returned using ISpVoice::Speak or ISpVoice::SpeakStream.
The SPEI_START_INPUT_STREAM event indicates that the output object has begun receiving output for a specific stream number. The rest of the fields are not of interest to this event type.
The SPEI_END_INPUT_STREAM event indicates that the output object has finished receiving output for a specific stream number. The rest of the fields are not of interest to this event type.
The SPEI_VOICE_CHANGE event indicates that the voice responsible for speaking the input text (or stream) has changed because of a <Voice> XML tag. It is fired at the beginning of each Speak call. For more information on using object tokens, see the Object Tokens and Registry Settings white paper.
SPEVENT Field |
Voice Change event |
eEventId |
SPEI_VOICE_CHANGE |
elParamType |
SPET_LPARAM_IS_TOKEN |
wParam |
|
lParam |
Object token of the new voice. |
The SPEI_TTS_BOOKMARK event indicates that the speak stream has reached a bookmark. Bookmarks can be inserted into the input text using the <Bookmark> XML tag.
SPEVENT Field |
Bookmark event |
eEventId |
SPEI_TTS_BOOKMARK |
elParamType |
SPET_LPARAM_IS_STRING |
wParam |
Value of the bookmark string when converted to a long (_wtol(...) can be used). |
lParam |
Null-terminated copy of the bookmark string. |
The SPEI_WORD_BOUNDARY event indicates that it has reached the beginning of a word.
SPEVENT Field |
Word Boundary event |
eEventId |
SPEI_WORD_BOUNDARY |
elParamType |
SPET_LPARAM_IS_UNKNOWN |
wParam |
Character offset at the beginning of the word being synthesized. |
lParam |
Character length of the word in the current input stream being synthesized. |
The SPEI_SENTENCE_BOUNDARY event indicates that the speak stream has reached the beginning of a sentence.
SPEVENT Field |
Sentence Boundary event |
eEventId |
SPEI_SENTENCE_BOUNDARY |
elParamType |
SPET_LPARAM_IS_UNKNOWN |
wParam |
Character offset at the beginning of the sentence being synthesized. |
lParam |
Character length of the sentence in the current input stream being synthesized. |
The SPEI_PHONEME event indicates that the speak stream has reached the phoneme.
SPEVENT Field |
Phoneme event |
eEventId |
SPEI_PHONEME |
elParamType |
SPET_LPARAM_IS_UNKNOWN |
wParam |
The high word is the duration, in milliseconds, of the current phoneme. The low word is the PhoneID of the next phoneme. |
lParam |
The low word is the PhoneID of the current phoneme. The high word is the SPVFEATURE value associated with the current phoneme. |
The SAPI 5 American English phoneme set can be found here. The SAPI 5 Chinese phoneme set can be found here. The SAPI 5 Japanese phoneme set can be found here.
SPVFEATURE contains two flags: SPVFEATURE_STRESSED and SPVFEATURE_EMPHASIS. SPVFEATURE_STRESSED means that the phoneme is stressed relative to the other phonemes of a word (stress is usually associated with the vowel of a stressed syllable). SPVFEATURE_EMPHASIS means that the phoneme is part of an emphasized word. That is, stress is a syllabic phenomenon within a word, and emphasis is a word-level phenomenon within a sentence.
The SPEI_VISEME event indicates that it has reached the viseme.
SPEVENT Field |
Viseme event |
eEventId |
SPEI_VISEME |
elParamType |
SPET_LPARAM_IS_UNKNOWN |
wParam |
The high word is the duration, in milliseconds, of the current viseme. The low word is the code for the next viseme. |
lParam |
The low word is the code of the current viseme. The high word is the SPVFEATURE value associated with the current viseme (and phoneme). |
See SPVISEMES for a listing of the SAPI 5 viseme set.
The SPEI_TTS_AUDIO_LEVEL event indicates the audio has reached the level of the synthesis at any given point.
SPEVENT Field |
Audio Level event |
eEventId |
SPEI_TTS_AUDIO_LEVEL |
elParamType |
SPET_LPARAM_IS_UNDEFINED |
wParam |
TTS audio level (ULONG). |
lParam |
NULL |
For an example of how to use TTS events in an application, see the Text-to-Speech Tutorial.