SpVoice Interface (Microsoft Speech Platform)

Microsoft Speech Platform SDK 11

Microsoft Speech Platform

SpVoice

The SpVoice object brings the text-to-speech (TTS) engine capabilities to applications using SAPI automation. An application can create numerous SpVoice objects, each independent of and capable of interacting with the others. An SpVoice object, usually referred to simply as a voice, is created with default property settings so that it is ready to speak immediately.

Voice Characteristics and UI Support

The fundamental characteristics of the voice are the Voice property, which can be thought of as the person of the voice, the Rate property, and the Volume property. "Microsoft Mary" and "Microsoft Mike" are examples of Voices. Use the GetVoices method to determine what other voices are available to the voice object.

These properties can be modified with a User Interface (UI). The IsUISupported method determines if a specific UI is supported. Use the DisplayUI method to display a supported UI. The TTS tab of Speech properties in Control Panel, which enables users to modify the characteristics of the default system voice, is an example of a voice UI.

Speaking and Queueing

The Speak method places a text stream in the TTS engine's input queue and returns a stream number. It can be called synchronously or asynchronously. When called synchronously, the Speak method does not return until the text has been spoken; when called asynchronously, it returns immediately, and the voice speaks as a background process.

When synchronous speech is used in an application, the application's execution is blocked while the voice speaks, and the user is effectively locked out. This may be acceptable for simple applications, or those with no graphical user interface (GUI), but when sophisticated user interaction is intended, asynchronous speaking will generally be more appropriate.

Asynchronous speaking can place numerous text streams into the input queue. These streams are also referred to as speech requests. The stream number returned by an asynchronous Speak call is the stream's index in the voice queue. The WaitUntilDone method blocks execution until the voice finishes speaking, enabling an application to speak a text stream asynchronously and determine when it finishes. The hidden SpeakCompleteEvent method is similar to WaitUntilDone, except that it returns an event handle for the background speaking process, and does not block application execution.

The SpeakStream method operates like the Speak method, except that it speaks sound files instead of text.

Voice Output

An SpVoice object is created with its audio output set to the system default audio output. Use the GetAudioOutputs method to determine what other outputs are available to the voice, and use the AudioOutput property to set its audio output to one of them.

Use the AudioOutputStream property with other Speech automation objects to store audio output in memory (see SpMemoryStream) or in files (see SpFileStream).

Voice Events

As a voice speaks text, it can generate events when it detects certain conditions in the input stream. These events are contained in the SpeechVoiceEvents enumeration. Examples of these events are completion of phonemes, words, or sentences, as well as changes of voice or the presence of bookmarks. The range of conditions which can be reported by SpeechVoiceEvents is wide enough that most applications will use only a few of them. To prevent the TTS engine from generating events that will be ignored by the application, use the EventInterests property to specify the events of interest. Only these events will be raised.

The point in the input text stream at which a potential event has been completed is referred to as an event boundary. At each event boundary, the event type is compared with the current EventInterests. If the event type is of interest, an event of that type is raised. Voice events return the input stream number in order to associate them with the appropriate stream.

Voice Priorities and Alerts

Application error handling has traditionally interrupted a UI with message boxes or alert boxes describing error states. Because a TTS application might operate with no graphical UI at all, it is able to implement error handling with a TTS voice. This voice is referred to as an alert, because its purpose is identical to that of an alert box or message box. To create an alert voice, create a new SpVoice object and set its Priority property appropriately. The alert voice should also use a different Voice property from the normal voice, so that users can easily distinguish the two.

When a speaking voice detects a pending alert, it continues speaking until it arrives at a specific application-defined stopping point, such as a sentence or a word. This stopping point is called the alert boundary because it is an event boundary at which alerts can be processed. When the alert has finished speaking, the interrupted voice resumes. Get and set the alert boundary with the AlertBoundary property.

Status and Control

The Status method may return an ISpeechVoiceStatus object, which contains several types of information about the state of the voice. Some ISpeechVoiceStatus properties are equivalent to parameters returned by voice events; it may be advantageous for some applications to get these elements by calling Status occasionally, rather than by receiving events constantly.

Voice status and voice events are closely associated with the status of the audio output device. A voice speaking to a file stream produces no audio output, generates no events, and has no audio output status. As a result, the ISpeechVoiceStatus data returned by that voice will always show it to be inactive.

A speaking voice can be paused at the next alert boundary with the Pause method. A paused voice can be resumed with the Resume method. The Skip method causes the voice to skip forward or backward in the input stream.

Automation Interface Elements

The SpVoice automation object has the following elements:

Properties Description
AlertBoundary Property Gets and sets the alert boundary, which specifies how a speaking voice pauses itself for alerts.
AllowAudioOutputFormatChangesOnNextSet Property Gets and sets the flag that specifies whether the voice is allowed to adjust its audio output format automatically.
AudioOutput Property Gets and sets the current audio output object used by the voice.
AudioOutputStream Property Gets and sets the current audio stream object used by the voice.
EventInterests Property Gets and sets the types of events received by the voice.
Priority Property Gets and sets the priority level of the voice.
Rate Property Gets and sets the speaking rate of the voice.
Status Property Returns the current speaking and event status of the voice in an ISpeechVoiceStatus object.
SynchronousSpeakTimeout Property Gets and sets the interval, in milliseconds, after which the voice's synchronous Speak and SpeakStream calls will time out when its output device is unavailable.
Voice Property Gets and sets the currently active member of the Voices collection.
Volume Property Gets and sets the base volume (loudness) level of the voice.
Methods Description
DisplayUI Method Initiates the display of the specified UI.
GetAudioOutputs Method Returns a selection of available audio output tokens.
GetVoices Method Returns a selection of voices available to the voice.
IsUISupported Method Determines if the specified UI is supported.
Pause Method Pauses the voice at the nearest alert boundary and closes the output device, allowing it to be used by other voices.
Resume Method Causes the voice to resume speaking when paused.
Skip Method Causes the voice to skip forward or backward by the specified number of items within the current input text stream.
Speak Method Initiates the speaking of a text string, text file or wave file by the voice.
SpeakCompleteEvent Method Gets an event handle from the voice that will be signaled when the voice finishes speaking.
SpeakStream Method Initiates the speaking of a text stream or sound file by the voice.
WaitUntilDone Method Blocks the caller until either the voice has finished speaking or the specified time interval has elapsed.