Synthesis markup (Microsoft Speech Platform)

Microsoft Speech Platform SDK 11

Microsoft Speech Platform

Synthesis Markup

SAPI 5 synthesis markup is the collection of XML tags inserted into text to modify the speech synthesis of that text. These XML tags, which provide functionality such as volume control and word emphasis, are inserted into text passed into ISpVoice::Speak and text streams of format SPDFID_XML which are then passed into ISpVoice::SpeakStream. By default, the SAPI XML parser auto-detects XML. In the case of an invalid XML structure, a speak error may be returned to the application. SAPI is not intended to be used to validate the XML structure, as it is the responisbility of the developer to validate the XML with an XML validation tool. Please see ISpVoice for more information.

SAPI 5 synthesis markup is an application of XML. Every XML element consists of a start tag <Some_tag> and an end tag </Some_tag> with a case-insensitive tag name and contents between these tags. If the element is empty, it has no contents <Some_tag></Some_tag> and the start tag and the end tag might be the same <Some_tag/>. More information about XML and the XML specification is available at: http://www.w3.org/TR/1998/REC-xml-19980210.html.

The following section covers:

SAPI 5 XML tags

XML tags in SAPI 5 follow a defined structure program scope and implementation. SAPI 5 XML tags have a specific purpose and affect the input text in a predetermined manner.

The SAPI 5 XML tags are divided into four different scope categories.

  1. Non-scoped
  2. Scoped
  3. Global
  4. Scoped/Global

The modification and properties can be controlled through the use of XML tags.

Attributes
Attributes of an XML element appear inside the start tag. Each attribute is in the form of a name, followed by an equal character, followed by a quoted string value. An attribute of a given name may only appear once in a start tag. Exact details on what characters may appear between quotes can be found at http://www.w3.org/TR/REC-xml#NT-AttValue.

Briefly, the literal string cannot contain a less than character "<" if the string is surrounded by single quotation marks, it cannot contain a single quotation mark. If the string is surrounded by double quotation marks it cannot contain a double quotation mark. Furthermore, all ampersands (&) can be used only in an entity reference such as &amp; and "&gt;". When a literal string is parsed, the resulting replacement text will resolve all entity references such as "&gt;" into its corresponding text, such as ">".

In this specification, only the resulting replacement text needs to be defined for attribute value strings. The XML specification defines the exact file syntax details. Character references allow entity references in ASCII characters to specify replacement text which has unprintable characters such as extended Unicode characters. The entity reference "&#x0259;" specifies the single Unicode character for the International Phonetic Alphabet symbol for a mid-central unrounded vowel. See http://www.w3.org/TR/1998/REC-xml-19980210#sec-references for details.

The <LANG> and <VOICE> XML tags are specific to the Microsoft engines and provide support for language and dialect attributes for a given voice.

The following is an example of what 409;9 refers to and how to correctly use it in XML tags:

<LANG LANGID="409">This is the US English language</LANG>
<LANG LANGID="9">This is the English Language</LANG>
<VOICE REQUIRED="language=409">This is the required voice for the US English language</VOICE>
<VOICE REQUIRED="language=9">This is the required voice that speaks in any dialect of the English language</VOICE>

A speak error will occur when entering voice attribute information as it appears in the Windows Registry:

For example:

<LANG LANGID="409;9">Speak this text with the US English language</LANG>
<VOICE REQUIRED="language=409;9">Require a voice be used that speaks the US English language</VOICE>

In the Windows Registry, the language attribute for the Microsoft SAPI 5 English voices is labeled as '409;9' The '409' attribute information indicates the voice is specifically US English, and '9' refers to the English language. This language labeling convention for voices may not be followed by all engine manufacturers. For example, the LH voices may use '409' to indicate an English voice, while Microsoft uses '409;9' to specify the voice is specifically US English.

For example:

409;9 = US English
809;9 = British English

Start RegEdit and expand the tree view pane to the following registry key location:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens

Select one of the available voices and view the corresponding attribute information.

The following is an example of the MSMary voice attributes:

Contents
The contents of an element consist of text or sub-elements. With these definitions, the XML specification defines the exact file syntax details.

Relationship to HTML web pages and SABLE

The XML format that SAPI 5 uses is NOT placed inside web pages. Web page authors who want to mark up sections of HTML text so that it is synthesized correctly, should use the W3C Aural Cascading Style Sheets (ACSS). More information is available at: http://www.w3.org/TR/WD-acss

SAPI applications that are synthesizing text from a web page will "render" HTML+ACSS into SAPI's synthesis markup format. Programs apply a default ACSS file when synthesizing web pages that do not have an associated ACSS file.

SAPI 5 synthesis markup format is similar to the format published by the SABLE Consortium. However, this format and SABLE version 1.0 are not interoperable. At this time, it's not determined if they will become partially interoperable in the future. More information about the SABLE specification is available at: http://www.bell-labs.com/project/tts/sable.html.

return to the top of this pageBack to top