Text Normalization

Microsoft Speech SDK

The Microsoft.com Speech website Microsoft Speech SDK SAPI 5.1

Text Normalization

You can perform simple text normalization for voice training using the text buffer provided to the engine. Text normalization is the process of changing the input buffer that allows the engine to use preferred word units. The engine word units affect how words are expected to be pronounced as well as how they appear in the voice training wizard.

The text provided to the engine is called an article. An article is composed of multiple phrases, each separated by a new-line character. The voice training wizard displays one phrase at a time.

<article> ::= { <phrase> "\n" }

A phrase is a sequence of word units, separated by white space characters. In this context, white space characters are all characters for which the C run time function iswspace() returns TRUE.

<phrase> ::= { <word> | <literal_symbols> | <numeric_expression> }

Word units

Literals

The following symbols are recognized as units. They should be separated from adjacent text with white space; they will "snuggle" to the words appropriately when presented to the user.

<literal_symbols> ::=
"!\exclamation-point" | "\"\end-quote" | "\"\quote" | "#\pound-sign" | "$\dollar" | "%\percent" | "&\ampersand" | "'\end-quote" | "'\quote" | "(\paren" | ")\close-paren" | "*\asterisk" | "+\plus" | ",\comma" | "--\double-dash" | "-\hyphen" | "...\ellipsis" | ".\dot" | ".\period" | "/\slash" | ":\colon" | ";\semicolon" | "<\less-than" | "=\equals" | ">\greater-than" | "?\question-mark" | "@\at-sign" | "[\bracket" | "\\back-slash" | "]\close-bracket" | "^\circumflex" | "_\underscore" | "`\back-quote" | "{\left-brace" | "| \vertical-bar" | "}\right-brace" | "~\tilde"

Numerics

Numbers can be the following form:

<digit> ::= "0"-"9"

<non_zero_digit> ::= "1"-"9"

<numeric_expression> ::= <integer_expression> | <integer_expression> <cardinal_suffix> | <floating_expression>

<integer_expression> ::= ["-"] <non_zero_digit>[<digit>[<digit>]] { [","] <digit><digit><digit> }

<floating_expression> ::= <integer_expression> "." <digit> [{ <digit> }]

<cardinal_suffix> ::= "st" | "nd" | "rd" | "th"

Collections

The remainder of the buffer will be treated as a collection of words:

<alpha_char> ::= "a"-"z"| "A"-"Z"

<word_char> ::= <alpha_char> | "-" | "_" | "0"-"9"

<word> ::= <word_0> | <word_1> | <word_2> | <word_3>

<word0> ::= <alpha_char> [{<word_char>}]

<word1> ::= <alpha_char> [{<word_char>}] "s'"|"in'"

<word2> ::= <alpha_char> [{<word_char>}] "." <word2>

<word3> ::= <abbreviation_string> "."

<abbreviation_string> ::=
"al" | "apr" | "assn" | "assoc" | "atty" | "aug" | "bef" | "bldg" | "ch" | "chg" | "co" | "com" | "cont" | "corp" | "dec" | "def" | "det" | "dev" | "div" | "doc" | "etc" | "ext" | "feb" | "gov" | "in" | "ins" | "int" | "intl" | "jan" | "jr" | "jul" | "jun" | "mar" | "messrs" | "mos" | "mph" | "mr" | "mrs" | "ms" | "mt" | "no" | "nov" | "oct" | "oz" | "par" | "pct" | "pfc" | "pp" | "pres" | "prov" | "pt" | "qtr" | "ref" | "reg" | "rep" | "rev" | "sdn" | "sec" | "sep" | "sq" | "sr" | "tech" | "vol" | "wm"