Grammar Rules and State Graphs (Microsoft Speech Platform)

Microsoft Speech Platform SDK 11

Microsoft Speech Platform

Grammar Rules and State Graphs

Grammar rules are elements that SAPI 5-compliant speech recognition (SR) engines use to restrict the possible word or sentence choices during the SR process. SR engines employ grammar rules to control the elements of sentence construction using the predetermined list of recognized word or phrase choices. This list of recognized words or phrase choices contained in the grammar rules forms the basis of the SR engine vocabulary.

The phrase or sentence uses each grammar rule element to determine the recognition path. For example, examine the phrase describing travel plans, "I would like to drive from Seattle to New York," and note that there are elements that determine the resulting information. In this example, a person is planning to drive to New York from Seattle. This is a very simple illustration of what could be a very complex problem. Determining the same travel plans without limiting the method, direction, and travel destination would result in an infinite number of travel options.

The resulting information can be determined by restricting the available choices for a given sentence. Using this method, the resulting information can be composed only from certain choices, thus eliminating the possibility of an infinite number of travel plan combinations.

 I would like to drive from Seattle to New York.
                  |      |     |     |     |
               [Method]  |     |     |     |
               /    \    |     |     |     |
            Fly   Drive  |     |     |     |
                         |     |     |     |
                  [Direction]  |     |     |
                    /     \    |     |     |
                  From    To   |     |     |
                               |     |     |
                            [City]   |     |
                           Seattle   |     |
                          New York   |     |
                       Los Angeles   |     |
                       Albuquerque   |     |
                                     |     |
                              [Direction]  |
                                 /    \    |
                                To   From  |
                                           |
                                        [City]
                                        Seattle
                                        New York
                                        Los Angeles
                                        Albuquerque

The elements of interest in the example phrase are as follows:

  • Method of travel (fly or drive), specifically "drive"
  • Travel direction (from or to), specifically "from"
  • The city of origin for the travel plan (from), specifically "Seattle"
  • Travel direction compliment (from or to), specifically "to"
  • The city of destination for the travel plan (to), specifically "New York"

The information can also be displayed as a graph of states and arcs, where each arc can have text (or semantic tags/properties) attached. The valid phrases are the unique paths through the graph, starting at the root and ending at a terminal state. Each state is denoted by the term (root node, interim node, and null) for the terminal node. The spoken text is denoted by words surrounded by quotation marks. The semantic property names are denoted by bold, block quoted words.

             (root node)
                  |
                  |"I would like to"
                  |
                  |
           (interim node)
                  /\
                 /  \
         "drive"/    \"fly" [METHOD]
                \    /
                 \  /
                  \/
           (interim node)
                  /\
           "from"/  \"to"   [DIRECTION]
                 \  /
                  \/
           (interim node)
                  /\
            _____/  \_____
           /   \       /  \
          /     \     /    \  [CITY_1]
         /       |   /      \
         |       |  |        \
"Seattle"|   "New|  |"Los     |"Albuquerque"
         |  York"|  |Angeles" |
         |       |  |         /
         |       |  |        /
          \      /  \       /
           \    /    \     /
            \___\     \___/
                 \    /
                  \  /
                   \/
           (interim node)
                   /\
            "from"/  \"to"   [DIRECTION]
                  \  /
                   \/
           (interim node)
                   /\
             _____/  \_____
            /   \       /  \
           /     \     /    \
          /       |   /      \
          |       |  |        \
 "Seattle"|   "New|  |"Los     |"Albuquerque"
          |  York"|  |Angeles" |
          |       |  |         /
          |       |  |        /  [CITY_2]
           \      /  \       /
            \    /    \     /
             \___\     \___/
                  \    /
                   \  /
                    \/
                  (NULL)

If the user speaks the following phrase:
I would like to travel from Seattle to New York.

Grammar rules become concatenated phrase elements. These phrase elements are limited to the defined set of grammars. Control can be significantly improved over the resulting information by restricting the input choice to a limited set of possibilities. Otherwise, obtaining the travel plan information from the same sample phrase, "I would like to travel from Seattle to New York," would be considerably more ambiguous.

The complexity of parsing the same sentence increases exponentially without using a defined set of choices. Imagine the possible number of combinations in a sentence that is not restricted to a finite list of combinations. For example, examine the possible choice combinations by moving the mouse over the following sentence.

To display the available choice selections in the example phrase, move the mouse over the underlined text below:

"I want to—(unknown travel method)(unknown travel direction)(unknown city)(unknown travel direction) (unknown city)." The amount of predictable information is significantly reduced without the ability to constrain the available choices within a sentence.

The semantic structure (using name/value pairs) is:
[METHOD="drive"], [DIRECTION="from"], [CITY_1="Seattle"], [DIRECTION="to"], [CITY_2="New York"]

By parsing the semantic structure, the application can easily and accurately analyze the content of the original phrase, without parsing or analyzing individual words. The application developer can then write application logic to perform specific actions based on the previously mentioned semantic names, and specialize the action based on the values of each semantic property. The grammar author can add to or delete from the lists of words, without breaking the application logic.

Grammar rules apply to the following:

TOPLEVEL versus non-TOPLEVEL
A grammar tagged as TOPLEVEL can be in an active or inactive state. The rules that import a grammar can override the activation state of a rule. This conditional state can be configured dynamically at run time. If an inactive grammar is included in another grammar or grammar rule, ignore the inactive state. When a rule is activated, an SR engine will accept only speech satisfying at least one of the active rules contained in the loaded grammar. If a rule is not marked TOPLEVEL, then it is a component rule, and not directly accessible (i.e., the user can only speak TOPLEVEL rules for valid recognition).
Non-terminal
A grammar node is considered to be non-terminal if it is the beginning of a choice selection or a group of choice selections. For example, the grammar node Dog is non-terminal when the subsequent choice selections are types of dogs. This type of grammar node is defined as non-terminal because of its choice selections.
Terminal
A grammar node is considered to be terminal if it's the only word in the recognized vocabulary which can be spoken. Using the Dog example above, terminal grammar nodes are the type of dogs.
                      --------------------+
               Animal                     +--- Non-terminal node
                 |    --------------------+
                 |
              /--+--\          -----------+
  Cat--------/       \------Dog           +--- Non-terminal node
   |                       |   -----------+
   |                       |
   |                       |   -----------+
   +-- Burmese             +-- Airedale   |
   +-- Himalayan           +-- Poodle     +--- Terminal nodes
   +-- Persian             +-- Schnauzer  |
   +-- Siamese             +-- Whippet    |
                               -----------+

The text format grammar XML tags follow block scope methods that are similar to HTML tags. That is, each tag has an opening tag and a corresponding closing tag.

XML tag syntaxContents
<sometag NAME="some_name" VAL="some_value">Start of "sometag" tag scope which includes the name and value information.
</sometag>End of the "sometag" scope.

return to the top of this pageBack to top