Designing grammar rules (Microsoft Speech Platform)

Microsoft Speech Platform SDK 11

Microsoft Speech Platform

Designing Grammar Rules

Speech applications often use context-free grammars (CFG) to parse the recognizer output and in some instances, to act as the recognizer's language model. Speech recognition engines use CFGs to constrain the user's words to words that it will recognize. If the CFG is augmented with semantic information (property names and property values as explained below), a SAPI component converts the recognized word string into a name/value-meaning representation. The application then uses the meaning representation to control its part of the conversation with the user.

The following section covers:

Semantic properties or tags

For example, the phrase "Please schedule a meeting with Amy Anderson," could be annotated as follows:

Phrase element             Grammar element            Contents
-------------------------------------------------------------------------
"schedule a meeting"      "request: meeting"       // attribute and value
"with"                    "participants:"          // only attribute
"Amy Anderson"            "<e-mail alias>"          // value type

Defining the different grammar element components could result in the following:

Please schedule a meeting with Amy Anderson.
         |         |       |       |
         |         |       |       |
         |         |       |       |
     request: meeting      |       |
                           |       |
                   participants: AmyAnd

The example sentence "Please schedule a meeting with Amy Anderson," generates the following SAPI 5 grammar:

<RULE TOPLEVEL=ACTIVE>
        <P PROPNAME="request" VAL="meeting">schedule a meeting</P>
        <P>with</P>
        <L PROPNAME="participants">
            <P VAL="AmyAnd">Amy Anderson</P>
            <P VAL="tbremer">Ted Bremer</P>
            <P VAL="fralee">Frank Lee</P>
            <P VAL="crandall">Cynthia Randall</P>
            <P VAL="swhite">Suki White</P>
            <P VAL="kyoshida">Kim Yoshida</P>
        </L>
   </RULE>

The result of saying the example sentence "Please schedule a meeting with Amy Anderson," would be as follows:

request:meeting

participants:AmyAnd

return to the top of this pageBack to top

Separation of dynamic and static content

Applications should separate dynamic rule content from static rule content to implement good grammar design and to improve initial SAPI grammar compiler performance. For example, using the above grammar that uses a list of names, the application could create a separate rule (isolated in its own grammar) that contained only the names. The list of names, based on an address book or past user data, can be updated at run time. The static grammar would then contain a rule reference (e.g., RULEREF) to the dynamic content. When the application starts up, it can quickly load the static content, without loading the SAPI grammar compiler, to prevent delay in the startup sequence. Then, the application could load the dynamic content, which requires SAPI to initialize the backend grammar compiler.

return to the top of this pageBack to top

Use dynamic rules for language flexibility

Suppose an application needs to support a phrase such as "send new e-mail to NAME." The phrase "send new e-mail to" is static, and known by the application at design time, well before run time. The application could use the following static XML grammar to support these phrases.

<GRAMMAR LANGID="409"><!-- american english grammar -->
   <RULE NAME="E-MAIL" TOPLEVEL="INACTIVE"><!-- inactive by default, to prevent premature recognitions -->
      <PHRASE>send new e-mail to</P>
      <RULEREF NAME="ADDRESS_BOOK" PROPNAME="NAME"/><!-- add TRACK_PROP semantic property tag for easy information retrieval -->
   </RULE>
   <RULE NAME="ADDRESS_BOOK" DYNAMIC="TRUE">
      <PHRASE>placeholder</PHRASE><!-- we'll stick placeholder text here that we'll replace immediately at runtime -->
   </RULE>
</GRAMMAR>

The source code to manipulate the dynamic rule, "ADDRESS_BOOK" follows:

     HRESULT hr = S_OK;

    // create a new grammar object
    hr = cpRecoContext->CreateGrammar(GRAM_ID, &cpRecoGrammar);
    // Check hr

    // deactivate the grammar to prevent premature recognitions to an "under-construction" grammar
    hr = cpRecoGrammar->SetGrammarState(SPGS_DISABLED);
    // Check hr

    // load the email grammar dynamically, so changes can be made at runtime
    hr = cpRecoGrammar->LoadCmdFromFile(L"email.xml", SPLO_DYNAMIC);
    // Check hr

    SPSTATEHANDLE hRule;

    // first retrieve the dynamic rule ADDRESS_BOOK
    hr = cpRecoGrammar->GetRule(L"ADDRESS_BOOK", NULL, SPRAF_Dynamic, FALSE, &hRule);
    // Check hr

    // clear the placeholder text, and everything else in the dynamic ADDRESS_BOOK rule
    hr = cpRecoGrammar->ClearRule(hRule);
    // Check hr

    // add the real address book (e.g. "Frank Lee", "self", "SAPI beta", etc.).
    // Note that ISpRecoGrammar inherits from ISpGrammarBuilder,
    // so application gets the grammar compiler and ::AddWordTransition for free!

    hr = cpRecoGrammar->AddWordTransition(hRule, NULL, L"Frank Lee", NULL, SPWT_LEXICAL, 1, NULL);
    // Check hr
    hr = cpRecoGrammar->AddWordTransition(hRule, NULL, L"self", NULL, SPWT_LEXICAL, 1, NULL);
    // Check hr
    hr = cpRecoGrammar->AddWordTransition(hRule, NULL, L"SAPI beta", NULL, SPWT_LEXICAL, 1, NULL);
    // Check hr
    // ... add rest of address book

    // commit the grammar changes, which updates the grammar inside SAPI,
    //    and notifies the SR Engine about the rule change (i.e. "ADDRESS_BOOK"
    hr = cpRecoGrammar->Commit(NULL);
    // Check hr

    // activate the grammar since "construction" is finished,
    //    and ready for receiving recognitions
    hr = cpRecoGrammar->SetGrammarState(SPGS_ENABLED);
    // Check hr

return to the top of this pageBack to top

Retrieving semantic tags or properties from recognition results

Note the XML grammar used a semantic property tag, NAME, in the static grammar. The property will enable the application to retrieve the dynamic phrase very easily at run time. Whenever recognition is received with rule name, "E-MAIL," search the property tree (see SPPHRASE.pProperties) for the property named "NAME." Then call ISpRecoResult::GetPhrase with (SPPHRASEPROPERTY)pNameProp.ulFirstElement and (SPPHRASEPROPERTY)pNameProp.ulFirstElement, and the application can retrieve the exact text that the user spoke into the dynamic rule (e.g., user says "send new e-mail to Frank Lee," and you retrieve "Frank Lee," user says "send new e-mail to self," and you retrieve "self," etc.).

    // activate the e-mail rule to begin receiving recognitions
    hr = cpRecoGrammar->SetRuleState(L"EMAIL", NULL, SPRS_ACTIVE);
    // Check hr

    PWCHAR pwszEmailName = NULL;

    // default event interest is recognition, so wait for recognition event
    // NOTE: this could be placed in a loop to process multiple recognitions
    hr = cpRecoContext->WaitForNotifyEvent(MY_REASONABLE_TIMEOUT);
    // Check hr

    // event notification fired
    if (S_OK == hr) {

 	CSpEvent spEvent;
         // if event retrieved and it is a recognition
	if (S_OK == spEvent.GetFrom(cpRecoContext) && SPEI_RECOGNITION == spEvent.eEventId) {

             // get the recognition result
	    CComPtr<ISpRecoResult> cpRecoResult = spEvent.RecoResult();

             if (cpRecoResult) {
                 SPPHRASE* pPhrase = NULL;

                 // get the phrase object from the recognition result
                 hr = cpRecoResult->GetPhrase(&pPhrase);
                 if (SUCCEEDED(hr) && pPhrase) {

                     // if "EMAIL" rule was recognized ...
                     if (0 == wcscmp(L"EMAIL", pPhrase->Rule.pszName) {

                         // ... ensure that first property is "NAME"
                         if (0 == wcscmp(L"NAME", pPhrase->pProperties->pszName)) {

                             // store the user's spoken "send-to" name
                             //    in a variable for later processing
                             hr = pPhrase->GetText(pPhrase->pProperties->ulFirstElement,
                                                   pPhrase->pProperties->ulCountOfElements,
                                                   FALSE,
                                                   &pwszEmailName,
                                                   NULL);
                             // Check hr

                          }
                     }
                     // free the phrase object
                     if (pPhrase) ::CoTaskMemFree(pPhrase);

                 }
             }
	}
   }

return to the top of this pageBack to top

Using semantic properties, hypotheses, and "property pushing"

SAPI supports a feature called "semantic property pushing" which enables applications to detect the semantic property structure more accurately at recognition time. "Property pushing" is done by SAPI at compile time, whereby the compiler moves semantic properties to the last terminal node within a rule that remains unambiguous.

For example, the phrases "a b c d" and "a b e f g" both have prefixes of "a b". The compiler will automatically split the phrases into three separate phrases, "a b", "c d", and "e f g", where the first phrase is the common prefix to both recognizable phrases.

The purpose of this feature is to enable applications that place properties on the phrases to detect which branch is being hypothesized as soon as the first unambiguous (non-common) portion of the phrase is spoken. When the user speaks "a b" it is not clear if the user will say "a b c d" or "a b e f g". If the user then says "e", the application can obviously eliminate the "a b c d" option. If the grammar author attached properties to the end of both phrases, the semantic property would be returned as soon as the user spoke the first unambiguous portion of the text (e.g., "c" or "e").

Note that the compiler will report an error ("Ambiguous Semantic Property") if multiple properties are pushed to the same node and two phrases are not unique. For example, the following grammar will fail with "ambiguous semantic property" because both phrases are the same and the compiler cannot determine which property to assign to phrases.

    <RULE NAME="AmbiguousProperty" TOPLEVEL="ACTIVE">
        <L>
            <P PROPID="42">this is a test</P>
            <P PROPID="3">different sentence</P>
            <P PROPID="75">this is a test</P>
        </L>
    </RULE>

The first and third phrases are the same. Note that these results are by design and are meant to prevent creating grammars that have multiple phrases with conflicting semantic properties.

There are a number of scenarios where property pushing can be helpful for an application.

One possibility is an application that wants to detect failures more intelligently. When a false recognition occurs, the application can detect the last semantic property returned and display an error message relevant to the attempted voice command.

Another scenario might be that of high-performance applications that wish to increase responsiveness of the user interface when a long voice command is spoken. The application can wait for the first unambiguous semantic property to be received (using hypothesis) and then fire the response action without waiting for the voice command to complete. This has the added benefit of allowing users to speak partial voice commands (e.g., instead of "go to website w w w Microsoft com" the user can say the slightly shorter "go to website w w w Microsoft"). The drawback is that the application must guard against performing critical, unrecoverable actions before completing the phrase (e.g., "delete hard drive" might fire after only "delete" if there are no other "delete" commands). Careful application design should enable the application to appear quicker and easier to use, without sacrificing robustness. By performing user studies, the application designer can decide which commands are capable of short circuiting and which are more critical.

return to the top of this pageBack to top