SISR

There are many ways a user can respond to the prompt What would you like to drink? While some of us might want a triple martini or an intergalactic gargle blaster, let’s suppose that the user only wants a Coke. The developer specifies a grammar containing the words and phrases Coke, Coca-Cola, or that fizzy brown drink. The speech recognition system compares the user utterance with each word and phrase in the grammar and chooses the word or phrase that most closely matches.

How does the speech application know that Coke, Coca-Cola, or that fizzy brown drink actually mean the same drink? One approach is to have the speech application look up these words in a translation table. A better approach is to embed the translation of each word within the grammar so that when the user speaks either Coke or that fizzy brown drink, the speech recognition engine will translate the words to Coca-Cola.

Just as the World Wide Web Consortium (W3C) made Speech Recognition Grammar Specification (SRGS) the standard for defining the grammars used by a speech engine, the W3C has specified Semantic Interpretation for Speech Recognition (SISR)as the standard for developers to interpret the words recognized by the speech engine.

SISR uses the ECMAScript Compact Profile, a strict subset of ECMAScript designed to meet the needs of resource constrained environments. Special attention has been paid to constrain ECMAScript features that require large amounts of system memory and processing power. In particular, it is designed for use in a lightweight environment. Thus, ECMAScript fits snugly within the grammar rules for extracting semantic information from the words recognized by the speech engine.

In addition to translating word aliases to the preferred word, as in the Coca-Cola example above, developers also specify the following tasks with SISR:
• Encode text strings into codes used by the speech application. For example, the phrases Coke, Coca-Cola, or that fizzy brown drink are all translated to code 15, while triple martini and intergalactic gargle blaster are translated, respectively, to codes 23 and 415. If the semantic interpretation instructions fail to encode the word recognized by the speech engine, then the dialogue manager prompts the user for a better answer.
• Produce simple word and phrase translation to another language. For example, translate English names of cities to their Italian equivalents, such as Florence to Firenze and Milan to Milano. Word-for-word translations do not constitute a perfect translation, but they can provide the approximate meaning of the text being translated. These simple translations could also form the basis for more advanced machine translation of one language to another, in which grammars constrain the source language and the corresponding words and SISR instructions determine the corresponding words and phrases in the target language.
• Create an ECMAScript structure for easy processing by the speech application. For example, travel from New York to Seattle could be translated to the ECMAScript object
{travel:
{departure: "New York"
destination: "Seattle"}
Using this structure, the back-end application quickly and easily determines the departure and destination cities when querying a database of airline flights.
• Semantically validate the user’s input. While the grammar constrains user input to predefined words and phrases, grammars cannot easily implement constraints in which one value is dependent upon other values. For example, if a user accidentally uttered February 30, semantic integrity instructions would detect an invalid number of days.

Most vendors have adopted VoiceXML 2.0, making it possible to port applications to competing speech platforms. Now that SISR is a W3C standard, vendors should support SISR in addition to their own proprietary languages. The VoiceXML Forum plans to update its VoiceXML certification program to include testing SISR. Make sure that your speech vendor supports SISR so your grammars and applications will be ported between platforms more easily.

Jim Larson is an independent consultant and VoiceXML trainer. He is the author of The VXMLGuide [www.vxmlguide.com]. He can be reached at jim@larson-tech.com.