Microsoft Joins the VoiceXML Express

Issue Date: STM-Newsblast - 04/05/2006, Posted On: 4/5/2006

Microsoft Joins the VoiceXML Express

The First Generation: The “Little Engine That Could”

Back in 1992, speech language experts at IBM, Motorola, AT&T, and Lucent were independently developing XML languages for speech applications when they observed that together they could develop a single language that would be portable across their respective platforms. They formed the VoiceXML Forum which gave birth to VoiceXML 1.0, a single language for specifying dialog control, grammars for speech recognition, and rendering of text as speech by a speech synthesizer.

The Second Generation: Working Trains

In 1999, the W3C Voice Browser Working group took over the specification of VoiceXML. They partitioned the language into five separate languages, collectively known as the W3C Speech Interface Framework.[1] To date, the W3C has standardized the following five languages:[2]

VoiceXML 2.0—a dialog language controlling the conversation between a computer and human
Speech Synthesize Markup Language 1.0—specifies how to render text as human-like speech
Speech Grammar Format Specification 1.0—specifies the words and phrases that the speech recognition engine can hear
Semantic Interpretation for Speech Recognition 1.0—specifies how to extract and translate recognized words into meaningful tokens
CCXML 1.0—manages incoming and outgoing calls

The growth in use of these languages has far exceeded anyone’s expectations. With Microsoft’s announcement that their speech server will support VoiceXML, every major speech technology vendor supports the W3C Speech Interface Framework languages. Each day hundreds of thousands of phone calls are processed by applications written with these languages.

The VoiceXML Forum[3] drives support for the W3C Speech Interface Framework with the following activities:

VoiceXML Platform Certification Program—validates conformance of VoiceXML processors to the W3C specification
VoiceXML Developer Certification—enables developers to demonstrate their knowledge and skill in developing speech applications
VoiceXML Community—online e-zine, tutorials, webinars, and other community events

The W3C is extending the Speech Interface Framework. Enhancements are being made to two of the languages: (1) Speech Syntheses Markup Language 1.1 with new features primarily for Asian languages; and (2) VoiceXML 2.1 with several new features which greatly expands the use of VoiceXML 2.0. A new language, Pronunciation Lexicon Specification 1.0, will be used to specify words and their pronunciations by both the speech recognition and speech synthesis engines. And, the W3C is working on the next generation of VoiceXML 3.0.

Third Generation: The VoiceXML Express

VoiceXML 3.0 will support State Chart XML[4]—a powerful language for controlling the flow of speech applications, plus many new capabilities, including improved prompt queue management for voice and video, external event processing, and improved resource management.

Choosing and Using Your VoiceXML engine

With several VoiceXML 2.1 implementations available, which is best for you? Criteria for evaluating alternative VoiceXML engines include:

Growth to VoiceXML 3.0—platform vendors should commit to support VoiceXML 3.0 when the W3C completes its specification.
Expandable platforms—platforms should grow as the number of calls handled by your application increases. Incrementally adding new ports and processing power conserves your initial investment.
Tools—development tools may improve programmer productivity. Reusable code in the form of grammars, subdialogs, and applications decrease the time to market. Testing and monitoring tools enable ongoing tuning and refinement to enable your applications to reach their full potential.
Vendor stability—the vendor should have a firm financial standing so it won’t suddenly disappear.
Company infrastructure -- VoiceXML remains completely independent of the choice of application server. Whether J2EE, .NET, or any other web infrastructure, regardless of operating system, VoiceXML and the W3C Speech Interface Framework provide a standard, portable, web-based environment for developing and deploying speech and telephony applications.

With Microsoft’s announcement that it will support VoiceXML, the W3C Speech Interface
Framework languages gain additional traction in the marketplace. A vibrant community of platform vendors, hosting specialists, tool developers, and application developers continues its strong growth. VoiceXML 3.0 promises new functionality and features, enabling applications not previously possible.

The “little engine that could” has grown up into a powerful and universal engine that is used to support telephony speech applications worldwide.

James A. Larson is manager of advanced human input/output at Intel Corporation and is uthor of the home study guide and reference “The VXMLGuide” http://www.vxmlguide.com/. His Web site is http://www.larson-tech.com/.

[1] http://www.w3.org/TR/2000/WD-voice-intro-20001204/

[2] http://www.w3.org/Voice/

[3] http://www.voicexmlforum.org/

[4] http://www.w3.org/TR/2006/WD-scxml-20060124/