March/April 2003

The W3C Speech Interface Framework

By Dr. James A. Larson

Figure 1

In the past three years, the World Wide Web Consortium Voice Browser Working Group has produced several reports that define languages in the W3C Speech Interface Framework (Figure 1). Developers use the W3C Speech Interface Framework languages to create speech applications. A simple speech application fragment using four of these languages is illustrated in Figure 2.

Figure 2. Simple speech application fragment specified using W3C Speech Interface Framework languages

The Voice Estensible Markup Language, VoiceXML 2.0, (illustrated in black in Figure 2) is a dialog language that controlls the exchange of information between the user and the application. VoiceXML 2.0 is based on VoiceXML 1.0, which was contributed to the W3C by members of the VoiceXML Forum. Figure 2 illustrates a fragment of a speech application containing a form with a single field in which the user is asked to speak the name of a destination city.

The Speech Synthesis Markup Language, SSML, (illustrated in green in figure 2) describes how text is presented as audio to the user. Developers use SSML to specify speech formatting, including voice characteristics, word emphasis, prosody, speed, pitch and other voice characteristics.

The Speech Recognition Grammar Specification, SRGS, (illustrated in red in Figure 2) specifies the words and phrases which a user may speak in response to a prompt. Figure 2 describes a simple grammar consisting of the names for New York and Washington. Grammars specify the words and phrases that the speech recognition engine recognizes at each point in the dialog.

The Semantic Interpretation Language (illustrated in blue in Figure 2) extracts words and phrases that have been recognized by the speech recognition engine and translates them to semantically meaningful tokens for processing by the speech application. For example, in Figure 2, “Big Apple” is translated to “New York” and “The Capital” is translated to “Washington”.

The Call Control Extensible Markup Language, CCXML, (not illustrated in Figure 2) is used to manage incoming and outgoing telephone and conference calls.

Extensible Multimodal Markup Annotation, EMMA, is being developed by the W3C Multimodal Working Group, which took over development of an earlier language from the Voice Browser Working Group. EMMA is used to annotate the output of any recognition engine (including speech recognition engines) with information about the sources of data, confidence levels, timing information and other information required by the application logic.

The W3C recommendation track sidebar summarizes the W3C process for evolving each language specification and achieving consensus with interested parties both inside and outside of the W3C. The VoiceXML 2.0, SRGS and SSML languages should achieve recommendation status this year.

The Voice Browser Working Group has begun to specify the requirements for the follow-on to VoiceXML 2.0. These requirements are being collected from deferred change requests submitted for VoiceXML 2.0, requirements from other groups within and outside of the W3C, and from other interested parties. We have received proposals from IBM, Motorola and Opera (XHTML + Voice Modules), and from the SALT Forum (Speech Application Language Tags). Anyone may submit suggestions and comments to the public mailing at list http://www-voice@w3.org. To subscribe to this public mailing list, to review the Working Group’s charter, or to review other W3C Speech Interface Framework documents, see http://www.w3.org/Voice/ .

Work on the languages of the W3C Speech Interface Framework is proceeding systematically through the W3C recommendation track. After reaching full recommendation, VoiceXML 2.0 will evolve into a new language in response to the developer needs.

The W3C Recommendation Track
The W3C recommendation track is the process that the W3C follows to build consensus around a Web technology, both within the W3C and the Web community as a whole. The labels that describe increasing levels of maturity and consensus are described below.

W3C Recommendation—Appropriate for widespread deployment and promotion of the W3C’s mission. Proposed Recommendation—Reviewed and approved by the W3C Advisory Committee (all W3C members).

Candidate Recommendation—Tests for required parts of the specification are created and conducted to verify that the specification can be implemented.

Last Call Working Draft—A special instance of a Working Draft that is considered by the Working Group to be complete. All suggestions for changes from interested parties are considered and resolved.

Working Draft—Work in progress and call for comments.

Requirements—Description of what the language should support

The current status the W3C Speech Interface Framework languages is shown in Figure 3.

Dr. Jim A. Larson is an adjunct professor at Portland State University and Oregon Health Sciences University. He can be reached at jim@larson-tech.com and his Web site is http://www.larson-tech.com