July/August 2003

State of Speech Standards

By Dr. James A. Larson

Speech standards include terminology, languages and protocols specified by committees of speech experts for widespread use in the speech industry. Speech standards have both advantages and disadvantages. Advantages include the following: developers can create applications using the standard languages that are portable across a variety of platforms; products from different vendors are able to interact with each other; and a community of experts evolves around the standard and is available to develop products and services based on the standard. On the other hand, some developers feel that standards may inhibit creativity and stall the introduction of superior technology. However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough.

The Big Picture

Several standards organizations have been active in the speech area, including the World Wide Web Consortium (W3C), Internet Engineering Task Force (IETF), and the European Telecommunications Standards Institute (ETSI). Table 1 (see below) summarizes several important speech-related standard languages. For a summary of additional speech-related standards, see http://www.larson-tech.com/speech.htm.

To understand how these standards relate, Figure 1 illustrates a conceptual framework for a distributed speech system consisting of:

• A user device such as a telephone or cell phone

• A document server storing the scripts and files required for the speech application

• An application server containing a voice browser which downloads and interprets documents

• A speech server containing various technology modules for recognizing and generating speech

• A telephone connection device containing the call control manager

There are many possible architectures that contain the components from Figure 1. The italicized acronyms inside each component refer to standard languages used by the component.

Figure 1: Conceptual framework for a distributed speech system

A user engages in a dialog via the user device by speaking into a microphone, pressing the keys on a 12-digit keypad and listening to a device’s speaker. The dialog is controlled by a voice browser which interprets dialog instructions written in VoiceXML 2.0 or other dialog scripting language. The voice browser downloads dialog scripts from a document server. Dialog scripts contain instructions for various speech technology processors residing on a speech server. A dialog script directs the voice browser to use the:

• BioAPI to access a speaker verification engine which verifies the user’s identity by analyzing the user’s voice features.

• Speech recognition system to convert the user’s spoken speech into text. The speech recognition system uses a grammar, such as the Speech Recognition Grammar Specification (SRGS) or some other grammar format, to define the words and phrases that the speech recognition system should hear and recognize. Grammars may reside within the dialog script or in a separate library on the document server. Some grammars may contain scripts for extracting and translating recognized words. These scripts are expressed using the Semantic Interpretation language. Some systems use the Distributed Speech Recognition (DSR) protocol, in which a module on the user device extracts speech features and transmits those features to the speech server where the rest of the speech recognition system converts the speech features to text.

• Touchtone recognition to convert touchtone sounds to digits. The touchtone recognition engine uses a grammar, such as SRGS, to define sequences of acceptable digits.

• Speech synthesis engine to convert text to human-like speech. The text contains Speech Synthesis Markup Language (SSML) tags that describe word pronunciations and voice characteristics, as well as the speaking rate and voice inflections.

• Audio system to download and play prerecorded audio to the user. Audio files reside on the document server and are downloaded to the audio system for presentation to the user via the user device’s speaker.

A dialog script contains commands which may be spoken by the user. Many of these commands have been standardized in English, French, German, Italian and Spanish using the ETSI command vocabulary standard.

Telephones and cell phones may require a gateway for conversion between telephone communication protocols and internet protocols, such as HTTP, SIP and RTP. A call control manager uses CCXML scripts to control the creation, management and destruction of connections among conversational participants.

World Wide Web Consortium (W3C)

In March 2000, the VoiceXML Forum, founded by AT&T, IBM, Lucent and Motorola, submitted Version 1.0 of the Voice Extensible Markup Language (VoiceXML 1.0), a dialog language for writing speech applications to the World Wide Web Consortium (W3C). The Voice Browser Working Group within the W3C extracted and refined the Speech Recognition Grammar Specification (SRGS) and Speech Synthesis Markup Language (SSML) to become separate specifications and made refinements to VoiceXML 1.0 which became VoiceXML 2.0. The Voice Browser Working Group also is defining the Semantic Interpretation Language for extracting and translating words recognized by a speech recognition engine into a structure suitable for the backend semantic processing of the user’s input.

Intellectual Properties

Intellectual property issues can be a concern with every new technology area, including speech. Speech technologies have a rich history of intellectual development, so no one should be surprised to learn that there are several patents in the area of speech. Currently, the W3C Voice Browser Working Group is chartered as a Royalty-Free Working Group under the W3C’s Current Patent Practice. Most members of the W3C Voice Browser Working Group have, in effect, pledged a royalty-free license. However, at this time, there are only a few identified patents which may be essential to VoiceXML 2.0 that are not royalty-free. (This is a dramatic change from a year ago when many members reserved the right to ask for a licensing fee.) According to the “Current Patent Practice,” a patent advisory group has been formed to finally resolve the patent issue.

Submissions to the W3C

On Nov. 30, 2001, IBM, Motorola and Opera submitted XHTML plus Voice Modules (X+V), which partitions VoiceXML 2.0 into modules that can be embedded into host programming languages. On July 31, 2002, the SALT Forum (founded by Cisco, Comverse, Intel, Microsoft, Philips and SpeechWorks) submitted Speech Application Language Tags (SALT) to the W3C. The SALT specification is a collection of speech tags which can be embedded into a host programming language, such as HTML or Java. Both SALT and X+V are designed to be embedded within XHTML so developers can create user interfaces that support both visual and verbal components. Developers, using both languages, are members of the W3C and have years of experience in developing multimodal applications. The W3C will sort out features of SALT and X+V for possible inclusion within a follow-on language for VoiceXML 2.0.

Multimodal Interaction Working Group

W3C formed a new working group on multimodal interaction in February 2002. This group will soon publish working drafts of:

• Extended MultiModal Annotation (EMMA)—language for describing the semantic structure of information containing user data from multiple information sources, including keyboard and mouse, pen and speech

• The Ink Markup Language—A language for capturing and presenting ink information

• Multimodal Framework Note—A high-level description of how an interaction manager coordinates input from multiple modalities (sources of input) and output to multiple media (output to the user) in a multimodal Web application

Internet Engineering Task Force (IETF)

The Speech Services Control (SpeechSC) Working Group of the IETF is chartered to develop protocols to support distributed media processing of audio streams. The focus of this working group is to develop protocols to support speech recognition, speech synthesis and speaker validation. The working group will only focus on the secure distributed control of these servers. To date, the SpeechSC group has produced two documents:

• Requirements for Distributed Control of ASR, SI/SV and TTS Resources ( http://www.ietf.org/internet-drafts/draft-ietf-speechsc-reqts-04.txt

• Protocol Evaluation (http://www.ietf.org/internet-drafts/draft-ietf-speechsc-protocol-eval-02.txt

European Telecommunications Standards Institute (ETSI)

The European Telecommunication Standards Institute (ETSI) has two efforts related to speech applications. The Aurora project has created a standard for distributed speech recognition in which speech features are extracted at the user device and transmitted to the server, where the remainder of the speech recognition engine converts the information to text. This protocol acts as a compression algorithm, requiring minimal bandwidth to communicate speech to the server, as well as avoid noise induced during transmission. In addition, the Aurora group is investigating additional protocols for use with tonal languages and reconstructing acoustical speech from the transmitted data.

ETSI has also created a list of standard telephony commands in each of five European languages—English, French, German, Italian and Spanish. The use of these commands will enable users of one device to easily use another device. Device venders may also implement commands for each of the five languages so their devices can be marketed in the major European markets.

Other Standards

BioAPI enables the design of multi-vendor and multi-biometric applications. It supports enrollment, verification and identification of individual users using any of several biometric technologies, including voice. BioAPI is now an ANI/INCITS

standard and is being proposed to the International Organization for Standard-ization (ISO).

Standards groups are busy specifying languages and protocols enabling interoperability, connectivity and portability of speech applications on a variety of devices. The results of these activities will enable developers to reach larger audiences and positively affect their revenue.

Dr. James A. Larson is an adjunct professor at Portland State University and Oregon Health Sciences University, as well as the conference chair of SpeechTEK 2003. He can be reached at jim@larson-tech.com and his Web site is http://www.larson-tech.com.

Table 1: Overview of speech-related standards