November/December 2004

Technology Trends: MRCP Enables New Speech Applications

By Dr. James Larson

Have you ever wished you could change your VoiceXML platform to use a speech synthesizer or speech recognizer from a different vendor? Have you ever wanted to move your speech synthesizer or speech recognizer to a different server? The Internet Engineering Task Force is proposing a new standard that will provide this flexibility.

Media Resource Control Protocol1 Version 2 (MRCPv2) is a network protocol, which provides a vendor-independent interface between speech media servers and speech application platforms. The Internet Engineering Task Force2 Speech Service Control’s3 MRCPv2 is based on an earlier version developed jointly by Cisco, Nuance, and SpeechWorks (now ScanSoft).

The MRCPv2 protocol controls media service resources over a network. This protocol depends upon a session management protocol, such as the Session Initiation Protocol (SIP), to establish a separate MRCPv2 control session between the client and the media server. MRCPv2 defines the following types of media processing resources:

Basic synthesizer — A speech synthesizer resource with very limited capabilities that play out concatenated audio file clips.
Speech synthesizer — A full capability speech synthesizer that produces human-like speech using the Speech Synthesis Markup Language specifications.
Recorder — A resource with end-pointing capabilities for detecting the beginning and ending of spoken speech and saving it to an URI.
DTMF recognizer — A limited DTMF-only recognizer that can match telephone touchtones to a grammar and perform semantic interpretation based on semantic tags in the grammar.
Speech recognizer — A full speech recognizer that converts spoken speech to text and interprets the results based on semantic tags in the grammar.
Speaker verification — Authenticate a voice as belonging to a person by matching the voice to one or more saved voiceprints.

MRCP is designed to support two important capabilities to make speech platforms more flexible:

Service provider independence — Developers can switch between service providers. For example, a developer switches from a public domain speaker recognition engine to a higher quality (and more expensive) proprietary speaker recognition engine.
Service location independence — Developers can move services among servers. For example, if a server becomes saturated, another server can be installed and some of the services from the first server can be reloaded onto the second server.

Developers can leverage the benefits of MRCP to provide this flexibility to any application or platform that uses speech recognition, speech synthesis, and speaker authentication. For example:

VoiceXML — MRCP provides media services to the VoiceXML platform. Developers continue to use speech application development methodology and tool kits to create VoiceXML 2.0/2.1 applications. The VoiceXML platform uses MRCP to provide speech resource services.
SALT — The Speech Application Language Tags4 (SALT) could be implemented using MRCP. SALT developers would be free to choose ASR and TTS services from any technology vendor.
W3C’s aural CSS — The W3C’s specification for aural Cascading Style Sheets5 (CSS) supports an audio rendering of (X)HTML and XML pages by using an aural style sheet. This would enable sight-impaired individuals to browse and interact with (X)HTML and XML pages on browsers supporting the W3C aural CSS style sheet. Currently, neither Microsoft Internet Explorer™ nor Netscape Navigator™ support aural CSS, but it is possible to build aural CSS plugins for Internet Explorer or Navigator that use MRCP to provide speech recognition and speech synthesis services.
Animated visual agents — Animated icons change their appearance, move around the screen and talk to users. Popular visual agents from Microsoft include Perdy (a parrot), Merlin (a wizard), Robby (a robot) and Genie.6 Animated visual agents will be used in entertainment applications on PCs, mobile devices, kiosks and Internet-enabled televisions. Users will interact with a variety of artificial newscasters, program hosts, artificial characters and cartoons. Generic animated agent software could use MRCP to provide the ASR and TTS services; while artists create the appearance and animation, and developers create the dialogs and applications.
Small mobile devices — Mobile devices are becoming so small that QWERTY keypads are impractical. Users will use a stylus to point and write and a microphone to speak to these devices. MRCP can be used to support the speech resources on remote servers accessed by mobile devices.

MRCP can support remote media services for the applications listed above, and others that have not been invented yet. MRCP can do for the entire speech industry what VoiceXML did for the telephony industry—provide a standard platform on which to write applications that enables media resources to be accessed remotely, and enables developers to choose the technology vendors that best support their applications within their budgets.

1: http://www.ietf.org/internet-drafts/draft-ietf-speechsc-mrcpv2-04.txt
2: http://www.ietf.org/
3: http://www.ietf.org/html.charters/speechsc-charter.html
4: http://www.saltforum.org/
5: http://www.w3.org/TR/2004/WD-css3-speech-20040727/
6: For a catalog of personal agents from Microsoft and other vendors, see http://www.iva-user-center.com/. For a short tutorial on various implementations of animated agents, see http://www.speechtechmag.com/issues/4_2/cover/298-1.html.

Dr. James A. Larson is manager of Advanced Human Input/Output at Intel Corporation and author of the book, Voice XML - Introduction to Developing Speech Applications. He can be reached at jim@larson-tech.com and his Web site is www.larson-tech.com.