SALT

Speech Application Language Tags (SALT)

James A. Larson

jim@larson-tech.com

Intel Corporation

Abstract

Enabling users to speak and listen to a computer will greatly enhance the ability for users to access computers at any time from nearly any place. Speech Application Language Tags (SALT) is a small number of XML elements that may be embedded into host programming languages to speech-enable applications. SALT may be used to develop telephony (speech input and output only) applications and multimodal applications (speech input and output, as well as keyboard and mouse input and display output). SALT and the host programming language provide control structures not available in VoiceXML, the current standard language for developing speech applications.

Keywords

SALT, VoiceXML, SRGS, Speech Recognition Grammar Specification, SSML, Speech Synthesis Markup Language, Semantic Interpretation, XML, telephony applications, multimodal applications

Introduction

Speaking and listening is so fundamental that people take it for granted. Everyday people ask questions. They give instructions. Speaking and listening are necessary for learning and training, for selling and buying, for persuading and agreeing, and for most social interactions. For the majority of people, speaking and understanding spoken speech is simply the most convenient and natural way of interacting with other people.

So, is it possible to speak and listen to a computer?

Yes.

Emerging technology enables users to speak and listen to the computer now. Speech recognition converts spoken words and phrases into text, and speech synthesis converts text to human-like spoken words and phrases. While speech recognition and synthesis have long been in the research stage, three recent advances have enabled speech recognition and synthesis technologies to be used in real products and services: (1) faster, more powerful computer technology, (2) improved algorithms using speech data captured from the real world, and (3) improved strategies for using speech recognition and speech synthesis in conversational dialogs enabling users to speak and listen to the computer.

Motivation for Speaking and Listening to a Computer

Speech applications enable users to speak and listen to a computer…

…despite physical impairments such as blindness or poor physical dexterity. Speaking enables impaired callers to access computers. Callers with poor physical dexterity (who cannot type) can use speech to enter requests to the computer. The sight-impaired can listen to the computer as it speaks. When visual and/or mechanical interfaces are not an option, callers can perform transactions by saying what they want done and supplying the appropriate information. If a person with impairments can speak and listen, that person can use a computer.

…to bypass the limitations of small keyboards and screens. As devices become smaller, our fingers do not. Keys on the keypad shrink—often to the point where people with thick fingers press two or more keys with one finger stroke. The small screens on some cell phones may be difficult to see, especially in extreme lighting conditions. Even PDAs with QWERTY keyboards are awkward. (QWERTY is a sequence of six keys found on traditional keyboards used by most English and Western-European language speakers.) Users hold the device with one hand and “hunt and peck” with the forefinger of the other hand. It is impossible to use both hands to touch-type and hold the device at the same time. By speaking, callers can bypass the keypad (except possibly for entering private data in crowded or noisy environments). By speaking and listening, callers can bypass the small screen of many handheld electronic devices.

…if the device has no keyboard. Many devices have no keypad or keyboard. For example, stoves, refrigerators, and heating and air conditioning thermostats have no keyboards. These appliances may have a small control panel with a couple of buttons and a dial. The physical controls are good for turning the appliance on and off and adjusting its temperature and time. Without speech, a user cannot specify complex instructions such as, “turn the temperature in the oven to 350 degrees for 30 minutes, then change the temperature to 250 degrees for 15 minutes, and finally leave the oven on warm.” Without speech, the appliance cannot ask questions such as, “When on Saturday morning do you turn the heat on?” Any sophisticated dialog with these appliances will require speech input. And speech can be used with rotary phones, which do not have a keypad.

…while callers work with their hands and eyes. Speaking and listening are especially useful in situations where the caller’s eyes and/or hands are busy. Drivers need to keep their eyes on the road and their hands on the steering wheel. If they must use a computer when driving, the interface should be speech only. When driving machines requiring their hands to operate controls and their eyes to focus on the machine activities, machine operators can also use speech to communicate with a computer. (Although is it not recommended that you hold and use a cell phone while driving a car.) Mothers and caregivers with children in their arms may also appreciate speaking and listening to a doctor’s Web page or medical service. If a person can speak and listen to others while they work, they can speak and listen to a computer while they work.

…at anytime during the day. Many telephone help lines and receptionists are available only during working hours. Computers can automate much of this activity, such as accepting messages, providing information, and answering callers’ questions. Callers can access these automated services 24 hours a day, 7 days a week via a telephone by speaking and listening to a computer. If a person can speak and listen, they can interact with a computer anytime during the day or night.

…with instant connection without being placed on “hold.” Callers become frustrated when they hear “your call is very important to us” because this message means they must wait. “Thanks for waiting, all of our operators are busy” means more waiting. When using speech to interact with an application, there are no hold times. The computer responds quickly. (However, computers can become saturated which results in delays; but these occur less frequently than callers waiting for a human operator.) Because many callers can be serviced by voice-enabled applications, the human operators are freed to resolve more difficult caller problems.

…using languages that do not lend themselves to keyboarding. Some languages do not lend themselves to data entry using the traditional QWERTY keyboard. Rather than force Asian language users to mentally translate their words and phrases to phonetic sounds and then press the corresponding keys on the QWERTY keyboard, a much better solution is to speak and listen. Speech and handwriting recognition will be the key to enabling Asian language speakers to gain full use of computers. If a person can speak and listen to an Asian language, they can interact with a computer using that language.

…to convey emotion. In an effort to enhance written text to convey emotions, callers frequently use emoticons — keyboard symbols to convey emotions — to enhance their text messages. Example emoticons include :) for happy or a joke and :( for sad. With speech, these emotions can be conveyed naturally by changing the inflection, speed, and volume of the speaking voice.

…to use multiple channels of communication between user and computer. Speech enhances traditional GUI user interfaces by enabling users to speak as well as click and type, and hear as well as read. Multimodal user interfaces will improve the exchange of information between users and computers by transferring information in the most appropriate mode—speech for simple requests and simple answers, and GUIs for complex requests and graphical and pictorial answers.

Languages for Speech Applications

This new environment led to the creation of VoiceXML, an XML-based declarative language for describing the exchange of spoken information between users and computers and related languages. The related languages include the Speech Recognition Grammar Specification (SRGS) for describing what words and phrases the computer should listen for and the Speech Synthesis Markup Language (SSML) for describing how text should be rendered as verbal speech. VoiceXML is widely used to develop voice-only user interfaces for telephones and cell phones users.

VoiceXML uses predefined control structures, enabling developers to specify what should be spoken and heard, but not the low level details of how those operations occur. As is the case with many special-purpose declarative languages, developers sometimes prefer to write their own procedural instructions. Speech Application Language Tags (SALT) was developed to enable Web developers to use traditional Web development languages to specify the control and use a small number of XML elements for managing speech. In addition for use with telephony applications, SALT can also be used for multimodal applications where people use multiple modes of input—speaking, as well as typing and selecting (pointing).

SALT

The SALT Forum [http://www.saltforum.org/] originally consisting of Cisco, Comverse, Intel, Microsoft, Philips, and SpeechWorks (now ScanSoft), published the initial specification in June 2002. This specification was contributed to the World Wide Web Consortium (W3C) in August of that year. Later in June 2003, the SALT Forum contributed a SALT profile for Scalar Vector Graphics (SVG) to the W3C.

The SALT specification contains a small number of XML elements enabling speech output to the user, called prompts, and speech input form the user, called responses. SALT elements include:

<prompt>—presents audio recordings and synthesized speech to the user. SALT also contains a prompt queue and commands for managing the presentation of prompts on the queue to the user.

<listen>—recognizes spoken words and phrases. There are three listen modes:

Automatic—used for recognition in telephony or hands-free scenarios. The speech platform rather than the application controls when to stop the recognition facility.
Single—used for push-to-talk applications. An explicit stop from the application returns the recognition result.
Multiple—used for “open-microphone” or dictation applications. Recognition results are returned at intervals until the application makes an explicit stop.

<grammar>—specifies the words and phrases a user might speak
<dtmf>—recognizes DTMF (telephone touch-tones)
<record>—captures spoken speech, music, and other sounds
<bind>—integrates recognized words and phrases with application logic
<smex>—communicates with other platform components

SALT designers subsetted the SALT functionality into multiple profiles that are implemented and used independently of the remaining SALT modules. Various devices may use different combinations of profiles. Devices with limited processor power or memory need not support all features (for example, mobile devices do not need to support dictation). Devices may be tailored to particular environments (for example, telephony support may not be necessary for television set-top boxes). While full application portability is possible within devices using the same profile, there is limited portability across devices with different profiles.

SALT has no control elements, such as <for> or <goto>, so developers embed SALT elements into other languages, called host languages. For example, SALT elements may be embedded into languages such XHTML, SVG, and JavaScript. Developers use the host language to specify application functions and execution control while the SALT elements provide advanced input and output using speech recognition and speech synthesis.

Architectures for SALT Applications

Users interact with telephony applications using a telephone, cell phone, or other mobile device with a microphone and speaker. The hardware architecture for telephony applications, illustrated in Figure 1, contains:

Web server—contains HTML, SALT and embedded scripts. The scripts control the dialog flow, such as the order for playing audio prompts to the caller.
Telephony server—connects the IP network (and the speech server) to the telephone network
Speech server—contains a speech recognition engine which converts spoken speech into text, a speech synthesis engine which converts text to human-sounding speech, and an audio subsystem for playing prompts and responses back to the user.
Client devices—device to which to user listens and speaks, such as for example mobile telephones and telephony-enabled PDAs.

There are numerous variations for the architecture shown in Figure 1. A small speech recognition engine could reside in the user device (for example, to recognize a small number of command and control instructions), or it may be distributed across the device and speech server (the device performs DSP functions on spoken speech, extracting “speech features” that are transmitted to the speech server which concludes the speech recognition processing). The various servers may be combined or replicated depending upon the workload. And the telephony server could by replaced by internet connections to speech-enabled desktop devices, bypassing the telephone communication system entirely.

Some mobile devices—and most desktop devices—have screens and input devices such as keyboard, mouse, and stylus. These devices support multimodal applications, which support more than one mode of input from the user, including keyed text, handwriting and pen gestures, and spoken speech.

Telephony and Multimodal Applications Using SALT

Figure 2 illustrates a sample telephony application written with SALT elements embedded in HTML. The bolded code in Figure 2 will be replaced by the bolded code in Figure 3, which illustrates the same application as a multimodal application.

Figure 3 illustrates a typical multimodal application written with SALT embedded in HTML. In this application, the user may either speak or type to enter values into the text boxes. Note that the code in Figure 3 is somewhat different from the code in Figure 2. This is because many telephony applications are system-directed (the system guides the user by asking questions which the user answers), while as with visual-only applications, multimodal applications are often user-directed (the user indicates which data will be entered by clicking a mouse or pointing with a stylus, and then entering the data).

Programming with SALT is different from programming traditional visual applications in the following ways:

If the developer does not like how the speech synthesizer renders text as human-understandable voice, the developer may add Speech Synthesis Markup language (SSML) elements to the text to provide hints for the speech synthesis system. For example, the user could insert a <break time = "500ms"/> element to instruct the speech synthesizer to remain silent for 500 milliseconds. SSML is a W3C standard and is used by both SALT and VoiceXML 2.0/2.1.

The developer must supply a grammar to describe the words and phrases users are likely to say. Grammars help the speech recognition system recognize words faster and more accurately. SALT (and VoiceXML 2.0/2.1) developers specify grammars using the Speech Recognition Grammar Specification (SRGS), another W3C standard. An example grammar is illustrated in Figure 2, lines 44–54. Application developers should spend effort to fine-tune the specification of grammars to recognize words frequently spoken by the user at each point in the dialog, as well as fine-tune the wording of the prompts to encourage users to speak those words and phrases.

Speech recognition systems do not understand spoken speech perfectly. (Even humans occasionally misunderstand what others say.) In the best circumstances, speech recognition engines fail to accurately recognize three to five percent of spoken words. Developers compensate for poor speech recognition by writing event handlers to assist users in overcoming speech recognition problems by prompting the user to speak again, often rephrasing the question differently so the user responds by saying different words. Example event handlers are illustrated in Figure 2, lines 35–37 and lines 38–40. Developers may spend as much as 30 to 40 percent of their time writing event handlers which are needed occasionally but are essential when the speech recognition system fails.

Comparison of SALT with VoiceXML

SALT and VoiceXML enable very different approaches for developing speech applications. SALT tags control the speech medium (speech synthesis, speech recognition, audio capture, audio replay, and DTMF recognition). SALT tags are often be embedded into another language that specifies flow control and turn taking. On the other hand, VoiceXML is a stand-alone language which controls the speech medium as well as flow control and turn-taking.

In VoiceXML the details of flow control are managed by an a special algorithm called the Forms Interpretation Algorithm. For this reason, many developers consider VoiceXML a declarative language. On the other hand, SALT is frequently embedded into a procedural programming language. Many developers consider the programming languages into which SALT is embedded to be procedural. It should be noted, however, that SALT can be used as a stand-alone declarative language by using the assignment and conditional features of the <bind> statement. Thus, SALT can be used in resource-scarce platforms such as cell phones that cannot support a host language. For details, see section 2.6.1.3 in the SALT specification.

While SALT and VoiceXML make it easy to implement speech-enabled applications, it is difficult to design a quality speech application. An HTML programmer easily learns how to write SALT applications, but designing a usable speech or multimodal application is still more of an art than a science. [Balentine and Cohen] present guidelines and heuristics for designing effective speech dialogs. A series of iterative designs and usability tests are necessary to implement speech applications for users to both enjoy and use efficiently to perform their desired computer tasks.

Conclusion

It is not clear at when this article was written if SALT will overtake and replace VoiceXML as the most widely used language for writing telephony applications. It is also not clear if SALT or some other language will become the preferred language for developing multimodal applications. The availability of high-level design tools, code generators, and system development environments that hide the choice of development language from the speech application developer may minimize the importance of programming language choice.

Further Reading

Balentine B, and Morgan, D. P. (2004) How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues (2 nd edition), 1999, San Ramon, CA: Enterprise Integration Group.

Cohen M. H., Giangola J. P., Balogh, J., (2004). Voice User Interface Design, Addison Wesley.

Speech Applications Language Tags Specification Version 1.0 , 15 July 2002, http://www.saltforum.org/

Speech Recognition Grammar Specification (SRGS), Version 1.0, W3C Recommendation, 16 March 2004, http://www.w3.org/TR/2004/REC-speech-grammar-20040316/

Speech Synthesis Markup Language (SSML), Version 1.0, W3C Proposed Recommendation, 15 July 2004, http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/

Voice Extensible Markup Language (VoiceXML), Version 2.0, W3C Recommendation, 16 March 2004, http://www.w3.org/TR/2004/REC-voicexml20-20040316/

Biography

Dr. James A. Larson is co-chair of the W3C Voice Browser Working Group that is standardizing VoiceXML and related markup languages for developing speech applications. Jim is Manager of Advanced Human Input/Output at Intel. Jim also teaches courses in developing speech applications at Portland State University and the Oregon Graduate School in Oregon Health and Sciences University. He is a columnist for SpeechTEK Magazine and was named one of the top ten leaders in Speech by SpeechTEK Magazine for 2002, 2003 and 2004.

Figure 1: Hardware Architecture for Telephony Applications