VoiceXML: Introduction to Developing Speech Applications

 

James A Larson

 

Chapters 8. Application-directed styles and dialog documents

8.1 Introduction to dialogs

Everyone speaks in hundreds of dialogs every day. Each of us would like to believe that designing a dialog document is easy—nothing to it. However, just as being able to draw does not enable someone to be great artist, being able to write does not make one a great author, and being able to speak does not make one a good dialog document designer.

Designing a dialog document is an art. There are some rules and guidelines a dialog designer should follow, but designing good dialog documents is still an art.

This chapter, and the next two chapters, describe various styles of dialogs and suggest how to implement each of these styles in VoiceXML. Chapters 8-10 also present human factor guidelines and hints for constructing applications using each type of dialog. Reading this chapter is just the first step in designing good dialogs. Designing good dialogs requires more than good intentions. It includes:

A designer’s first few dialog documents will likely to perform less than perfectly during usability testing. Do not feel badly; this happens to most dialog designers. Also, this usually happens with the initial versions of dialog documents. Developers need to “bootstrap” their initial dialog documents into world-class dialog documents with repeated user testing.

A dialog document is a specification of prompts and responses, including possible caller utterances and computer actions. Just as the script of a theatrical play guides an actor in what to do and say, a dialog document guides the voice application caller in what to do and say. However, unlike a theatrical script, dialog documents are much more interactive; actors usually have a single plot to follow during a play. With a dialog document, the caller may select from many alternative paths through a dialog document.

As discussed in chapter 3, an interpreter is software engine that processes dialog documents to produce a dialog that the caller experiences as he or she responds to prompts. A VoiceXML interpreter is usually bundled with a VoiceXML browser, which also fetches dialog documents from a Web server. A dialog is a sequence of actual requests and information exchanges between a caller and an application that conform to a dialog document.

As an analogy, consider a roadmap. It corresponds to the dialog document—it presents alternative routes and destinations. The interpreter corresponds to a car—it takes people to their destination along a route that they choose. The trip itself corresponds to the dialog—the experience of choosing roads to travel, seeing landmarks, and arriving at the destination.

A quality dialog document results in dialogs that are useful and useable. A useful dialog document contains application functions that enable the caller to accomplish tasks that assist the caller to perform his or her tasks at hand or to provide enjoyable entertainment. A useable dialog requires little training to use effectively. It is predictable, clear, well organized, uncluttered, and comprehensible.


Seven important classes of dialog styles are shown in Figure 8.1. These classes can be grouped into three categories—application-directed, user-directed, and mixed-initiative dialogs. The simpler dialog styles are positioned towards the bottom and the more complex dialogs towards the top of the illustration. Table 8.1 summarizes the differences among the seven main dialog styles, which are described in greater detail in this and the following two chapters.

Application-directed dialogs (also called application-initiative, machine-directed, system-directed, or directed dialogs) . The application “drives” the dialog by prompting the caller by asking the caller questions to which the caller responds. Figure 8.2 illustrates the typical flow of an application-directed dialog. The application prompts the caller by asking a question or giving instructions and then waits for the caller to respond. The caller responds by speaking or pressing the buttons on a touchtone phone. The application then performs the appropriate action, and the cycle begins again.

 


Three types of application-directed dialog styles are touchtone-based menus, ASR-based menus, and form fill-in. Touchtone-based menus are verbal menus to which the caller responds by pressing the keys on the telephone keypad to produce touch-tones. ASR-based menus are verbal menus to which the caller responds by speaking one of a small number of words or phrases. A verbal form fill-in is the verbal equivalent of a paper form in which the caller verbally enters values into slots.

Application-directed dialogs are easy to code and maintain. Many callers like application-directed dialogs because they guide the caller through the application. The caller is not required to remember any commands or options (although with experience, callers do remember commands and options and use barge-in to speed up the dialog). Many callers have experienced application-directed dialogs when using the touch-tone buttons on telephones. Most callers feel comfortable with application-directed dialogs. On the negative side, callers may complain that the dialog is too structured, too rigid, and takes too much time to complete. Some callers feel that the computer becomes their master, and they become mere slaves to the computer.

User-directed dialogs (also called user-initiative dialogs) . The caller speaks to the application, instructing the application what to do. Figure 8.3 illustrates the typical flow of a user-directed dialog. The caller speaks a request, the application performs the appropriate action and confirms the result to the caller, and then waits for the caller to speak the next request. Then, the cycle starts again.


The caller “drives” the dialog by initiating each dialog segment without explicit prompts. Three types of user-directed dialogs are command and control in which the caller speaks a small number of specific commands, query in which the caller asks a well-defined question, and dictation in which the caller speaks sentences and paragraphs that the application transcribes to text.

User-directed dialogs generally require the caller to remember the names of commands and parameters. While this is seldom a problem for an experienced caller, it may be problematic for a novice caller. This problem can be minimized with carefully designed help messages or a cue card listing the commands that the caller carries in his wallet or her purse. After learning the command set, callers generally like user-directed dialogs because callers do not have to listen to lengthy menus. They feel in control of the application. This type of dialog style will be discussed in Chapter 9.

Mixed-initiative . This dialog style is a mixture of user-directed and application-directed dialogs. Figure 8.4 illustrates an application-directed style in the top of the illustration; a user-directed flow in the bottom of the illustration, and switching between these two dialog styles in the center of the illustration.


Some callers become confused when they first encounter a mixed-initiative dialog because they do not know what to do or when to barge-in. Perhaps this is because many callers are familiar with application-directed dialogs, especially touch-tone based applications, and have little experience with mixed-initiative dialogs. Mixed-initiative feels more natural to most callers after they pass through the initial learning period. On the negative side, mixed-initiative dialogs can be more time-consuming to build, test, and maintain. Frequently they require a great deal of user testing and maintenance. This type of dialog will be discussed in Chapter 10.

(Begin Table 8.1)

Type of dialog: Problems solved

Usage model

Requirements

Example application

Touchtone menus: Numeric data entry over the telephone

Nested sequences of system prompts and caller responses

Touchtone recognition

Telephone answering assistant, telephone transactions

ASR menus: Alphanumeric data entry

Nested sequences of prompts and caller verbal responses

Conversational speech recognition engine

Telephone transactions

Forms : Data entry

Sequences of system prompts and caller responses

Conversational speech recognition engine

Data entry—e.g., patient history, credit card application

Command and control : Caller does not have access to a keyboard.

Caller utters simple commands, to which the computer responds.

“Undo” commands reverse process actions immediately.

 

Small vocabulary ASR

Robot control, environment control, application control

 

Query : Obtain information

1. Caller formulates the query.

2. Application may ask the caller clarifying questions.

3. Application generates and presents the results.

4. Caller reviews results and, possibly, modifies the query.

Large vocabulary speech recognition engine, speech synthesis engine

Query a relational database; search files or Web sites; search for help topics.

Dictation : Caller does not have access to, want to use, or cannot use a keyboard.

Caller dictates sentences and paragraphs.

Large vocabulary, continuous dictation engine

Create, edit, and format documents; create and edit e-mail messages

Mixed-initiative : Inflexible dialogs

Caller and system take turns “driving” the dialog.

Small vocabulary speech recognition engine

Any speech application

Table 8.1:. Summary of types of dialog types

Most applications should support application-directed dialogs for reasons including the following:

Experienced users in the manipulation stage of using an application may prefer mixed initiative because they can take advantage of short cuts such as barge in to perform their tasks more quickly. It is desirable for novice callers to gradually begin using mixed initiative as they progress from the orientation phase through the exploration phase and then fully use mixed initiative when they finally achieve the manipulation phase. It is also nice if a caller can drop back to a system-directed dialog when the caller enters a part of the application with which he is not familiar or has not visited in some time.

8.2. Touchtone menus

Touchtone menus are currently the most widely used dialog style for telephone-computer systems. They are widely implemented and are used to capture data from callers and route callers to the appropriate person in the company called.

Problems solved

Usage model

Requirements

Example application

Numeric data entry over the telephone

Nested sequences of system prompts and caller responses

Touchtone tone recognition

Telephone answering assistant, telephone transactions

Table 8.2: Summary of touchtone menus

Description

Dual Tone Multiple Frequency (better known as DTMF or touchtones) refers to the sounds a telephone makes when its keys are pressed. Callers press buttons on the telephone keypad to dial numbers; callers can also press these buttons to answer questions and select options from verbal menus.

Touchtone menus consist of a series of verbal menus where callers hear a set of verbal options and respond by pressing keys on a touchtone phone. Callers also may hear a series of spoken prompts and enter values into a VoiceXML form. A DTMF tone interpreter hears the tones and recognizes the digit pressed by the caller. The dialog manager performs the task indicated by the digit. Then the caller presents another set of verbal menu options to the caller.

Use

Callers love to hate touchtone menus because not only do they require careful listening on the part of the caller to hear options to press the appropriate key, but also they are time-consuming, inflexible, and sometimes confusing. Callers require time to listen to menu options before deciding which key to press. Callers are forced to navigate the menu tree hierarchy, even if they know which leaf they want to access. Callers find it difficult to recover when they accidentally choose the wrong option and proceed down the wrong branch of the menu tree.

Despite these problems, touchtone menu dialogs are widely used. They represent the first reliable technology for automating telephone answering, call routing, and simple data collection. Touchtone menus are necessary in situations such as the following:

Example

Figure 8.5 illustrates an example of a two-level touchtone menu hierarchy expressed using VoiceXML. The main menu (lines 3-13) asks the caller to select from among three options—thermostat, lights, and security. The caller responds by pressing one of the keys on the touchtone phone. If the caller presses key 1, a second level menu for thermostat (lines 14-23) asks the caller what adjustment to the temperature is desired. If the caller presses key 2 or 3, then the dialog manager presents menus about lights (lines 24-33) or security (lines 34-43). Figure 8.5 illustrates “stubs” for processing each of the options (lines 44-73). These stubs illustrate how a message is presented to the user, but do not illustrate how the actual function is performed. In a real application, these stubs would be replaced by the actual functions, or invocations of the functions.

<?xml version="1.0"?> <!-- 1 -->

<vxml version="2.0"> <!-- 2 -->

<menu id="main"> <!-- 3 -->

<prompt bargein="true"> <!-- 4 -->

Main menu. Choose the environmental control you want to modify. <!-- 5 -->

To change the temperature, press one. <!-- 6 -->

To change lights, press two. <!-- 7 -->

To change security, press three. <!-- 8 -->

</prompt> <!-- 9 -->

<choice dtmf="1" next="#thermostat"/> <!-- 10 -->

<choice dtmf="2" next="#lights"/> <!-- 11 -->

<choice dtmf="3" next="#security"/> <!-- 12 -->

</menu> <!-- 13 -->

 

<menu id= "thermostat"> <!-- 14 -->

<prompt bargein="true"> Thermostat. <!-- 15 -->

To make the temperature warmer, press one. <!-- 16 -->

To make the temperature cooler, press two. <!-- 17 -->

To return to the main menu, press nine. <!-- 18 -->

</prompt> <!-- 19 -->

<choice dtmf = "1" next="#warmer"/> <!-- 20 -->

<choice dtmf = "2" next="#cooler"/> <!-- 21 -->

<choice dtmf = "9" next="#main"/> <!-- 22 -->

</menu> <!-- 23 -->

 

<menu id= "lights"> <!-- 24 -->

<prompt bargein="true"> Lights. <!-- 25 -->

To adjust the yard lights, press one. <!-- 26 -->

To adjust the porch lights, press two. <!-- 27 -->

To return to the main menu, press nine.. <!-- 28 -->

</prompt> <!-- 29 -->

<choice dtmf = "1" next="#lights_yard"/> <!-- 30 -->

<choice dtmf = "2" next="#lights_porch"/> <!-- 31 -->

<choice dtmf = "9" next="#main"/> <!-- 32 -->

</menu> <!-- 33 -->

 

<menu id= "security"> <!-- 34 -->

<prompt bargein="true"> Security. <!-- 35 -->

To adjust the garage security, press one. <!-- 36 -->

To adjust the house security, press two. <!-- 37 -->

To return to the main menu, press nine. <!-- 38 -->

</prompt> <!-- 39 -->

<choice dtmf = "1" next="#security_garage"/> <!-- 40 -->

<choice dtmf = "2" next="#security_house"/> <!-- 41 -->

<choice dtmf = "9" next="#main"/> <!-- 42 -->

</menu> <!-- 43 -->

 

<form id="warmer"> <!-- 44 -->

<block> <!-- 45 -->

<prompt> processing warmer </prompt> <!-- 46 ->

<goto next="#main"/> <!-- 47 -->

</block> <!-- 48 -->

</form> <!-- 49 -->

 

<form id="cooler"> <!-- 50 -->

<block> <!-- 51 -->

<prompt> processing cooler </prompt> <!-- 52 -->

<goto next="#main"/> <!-- 53 -->

</block> <!-- 54 -->

</form> <!-- 55 -->

 

<form id="lights_yard"> <!-- 56 -->

<block> <!-- 57 -->

<prompt> processing yard lights </prompt> <!-- 58 -->

<goto next="#main"/> <!-- 59 -->

</block> <!-- 60 -->

</form> <!-- 61 -->

 

<form id="lights_house"> <!-- 62 -->

<block> <!-- 63 -->

<prompt> processing porch lights </prompt> <!-- 64 -->

<goto next="#main"/> <!-- 65 -->

</block> <!-- 66 -->

</form> <!-- 67 -->

 

<form id="security_garage"> <!-- 68 -->

<block> <!-- 69 -->

<prompt> processing garage security </prompt> <!-- 70 -->

<goto next="#main"/> <!-- 70 -->

</block> <!-- 72 -->

</form> <!-- 73 -->

 

<form id="security_house"> <!-- 74 -->

<block> <!-- 75 -->

<prompt> processing house security </prompt> <!-- 76 -->

<goto next="#main"/> <!-- 77 -->

</block> <!-- 78 -->

</form> <!-- 79 -->

 

</vxml> <!-- 80 -->

Example 8.5: Example two-level DTMF menu

If the VoiceXML browser supports type-ahead, then an experienced caller who is familiar with the menu structure is not forced to listen to the long prompts, but may press keys while the prompt is being presented. Type-ahead is desirable because it enables experienced callers to navigate through menus quickly. As non-experienced callers listen to full prompts, they will learn the menu structures and gradually enhance their knowledge of the system to become experienced callers. Callers are given the ability to type-ahead by specifying the bargein attribute to be true:

<prompt bargein="true">

 

within the <prompt> element. The developer specifies

 

<prompt bargein= "false">

 

if the caller is required to listen to the prompt—for example, if the prompt contains a warning, legal notice, or paid advertisement. If the barge-in attribute is not present, the default value is true.

The following human factor guidelines are useful when designing dialogs that are both easy and quick to use.

Guidelines

Menu organization matches caller’s perspective . Minimize callers’ effort to map their request onto the menu hierarchy. Identify the types of requests that callers make and organize the menu hierarchy reflect those request rather than organize the menu around the organization’s structure. Present the most frequently selected choices early so callers can avoid listening to seldom-selected choices.

Short prompts. Short prompts have one big advantage—callers can hear the prompt and respond faster. In general, long prompts are time-consuming. Most callers dislike listening to long prompts. Short prompts make more efficient use of the caller’s time.

Long prompts may cause the caller to believe that the system is more intelligent than it really is. When the system fails to live up to expectations, the caller may be disappointed and unhappy with the system.

Tapered prompts . When presenting a sequence of prompts to the caller, make successive prompts shorter by removing hints and unnecessary words. This will make the conversation seem less repetitive, as well as shorten the conversation. Figure 8.6 illustrates the use of tapered prompts. Note how the prompt message becomes simpler and shorter in each successive prompt. The first prompt (lines 5-12) both introduces the form to the caller and presents a detailed explanation of what the caller should do for the first field. The second prompt (lines 16-19) presents a shorter prompt. The remaining prompts (lines 24-26, 30-32, and 36-38) are each very short prompts. After the first prompt, repeating the same information would be boring.

<?xml version="1.0"?> <!-- 1 -->

<vxml version="2.0"> <!-- 2 -->

 

<form id= "evaluation"> <!-- 3 -->

 

<field name= "easy_to_learn" > <!-- 4 -->

<prompt> <!-- 5 -->

Thanks for testing our voice application. <!-- 6 -->

We need your opinion about how this application worked for you. <!-- 7 -->

Rank this application on a scale of one to five, where <!-- 8 -->

one represents the worst and five represents the best. <!-- 9 -->

How easy was this system to learn? <!-- 11 -->

</prompt> <!-- 12 -->

<grammar type="applicaton/grammar+xml" src = "scale.gramar"/> <!-- 13 -->

</field> <!-- 14 -->

 

<field name= "easy_to_use" > <!-- 15 -->

<prompt> <!-- 16 -->

On a scale of one to five, how easy did you find <!-- 17 -->

this application to use? <!-- 18 -->

</prompt> <!-- 19 -->

<grammar type="applicaton/grammar+xml" src = "scale.gramar"/> <!-- 20 -->

</field> <!-- 21 -->

 

<field name= "help" > <!-- 23 -->

<prompt> <!-- 24 -->

How helpful were the help messages? <!-- 25 -->

</prompt> <!-- 26 -->

<grammar type="applicaton/grammar+xml" src = "scale.gramar"/> <!-- 27 -->

</field> <!-- 28 -->

 

<field name= "understandability" > <!-- 29 -->

<prompt> <!-- 30 -->

Understandability? <!-- 31 -->

</prompt> <!-- 32 -->

<grammar type="applicaton/grammar+xml" src = "scale.gramar"/> <!-- 33 -->

</field> <!-- 34 -->

 

<field name= "overall" > <!-- 35 -->

<prompt> <!-- 36 -->

Overall impression? <!-- 37 -->

</prompt> <!-- 38 -->

<grammar type="applicaton/grammar+xml" src = "scale.gramar"/> <!-- 39 -->

</field> <!-- 40 -->

 

<!-- Save the scores onto a file for further analysis--> <!-- 41 -->

 

<block> <!-- 42 -->

<prompt> <!-- 43 -->

Your opinion will help us to decide this application's future. <!-- 44 -->

Thank you very much. <!-- 45 -->

</prompt> <!-- 46 -->

</block> <!-- 47 -->

</form> <!-- 48 -->

 

</vxml> <!-- 49 -->

 

using the following grammar at city.grammar:

<grammar type="application/grammar+xml" <!-- 1 -->

xml:lang = "en" root="destination"> <!-- 2 -->

<rule id = "destination" scope = "public"> <!-- 3 -->

<one-of> <!-- 4 -->

<item> <tag>"new_york"</tag> new york</item> <!-- 5 -->

<item> <tag>"new_york"</tag>big apple</item> <!-- 6 -->

<item> washington</item> <!-- 7 -->

<item> <tag>" washington"</tag>the capital</item> <!-- 8 -->

</one-of> <!-- 9 -->

</rule> <!-- 10 -->

</grammar> <!-- 11 -->

 

Figure 8.6 Tapered prompts

Progressive assistance . Reveal additional information and instruction to the caller each time the caller is prompted for the same information. For example:

Figure 8.7 illustrates progressive prompts. If the caller fails to respond to the first level prompt (lines 2-8), responds with an invalid key, or asks for help, the system should present a second level prompt (lines 9-15). If the caller still fails to respond, responds with an invalid key, or asks for help, the system should present a third level prompt (lines 16-23). If the caller still fails to respond, responds with an invalid key, or asks for help, the system transfers the caller to an operator. A <catch> tag (line 27-29) is used instead of the <prompt> tag for level 4 because the <catch> tag may contain a transfer, while the <prompt> tag may only contain text to present to the caller.

 

<?vxml version="1.0"?> <!-- 1 -->

<vxml version="2.0"> <!-- 2 -->

 

<menu id="main-menu"> <!-- 2 -->

<prompt count = "1"> Main menu, which device? <!-- 4 -->

For temperature, press one. <!-- 5 -->

For lights, press two. <!-- 6 -->

For security, press three. <!-- 7 -->

</prompt> <!-- 8 -->

 

<prompt count = "2"> Main menu, <!-- 9 -->

To change your environment, select a device and then <!-- 10 -->

press the corresponding button on your telephone. <!-- 11 -->

To change the temperature, press one. <!-- 12 -->

To change lights, press two. <!-- 13 -->

To change security, press three. <!-- 14 -->

</prompt> <!-- 15 -->

<prompt count = "3"> This is the main menu. <!-- 16 -->

You may press button on your telephone to select a device <!-- 17 -->

that affects your environment. Listen to the options and <!-- 18 -->

then press the corresponding button on your phone. <!-- 19 -->

To change the temperature, press one. <!-- 20 -->

To change lights, press two. <!-- 21 -->

To change security, press three. <!-- 22 -->

</prompt> <!-- 23 -->

 

<choice dtmf="1" next="#thermostat"/> <!-- 24 -->

<choice dtmf="2" next="#lights"/> <!-- 25 -->

<choice dtmf="3" next="#security"/> <!-- 26 -->

 

 

<catch event="nomatch noinput help" count = "4"> <!-- 27 -->

<prompt>Transferring you to an operator</prompt> <!-- 28 -->

 

<exit/> <!-- 29 -->

</catch> <!-- 30 -->

</menu> <!-- 31 -->

 

<!-- forms are the same as in figure 8.5 --> <!-- 32 -->

Figure 8.7: Progressive assistance

State what is expected . State what is expected from the caller, followed by specific options. In the VoiceXML dialog illustrated Figure 8.5, the caller is told, in general terms, to choose an environmental control (line 5). Then, the caller is requested to select a specific option (lines 6-8). Novice callers will listen to the entire prompt. As they learn what the options are, experienced callers will barge-in and speak the desired option when they hear the first part of the prompt.

Wording of prompts. There are many guidelines for formulating prompts [Bruce Balentine and David P. Morgan, How to Build a Speech Recognition Application, San Ramon, CA: Enterprise Integration Group]. For example,

Many of the guidelines for wording prompts can be summarized by the following:

If people do not say a phrase in natural, day-to-day conversation, then the computer should not say the phrase during a dialog with a caller.

Focus the caller during the end of computer messages . Callers tend to focus on the last words of a sentence. Consider these two sentences:

Here comes the car.

The car comes here

In the first sentence, the emphasis is on “car,” implying that the car is coming. In the second sentence, the emphasis in on “here,” implying the location to where the car is coming. In general, place the most important word—the word on which the caller should focus—at the end of the message. For example, suppose the caller enters a date like June 31. The error message

June has only thirty days.

is preferable to

There are only thirty days in June.

Small number of options . Because of limitations of the human short-term memory, G. A. Miller [G. A. Miller, “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information,” Psychological Science, Vol. 63, 1956, pp. 81-97.] suggests that a person only remembers 7±2 information chunks. Dialog designers should facilitate the human short-term memory by never exceeding nine options in a verbal menu. Many developers use prompts containing no more than five options.

Consistent use of keys. Developers try to use the same telephone keypad button for the similar options throughout the dialog document. For example, in Figure 8.5, key 9 is always used to return to the main menu (lines 22, 32, and 42).

Status sound. If the computer is taking more than a few seconds to perform a task, it should inform the caller that the system is working and has not abnormally ended or locked up. An audio icon such as a ticking clock, the grinding of gears, or some other appropriate sound or melody that indicates, “I’m busy” should be presented to the caller.

Touchtone menus have been used widely to automate many of the tasks of a real-life telephone attendant or operator, enabling them to spend more time dealing with the complex or difficult calls. For many years, touch-tones have been the only available technology for automating telephone attendant tasks. This is changing with the improved performance of speech recognition.

8.3. ASR Menus

Menus with ASR—automatic speech recognition—overcome many of the problems with touchtone menus, but introduce new problems with speech recognition.

Problems solved

Usage model

Requirements

Example application

ASR Menus : Alphanumeric data entry

Nested sequences of system prompts and caller verbal responses

Conversational speech recognition engine

Telephone transactions

Table 8.3: Summary of menus with ASR

Description

Menus with ASR enable the caller to speak rather than press touchtone buttons. The caller listens to a menu prompt and selects a choice by speaking. The cell phone or telephone can be held to the caller’s ear for the entire process; the caller does not need to move the phone away from his or her ear in order to select an option by pressing a key. Also, the caller does not need to translate his or her request into a sequence of key presses.

Example

With speech recognition, the main menu of Figure 8.5 could be rewritten as illustrated in Figure 8.8. Here, the caller can determine whether to speak or press a key during each prompt. This type of menu enables callers not familiar with voice-enabled applications to gradually switch from touch-tones to speech. It also provides a touchtone fallback for when callers cannot or will not speak into the telephone.

<?xml version="1.0"?> <!-- 1 -->

<vxml version="2.0"> <!-- 2 -->

<!-- 3 -->

<!-- Figure 8.8 Speech input version of Figure 8.5 --> <!-- 4 -->

<!-- 5 -->

<menu id="main"> <!-- 6 -->

<prompt bargein="true"> <!-- 7 -->

Main menu. Choose the environmental control you want to modify. <!-- 8 -->

You can change the temperature, lights, or security. <!-- 9 -->

</prompt> <!-- 10 -->

<choice next="#thermostat"> <!-- 11 -->

<grammar type="application/grammar+xml" root="temp"> <!-- 12 -->

<rule id = "temp" scope = "public"> <!-- 13 -->

<item>temperature</item> <!-- 14 -->

</rule> <!-- 15 -->

</grammar> <!-- 16 -->

</choice> <!-- 17 -->

<!-- 18 -->

<choice next="#lights"> <!-- 19 -->

<grammar type="application/grammar+xml" root="lighting"> <!-- 15 -->

<rule id = "lighting" scope = "public"> <!-- 20 -->

<item>lights</item> <!-- 21 -->

</rule> <!-- 22 -->

</grammar> <!-- 23 -->

</choice> <!-- 24 -->

<choice next="#security_control"> <!-- 25 -->

<grammar type="application/grammar+xml" <!-- 26 -->

root="security_system" > <!-- 27 -->

<rule id = "security_system" scope = "public"> <!-- 28 -->

<item>security</item> <!-- 29 -->

</rule> <!-- 30 -->

</grammar> <!-- 31 -->

</choice> <!-- 32 -->

</menu> <!-- 33 -->

<!-- 34 -->

<menu id= "thermostat"> <!-- 35 -->

<prompt bargein="true"> Thermostat. <!-- 36 -->

Do you want to make the temperature warmer or cooler? <!-- 37 -->

</prompt> <!-- 38 -->

<choice next="#warmer"> <!-- 39 -->

<grammar type="application/grammar+xml" root="temp_up" > <!-- 40 -->

<rule id = "temp_up" scope = "public"> <!-- 41 -->

<item>warmer</item> <!-- 42 -->

</rule> <!-- 43 -->

</grammar> <!-- 44 -->

</choice> <!-- 45 -->

<choice next="#cooler"> <!-- 46 -->

<grammar type="application/grammar+xml" root="temp_down"> <!-- 47 -->

<rule id = "temp_down" scope = "public"> <!-- 48 -->

<item>cooler</item> <!-- 49 -->

</rule> <!-- 50 -->

</grammar> <!-- 51 -->

</choice> <!-- 52 -->

<choice next="#main"> <!-- 53 -->

<grammar type="application/grammar+xml" root="again"> <!-- 54 -->

<rule id = "again" scope = "public" > <!-- 55 -->

<item>main menu</item> <!-- 56 -->

</rule> <!-- 57 -->

</grammar> <!-- 58 -->

</choice> <!-- 59 -->

<!-- 60 -->

</menu> <!-- 61 -->

<!-- 62 -->

<menu id= "lights"> <!-- 63 -->

<prompt bargein="true"> Lights. <!-- 64 -->

Do you want to turn the lights on or off? <!-- 65 -->

</prompt> <!-- 66 -->

<choice next="#lights_on"> <!-- 67 -->

<grammar type="application/grammar+xml" root="on" > <!-- 68 -->

<rule id = "on" scope = "public"> <!-- 69 -->

<item>on</item> <!-- 70 -->

</rule> <!-- 71 -->

</grammar> <!-- 72 -->

</choice> <!-- 73 -->

<choice next="#lights_off"> <!-- 74 -->

<grammar type="application/grammar+xml" root="off"> <!-- 75 -->

<rule id = "off" scope = "public"> <!-- 76 -->

<item>off</item> <!-- 77 -->

</rule> <!-- 78 -->

</grammar> <!-- 79 -->

</choice> <!-- 80 -->

<choice next="#main"> <!-- 81 -->

<grammar type="application/grammar+xml" root="main2" > <!-- 82 -->

<rule id = "main2" scope = "public"> <!-- 83 -->

<item>main menu</item> <!-- 84 -->

</rule> <!-- 85 -->

</grammar> <!-- 86 -->

</choice> <!-- 87 -->

</menu> <!-- 88 -->

<!-- 89 -->

<menu id= "security_control"> <!-- 90 -->

<prompt bargein="true"> Security. <!-- 91 -->

Do you want to turn the security system on of off? <!-- 92 -->

</prompt> <!-- 93 -->

<choice next="#security_on"> <!-- 94 -->

<grammar type="application/grammar+xml" root="turn_on"> <!-- 95 -->

<rule id = "turn_on" scope = "public"> <!-- 96 -->

<item>on</item> <!-- 97 -->

</rule> <!-- 98 -->

</grammar> <!-- 99 -->

</choice> <!-- 100 -->

<choice next="#security_off"> <!-- 101 -->

<grammar type="application/grammar+xml" root="turn_off"> <!-- 102 -->

<rule id = "turn_off" scope = "public"> <!-- 103 -->

<item>off</item> <!-- 104 -->

</rule> <!-- 105 -->

</grammar> <!-- 106 -->

</choice> <!-- 107 -->

<choice next="#main"> <!—108 -->

<grammar type="application/grammar+xml" root="to_main"> <!-- 109 -->

<rule id = "to_main" scope = "public"> <!-- 110 -->

<item>main menu</item> <!-- 111 -->

</rule> <!-- 112 -->

</grammar> <!-- 113 -->

</choice> <!-- 114 -->

</menu> <!-- 115 -->

<!-- forms are the same as in Figure 8.5 -->

Figure 8.8: Menu with speech recognition

The dialog document illustrated in Figure 8.8 is the same dialog structure as the touchtone dialog document illustrated in Figure 8.5, as well as the same problems: it is a time-consuming, inflexible, and confusing. These problems are overcome with the revised dialog document illustrated in Figure 8.9.

<?xml version="1.0"?> <!-- 1 -->

<vxml version="2.0"> <!-- 2 -->

<!-- 3 -->

<!-- Figure 8.9 Voice menu that compresses menu hierarch of Figure 8.5 -->

<!-- 5 -->

<menu id="main"> <!-- 6 -->

<prompt bargein="true">Main menu. <!-- 7 -->

What action? Warmer, cooler, lights on, lights off, security on, <!-- 8 -->

or security off? <!-- 9 -->

</prompt> <!-- 10 -->

<help> <!-- 11 -->

To change temperature, say warmer or cooler. <!-- 12 -->

To change the lighting, say lights on or lights off. <!-- 13 -->

To change the security, say security on or security off. <!-- 14 -->

</help> <!-- 15 -->

<choice next="#warmer"> <!-- 16 -->

<grammar type="application/grammar+xml" root="temp_up" > <!-- 17 -->

<rule id = "temp_up" scope = "public"> <!-- 18 -->

<item>warmer</item> <!-- 19 -->

</rule> <!-- 20 -->

</grammar> <!-- 21 -->

</choice> <!-- 22 -->

<choice next="#cooler"> <!-- 23 -->

<grammar type="application/grammar+xml" root="temp_down"> <!-- 24 -->

<rule id = "temp_down" scope = "public"> <!-- 25 -->

<item>cooler</item> <!-- 26 -->

</rule> <!-- 27 -->

</grammar> <!-- 28 -->

</choice> <!-- 29 -->

<choice next="#lights_on"> <!-- 30 -->

<grammar type="application/grammar+xml" root="on" > <!-- 31 -->

<rule id = "on" scope = "public"> <!-- 32 -->

<item>on</item> <!-- 33 -->

</rule> <!-- 34 -->

</grammar> <!-- 35 -->

</choice> <!-- 36 -->

<choice next="#lights_off"> <!-- 36 -->

<grammar type="application/grammar+xml" root="off"> <!-- 38 -->

<rule id = "off" scope = "public"> <!-- 39 -->

<item>off</item> <!-- 40 -->

</rule> <!-- 41 -->

</grammar> <!-- 42 -->

</choice> <!-- 43 -->

<choice next="#security_on"> <!-- 44 -->

<grammar type="application/grammar+xml" root="turn_on"> <!-- 45 -->

<rule id = "turn_on" scope = "public"> <!-- 46 -->

<item>on</item> <!-- 47 -->

</rule> <!-- 48 -->

</grammar> <!-- 49 -->

</choice> <!-- 50 -->

<choice next="#security_off"> <!-- 51 -->

<grammar type="application/grammar+xml" root="turn_off"> <!-- 52 -->

<rule id = "turn_off" scope = "public"> <!-- 53 -->

<item>off</item> <!-- 54 -->

</rule> <!-- 55 -->

</grammar> <!-- 56 -->

</choice> <!-- 57 -->

</menu> <!-- 58 -->

 

<!-- forms are the same as in Figure 8.5 -->

Figure 8.9 Voice menu that compresses menu hierarchy menu of Figure 8.5

The voice menu of Figure 8.9 compresses the two levels of menus from Figure 8.5 into a single menu. This is possible because callers no longer need to listen to the menu options and translate their choices into key presses. Instead, callers just say a word or two to instruct the voice-enabled application what to do.

This menu avoids the rigid structure of the touchtone menu in Figure 8.5. Rather than listening to the multiple levels of menus, the caller listens to a single menu and then instructs the system what to do by speaking one or two words. By using barge-in, experienced callers can bypass the single prompt to say the option desired.

If the user has trouble, and asks for “help,” the VoiceXML browser executes the help event handler (lines 11-15) to present additional instruction (lines 12-14) to the user.

Guidelines

Designing menus for voice response systems is different from designing menus for graphical user interfaces or for touchtone responses. Here are some guidelines.

Compress multilevel menus. While it is always a good idea to minimize the number of options at each point in the dialog, menu compression decreases the number of menus to which a caller must respond. Figure 8.5 illustrates a menu with two levels that are compressed into one level in Figure 8.8. While the number of options in the compressed menu is large, knowledgeable callers can barge-in without having to listen to the complete set of verbal options.

Take advantage of symmetry. A general principle called symmetry suggests that callers will respond by speaking in the same style as the prompt. People are natural mimics. If the prompt is long, then the caller’s response tends to be long. If the prompt contains a specific word as an option, the caller will likely speak the word if the caller wants to select the corresponding option. And, most importantly, a short prompt will result in a short response. Short responses consisting of a word or a short phrase are much easier for the speech recognition engine to recognize than a long phrase or a sentence.

Allow alternate wording. Alternate wording should included in the grammar for callers who can remember the option, but not the exact option name. For example, callers could say “quit” or “stop” rather than the say the option name of “exit” to exit an application. However, only the most frequently used word from the set of alternative words needs to be included in the prompt.

Guide the caller. Dialog document designers must create documents that guide the caller towards responses that the speech recognition engine can understand and to which the dialog manager can respond.

Avoid “open-ended prompts” such as “what would you like to do?” Open-ended prompts set the caller’s expectation beyond what VoiceXML interpreters can handle. It is much better to use “suggestive” prompts that guide the caller to speak words that the recognition engine can recognize. Many callers, especially inexperienced callers, feel more comfortable being told what to do rather than guessing how to respond to a vague prompt. Experienced callers, who are more confident, will barge-in to speed up the dialog.

Inform the caller of words the caller must speak at each point in the dialog. This aids the novice caller by enumerating the options. Experienced callers may barge-in before this prompt, so the list of options does not slow down experienced callers. Either enumerate the words which the user may speak, “How do you want to travel? By air, boat or car?,” or identify a class of words known to the user (e.g., days of the week).

Align prompts and grammars . Words enumerated or suggested by the prompt should be specified in the grammar. For example, if the prompt is “What do you want to do?” then “listen to my messages” should be covered by the grammar. However, if the prompt is “What can I do for you?” then “play my messages” should be covered by the grammar.

Avoid similar sounding options. Developers should try to choose options that sound differently from each other, yet still represent the action they invoke. For example, the second syllable of “Delete” and “Repeat” sound very similar, as does “Backup” and “Hang-up.” Avoid using pairs of commands that sound similar and that share many vowels or syllables. This will help the speech recognition engine differentiate among possible caller responses and reduce the number of word recognition errors.

General format for a prompt in a menu . Using a consistent format for prompts will help callers select options faster. A menu prompt may consist of the following sequence:

  1. Speak the menu name . For important menus, the dialog designer includes the menu name in the menu’s prompt. The menu name serves as a landmark. A landmark is a speech or non-speech cue that marks a specific location within the dialog structure. By providing the menu name such as “main menu,” or “thermostat,” callers may jump to this menu by speaking the menu name, or they may return to this menu if they get confused or lost. Repeating the menu name to the caller confirms that the caller has reached the menu.
  2. Ask a question. Often this can be achieved with two or three words. In Figure 8.9, the question is “What action?” This should be enough for experienced callers to say one of the commands without listening to the enumerated options. Novice callers will listen to the enumerated options before speaking their selection.
  3. Enumerate options . List the options so novice callers can hear and select the desired options.
  4. Make additional help available . Make sure additional details are available as error handlers if the caller asks for help.

Use error handlers to help the caller. When errors occur the application responds according to instructions specified by the dialog designer in event handlers—special snippets of VoiceXML code especially designed for dealing with errors. Error handling is extremely important. Some estimate that dialog designers spend more that half of their coding effort in developing effective error handlers.

VoiceXML detects several types of caller errors. It is possible to write progressive assistance for each of these types of errors:

Callers will tolerate only a limited number of failures before they give up and abandon the application. The system should connect the caller to a human operator before this happens. Only user testing will determine the approximate number of caller failures before the caller gives up. Adjust the prompts so the caller is transferred to the operator before the caller experiences this number of failures.

Avoid options that contain the same word. By using words that sound different, the speech recognition engine will be able to differentiate between spoken words.This reduces the number of recognition errors.For example,replace the following prompt:

<prompt>

choose from among the following:

salmon fish

trout fish

roast beef

ground beef

</prompt>

 

with:

 

<prompt>

choose from among the following:

salmon

trout

roast beef

hamburger

</prompt>

 

Avoid noisy and low-energy words. Developers also try to avoid options that are noisy, which contain consonants or nonverbal noise such as the unvoiced fricatives /f/ and /s/. Commands containing these sounds are difficult for many speech recognizers to recognize. Developers also try to avoid commands that are low-energy and produce sounds that are quieter than other spoken sounds. The nasals /m/ and /n/ and many other consonants are low energy. Developers generally try to avoid commands that contain a large percentage of low energy sounds.

Avoid hyperarticulation. When the application repeatedly fails to understand the caller’s responses, the caller may become annoyed, frustrated, and tense. Unfortunately, this often causes the caller’s facial muscles to tighten and changes the characteristics of the caller’s speech. In turn, this makes it more difficult for the ASR to recognize what the caller is saying, causing even more frustration and additional change in the caller’s speech pattern. The caller may slowly and carefully pronounce each syllable, an effect called hyperarticulation. Hyperarticulation may help people to better understand what is said, but it does not help a speech recognition engine. In fact, it makes it more difficult for the engine because the caller’s pronunciation differs from the acoustic model. Ways of dealing with hyperarticulation and other changes to the caller’s voice include:

8.4. Forms

Voice-enabled forms are a variation on the well-establish paper and GUI-based forms. Forms are used to collect and validate data. They also may be used to specify the parameters for a transaction or specify the constraints of a query.

Problems solved

Usage model

Requirements

Example application

Forms : Data entry

Sequences of system prompts and caller responses

Conversational speech recognition engine

Data entry—e.g., patient history, credit card application

 

Table 8.4: Summary of forms

Description

A form is a collection of fields with prompts, grammars, and event handlers. The prompts encourage callers to speak a value into a field. The grammar describes the valid values that callers may speak into a field. Event handlers specify what to do when an error occurs.

Menus and forms serve different purposes. Voice-enabled application developers use menus to select from a prescribed set of options. Callers must select from the menu before continuing with the task. However, forms solicit values for multiple fields. The caller supplies values for each field, one field at a time, in the sequence listed.

Example

Figure 5.4 illustrates an example form used to solicit parameter values for a transaction. Lines 13-24 constitute the “recipient” field. Lines 26-43 constitute the “amount” field. Lines 45-59 make up the “validation” field. The prompt in lines 55-58 use the values entered by the caller to the “recipient” and “amount” fields. The filled element (lines 61-75) contains instructions that are executed when the caller has supplied values for all three fields. If the caller does not confirm the prompt in lines 55-58 and set the validation to “false,” then the variables are cleared (line 66) of any values. If the caller does confirm the prompt in lines 55-58 by setting the validation field to “true,” then the values for “recipient” and “amount” are submitted to the transaction system (line 75) and the caller is notified that that payment has been e-mailed (line 70-73).

Guidelines

It is always good practice to apply human factor guidelines to improve the usability of Forms.

Landmark. Inform the caller of the form name by including it in a prompt within a block as the first element of a form. Blocks are typically executed just once per form invocation. Landmarks reassure the caller that he has arrived at the desired form. The landmark also sets the stage for the actions the caller will perform while within the form. In some circumstances, callers can speak the landmark name when they want to jump to the form from elsewhere in the speech application.

Field sequence. Order the sequence of fields within a form to be convenient for callers who will enter values into the form. For example, if callers frequently obtain information from paper forms, sequence verbal form fields to follow the same the sequence of fields in the corresponding paper form.

Simple fields . Break complex fields into multiple simple fields. For example, break a date field into three fields—day, month, and year. This avoids problems such as whether the caller should enter the values in month-day-year, day-month-year, or year-month-day sequence. (Note that because many applications solicit dates from their callers, a special built-in grammar for dates has been made part of VoiceXML.) However, many small fields can result in very tedious dialogs, which should be avoided

Simple, closed questions . Phrase each question to be simple to answer, The answers are closed—either

Confirmation . If confidence is low, the result is critical, or the request will trigger irreversible actions, explicitly ask the caller to confirm entered values. For example, in Figure 8.10, it is critical that the application correctly understands both the amount and recipient. Thus, the developer used an explicit confirmation (lines 14-15) before proceeding with the transaction. If confidence is low and the result is not critical, then an implicit confirmation is appropriate. As illustrated in Figure 8.11, an implicit or forward-feeding confirmation informs the caller of the value understood by the speech recognition engine. The amount field prompt (lines 15-17) informs the caller who the recipient is as part of the prompt for amount. The caller implicitly confirms the recipient when answering the question “how much do you want to pay?” Likewise, in the when field prompt (lines 30-32), informs the caller of both the amount and the recipient. The caller implicitly confirms that the amount by answering the question. If the caller responds by saying “no,” the values are reset to empty (53 and 77) and the fields will be executed again by the VoiceXML interpreter.

<?xml version="1.0"?> <!-- 1 -->

<vxml version="2.0"> <!-- 2 -->

<!-- 3 -->

<!-- Figure 8.11 Forward-feeding prompts --> <!-- 4 -->

<!-- 5 -->

<form> <!-- 6 -->

<!-- 7 -->

<block> <!-- 8 -->

<prmpt> <!-- 9 -->

You can say no at any time to change what you have spoken <!-- 10 -->

</prompt> <!-- 11 -->

</block> <!-- 12 -->

<!-- 13 -->

<field name="recipient"> <!-- 14 -->

<prompt> <!-- 15 -->

Whom do you want to pay? <!-- 16 -->

</prompt> <!-- 17 -->

<grammar type="application/grammar+xml" version="1.0" <!-- 18 -->

root = "payee"> <!-- 19 -->

<rule id = "payee" scope = "public"> <!-- 20 -->

<one-of> <!-- 21 -->

<item> ajax</item> <!-- 22 -->

<item>superstore</item> <!-- 23 -->

</one-of> <!-- 24 -->

</rule> <!-- 25 -->

</grammar> <!-- 26 -->

</field> <!-- 27 -->

<!-- 28 -->

<field name="amount"> <!-- 29 -->

<prompt> <!-- 30 -->

How much do you want to pay <value expr="recipient"/>? <!-- 31 -->

</prompt> <!-- 32 -->

<grammar type="application/grammar+xml" version="1.0" <!-- 33 -->

root = "amount"> <!-- 34 -->

<rule id = "amount" scope = "public"> <!-- 35 -->

<one-of> <!-- 36 -->

<item>ten</item> <!-- 37 -->

<item>twenty</item> <!-- 38 -->

<item>thirty</item> <!-- 39 -->

<item>fourty</item> <!-- 40 -->

<item>fifty</item> <!-- 41 -->

<item>sixty</item> <!-- 42 -->

<item>seventy</item> <!-- 43 -->

<item>eighty</item> <!-- 44 -->

<item>ninety</item> <!-- 45 -->

<item>one hundred</item> <!-- 46 -->

<item>no</item> <!-- 47 -->

</one-of> <!-- 48 -->

</rule> <!-- 49 -->

</grammar> <!-- 50 -->

<filled> <!-- 51 -->

<if cond="amount == 'no'"> <!-- 52 -->

<clear namelist = "recipient amount"/> <!-- 53 -->

<prompt> <!-- 54 -->

Please say the recipient's name again <!-- 55 -->

</prompt> <!-- 56 -->

</if> <!-- 57 -->

</filled> <!-- 58 -->

</field> <!-- 59 -->

<!-- 60 -->

<field name="validate"> <!-- 61 -->

<prompt> <!-- 62 -->

Do you want to pay <value expr="amount"/> <!-- 63 -->

to <value expr="recipient"/>? <!-- 64 -->

</prompt> <!-- 65 -->

<grammar type="application/grammar+xml" version="1.0" <!-- 66 -->

root = "yes_no"> <!-- 67 -->

<rule id = "yes_no" scope = "public"> <!-- 68 -->

<one-of> <!-- 69 -->

<item>yes</item> <!-- 70 -->

<item>no</item> <!-- 71 -->

</one-of> <!-- 72 -->

</rule> <!-- 73 -->

</grammar> <!-- 74 -->

<filled> <!-- 75 -->

<if cond = "validate == 'no'"> <!-- 76 -->

<clear namelist="recipient amount validate"/> <!-- 77 -->

<prompt> <!-- 78 -->

Sorry, please say the recipient and amount again. <!-- 79 -->

</prompt> <!-- 80 -->

<else/> <!-- 81 -->

<prompt> <!-- 82 -->

I will send a check to <value expr="recipient"/>, <!-- 83 -->

in the amount of <value expr="amount"/> <!-- 84 -->

</prompt> <!-- 85 -->

</if> <!-- 86 -->

</filled> <!-- 87 -->

</field> <!-- 88 -->

</form> <!-- 89 -->

<!-- 90 -->

</vxml> <!-- 91 -->

Figure 8.11: Forward-feeding prompts

Sometimes it is possible verify a value without repeating the erroneous value to the caller and asking the caller to confirm the value. Instead, ask the caller a different question whose answer confirms the value. For example, after soliciting a street address, ask the caller for the ZIP code. Then lookup the street address and validate that the area covered by the ZIP code includes the street address. Similar checking can be done to verify that a city name uttered by the caller is located in the state or country uttered by the caller.

Error detection . Try to write documents so errors are detected and resolved as soon as possible after the caller enters a value. It is much easier for a caller to correct the error at the time the error is made, rather than hours or days afterwards when the caller may not remember the context or have supporting documents or artifacts available.

The first line of error detection is the use of grammars. For example, a grammar specifies digits 1 through 31 for days of a month. The second line of defense is executable code. For example, “if then” logic rejects a day value of 31 for months with only 30 days and rejects 29, 30, and 31 as values of days for the month of February except for leap years when it rejects day values of 30 and 31.

Error recovery. Avoid putting blame on the caller and making the caller feel discouraged or begin to resent the system. For example, rather than saying:

<prompt count = "2">

You said February 30. This is not a valid date. Please restate the date.

</prompt>

 

Use the following:

<prompt count = "2">

February 30 is not a valid date. Please restate the date.

</prompt>

Application-directed dialog documents are suitable for novice callers who need to be guided and prompted. Experienced callers may use another class of dialogs called user-directed dialog documents. Experienced callers who know precisely how to verbalize what they want to do use these dialog documents. User-directed dialog documents are discussed in the next chapter.

8.5. Key Concepts

Critical factors in dialog design include designer experience and usability testing.

Application-directed dialogs are ideal for novice callers, who need to be directed and lead through the application. Application-directed dialogs include touchtone menus, ASR menus, and forms.

Touchtone menus are widely used in IVR systems, but are gradually being replaced by ASR menus. However, touchtone input is still used in ASR menus when the speech recognition engine fails or when the caller desires privacy.

Forms are a variation of the well-established paper and GUI-based forms used to collect and validate data.