Should You Speech-enable Your Web Site?

Consider speech-enabling your Web site so that callers can access your Web site almost instantly at any time from wherever they are without being placed on hold. Explore this new technology in this first of a five-part series.

You are on the patio grilling steaks for some friends, when the conversation turns to baseball and the recent success of your home team. Someone wonders when the next home game is. How do you find out? Access the Internet, of course. Here are two possible scenarios:

Scenario 1: You leave your steaks on the grill, walk upstairs to your den, turn on your PC, watch the operating system start up, invoke your browser, search for the local team's home page, browse the page, and learn that your home team is playing a home game that evening. Then, you turn off the PC, walk back to the patio only to see a bright yellow flame consume your crispy-black steaks.

Scenario 2: You pull your cell phone out of your pocket, dial the number of a voice portal, answer a couple of simple questions, become connected to your home team's verbal Web site, answer several more questions, and learn that the team is playing a home game that evening. Then, you dish up your medium-done steak, enjoy the meal, and discuss attending the game in a few hours.

Even disregarding the steak, most people prefer Scenario 2. The telephone can be used to access the Internet—from anywhere, at any time the caller needs information, and without being placed on "hold." With the Internet just a phone call away, the cell phone, telephone, and new devices resulting from the convergence of cell phones with PDAs will become the Internet terminals of choice.

Most users currently use a PC to access the World Wide Web. And most businesses have Web sites designed to be accessed by the two most popular PC Web browsers: Netscape Navigator and Microsoft Internet Explorer. But there are problems with using the PC to access the Web:

There is an alternative to the "point and click" of a PC's GUI browser...voice.

Use Your Phone to Talk and Listen to a Computer

Instead of pointing and clicking, the caller speaks and listens to a Web site using a telephone or a cell phone. The caller dials the number of a "gateway" that connects the telephony world with the IP world of the Internet. The gateway connects the caller to a speech server that interprets speech applications, using speech recognition technology to understand what the caller says, and uses speech synthesis to produce synthetic voices that the caller hears.

Why voice? We spent the first two years of our lives learning how to talk and understand what others say. Now, it's second nature for us to speak and listen.

Should you speech-enable your Web site? Your Web site is just a phone call away. Callers can access your Web site almost instantly at any time from wherever they are without being placed on hold.

What Callers Can Do with a Single Phone Call

This is what callers can do with a single phone call:

Speaking and listening on a telephone can do all these.

However, speech has its drawbacks. It is not easy to edit speech. Listening to speech can be tiring for long periods of time. Callers have trouble remembering more than just a few details delivered via speech. Speech is difficult for callers to hear in a noisy environment, and it is difficult for speech-recognition technology to understand what callers say in a noisy environment. In addition, because most people can read faster than they normally listen, speech interface has a smaller bandwidth than a graphical user interface.

Approaches for Speech-enabling Your Web Site

To speech-enable your Web site, you can follow any of three approaches:

  1. Use a speech synthesis engine to read the contents of an existing Web page to the caller. However, listening to a Web page that's designed to be viewed on a screen can be time-consuming, tedious, and boring. Even if the caller can "fast forward" and "skip backward," the experience is similar to hunting for a segment of a show recorded on videotape. Not fun.

  2. Add speech to an existing Web page. This enables the page to speak to the user and listen to the user speak. This approach is said to be multimodal because the user can interact with a verbal display, as well as speak and listen. PCs can support multimodal user interfaces, but current telephones and most cell phones cannot. Not yet. New devices that integrate the functions of both cell phones and PDAs are starting to appear.


    Article 4 of this series will address multimodal user interfaces, and their advantages and disadvantages.

  3. Develop a speech-only user interface to your Web site. Users can call your Web site from their existing telephones and cell phones. VoiceXML, a language implemented by more than 50 software and platform venders, is designed specifically for developing voice interfaces to Web sites. A new collection of tags, Speech Application Language Tags (SALT), is available to implement either speech-only or multimodal applications.


    Article 5 of this series will address the advantages and disadvantages of the VoiceXML and SALT approaches for developing speech applications.

A speech interface is very different from a graphical user interface. New skills are required to specify prompts that encourage callers to speak, design grammars that describe how the caller may respond, and specify event handlers in case the caller fails to respond to prompts appropriately. New hardware for connecting the telephone system to your server (as well as new software for speaking, listening, and managing dialogs between callers and applications) are needed.


Articles 2 and 3 of this series will deal with designing, building, and testing voice interfaces.

Speaking and listening are not the only ways that callers access the Internet using a phone. Many businesses already use IVR systems that replay files of prerecorded questions to which callers respond by pressing the keys on their touchtone phones. These systems were the first to enable callers to interact with a computer by using the telephone without being placed on hold. However, these systems can be awkward for the caller—the caller must move the handset between the caller's ear and the front of the caller's face. The caller must select options from verbal menus, translate the option to a digit, and push the corresponding key on the touchtone phone—a cognitive overload for many callers. These systems tend to have large menu hierarchies because the 7±2 information chunks that humans remember in their short-term memory limit callers. Thus, these applications use long narrow menu hierarchies rather than short fat menus, and callers sometimes get lost traversing the long menu hierarchies. Callers find that spelling character strings is difficult by pressing the 12 buttons on a typical telephone keypad. Most importantly, the dialog typically is structured and does not enable the caller to reach the needed information quickly.

Speech overcomes many of these disadvantages of touchtone systems. A caller can hold his handset next to his ear without having to move it. Short fat menus frequently replace long narrow menu hierarchies. Users can speak words and phrases instead of trying to spell character strings.

Speech is often worth the time, effort, and expense if your customers will be better served. There are 10 times as many telephones as connected PCs in the world. And the number of cell phones is increasing dramatically. Everyone has access to a phone. Everyone has the ability to access your Web site quickly.

Do Your Homework Before Deciding to Speech-enable

But before constructing a voice interface to your Web site, you should do some sleuthing to determine whether speech will be useful for your applications:

  1. Telephone some voice Web sites, and listen to their voice user interfaces. Here are some telephone numbers of interesting voice applications you can call for free.


    Phone number


    Tell Me



    1–888–38–AUDIO (1-888-382-8346)



    Hey Anita



    1-888-NUANCE-8 or 1-650-847-7656

  2. Try your hand at developing a simple voice interface by using one of these online VoiceXML tools.



    BeVocal Café

    Hey Anita FreeSpeech

    TellMe Studio

    VoiceGenie Developer Workshop

    Voxeo Community

  3. Or download one of these software development kits and develop a voice interface on your own PC.



    IBM WebSphere Voice Server SDK



  4. Microsoft has released a beta version of the Speech Application Language Tags (SALT) ( that you can download and experiment with.

  5. Determine whether a voice interface will actually improve relationships with your customers and improve sales. There are a variety of techniques to test the effect of voice user interfaces. With one popular technique, called a Wizard of Oz experiment, the developer pretends to speak to the caller on behalf of the computer, and records the caller's responses. Not only is this an effective tool for designing dialogs, but post surveys also give indications of what callers think and feel about speaking and listening to a computer. Other techniques include demonstrating a voice user interface to a focus group and listening to their comments.

  6. Get your feet wet by prototyping a noncritical application and actually measure its impact on selected users.

After you have finished your investigation, you will be ready to make an informed decision about enabling your Web site.


For additional suggestions on how to improve the quality and experience of using telephony applications, see the author's book, VoiceXML: An Introduction to Building Speech Applications (Prentice Hall, 2002, ISBN: 0130092622).