Entering Data by Speaking on the Telephone: Three Problems and What to Do about Them

Data entry on the phone? Enable your system to accept voice information and learn some of the problems which accompany this new technology — like what to do with tongue-tied callers.


Telephones and cell phones are everywhere. Users call from wherever they are and call whenever they want to. But when calling businesses, callers frequently are put on hold. To overcome the long hold queues, several businesses have installed IVR systems that collect data by asking the callers carefully formatted questions. Some systems require callers to answer questions by pressing the buttons on touchtone phones. These systems have several disadvantages:

Using speech-recognition technology solves these problems, but creates some new ones. There are always problems with new technology. Speech recognition is no exception. This article discusses three problems with using telephones to enter data by speaking—and what you can to do about them.

Problem 1: Callers Don't Know When to Speak and What to Say

Currently, many callers do not have experience using a telephony application with automatic speech recognition. They don't know that they can speak. Often, they are "tongue-tied" about what to say. Here are some hints to help callers say the right thing at the right time:

2. Problem 2: Speech Recognition Errors Break the Dialog's Rhythm and Slow Down the Dialog Between the Caller and the Computer

Problem 2: Speech Recognition Errors Break the Dialog's Rhythm and Slow Down the Dialog Between the Caller and the Computer

Occasionally, automatic speech-recognition systems make errors, just as humans make speech-recognition errors in daily conversations. However, the number of errors made by a speech recognition engine can be minimized:

  • Avoid words in the grammar that are confusing for the speech recognition engine. If two words are frequently confused by the speech-recognition engine, modify the grammar to use different words or phrases that are distinguished easily. For example, change the vocabulary from {"green", "gray"} to {"forest green", "charcoal gray"}.

  • The grammar should contain words frequently spoken by callers. Wizard of Oz tests and usability tests are necessary to determine which words callers frequently speak. (During a Wizard of Oz test, the developer pretends to be the computer, asks the caller questions, and records the caller's responses.)

  • Keep the grammar as small as possible. If the grammar is too large, the speech-recognition system may become confused about which of several words matches the caller's utterance. This confusion may result in a mismatch event, with the computer asking the user clarifying questions. Smaller grammars enable the speech-recognition engine to return accurate results more quickly.

  • Validate low-confidence recognitions. When the speech recognizer returns a result with a low confidence rating, confirm the result by asking the caller a yes/no question. For example, if the speech-recognition engine returns a low confidence score for the word "Austin," ask the caller to confirm: "Did you say Austin?"

    If the confidence scores are similar for two words in the vocabulary, then ask a yes/no question about the most frequently used word. For example, suppose the most frequent answer to the question: "Which color? Blue or green?" is green, and the confidence scores are for blue and green are about the same. Validate by prompting the caller with the yes/no question: "Did you say green?"

    Frequently, people ask these types of questions in daily conversations, and think nothing of answering these questions with a simple "yes" or "no." Most callers are never aware that a real or suspected speech-recognition error occurred. Also, callers often say the correct value after responding to a yes/no question.

    • The caller mumbles or speaks a word not in the vocabulary. Prompt the caller again, but use a different wording to encourage the caller to say one of the words in the vocabulary. For example:

      Prompt: Which account? Savings or checking?

      Caller: Hmmm.

      Prompt: Do you want to access your savings or checking accou

Problem 3: When Detected, Callers Cannot Easily Correct Misunderstandings

Callers cannot correct an incorrect value spoken earlier in the dialog. Form fill-in dialogs prompt callers for values for a sequence of fields, summarize the values in a final prompt, and follow with the following question: "Is that correct?" For example, after soliciting values for departure city, arrival city, and travel date, the values are summarized as follows:

"You want to travel from Boston to Chicago on July 15; is that correct?"

What happens if the caller says no? The summary has an invalid value, either because the caller spoke the wrong answer or the speech-recognition engine misunderstood what the caller said.

  • Accept verbal corrections. Listen for keywords to identify the corrected values. For example, the caller could say the following:

    "No, from Austin."

    Note that the caller frequently responds by speaking the same phrase as spoken by the system. In the example above, prepositions such as "from," "to," and "on" identify each value. Callers will speak these words frequently to indicate which words should be corrected.

    Sometimes, the caller will repeat the entire summary and emphasize the corrected word and its preposition. For example, the caller could say the following:

    "No, I want to travel from Austin to Chicago on July 15."

    They emphasize the words "from Austin." By comparing the volume of words and their corresponding prepositions, it may be possible to guess which word is being corrected.

  • Escape. Provide callers with rescue navigational commands. Enable the caller to say "back up" or "go home" rather than force the caller to "reboot" by hanging up and redialing.

4. Other Problems

Sometimes, callers say too much. They forget that they are talking to a machine, and speak as if they are talking to a human who can understand their every word. Although dictation systems can translate long sentences to text, the system may not "understand" the meaning of the text. Callers may need to be encouraged to answer the question and to not volunteer additional information. For example:

System: "Color?" (pause) "Say the color you want." (pause) "Green, red, or blue?"

Caller: "I think I like green better than red or blue."

This phrasing confuses the speech-recognition system, which hears the words "green," "red," and "blue," and has no idea which color the user wants. Encourage the user to answer the question simply and directly:

System: "I'm sorry, I didn't understand you. Just say the color you want: green, red, or blue."

The key to successful speech data entry programs is careful dialog design and iterative usability testing. The best practice techniques for dialog design can both accelerate the data entry process and make entering data by speaking into a phone more enjoyable. But there are no guarantees. Developers must test with a vengeance, iteratively modifying the dialog, and verifying that the modification does indeed improve the process.


For additional suggestions on how to improve the quality and experience of using telephony applications, see the author's book, VoiceXML: An Introduction to Building Voice Applications (Prentice Hall, 2002, ISBN: 0130092622).