The Evolution of IVR Systems

By James A. Larson - Posted Jun 1, 2008

Over the years, IVR technology has evolved in four major phases:

Generation 1: Touchtone input and voice output Systems presented prerecorded voice prompts to callers, who responded by pressing keys on a touchtone phone. While this simple technology was widely deployed, callers complained about form factors (moving the handset between ear and eyes), getting lost in large menus without being able to back out, and time-consuming traversal of numerous options.

Generation 2: Speech input and output Systems resolved many first-generation problems by supporting the automatic recognition of user speech and responding with prerecorded verbal messages or dynamically generated messages using synthesized speech. Call routing technology replaced long menu hierarchies. Clever error-handling dialogues helped users overcome confusion when problems arose. Callers no longer needed to move handsets between ears and eyes; they simply responded by voice.

Generation 3: Speech input and output and visual output We are now on the edge of third-generation IVR systems. IVRs will use the small displays available on today’s phones and handheld mobile devices in two ways:
• Media viewer. Screens will present illustrations, animation, and video to callers and support more than just TV applications on a mobile device; they will involve personalized interaction with artificial agents. Callers will observe and internalize information using visual components. These visual elements will support a wide variety of new applications not previously possible for phones, including entertainment (games, video clips, and shows), training, and shopping applications.
• Scratchpad. Callers will no longer need to wait while verbal menus are read to them. Instead, they can scan the screen and select the appropriate option by speaking or pressing buttons on the phone’s keypad. A software agent will guide callers in the construction of queries. Partial queries will be presented on the display, along with options to complete the construction of the query. In effect, the display extends the callers’ memories—both short-term (by displaying partially constructed queries) and long-term (by presenting options and alternatives that callers no longer need to memorize).

Developers need not wait for VoiceXML 3.0 to create third-generation IVR applications. Several voice platform vendors have extended VoiceXML 2.1 to include graphics and video. In addition, several VoiceXML platform vendors have built upon two VoiceXML elements:
<audio>, originally used to replay audio files, has been extended to replay videos and present image files; and
<record>, originally used to capture audio files, has been extended to capture video and image files.

Although widely available in Europe, 3G communications technology is new to the United States. This technology has the bandwidth to support the dynamic uploading and downloading of video and image files. However, 3G devices will frequently use non-3G networks, so slow networks will remain a reality for a few years and may hinder applications requiring high volumes of data, such as video.

Generation 4: Multimodal modes of input and multimedia output IVRs will support multiple modes of input, including speech and handwriting recognition and keyboard input. Alternative input modes enable one mode to back up another, meaning that if speech recognition fails, the user can press buttons. Users can also select the appropriate input mode (e.g., speech recognition while walking and handwriting or key input during a business meeting). Other technologies could include GPS to identify the mobile device’s location and sensors to detect the device’s orientation.

VoiceXML already supports touchtone and speech recognition. While the World Wide Web Consortium’s Multimodal Working Group is specifying a distributed multimodal architecture, researchers are investigating extending VoiceXML to support handwriting recognition (using the Speech Recognition Language Specification to indicate grammars describing words) and keyboard input. They are also researching ways to detect a user’s emotions based on various biometric techniques.

Each IVR generation has generated new types of mobile devices. IVR-G1 and IVR-G2 devices use cell phones with push-button keypads. IVR-G3 devices will also contain small display screens. IVR-G4 will likely be a Swiss Army knife device with all types of attachments for specialized functions. And each IVR generation will enable new and useful applications that will help the mobile user to access the Web and allow contact to friends and family from anywhere.

James A. Larson, Ph.D., is co-program chair for the SpeechTEK 2008Conference, co-chair of the World Wide Web Consortium’s Voice BrowserWorking Group, and author of the home-study guide The VoiceXML Guide ( He can be reached at