February/March 1999

SPEECH FROM THE WEB:
Conversational Agents Let Users Treat Computers as People

By James A. Larson

In their book, The Media Equation, Byron Reeves and Clifford Nass suggest that people treat computers as if they are real people. When people interact with each other, they normally see and hear, as well as speak with each other. It is also natural for users to want to see and hear their computers, as well as speak with them. In order for users to hear a computer, it must have synthesized speech.

Synthesized speech can be categorized as realistic or fantasy; the category depends on the goal of the application with which it is used. Realistic synthesized speech attempts to be as human-like as possible; while fantasy synthesized speech often sounds non-human, but is understandable.

Two widely different approaches for creating realistic synthesized speech are model-based and concatenated.

In the model-based approach, a model of a human vocal track is constructed to produce synthesized speech using the electronic equivalents of the lips, mouth, tongue, and throat. By changing model parameters, a model produces different voices, so the model provides a wide variety of voices for different situations. The model dynamically varies the tone and speed of speech, or prosody, to convey emotion. However, the speech still sounds artificial. Various users have likened model-based speech to a Swedish moose, an intoxicated Martian, or a radio with a loose paper clip in its speaker. Researchers believe that by improving the simulated model, the quality of speech will improve.

With the concatenated approach, words and phrases are extracted from a person's recorded speech and spliced together to replay the message for the listener. For example, telephone users frequently hear concatenated speech after they dial an invalid telephone number. The prerecorded words and phrases are arranged to present a specific message to the telephone user: "The number you have reached-nine, five, seven, four, four, four, three-is not a working number."

To improve concatenated synthesized speech, researchers have identified and extracted small units of speech, called phonemes, which can be joined and replayed to produce words the original speaker never uttered. To create a new voice, it is necessary to record words spoken by another human speaker. Large amounts of disk space are required to store the phonemes and combinations of phonemes to be stitched together.

While concatenated synthesized speech sounds more human-like than synthesized speech produced by the model approach, concatenated speech often has little prosody. Some users may tire from listening to it over long periods of time.

Three approaches are used to create fantasy speech. In addition to the model-based and concatenated approaches, human speech can be altered by passing it through various types of filters. Voxware (http://www.voxware.com) has software filters that produce a variety of voices called VoiceFontstm. For example, the pitch of a speaker's voice can be changed to sound like a mechanized robot voice.

Figure 1 - UCSC's Baldy is a model-based talking head with realistic mouth and tongue movements, shown with six different expressions.

The model-based and concatenated approaches also are used to produce images of talking human heads. The model-based approach is exemplified by the Baldy talking heads from UCSC, illustrated in Figure 1. Scientists at UCSC have carefully measured various movements of human vocal tracks and faces as people spoke. They developed a sophisticated 3-D graphical model of the human head and software to synthesize a talking human face. The jaw, lips, mouth, and facial expressions are controlled by a set of parameters. By varying the parameters, the 3-D graphical model appears to speak similarly to a live person and can be synchronized with either prerecorded or synthesized speech.

Figure 2 - MIT's MikeTalk is a concatenated talking head.

MikeTalk, developed by Tony Ezzat and Tomaso Poggio at MIT, is an example of the concatenated approach. Ezzat and Poggio videotaped a person (Figure 2) enunciating a set of key words containing a collection of visemes, the visual equivalent of phonemes. Images for each viseme are identified and extracted from the sequence. The extracted images may then be sequenced and morphed together to produce a string of images that look like a person saying a word that he/she never really uttered. Combined with realistic synthetic speech, the result is a video-realistic text-to-audiovisual speech synthesizer.

Figure 3 - Interval's VideoRewrite has been used to synthesize John Kennedy saying, "I never met Forrest Gump."

Researchers at Interval Research Corporation have developed an experimental system called Video Rewrite. This system uses existing video to create a new video of a person mouthing words not spoken in the original footage. In Figure 3, Video Rewrite reorders the mouth images to match the phoneme sequence of a new audio track, which consists of a human voice or realistic synthesized voice generated either by the model-based or concatenated approach.

Figure 4 - Microsoft's fanciful visual agents are easily programmed to speak, perform, and understand the user's speech.

If human realistic talking heads are not the goal, then animation techniques can be used to produce fantasy talking heads and bodies. Cartoons can be more expressive than human heads and may be more appropriate in some situations. Microsoft has created several cartoon-like animated agents, two of which are shown in Figure 4. In addition to synchronizing the animated cartoon's mouth to synthesized speech, Microsoft provides software for controlling the cartoon's actions and positions.

Haptek, Inc. has developed a collection of animated talking heads. Developers embed commands in the text to control the placement orientation, size, and emotion of the talking head. It is even possible to cause one talking head to morph into another talking head.

Figure 5 - Haptek's fanciful VirtualFriendRoswell has well-developed facial expressions and can be morphed into other talking heads.

Figure 5 is a screen shot of Haptek's VirtualFriendtm Roswell. Haptek's has created several 3-D, morphing, speaking, moving, and emoting fanciful VirtualFriends.

Animated Conversational Agents

Adding speech recognition to a talking head with speech synthesis enables a new input-output paradigm, the animated conversational agent. There are many applications that can be improved with a carefully designed animated conversational agent.

These include:

Instructional software, in which an instructor smiles or frowns and occasionally speaks to provide feedback as a student performs training activities. A visual speaking agent is especially useful in language training applications where the agent illustrates how to pronounce new or difficult words. For example, Baldy is being used to teach deaf students how to position their mouths, tongues, and throats, so the students can pronounce words correctly.
Games, where the user battles a visual opponent with expressions and words to make the game more realistic. An animated conversational agent also makes board games, such as chess or checkers, more interesting by enabling a player to see, hear, and talk to the opponent.
Guides, which explain and demonstrate the functions, organization, and layout of an application or Web site. For example, an animated guide not only reads help messages, but also points to objects and demonstrates the operations as it describes the objects to users.
Helpers, which offer suggestions and assistance at appropriate times. The Microsoft Paper Clip is an example of a helper. Paper Clip comes with every copy of Microsoft Office and pops up whenever help is asked.
Security cops, where images of Mom or Dad appear when children attempt to access inappropriate Web sites or have been playing games for more than the prescribed time limit. So rather than displaying a message saying "Access denied," an animated conversational agent appears on the screen to say, "Your parents don't want you to see this because they feel that it is not appropriate for kids your age."
Remote audio conferencing, where the animated head of the current speaker moves with the user's speech. The animated head simulates the facial expressions and head movements of the speaker. Visual agents provide low-bandwidth alternative to video conferencing.
Chatterbots, which are artificial entities with which users may converse. The simplest chatterbots consist of a speech recognition system to convert a user's speech into textual statements, a response generation algorithm to convert textual statements into questions, and a speech synthesis system with a talking head to present questions to the user.

Cutsey Creatures

When fonts were first introduced, users went wild over-using multiple fonts resulting in gaudy documents with less emphasis on content and more emphasis on layout. When color was introduced, documents were suddenly filled with inappropriate colors. Eventually, users learned the appropriate use of fonts and colors.

It follows that initially there will be a plethora of visual agents with many annoying, "cutsey" creatures saying inappropriate things at inappropriate times. Often new technology is unwieldy and easily misused. Visual agents require that developers not only be computer savvy, but also be the author, artistic director, and content provider-all difficult tasks that even experts sometimes fail to do well.

Animated conversational agents are a natural extension of the Reeve and Nass observation that users treat computers as if they are people.

Jim Larson is the Manager of Advanced Human I/O at the Intel Architecture Labs in Hillsboro, OR. He can be reached at jim@larson-tech.com

SPEECH FROM THE WEB: Conversational Agents Let Users Treat Computers as People

SPEECH FROM THE WEB:
Conversational Agents Let Users Treat Computers as People