November/December 2000

Speech-enabled Appliances

By James A. Larson

Speech-enabled appliances, including handheld computers, promise to be everywhere in the near future - in the office, at home and on the road - enabling users to easily interact with people, to control consumer appliances and to access personal and public information. To be used effectively, these appliances will support speech interfaces to intelligent software agents that perform various types of searching and computational tasks on behalf of the user. The sidebar titled "A Day in Jack's Life" Page 26 illustrates how pervasive speech will become in the future. The components of the speech interface will likely be distributed among various hardware components connected via a communication network. This article outlines the types of applications that speech-enabled appliances will support, the architectural environments in which speech-enabled appliances will work and the types of speech-enabled appliances users can expect to use.

SPEECH-ENABLED APPLICATIONS
Intelligent speech interfaces promise to enable users to communicate with people, control consumer appliances and access personal and public information. Figure 1 illustrates some of the applications in each of these categories.

Communicate with people
The telephone has been the standard appliance for communication among people at different locations. If the called person is not available, the caller leaves a voice-mail message on a telephone answering machine that can be accessed at a later time. People also exchange e-mail messages and faxes.

Speech is the most natural way for people to communicate with each other, so adding speech to these traditional forms of communication will make them easier to use. For example, saying, "Call Fred Lewis," is easier than looking up and dialing Fred's telephone number. Speech dictation enables users to leave the keyboard behind and create e-mail messages from wherever they are - in the office, at home or on the road.

Control consumer appliances
Designers have developed very convenient user interfaces to consumer appliances. What could be easier than pressing buttons on a remote control to select television channels or flipping a switch to turn on a light? These types of direct manipulation user interfaces will continue to be widely used. However, because current buttons and switches are not intelligent, you cannot ask your remote control when "Star Trek" is on, and you must walk to the light switch before turning the light on. Speech enables consumer appliances to act intelligently, responding to speech commands and answering verbal questions. For example, speech enhances consumer appliances by enabling the user to say instructions such as:

To the VCR: "Record tonight's 'Star Trek'."
To the coffeepot: "Start at 6:30 a.m. tomorrow."
To the light switch: "Turn on the lights one half-hour before sunset."

Access personal information
Handheld devices, such as the Palm computer, demonstrate that users can create and access personal data such as to-do lists, shopping lists, calendar appointments, birthday lists and address lists by writing or pressing buttons on a hand-held device carried with the user. However, users must learn a special handwriting technique, such as Graffiti, to enter data and press sequences of buttons to access that data. Would it not be easier to just say the following?

"Remember to pick up Sally from her music lesson on Wednesday at 6 p.m."
"Add toothpaste to the shopping list."
"Remind me that Uncle Jack's birthday is the ninth of November."

Access public information
The World Wide Web enables users to access a nearly endless supply of data and information with a personal computer. Interactive voice response systems enable users to find out what is playing at the local movie theater by selecting options by pressing touch-tone keys on a telephone. However, an intelligent speech interface would enable users to ask direct questions, such as:

"What's the current stock quote for Intel?"
"What's playing at the Roxy Theater tonight after 8 o'clock?"
"What's the current traffic report?"
"Will it rain today?"

ARCHITECTURE OF SPEECH-ENABLED APPLIANCES
Much more than automatic speech recognition and text-to-speech synthesis are necessary to support natural, intelligent speech interfaces. Sophisticated natural language processing and dialog management provides the "naturalness" while software agents provide the "intelligence" users expect when they communicate using speech.

Figure 2 shows the control flow of a typical intelligent speech system. Feature extraction extracts the critical audio characteristics from the acoustic input spoken by the user into a microphone. The classification module converts these features into text. The natural language understanding module extracts semantics information from the text and converts it to a form acceptable to the dialog manager. The dialog manager executes a dialog script, which determines what to do with each piece of semantic information. An intelligent agent develops plans, accesses a database or Website and generates information and/or actions. The natural language generator accepts this information and converts it to text that a user can understand. The prosody generator inserts tags to the text wherever the text should be emphasized, speeded up or slowed down. The waveform generator applies phonetic mappings to convert the marked-up text to an acoustic speech that the user hears via a microphone.

Figure 3 illustrates the software architecture of intelligent speech systems being developed for the DARPA Communicator program (http://fofoca.mitre.org). In this architecture, the hub is a traffic cop that examines messages from components, possibly changes message formats and routes the modified message to the appropriate component. The hub gives the communication network intelligence, while directing messages among the appropriate software components irrespective on which hardware processor they reside. The various software components may reside on the same processor or may be distributed across multiple processors. New software components may be added by modifying the hub to process its messages and route messages to/from the appropriate existing components. Likewise, a new hardware processor may be added and software components migrated to the new processor by modifying the hub's program.

The DARPA Communicator architecture enables a distributed approach to processing speech. Distributed processing of speech is desirable for the following reasons:

Speech processing can be off-loaded from appliances and placed on backend servers. This enables all appliances to share powerful speech and natural language processing capabilities, which decreases their cost while increasing their functionality.
User profile and speech characteristics are stored on a backend server where it is available to all appliances. This minimizes the number of times that a user must train the automatic speech recognition system.
Any appliance can be used to communicate with any other appliance as well as with the Internet. For example, a mother can monitor her children playing in the bedroom by listening to them via a radio. Telephone calls can be received and placed from any appliance. The coffeepot can deliver the current prices of Intel and Microsoft stock to the user. The television can remind the user about an important meeting.

TYPES OF SPEECH-ENABLED APPLIANCES
Speech will enable users to communicate with people, control consumer appliances and access personal and public information. A large variety of appliances will serve the diverse needs of users in the office, in the home or on the road. Speech can be embedded into traditional home appliances, such as the furnace, air conditioner, washing machine, refrigerator, coffeepot, stove, microwave, television and VCR; enabling these appliances to accept speech commands. Because appliances will be networked together, they can be coordinated. For example, at bedtime the user will instruct the coffeepot to turn on and the VCR to record the morning news at 6:30 in the morning. The user also instructs the user's clock to remind him/her to get out of bed at 6:45 a.m. By 7 a.m., the user will be listening to interesting snippets from the morning news while drinking a freshly brewed cup of coffee.

Other appliances will be portable. Portable handsets, headsets, remote controls and wearable jewelry and clothing will contain microphones and speakers for capturing and presenting speech. These devices may communicate with each other and with computational processors within a 30-foot range via Bluetooth RF communication technology, within a house via a home-area network and outside of the home via long-distance RF communications technology. Some appliances will only support one of these forms of communication, while others may support two or all three.

Appliances can be categorized by how much processing is done within the appliance itself versus the amount of processing performed by other processors connected to the appliance. The distribution of software modules between an appliance and other hardware processors is illustrated in Figure 2 for each of three generations of speech-enabled appliances.

First-generation appliances
Speech, encoded as an analog or digital electronic signal, is transmitted to and from the appliance. No significant computation takes place in the appliance itself. The device contains a microphone, speaker and a transmitter so that speech can be transmitted to remote components that perform all of the speech processing. Today's cordless telephones and cell phones are first-generation appliances. Websites will provide private information (e-mail, calendar, telephone lists), community information (traffic conditions, school closures, weather conditions, location and direction information, local events) and national and international information (news, financial information, entertainment) all of which can be accessed with speech using first- generation appliances. Examples of speech-enabled Websites include @Motion(www.atmotion_inc.com/home. htm) and Pipebeach (www.pipebeach.se). XML-based mark-up languages, such as VoxML (www.voxml.com/voxml.html) and VoiceXML (www.vxmlforum.org), will be used to mark up text on Websites for presentation as speech to users and to accept spoken utterances from users.

Second-generation appliances
Speech features are transmitted to and from the appliance. The appliance contains an inexpensive digital signal processor that extracts speech features from the user's speech for transmittal to remote hardware processors for speech recognition and semantic processing. Not only does extracting the speech features act as a compression technique that decreases the bandwidth between the appliance and the backend hardware, it also prevents noise from entering the signal during transmission. The Aurora project (www.drago.cselt.it/fipa/yorktown /nywso29.htm) within ETSI in Europe is investigating algorithms and setting standards for extracting and representing speech features for this purpose. Responses are returned as text with tags describing prosodic control (pitch, timing, pausing, speaking rate, emphasis and pronunciation), which are converted to acoustic sounds by the waveform generation module.

Third-generation appliances
Data is transmitted to and from the appliance. The appliance performs most of the speech processing functions itself, while transmitting data and commands to other components. Also, users can speak to third-generation appliances that are not networked but will store data and commands to be transmitted to the backend server when networked. Because these appliances contain more computation power than the first- or second-generation appliances, they are currently more expensive. Some third-generation appliances may be just around the corner. Microsoft is working on a demonstration project called MiPad (pronounced "my pad"), a handheld computer personal information manager controlled by human speech (http://www.research .microsoft.com/srg/drwho.htm). Lernout and Hauspie has demonstrated a hand-held device using voice commands that enables users to search the Web, compose e-mail and organize files (http://www .lhs.com/news/releases/20000207_ linuxdevice.asp).

SUMMARY
Speech will enable users to communicate with people, control appliances and access personal and public information. Speech software components may be distributed among a variety of hardware processors. By connecting to and sharing remote hardware processors, first-generation appliances enable users to speak to software agents from home, at work or on the road. Second-generation appliances will improve the accuracy of speech recognition by avoiding the introduction of noise during transmission and by applying powerful speech recognition and semantic process algorithms that execute on powerful hardware processors. Third-generation appliances can be used in stand-alone mode, but are more effective and powerful when networked. However, second-generation appliances will be more cost effective than third-generation appliances in the near future.

A DAY IN JACK'S LIFE

The following hypothetical scenario illustrates how speech processing will enable users to interact with a variety of appliances within the next five years: As Jack crawled under the covers of his bed, he spoke to the microphone in his radio: "Computer, wake me up at 6:30 in the morning by turning on the radio to my favorite radio station."

"At 6:30 in the morning, I'll turn on your radio tuned to 95.5 FM. Is that correct?" replied the synthesized voice from the radio's speaker.

"Yes, and turn the coffeepot on at 6:15 in the morning," added Jack.

"At 6:15 in the morning, I'll turn on the coffeepot. Is that correct?" replied the voice.

"Check to see if 'The Trouble with Tribbles' episode of 'Star Trek' is on TV during the night. If it is, record it," instructed Jack.

"The 'Web TV Guide' lists three episodes of Star Trek, but none of them is titled, 'The Trouble with Tribbles,' " replied the voice.

"Turn the furnace setting to 60 degrees now, then turn up setting to 70 degrees at 6:15 in the morning," commanded Jack.

"The heat has been turned down to 60 degrees. I'll turn the heat up to 70 degrees at 6:15 in the morning."

"And turn the lights off," Jack interrupted before the voice could ask if what it understood was correct.

The lights were immediately turned out and Jack fell asleep.

When the radio started playing music at 6:30, Jack woke up and smelled the aroma of fresh brewed coffee drift through the warm air in the house. He got out of bed and squeezed the toothpaste tube to force the last of the toothpaste onto his toothbrush.

Jack spoke into the microphone of a small handheld appliance set beneath the bathroom mirror. "Add toothpaste to my shopping list."

"Toothpaste has been added to your shopping list. There are 18 items on your shopping list."

"Remind me to go shopping at 6 this evening," instructed Jack.

"I'll remind you to go shopping at 6 this evening, is that correct?" said the voice.

"No, make that 6:30 tonight," corrected Jack.

"I'll remind you to go shopping tonight at 6:30, is that correct?" said the voice.

"Yes," replied Jack.

While drinking a hot cup of coffee, Jack spoke into a microphone in the refrigerator door.

"What's Intel stock at?" Jack asked. "Intel opened at 90 and a half, up one and a half," replied the voice through a speaker in the refrigerator door.

"What's the weather forecast?"

"Partly sunny, with a 50 percent chance of light showers this evening," replied the voice.

"What's on my calendar at work for today?"

"Nine o'clock appointment with Fred Lewis, 11 o'clock staff meeting, and 2 o'clock presentation to the budget review committee," replied the voice.

Realizing that he still needed to complete the budget presentation, Jack gulped down the rest of his coffee and walked toward the garage.

"Do you realize that the battery in the fire alarm is due to be changed today?" asked the voice.

"No. Add the correct type of batteries to my shopping list," said Jack.

"Batteries have been added to your shopping list. There are 19 items..."

"I know," interrupted Jack.

"Open the garage door," Jack spoke into the microphone embedded in his tie clasp as he arrived in the garage.

The garage door opened, Jack got into the car and backed out of the garage.

"Close the garage door," he spoke into a microphone embedded into the steering wheel.

As the garage door closed, Jack asked, "What are the traffic conditions on Highway 26?"

"Traffic on Highway 26 is moving slowly. Stalls at the 217 exit and at the entrance to the tunnel," replied the voice from the speaker in the car's radio.

"Place a call to Fred Lewis," commanded Jack.

In a few seconds, Jack heard Fred's voice from the speaker in the car's radio.

"Hi, Fred," said Jack, "I'm stalled in traffic on Highway 26, and I doubt that I'll be there to meet with you at 9 o'clock. Can we reschedule?"

"Sure," replied Fred, "How about tomorrow at 3 o'clock?"

"Calendar, schedule Fred for tomorrow at 3 o'clock," said Jack.

"Tomorrow at 3 o'clock is not available. The next available time slot is 4 o'clock tomorrow," said the voice coming from the car radio.

"Fred, I can't do it at 3 o'clock, how about 4 o'clock tomorrow?" asked Jack. "Four o'clock it is," said Fred. "See you then. Bye now."

"That gives me another hour to prepare for the budget presentation," thought Jack.

"Make a note to remind me when I get to work to add a summary section to the budget presentation," said Jack.

"So noted," replied the voice.

"Did I remember to turn off the coffeepot before I left this morning," asked Jack.

"No, you didn't. But I turned it off when you left the house at 7:15 this morning," replied the voice.

James Larson works for Intel Architecture Labsand is co-chairman of the W3 Voice Browser Working Group