Multimodal Mobile Apps Will Thrive
Posted May 1, 2007    

Windows Vista will introduce users to applications that speak and listen. Users will be able to speak and listen to Word, Excel, PowerPoint, and other Microsoft Office applications. However, people everywhere are embracing electronic companions (ECs)—portable devices such as smartphones or PDAs with a cellular or other wireless connection—that also speak and listen.

Like Swiss Army knives, ECs perform many tasks with different kinds of easy-to-use, compact, and lightweight hardware, such as microphones, speakers, cameras, displays, keypads, and/or writing pads. Users communicate by typing on a small keypad, pointing and writing on the display using a stylus, speaking into a built-in microphone, or through a combination of these input modes. Users read and listen to responses to their requests. They can connect with powerful servers to search and save data. More importantly, they can connect with family and friends and share content with them. The major advantages of these devices are portability, connectivity, and versatility. Like cell phones and PDAs,

ECs can be used almost anywhere and at any time.

Through connectivity, ECs can leverage the power of computers anywhere on the Internet. Users can not only enter data while their hands and eyes are busy, but also access information while they drive, walk, or exercise. The real utility of an EC comes from its applications. While some applications reside permanently on an EC, others can be downloaded on demand, or executed on a server wirelessly connected to the EC. The popularity of ECs will exceed the popularity of cell phones and MP3 players because ECs will be able to do everything that cell phones and MP3 players can, and more.
New classes of multimodal applications for ECs will emerge. These applications can be classified as follows:

Active listening While radios and TVs enable users to listen passively, ECs will enable active listening. EC users can speak commands to start, stop, fast forward, and rewind, to select content, and to increase and decrease speed, volume, and other characteristics of content presentation. Users will be able to navigate the content using voice menus, pick lists, and voice-invoked hyperlinks.

Virtual assistants EC applications called virtual assistants will contain application-specific commands beyond the active listening commands. A virtual assistant listens for and acts upon user requests. Examples of this include:

• a violin tuner, which presents the audio tone after a user says the name of the note, leaving both hands free to tune the violin;
• a TV controller, which changes channels, volume, and TV display characteristics;
• an environmental controller, which adjusts the temperature, lighting, and security system; and
• a family-activity coordinator, which enables family members to coordinate their individual activities.

Synthetic experiences Synthetic interviews enable users to converse with artificial agents representing real or imaginary people, much like an interview or question/answer session. For example, users can ask Albert Einstein about his life and work, Britney Spears about her latest song, or a fictitious 2020 secretary general of the United Nations about future world events. Speech-enabled video games allow users to speak with other human and artificial players to affect the outcome of games. The same technology will be used for training and educational purposes.

Authoring content As Lev Grossman noted in the Time magazine article “Time’s Person of the Year: You,” during the past few years, the World Wide Web became “a tool for bringing together the small contributions of millions of people and making them matter.” Users will continue this trend by creating applications involving multimedia and multimodal. Students will no longer write essays and papers, but instead create multimodal activities that explore alternative viewpoints, such as opposing opinions of the westward movement in America as expressed by avatars representing a frontiersman, a Native American, and a settler.

ECs are the wave of the future for consumer electronics. They are both computers and communicators. The market for applications that actively listen to users, support virtual assistants, and provide synthetic experiences is expanding, as is the market for EC authoring tools, so that anyone and everyone can create content for them.

Jim Larson is co-program chair for the SpeechTEK East Conference, co-chair of the W3C Voice Browser Working Group, and the author of the home-study guide and reference, The VXML Guide ( He can be reached at