The Expanding Speech Tech Universe

By James A. Larson - Posted Apr 1, 2008

The universe of speech technologies is coalescing into two giant galaxies: The .NET galaxy and the VoiceXML galaxy.

Microsoft’s .NET framework provides a large body of precoded solutions to common program requirements and manages the execution of programs written specifically for the framework. It is a key Microsoft offering intended for use by applications, including many speech applications, created for the Windows platform. Microsoft’s Visual Studio enables developers to create diagrams using a drag-and-drop interface that specifies the speech and dialogue interface to .NET applications.

The VoiceXML galaxy contains many commercial platforms based on the World Wide Web Consortium’s Voice Browser Working Group specifications. At the center of this galaxy is Nuance Communications, which has a large ancestry of speech companies, including Kurzweil Computer Products, Lernout & Houspie, Dragon Systems, SpeechWorks, Bevocal, and VoiceSignal. Nuance speech recognition engines are widely used by other companies.

The VoiceXML galaxy has started to overlap the .NET galaxy since Microsoft began offering a VoiceXML platform as part of its speech server and through its Tellme Networks division, but most IVR customers belong to the VoiceXML galaxy, while most Microsoft shops belong to the .NET galaxy. Few customers move between galaxies, and both continue to grow: New IVR applications are appearing in the VoiceXML galaxy, and new desktop voice applications in the .NET galaxy. But some new stars are beginning to appear in the speech universe. They include:

Gaming Gamers have long wished for a "third hand" to supplement the two hands managing the game controller. Speech is an ideal third hand, enabling gamers to request information (How much fuel is left?) and issue commands (Transport me to the beta quadrant!). Speech allows users to talk with game avitars, making the game experience both interactive and natural. With the Wii’s new interaction techniques and the Xbox’s and PS3’s graphics, games rise to a new level of excitement and interest. Can speech be far behind?

Multiplayer games involve some type of chat. Most players prefer audio to text chat because it’s faster and they don’t need to pause from entering gaming gestures to type text messages. Especially with Voice over Internet Protocols, voice chat enables players to interact with one another naturally and easily. Voice chat could also enable players to interact with artificial entities: issue commands to your droids, plot strategy with virtual ball players, and speak answers to Jeopardy!- and Wheel of Fortune-like games. Voice is the next frontier in computer games.

Voice Search Many of us use voice search to speak the name of the person we want to dial. Soon we will be able to speak a question and hear an answer derived from one or more Web sites. New standards, such as Resource Description Framework and Web Ontology Language—a pair of labeling formats to identify a file’s contents—will improve accuracy, and improved speech synthesis engines will produce speech that is both enjoyable and understandable.

Voice-Enabled Mobile Devices Two forces are driving speech technology in the car: Automobile manufacturers are embedding speech in the dashboard, and independent vendors provide car add-ons such as cell-phone holders and interactive direction finders. Voice-controlled cameras, MP3 players, video players, and phones are or will soon be available. Many of the functions of these devices may be rolled together into a "Swiss army knife" for mobile devices. With the shrinking keyboard on mobile devices, users will either need a pencil sharpener to make their fingernails pointed for typing or will need to speak requests. Speech technologies will either be embedded into the device or available via a communication channel to one or more servers.

A consortium of more than 30 technology and mobile companies, including Nuance and Google, is developing an open and free Android mobile platform that will soon provide the power for new G-phones. With Microsoft’s recent moves to buy Yahoo!, Google could be in a position to drive the mobile phone market.

These new stars in the speech universe present many opportunities for using speech to make users more productive and to provide enjoyable entertainment. No one knows if they will form their own galaxies or join the two existing ones, but it will be fun to apply speech in these new domains and perhaps improve the users’ lives.
James Larson, Ph.D., is the co-program chair for the SpeechTEK 2008 Conference, co-chair of the W3C Voice Browser Working Group, and author of the home-study guide and reference The VoiceXML Guide (www.vxmlguide.com). He can be reached at jim@larson-tech.com.