November/December 2003

EMMA: W3C’s Extended Multimodal Annotation Markup Language

By Dr. James A. Larson

Recently, the W3C Multimodal Working Group published a first working draft of EMMA — the Extended MultiModal Annotation markup language — EMMA (www.w3.org/TR/emma/). EMMA’s intended use is to represent the semantics for information entered via various input modalities and the resulting integrated information.

Using EMMA

EMMA’s use is illustrated in Figure 1.

It shows a part of the W3C multimodal framework and how users can enter information using speech, ink, keyboard and mouse, and other modalities. Normally, EMMA is not written by humans. It is generated by software components for use by other software components. User information is recognized and interpreted by modality-specific components including speech recognizers, handwriting recognizers and keyboard and mouse device drivers. The information entered using each modality is represented with a common language — this is where EMMA comes in. Each of the modality-specific recognizers/interpreters convert the user supplied information into an EMMA representation. Examples of the EMMA representation include:

The speech utterance “Zoom in here” is represented as:
The area circled by the user’s pen is represented as a series of points:
The integration of information from both the speech and pen modalities is represented as:

This EMMA representation would then be processed by a dialog manager that would respond to the user by zooming into the area contained by the series of points.

EMMA Descriptions

A typical EMMA description consists of three types of information that are useful to describe user-entered information:

1. Data model — A schema to describe the names and structure of data entered by the user, such as:

2. Instance data — The information entered by the user via various input modalities:

3. Meta data — The annotation of instance data. This includes information generated by speech and handwriting recognizers, integration processors, and other information that may be useful to backend information processors. For example, the confidence factor assigned by a recognizer and the natural language being used by the user:

EMMA Concepts

EMMA defines a number of concepts, including:

Interpretation — One (of possibly several) meanings of user input.
Model — The schema (data model) describing the EMMA information.
Derived-from — The interpretation(s) from which this EMMA information is derived.
Process — The interpretation process used to derive this EMMA information.
Confidence — The quality of input.
No-input — Input is not available for this EMMA information.
Uninterpreted — This EMMA information is not interpreted (e.g., recorded speech or ink will be presented to the user at a later time).
Lang — The natural language used to express the information.
Group — Group several EMMA descriptions together.
Group-info — Annotation for a group of EMMA descriptions.
Source — The raw source from which this EMMA information is derived.
Sequence — Relative position of EMMA descriptions.
Timestamps — The start and stop timestamps for raw input.
One-of — One of several mutually exclusive EMMA descriptions.
Medium, mode and function — Data about the source and use of the information.
Medium — acoustic, tactile, or visual.
Mode — speech, DTMF, keypad, ink, GUI, keys, video, photograph.
Function — recording, transcription, dialog, verification, and so on.

The EMMA language is still evolving. The Multimodal Working Group solicits your feedback about the above concepts and how they should be represented in the EMMA language. The Multimodal Working Group is evaluating whether annotations should be integrated tightly with instant data or separated from instant data. The language will likely evolve based on feedback from practitioners. However, with the publication of the first working drafts for EMMA and InkML, the Ink Markup Language (www.w3.org/TR/InkML/), the W3C Multimodal Working Group has taken an important step in the creation of languages enabling multimodal applications.

If EMMA is widely adopted, then components from different vendors will be able to interoperate. For example, a speech recognizer from Vendor A, a handwriting recognizer from Vendor B, an integration component from Vendor C, and an interaction manager from Vendor D will be able to transfer EMMA statements among themselves. EMMA will be come the inter lingua among components of multimodal systems. This will enable developers to create multimodal platforms by choosing the “best of breed” or “least expensive” for each component type.

EMMA will become an important language for integrating user input entered via different modalities. By representing information in a common format, information from different devices can be integrated for processing into a single representation for processing by dialog managers, inference engines, or other advanced information processing components.