IEEE

Shared Task Evaluation Challenges in Natural Language Generation

Anja Belz and Albert Gatt

SLTC Newsletter, April 2009

Shared Task Evaluation Challenges (STECs) have been common in many areas of NLP for some time, but have only recently started in the field of Natural Language Generation (NLG). This article gives an overview of developments in Shared Tasks for NLG over the past three years.

The move towards comparative evaluation in NLG

The field of Natural Language Generation (NLG) has a strong evaluation tradition, in particular in user-based and task-oriented evaluation, but until recently it had never evaluated alternative, independently developed approaches and techniques by comparing their performance on the same tasks.

Evidence from other NLP fields had shown that shared-task evaluation can lead to rapid technological progress and substantially increased participation, and 2005 saw the beginnings of a groundswell of interest in comparative evaluation among NLG researchers, starting in heated discussions at the UCNLG and ENLG workshops that year, and continuing in a special session on the topic at INLG'06 and an NSF-funded workshop on Shared Tasks and Comparative Evaluation in NLG in April 2007.

There was initially a lot of skepticism especially among more senior NLG researchers about the feasibility and utility of shared-task evaluation in NLG, and the NSF workshop in particular provided a forum for discussing concerns about the impact that such tasks might have on the field. Chief among these were the possibility of restricting focus to a small set of problems to the detriment of others, as well as the problem of using a restrictive range of evaluation methods, as exemplified by developments in MT.

The NSF workshop also provided an opportunity to present and develop specific ideas for shared tasks in NLG. Two proposals seemed particularly feasible in the short to medium term. One of these focused on Referring Expressions Generation, an area of NLG which has been researched intensively since the 1980s; the second proposal was about instruction giving in virtual environments. These proposals have since become a reality and have resulted in a shared-task evaluation initiative with yearly events. The most recent of these is Generation Challenges 2009, the first results session of which has just been held at this year's European Workshop on Natural Language Generation (ENLG'09), in Athens, Greece.

Generation Challenges 2007--2009

The first NLG STEC event was the Attribute Selection for Generating Referring Expressions Challenge (ASGRE'07); the results session was held just five months after the NSF workshop, in September 2007 at the UCNLG+MT workshop at MT Summit XI in Copenhagen. The task involved selecting the attributes that would identify an object in a visual context, and made use of the TUNA Corpus, a collection of referring expressions written by humans to identify objects in visual domains. ASGRE'07 was organized in the spirit of a pilot event to gauge community interest in STECs. The fact that 6 teams were able to participate despite the short available time demonstrated sufficient community interest to continue with a second STEC the following year.

Referring Expression Generation Challenges 2008 (REG'08) had four individual STECs. Three of these used the TUNA corpus: one task was attribute selection, another was realisation of selected attributes and the third was end-to-end referring expression generation. The fourth task was based on the GREC (Generating Referring Expressions in Context) Corpus, a collection of introduction sections from Wikipedia articles about people and geographic entities. The task was to select suitable referring expressions for the main subject of an article within the context of the article.

This year has seen the third NLG STEC event, Generation Challenges 2009, with four individual STECs: the (end-to-end) TUNA Referring Expression Generation Task (TUNA-REG); the GREC Main Subject Reference Generation (GREC-MSR) Task; the GREC Named Entity Generation (GREC-NEG) Task; and the Giving Instructions in Virtual Environments Challenge (GIVE) Challenge. The new GREC-NEG Task used the new GREC-People corpus, and the task was to generate chains of referring expressions for all people entities mentioned in an article. In the GIVE Challenge, participating teams developed systems which generate instructions to users navigating a virtual 3D environment and performing computer-game-like tasks.

In addition to the shared tasks, all three STEC events offered (i) an open submission track in which participants could submit any work involving the data from any of the shared tasks, while opting out of the competitive element, and (ii) an evaluation track, in which proposals for new evaluation methods for the shared task could be submitted. Generation Challenges 2009 additionally offered a task proposal track in which proposals for new shared tasks could be submitted.

Preparations are underway for a fourth NLG shared-task evaluation event next year, Generation Challenges 2010, which is likely to include a further run of the GREC-NEG Task with an extended training/development corpus, a new task which links GREC-NEG to a named-entity recognition preprocessing stage, and a second run of the GIVE Challenge.

Taking stock

Our general aims and objectives in the NLG STECs have been (i) the creation of new data resources, made freely available for research purposes; (ii) innovation in evaluation (new task-performance experiments, automatic extrinsic techniques, etc.); (iii) application and testing of a wide range of different intrinsic and extrinsic evaluation techniques and diverse test data sets; (iv) bridging to other research areas where language is automatically generated (MT, summarization and dialogue) and drawing new researchers into NLG; and last but not least (v) engendering technological progress in the specific tasks addressed.

Among the tangible results of the three NLG STECs so far are (i) a range of publicly available data resources and evaluation tools; (ii) new methodologies for extrinsic evaluation; (iii) correlation results for wide range of intrinsic and extrinsic evaluation measures; and (iv) a diverse range of follow-on research. STECs and comparative evaluation are now firmly established in NLG.

The NLG STEC events have created a real buzz and excitement in the community. Participation is steadily increasing, beyond the core NLG community. In order to put the Generation Challenges initiative on a more permanent footing, we have recently set up a Generation Challenges Steering Committee whose role will be to evaluate new proposals for Shared Tasks, as well as to provide an organizational backbone. This committee is made up of past STEC organizers and STEC review committee members, and includes researchers from MT and summarization. We hope that this will ensure the continuation of what many researchers are already expecting to be an annual shared-task evaluation event in language generation.

More information

For more information (including links to data sets and task documentation), see: