Text Analysis Conference 2009: An Overview
Annie Louis
SLTC Newsletter, February 2009
The annual Text Analysis Conference (TAC) provides a setting where participant systems are evaluated on common test sets to create direct comparisons, develop and test evaluation methods and to assess system progress. There were three main tracks at this year’s evaluation--Knowledge Base Population, Update Summarization and Recognizing Textual Entailment. More than 50 teams from countries around the world including Australia, China, Japan, Israel, India and several parts of Europe took part in these tracks. The results of this year’s evaluation were presented at the workshop conducted in November 2009. It featured technical sessions with presentations and posters from the participants as well as overviews of system performances analyzed by NIST.
The Knowledge Base Population (KBP) track was conducted for the first time at TAC this year. The focus was on systems that can mine specific pieces of information from a large text collection with the view to add them to an existing knowledge base. Over a million newswire documents were assembled by the Linguistic Data Consortium (LDC) to serve as a corpus for the various search tasks. The knowledge base was compiled from Wikipedia “info boxes”, the section in each Wikipedia article listing the most important snippets of information about the entity discussed. An example snapshot from the page for ‘Max Planck’ is below.

One of the tasks for the KBP systems was “entity linking”. Given the name of an entity and a document which mentions the name, systems had to determine if the entity being referred to existed in the database. Systems need to disambiguate the mention using the provided document to distinguish it from others with the same name. For example, the name ‘Colorado’ could refer to a state, a river or a desert. Then a link to the entry in the knowledge base must be provided if the entity already exists in it. Systems performed well overall on the entity linking task. There were however several clearly ambiguous mentions that most systems had problems disambiguating. In the KBP “slot filling” task, systems must use the collection of newswire documents to obtain missing pieces of information for an entity in the database. The entities could be people, organization or locations and a list of desirable attributes were defined for each of them.
Continuing along the lines of previous few years, this year’s summarization track focused on creation of Update Summaries. Two sets of 10 newswire documents each were given to the participants with the second set containing articles published later than the first. Systems should assume that the user has read the first set of documents. They must create a single summary of the second collection, but compiling only the new information as an update summary for the user. It was observed from the evaluations this year that selecting good content for the update summaries still remains difficult for systems with most of them obtaining scores significantly below human performance on this task.
This year NIST also introduced an evaluation track where participants could submit their own automatic metrics for summary evaluation. The currently used automatic metric, ROUGE, is a suite of n-gram statistics to compare a system summary with gold-standard human summaries, both written for the same source article. The degree of overlap gives a measure of content quality which has been shown to correlate highly with human judgments. Track participants were given the output of systems from the update task for scoring. The scores submitted by the participants were then checked for correlations with those assigned manually by NIST assessors. Interestingly, some of the participant systems outperformed or did comparably to some of the ROUGE metrics obtaining very good agreement with human judgements.
The third track was Recognizing Textual Entailment (RTE). Given two pieces of text, the task of an RTE engine is to predict if one entails the other. Recognizing entailment relationships between texts is useful for a variety of tasks. For example, in summarization, the system could be used to help remove redundant facts. While systems are often given two self-contained pieces of text for predicting entailment, this year, an additional new setup was also introduced where the candidates for entailment judgments included the full collection of sentences in a set of documents. This setup poses two challenges. Systems will need to work with a real distribution of entailment examples, where positive ones would be considerably fewer than the non-entailing examples. Secondly, individual sentences of the collection may not contain all the information needed for prediction. Indeed, runs of this pilot task brought out several issues that systems must address to do better on the task. For example, within as well as cross-document references to people, dates and places need to be successfully resolved for entailment prediction.
In addition, this year, all RTE participants were mandatorily required to present ablation results for the knowledge resources they used. Participants were asked to submit their results leaving out each resource at a time and running their system with the remaining modules. This way one could analyze which resources when removed had the greatest impact on performance. The results showed that the impact of resources varied greatly depending on how systems used them. For example, WordNet, one of the commonly used resources had a negative impact when removed for some systems, at the same time, improving performance for some others. But such analyses are helpful to understand which resources are overall beneficial so that they could be shared to assist faster system development.
In sum, the new tasks that were introduced and the evaluations brought out several relevant issues in task design and evaluation for text analysis systems. Participants shared their experiences with one another at the talk and poster sessions. More information at: http://www.nist.gov/tac/
If you have comments, corrections, or additions to this article, please contact the author: Annie Louis, lannie [at] seas [dot] upenn [dot] edu.


Add A Comment