IEEE

The Penn Discourse Treebank Corpus

Annie Louis

SLTC Newsletter, April 2009

Large scale annotated corpora of linguistic phenomena play an important role in the development of language technologies. A new resource for studying discourse properties of the English language has been developed at the University of Pennsylvania by a team led by Prof. Aravind Joshi. The Penn Discourse Treebank provides annotations of discourse relations over the 1 million word Wall Street Journal part of the Penn Treebank (PTB) Corpus. The corpus has enabled NLP researchers to gain better understanding of discourse structure, and several efforts are underway to use the annotations contained there as a basis for automatically annotating other texts. Meanwhile, recent studies based on the corpus have demonstrated the usefulness of discourse information for applications such as text quality assessment and question generation.

The constituent units of coherent text have a structure that is key to its well-formedness and central to understanding its meaning. Relations between text units-- sentences, clauses or utterances-- in coherent discourse are called rhetorical relations or discourse relations. For example, a contrast relation holds between the two sentences below.

  • I went to the library to borrow a book. But it had closed early.

Discourse relations are often signaled by the presence of words or phrases, such as ‘But’ in the previous example. In English, these cue words (discourse connectives) can be realized as subordinating conjunctions (eg. ‘because’, 'although'), coordinating conjunctions (eg. ‘and’, ‘or’, 'but') or discourse adverbials such as ‘however’ and ‘then’. But not all discourse relations are marked explicitly. Relations conveyed implicitly are also understood by readers; a causal relation can be inferred between the following two sentences even in the absence of an explicit connective.

  • I took a cab to reach the venue on time. I had missed the train.

Several NLP applications require access to such discourse information. A question answering system posed with a query such as “What are the causes of global warming?” can make more informed choices given the causal relationships that exist in documents containing the keywords “global warming”. A natural language generation system must make discourse related choices to produce coherent text that satisfies the communicative goal, e.g. constructing a description, making a comparison or presenting a viewpoint. Similarly, in a summarization system, after the content for a summary is selected, the chosen units must be organized to produce a coherent and readable summary. Such tasks would benefit greatly from discourse information. Moreover, when speech is involved, obtaining a full syntactic parse could become difficult due to speech disfluencies. In such cases, discourse relations between clauses and utterances represent invaluable information at the right level of granularity. Discourse structure can also help predict turn-taking behavior between dialog participants.

Automatically identifying discourse relations in text and speech is thus a significant task, for which large-scale annotations of discourse can provide a useful basis for machine learning and tool building. The Rhetorical Structure Theory (RST) corpus [15] and Graphbank [14] are two existing corpora that provide such information for English language texts. The RST annotations are available for 385 Wall Street Journal (WSJ) articles. The size of the Graphbank corpus is 135 documents- 30 articles from the WSJ and others from the Associated Press. The Penn Discourse Treebank (PDTB) [1] provides annotations of discourse relations over 1 million words of the WSJ portion of the PTB Corpus (2161 articles).

In the PDTB, a discourse relation is defined to hold between abstract objects (facts, propositions and beliefs) which are realized as clauses. This definition serves to distinguish between the two sentences shown below. The first sentence contains a conjunction relation between the facts “John went to the store” and “John bought some fruit”. The 'and' in the second sentence only connects entities, hence does not signal a discourse relation.

  • John went to the store and bought some fruit.
  • John bought some fruit and milk.

The PDTB annotations follow a lexically grounded approach. Annotators first identify the discourse connectives in text. Next, annotators mark spans of abstract entities that participate in the relation associated with the connective as the relation's arguments-- Arg1 and Arg2. The argument syntactically bound to the connective is called as Arg2. The other argument is referred to as Arg1. In the example annotation shown below, a causal relation is signaled by the connective 'because'. The subordinate clause forms the Arg2 and Arg1 is shown in italics.

  • Solo woodwind players have to be creative if they want to work a lot, because their repertoire and audience appeal are limited.

Relation instances with a discourse connective are categorized as explicit discourse relations.

Discourse relations can also be triggered by adjacency, without any explicit markers. For adjacent sentences within the same paragraph, annotators chose a discourse connective which if inserted between the sentences would best represent the intended discourse relation. Following connective selection, an appropriate sense is assigned to the relation. The argument spans are also annotated as in the case of explicit relations. The preliminary step of choosing a connective was designed to guide annotators in making decisions regarding the implicit sense. For the two sentences below, the annotator chose the connective 'however' as most suitable for conveying the implicit contrastive relation.

  • The recent explosion of country funds mirrors the "closed-end fund mania" of the 1920s, Mr. Foot says, when narrowly focused funds grew wildly popular. [However] They fell into oblivion after the 1929 crash.

For both implicit and explicit relations, senses have been assigned from a sense hierarchy that has four categories at the topmost level - "Comparison", "Contingency", "Temporal" and "Expansion". They are further subdivided into types and subtypes. For example, the implicit relation shown above is assigned the sense "Comparison.contrast". Another example sense annotation is "Temporal.Asynchronous.precedence" as in the following sentence.

  • Back downtown, the execs squeezed in a few meetings at the hotel before boarding the buses again.

A temporal relation is present between the events realized in the two arguments. Further the events do not occur simultaneously; rather they can be temporally ordered (asynchronous), with the meetings at the hotel preceding the boarding of buses.

The corpus also contains instances of three other relation types - 'Altlexes',' Entity Relations' and 'No Relations' which hold between adjacent sentences and are characterized by non-insertability of a connective, in contrast to implicit relations described above. In some cases, other words or phrases in Arg2 convey the relation sense, making the insertion of a connective redundant. This is called ‘Altlex’, illustrated by the pair of sentences below: Inserting a connective like "But" to signal contrast between them is redundant:

  • This legislation was not drafted by a handful of Democratic "do-gooders." [But] Quite the contrary-- it results from years of work by members of the National Council on the Handicapped, all appointed by President Reagan. --An 'Altlex' relation

When the relation between two sentences is based solely on a common entity shared across them, the spans are annotated as belonging to an Entity Relation.

  • Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. -- An 'entity relation'

A 'No Relation' is assigned when neither discourse nor entity relation is present. ‘Entity Relations' and 'No Relations' are not defined as discourse relations and hence no sense annotation is provided.

In addition to relation senses, attribution features are also annotated in the corpus. Attribution features capture ownership information about abstract entities involved in discourse relations.

  • When Mr. Green won a $ 240,000 verdict in a land condemnation case against the state in June 1983, [he says], Judge O'Kicki unexpectedly awarded him an additional $ 100,000.

In the above example, the temporal relation (indicated by the connective ‘when’) and Arg2 (Mr. Green winning the verdict) are attributed to the writer while Arg1 is attributed to Mr. Green. This difference is captured by a 'source' feature associated with the relation and each of its arguments. For the above example, the source of the connective and Arg2 is the writer; the source of Arg1 is annotated as ‘other’ implying attribution to an agent different from the writer.

The PDTB is theory-neutral and the annotation is markedly different from other annotations of discourse structure. For example, the Rhetorical Structure Theory (RST) [13] posits a single tree structure over a document and Graphbank [14] hypothesizes that text structure is better captured as a graph of text units connected by discourse relations. Hence, in the RST, the text is divided into non-overlapping elementary discourse units. Adjacent units are combined recursively into composite units by annotating the rhetorical relation between them. In contrast, PDTB makes no assumptions about higher level structures over a full document. Annotations are guided by the corpus itself, identifying connectives and then selecting spans that they connect. Current annotations in the PDTB provide only low level discourse information.

Release of the PDTB has been followed by several studies of discourse phenomena found in the corpus [5, 6, 7], and its annotations are being used in the development of automatic discourse parsers which can identify discourse relations in free text. The necessary tasks involve building classifiers for connective and argument identification [5, 8, 9] and sense disambiguation [2, 3, 4].

Experiments using the PDTB annotations have shown that such discourse information is immensely useful for language tasks. The likelihood of discourse relations in a text is found to be a good predictor of text quality as perceived by its readers [11]. Arguments of causal relations both explicit and implicit as annotated in the PDTB have been shown to correlate well with text units needed for generating 'why' questions [10].

The PDTB has also paved way for similar annotation efforts in Czech (Charles University, Prague), Turkish (Middle East Technical University, Ankara), Hindi (IIIT, Hyderabad) and Arabic (Leeds University, UK). Preliminary studies have shown that discourse properties vary across different languages. For instance, while implicit and explicit relations are almost equally distributed in the PDTB, explicit relations are less frequent in Hindi. In addition, analysis of discourse connectivity in English texts from other domains eg. biomedical texts [15] are also being carried out.

The latest version is PDTB 2.0 which can be obtained from the Linguistic Data Consortium. A link to the catalog entry is provided here.

Complete information regarding the annotation process is present in the annotation manual

Publications related to the PDTB are available from the project webpage

A browser tool has been developed for viewing the discourse annotations and their mapping to syntactic parse trees from the Penn Treebank. The browser and Java API for using the PDTB annotations can be downloaded from the API support page

References:

[1] Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber, "The Penn Discourse Treebank 2.0", Proceedings of LREC 2008.

[2] Eleni Miltsakaki, Nikhil Dinesh, Rashmi Prasad, Aravind Joshi and Bonnie Webber, "Experiments on Sense Annotations and Sense Disambiguation of Discourse Connectives", In Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT2005), 2005.

[3] Emily Pitler, Mridhula Raghupathy, Hena Mehta, Ani Nenkova, Alan Lee, Aravind Joshi, "Easily Identifiable Discourse Relations", Proceedings of COLING, 2008.

[4] Sasha Blair-Goldensohn, Kathleen R. McKeown and Owen Rambow, "Building and Refining Rhetorical-Semantic Relation Models", In Proceedings of NAACL-HLT, 2007.

[5] Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi and Bonnie Webber, "Attribution and the (Non)-Alignment of Syntactic and Discourse Arguments of Connectives", In Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II, 2005.

[6] Alan Lee, Rashmi Prasad, Aravind Joshi and Bonnie Webber, "Departures from Tree Structures in Discourse: Shared Arguments in the Penn Discourse Treebank", In Proceedings of the Constraints in Discourse III Workshop, 2008.

[7] Bonnie Webber and Rashmi Prasad, “Sentence-Initial Discourse Connectives, Discourse Structure and Semantics”, In Proceedings of the Workshop on Formal and Experimental Approaches to Discourse Particles and Modal Adverbs, 2008

[8] Ben Wellner and James Pustejovsky, "Automatically Identifiying the Arguments of Discourse Connectives", In Proceedings of EMNLP-CoNLL, 2007.

[9] Robert Elwell and Jason Baldridge, "Discourse connective argument identification with connective specific rankers", In Proceedings of ICSC-2008

[10] Rashmi Prasad and Aravind Joshi, "A Discourse-based Approach to Generating Why-Questions from Texts", Proceedings of the Workshop on the Question Generation Shared Task and Evaluation Challenge, 2008.

[11] Emily Pitler and Ani Nenkova, "Revisiting Readability: A Unified Framework for Predicting Text Quality", Proceedings of EMNLP, 2008.

[12] Hong Yu, Nadya Frid, Susan McRoy, Rashmi Prasad, Alan Lee and Aravind Joshi, “A Pilot Annotation to Investigate Discourse Connectivity in Biomedical Text”, In Proceedings of the ACL:HLT 2008 BioNLP Workshop, 2008

[13] William C. Mann and Sandra A. Thompson, "Rhetorical structure theory: Towards a functional theory of text organization", Text, 8, 1988

[14] Florian Wolf and Edward Gibson, "Representing discourse coherence: A corpus-based study", Computational Linguistics, 31(2):249?288, 2005

[15] Lynn Carlson, Daniel Marcu and Mary Okurowski, "Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory", In Proceedings of the Second SIGdial Workshop on Discourse and Dialog, 2001