Data-driven Models of Text Structure and their Applications
Annie Louis
SLTC Newsletter, October 2009
Linguistic theories describe the structure of coherent text in several different ways. Some relate text units in terms of various discourse relations, such as "contrast", "elaboration", "cause" and "temporal". In other accounts, when the same entities are referred to in adjacent sentences they relate the sentences by a common topic. There has been considerable interest in discourse parsing–automatically identifying such discourse relations in free text.
Independently, computational models that learn text structure in an unsupervised manner have been developed. Some salient properties of these approaches are:
- They aim to discover structure information by empirically examining large corpora of coherent texts.
- They exploit regularities present in naturally occurring discourse. Texts on the same topic tend to have similar organization. Their word distributions exhibit certain fixed patterns. Similarly, information that pertains to the same subtopic often appears contiguously. For example, sentences in the same paragraph have a common focus.
- Models have been developed that can learn local (between adjacent sentences) as well as global document structure.
Empirical models of text structure have had considerable success in several NLP applications. Some of these approaches are described below.
Domain dependent word co-occurrence models:
Lexical co-occurrence forms one set of cues for text structure. Words tend to appear in regular patterns in documents from the same domain or those sharing the same topic. For example, in news reports about any disaster, information about causalities typically precedes description of relief efforts. Examination of a large number of documents on a topic could reveal such patterns. Different models based on this idea have shown that both relationships between adjacent sentences and global document structure can be induced by word co-occurrence information
For instance, one can learn information about words likely to appear in adjacent sentences from a large corpus of documents [1, 3]. Global models can be built by an extension of the same idea. Barzilay and Lee [2] for example, used documents on the same topic to obtain clusters of related sentences and a HMM model to learn likely transitions between the clusters. In a different approach by Chen et. al [4], global structure was directly learnt as a permutation of subtopics or word distributions rather than as Markovian transitions between consecutive topics.
Topic independent entity coherence model:
Document structure can also be described as text units interconnected by the presence of common entities. Mentions of the same entity across adjacent sentences are indicative of common focus and local coherence. Corpus-based methods using this view aim to learn different entity overlap patterns commonly present in texts. These patterns encode preferences that are characteristic of coherent organization.
Under this perspective, the actual lexical identity of words can be abstracted away resulting in a topic independent framework. In other words, local coherence can be computed based on whether entities are shared across sentences, without using the actual words themselves. The entity grid [7] is one model driven entirely by entity overlap across adjacent sentences. While traditional theories like Centering [8] predict some patterns of entity sharing to be more coherent over others, data driven approaches such as these learn preferences as they exist in large collections.
Temporal model of narrative structure:
While the models described above learn general word distribution patterns, learning fine grained relationships may be suitable for some domains. For example, narratives tend to have an event-driven structure and a temporal structure could well describe such discourse.
In Chambers and Jurafsky [5], related events (verbs) and their participants are learnt from free text. Adjacency or co-occurrence was widely exploited in the generic models for learning local coherence patterns. In this work, the space of possibilities is limited by considering only events in the same document that also share the same participant-events with a common "protagonist" are likely to be semantically related. Corpus counts of co-occurrence are used to subsequently obtain confidence scores for a candidate pair of verbs. Narrative structure is built as a chain of verbs together with precedence relationships obtained using a temporal classifier.
Applications of data-driven models:
The models described above have been successfully used for text-ordering tasks, topic segmentation and for detecting important text units for summarization. Recent work by McIntyre and Lapata [6] is another typical example of a data-driven approach. As a first step in their story generation process, information about entities is learnt from a large collection of stories–"dogs_bark", “dogs_bark_at_cats", etc. A sentence is produced by choosing an entity and a likely event. Consecutive sentences borrow entities from the previous ones and events are chosen based on both the previous event and the entity currently in focus. The generated stories are then ranked by entity coherence and story likelihood to select the best one.
References:
- Mirella Lapata, "Probabilistic Text Structuring: Experiments with Sentence Ordering", ACL 2003
- Regina Barzilay and Lillian Lee, "Catching the drift: Probabilistic Content Models with applications to generation and summarization", NAACL-HLT 2004.
- Radu Soricut and Daniel Marcu, "Discourse Generation using Utility-Trained Coherence Models", ACL 2006
- Harr Chen, S.R.K Branavan, Regina Barzilay and David Karger, "Global Models of Document Structure Using Latent Permutations", NAACL 2009
- Nate Chambers and Dan Jurafsky, "Unsupervised Learning of Narrative Schemas and their Participants", ACL-IJCNLP 2009
- Neil McIntyre and Mirella Lapata, "Learning to Tell Tales: A Data-driven Approach to Story Generation", ACL-IJCLNP 2009
- Regina Barzilay and Mirella Lapata, "Modeling Local Coherence: An Entity-based Approach", Computational Linguistics, 2008.
- Barbara Grosz, Aravind Joshi, Scott Weinstein, "Centering: A Framework for modeling the local coherence of discourse", Computational Linguistics, 1995


Add A Comment