Digital Humanities and Language Technology

Antonio Roque

SLTC Newsletter, July 2010

The rise of computer technology has influenced research in all disciplines, including the humanities. Humanities research that uses language technology includes classical lexicography with computational linguistics and authorship attribution using machine learning methods.

Classical Lexicography

One of the first examples of language technology for humanities research was for the development of dictionaries, concordances, and similar research tools. Bamman and Crane (2009) provide a recent example in the Classics, describing the development of lexical resources in Latin and Greek literature. They describe traditional methods of reference work development, emphasizing the manual labor and the level of expertise of those doing the developing, as well as the tasks such as finding word examples in texts, clustering them by 'sense', and and labeling those senses. They note a key difference between classical text collections and modern corpora: "The meaning of the word child in a single sentence from the Wall Street Journal is hardly a research question worth asking, except for the newspaper's significance in being representative of the language at large; but this same question when asked of Vergil's fourth Eclogue has been at the center of scholarly debate since the time of the emperor Constantine". However, although certain texts and certain time periods (such as the Golden and Silver age in Latin literature) have been extensively studied to provide lexical resources such as definitions with "word in context" text excerpts and references, there are several centuries worth of texts that have not been so analyzed. In large part this is because of the expertise required in the time-consuming process of collecting citations in classical languages and merging them into distinct senses.

To complement the centuries' worth of expertly-crafted resources that currently exist, computational lexicography emerged in the late 1980s. Computational lexicography has produced resources on increasingly large corpora, and has used more complex algorithms to extract information regarding, for example, grammatical features. Bamman and Crane describe their vision of a lexical resource that includes a word's possible senses, a list of usage of each sense in the source texts, subcategorization frames, and selectional preferences by author. Bamman and Crane describe their work in developing such a resources. They use parallel texts (translations in different languages of the same text) to do word sense induction; this involves automated sentence and word alignments, which are then used to calculate translation probabilities. They also use parallel texts to train Bayesian classifiers for word sense disambiguation, and are creating treebanks in classical languages to develop parsers to study lexical subcategorization and selectional preference. In this way, they plan to produce tools that allow students a richer learning resource, and that allow scholars in the Classics to interact with the texts in more powerful ways.

Authorship Attribution

Another humanities research area with a (relatively) long history is authorship attribution, in which a text's author is identified using a classifier and a training corpus of texts with known authors. Jockers and Witten (2010) note that unlike typical cases, classification here involves the style, rather than the content, of a document. They survey the field, note a consensus on high-frequency words and/or n-grams as the choice of features, and decry the lack of thorough comparisons of the relative effectiveness of the various possible classification algorithms. To fill this void, they compare a number of classification methods (k-nearest neighbors, support vector machines, nearest shrunken centroids, regularized discriminant analysis, and Delta) on a prototypical authorship attribution task: identifying the authorship of various disputed elements of the Federalist Papers of the American colonial era. On cross-validation and test data, they find that nearest shrunken centroids and regularized discriminant analysis perform best, and suggest they are worth applying further, as they had not been previously used for this task.

In these and other ways, humanities research continues to benefit from applying techniques from engineering and the sciences.

For more information, see: