Book announcements

To post a book announcement, please email speechnewseds [at] listserv (dot) ieee [dot] org.

Data-Intensive Text Processing with MapReduce

Jimmy Lin and Chris Dyer
University of Maryland
Synthesis Lectures on Human Language Technologies #7 (Morgan & Claypool Publishers), 2010, 177 pages

Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well.

Table of Contents: Introduction / MapReduce Basics / MapReduce Algorithm Design / Inverted Indexing for Text Retrieval / Graph Algorithms / EM Algorithms for Text Processing / Closing Remarks

http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007

This title is available online without charge to members of institutions that have licensed the Synthesis Digital Library of Engineering and Computer Science. Members of licensing institutions have unlimited access to download, save, and print the PDF without restriction; use of the book as a course text is encouraged. To find out whether your institution is a subscriber, visit , or just click on the book's URL above from an institutional IP address and attempt to download the PDF. Others may purchase the book from this URL as a PDF download for US$30 or in print for US$40. Printed copies are also available from Amazon and from booksellers worldwide at approximately US$40 or local currency equivalent.

Automated Grammatical Error Detection for Language Learners

Claudia Leacock, Martin Chodorow, Michael Gamon, Joel Tetreault
Butler Hill Group, City University of New York, Microsoft Research, and Educational Testing Service
Synthesis Lectures on Human Language Technologies #9 (Morgan & Claypool Publishers), 2010, xi+122 pages

It has been estimated that over a billion people are using or learning English as a second or foreign language, and the numbers are growing not only for English but for other languages as well. These language learners provide a burgeoning market for tools that help identify and correct learners' writing errors. Unfortunately, the errors targeted by typical commercial proofreading tools do not include those aspects of a second language that are hardest to learn. This volume describes the types of constructions English language learners find most difficult -- constructions containing prepositions, articles, and collocations. It provides an overview of the automated approaches that have been developed to identify and correct these and other classes of learner errors in a number of languages.

Error annotation and system evaluation are particularly important topics in grammatical error detection because there are no commonly accepted standards. Chapters in the book describe the options available to researchers, recommend best practices for reporting results, and present annotation and evaluation schemes.

The final chapters explore recent innovative work that opens new directions for research. It is the authors' hope that this volume will contribute to the growing interest in grammatical error detection by encouraging researchers to take a closer look at the field and its many challenging problems.

Table of Contents: Introduction / History of Automated Grammatical Error Detection / Special Problems of Language Learners / Language Learner Data / Evaluating Error Detection Systems / Article and Preposition Errors / Collocation Errors / Different Approaches for Different Errors / Annotating Learner Errors / New Directions / Conclusion

http://dx.doi.org/10.2200/S00275ED1V01Y201006HLT009

This title is available online without charge to members of institutions that have licensed the Synthesis Digital Library of Engineering and Computer Science. Members of licensing institutions have unlimited access to download, save, and print the PDF without restriction; use of the book as a course text is encouraged. To find out whether your institution is a subscriber, visit , or just click on the book's URL above from an institutional IP address and attempt to download the PDF. Others may purchase the book from this URL as a PDF download for US$30 or in print for US$40. Printed copies are also available from Amazon and from booksellers worldwide at approximately US$40 or local currency equivalent.

Cross-Language Information Retrieval

Jian-Yun Nie
University of Montreal
Synthesis Lectures on Human Language Technologies #8 (Morgan & Claypool Publishers), 2010, 125 pages

Search for information is no longer exclusively limited within the native language of the user, but is more and more extended to other languages. This gives rise to the problem of cross-language information retrieval (CLIR), whose goal is to find relevant information written in a different language to a query. In addition to the problems of monolingual information retrieval (IR), translation is the key problem in CLIR: one should translate either the query or the documents from a language to another. However, this translation problem is not identical to full-text machine translation (MT): the goal is not to produce a human-readable translation, but a translation suitable for finding relevant documents. Specific translation methods are thus required.

The goal of this book is to provide a comprehensive description of the specific problems arising in CLIR, the solutions proposed in this area, as well as the remaining problems. The book starts with a general description of the monolingual IR and CLIR problems. Different classes of approaches to translation are then presented: approaches using an MT system, dictionary-based translation and approaches based on parallel and comparable corpora. In addition, the typical retrieval effectiveness using different approaches is compared. It will be shown that translation approaches specifically designed for CLIR can rival and outperform high-quality MT systems. Finally, the book offers a look into the future that draws a strong parallel between query expansion in monolingual IR and query translation in CLIR, suggesting that many approaches developed in monolingual IR can be adapted to CLIR.

The book can be used as an introduction to CLIR. Advanced readers can also find more technical details and discussions about the remaining research challenges in the future. It is suitable to new researchers who intend to carry out research on CLIR.

Table of Contents: Preface / Introduction / Using Manually Constructed Translation Systems and Resources for CLIR / Translation Based on Parallel and Comparable Corpora / Other Methods to Improve CLIR / A Look into the Future: Toward a Unified View of Monolingual IR and CLIR? / References

http://dx.doi.org/10.2200/S00266ED1V01Y201005HLT008

This title is available online without charge to members of institutions that have licensed the Synthesis Digital Library of Engineering and Computer Science. Members of licensing institutions have unlimited access to download, save, and print the PDF without restriction; use of the book as a course text is encouraged. To find out whether your institution is a subscriber, visit , or just click on the book's URL above from an institutional IP address and attempt to download the PDF. Others may purchase the book from this URL as a PDF download for US$30 or in print for US$40. Printed copies are also available from Amazon and from booksellers worldwide at approximately US$40 or local currency equivalent.

Semantic Role Labeling

Martha Palmer, Daniel Gildea, Nianwen Xue
University of Colorado Boulder, University of Rochester, Brandeis University
Synthesis Lectures on Human Language Technologies #6 (Morgan & Claypool Publishers), 2010, 103 pages

This book is aimed at providing an overview of several aspects of semantic role labeling. Chapter 1 begins with linguistic background on the definition of semantic roles and the controversies surrounding them. Chapter 2 describes how the theories have led to structured lexicons such as FrameNet, VerbNet and the PropBank Frame Files that in turn provide the basis for large scale semantic annotation of corpora. This data has facilitated the development of automatic semantic role labeling systems based on supervised machine learning techniques. Chapter 3 presents the general principles of applying both supervised and unsupervised machine learning to this task, with a description of the standard stages and feature choices, as well as giving details of several specific systems. Recent advances include the use of joint inference to take advantage of context sensitivities, and attempts to improve performance by closer integration of the syntactic parsing task with semantic role labeling. Chapter 3 also discusses the impact the granularity of the semantic roles has on system performance. Having outlined the basic approach with respect to English, Chapter 4 goes on to discuss applying the same techniques to other languages, using Chinese as the primary example. Although substantial training data is available for Chinese, this is not the case for many other languages, and techniques for projecting English role labels onto parallel corpora are also presented.

Table of Contents: Preface / Semantic Roles / Available Lexical Resources / Machine Learning for Semantic Role Labeling / A Cross-Lingual Perspective / Summary

http://dx.doi.org/10.2200/S00239ED1V01Y200912HLT006

This title is available online without charge to members of institutions that have licensed the Synthesis Digital Library of Engineering and Computer Science. Members of licensing institutions have unlimited access to download, save, and print the PDF without restriction; use of the book as a course text is encouraged. To find out whether your institution is a subscriber, visit , or just click on the book's URL above from an institutional IP address and attempt to download the PDF. Others may purchase the book from this URL as a PDF download for US$30 or in print for US$40. Printed copies are also available from Amazon and from booksellers worldwide at approximately US$40 or local currency equivalent.

Supertagging: Using Complex Lexical Descriptions in Natural Language Processing

Edited by Srinivas Bangalore and Aravind K. Joshi

The last decade has seen computational implementations of large hand-crafted natural language grammars in formal frameworks such as Tree-Adjoining Grammar (TAG), Combinatory Categorical Grammar (CCG), Head-driven Phrase Structure Grammar (HPSG), and Lexical Functional Grammar (LFG). Grammars in these frameworks typically associate linguistically motivated rich descriptions (Supertags) with words. With the availability of parse-annotated corpora, grammars in the TAG and CCG frameworks have also been automatically extracted while maintaining the linguistic relevance of the extracted Supertags. In these frameworks, Supertags are designed so that complex linguistic constraints are localized to operate within the domain of those descriptions. While this localization increases local ambiguity, the process of disambiguation (Supertagging) provides a unique way of combining linguistic and statistical information.

This volume investigates the theme of employing statistical approaches with linguistically motivated representations and its impact on Natural Language Processing tasks. In particular, the contributors describe research in which words are associated with Supertags that are the primitives of different grammar formalisms including Lexicalized Tree-Adjoining Grammar (LTAG).

Contributors: Jens Bäcker, Srinivas Bangalore, Akshar Bharati, Pierre Boullier, Tomas By, John Chen, Stephen Clark, Berthold Crysmann, James R. Curran, Kilian Foth, Robert Frank, Karin Harbusch, Mary Harper, Saša Hasan, Aravind Joshi,Vincenzo Lombardo, Takuya Matsuzaki, Alessandro Mazzei, Wolfgang Menzel, Yusuke Miyao, Richard Moot, Alexis Nasr, Günter Neumann, Martha Palmer, Owen Rambow, Rajeev Sangal, Anoop Sarkar, Giorgio Satta, Libin Shen, Patrick Sturt, Jun’ichi Tsujii, K. Vijay-Shanker, Wen Wang, Fei Xia

About the Editors

Srinivas Bangalore is Principal Technical Staff Member at AT&T Labs-Research.He was awarded the AT&T Science and Technology Medal in 2009 for technical leadership and innovative contributions in Spoken Language Technology and Services.

Aravind K. Joshi is Henry Salvatori Professor of Computer and Cognitive Science at the University of Pennsylvania. He received the David Rumelhart Prize for fundamental theoretical contributions to the cognitive sciences in 2003.

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12149

Opinion mining and sentiment analysis

Bo Pang and Lillian Lee
Now Publishers, 2008, 135 pages.
Also Foundations and Trends in Information Retrieval 2(1-2), 2008.

Official publisher website: http://www.nowpublishers.com/product.aspx?product=INR&doi=1500000011

Author-maintained book homepage, including the entire PDF posted for free download with no access restrictions, by publisher permission: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html

An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object.

This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, vulnerability to manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.