Multilingual Models for NLP
Annie Louis
SLTC Newsletter, July 2009
Ambiguity in language is problematic for NLP systems. However, what is ambiguous in one language might be unambiguous in another. This benefit forms the motivation for a multilingual framework to perform disambiguation in several language tasks. Multilingual models use information about equivalent words or sentences in two or more languages and have often been found to yield considerable improvements over results from monolingual features. In addition, the feasibility of this approach has greatly increased in recent years with the availability of large amounts of parallel texts and translations in different languages. This article surveys a suite of multilingual models that have been discussed in some of the most recent work presented at the NAACL and EACL conferences this year. These models cover a variety of topics ranging from ambiguity resolution for parsing and part-of-speech tagging, better methods for model design and preprocessing for statistical machine translation systems and findings about language similarities that enable systems built for one language to be used for another.
Snyder et al. [2] present a multilingual approach to resolve ambiguity during part of speech (POS) tagging. For example, a word like “fish” can be used in English as either a noun (in “I like fish.”) or a verb (in “I love to fish”). But in French, the words for noun (“poisson”) and verb forms (pêcher) are different. Therefore depending on the tag for the word in French, this can be unambiguously assigned in this example, the correct tag for the English equivalent can be chosen. In this work, the basic structure consists of a Hidden Markov Model (HMM) for each language which is designed to capture properties about tag sequences that are specific for that language. Such a model structure is standard in monolingual setups. Word alignments across the languages are then used to build a superstructure to learn multilingual patterns. The model now provides the ability to learn the POS tags for a particular language using both patterns of POS transitions in that language together with the information about tags for same word in a different language. When used to tag monolingual texts in 8 European languages, the multilingual model obtained better performance than monolingual models in a majority of the cases. Another experiment showed that performance improved continuously as the number of languages in the model was increased.
A similar intuition can be employed to disambiguate between different senses of the same word. A word with multiple senses in one language may have different lexical forms for the senses in another language. Work by Apidianaki [10] presents an analysis of and a new approach to multilingual word sense disambiguation.
Multilingual models are also attractive for learning larger syntactic structures, where several problems related to ambiguity are often encountered. Fraser et. al [1] show that parse-reranking can be improved by comparing the candidate parses for a sentence in one language to parse trees for the sentence in a different language. Their experiments use English sentences and their German translations. The set of parses generated for an English sentence are compared to the 1-best parse for the equivalent German sentence. The divergence between their parse trees is examined and used to rank the candidate English parses by closeness to their German counterpart. For example, two features found to be very indicative of syntactic divergence are difference in spans and correspondence in depth of the two parse trees. The model using these bilingual features resulted in improved performance over a monolingually trained parser. The German parse trees and word alignments are also obtained automatically showing that considerable benefits can be had from a multilingual model even with slightly noisy information.
In another recent piece of work, Cohen and Smith [8] show their new set of priors for probabilistic grammars can be used to jointly learn grammars for English and Chinese with improved performance over their monolingual settings.
In addition to resolving ambiguity for monolingual texts, models trained on information from more than one language can be particularly advantageous when the final task or application involves both languages. For example, when word boundaries are not explicitly marked in languages like Chinese, the standard technique is to use a segmenter trained on annotated monolingual corpora. However the segmentations obtained may not always be optimal, such as when inducing alignments for the purpose of training Machine Translation (MT) systems. Ma and Way [3] present an approach to word segmentation for MT where decisions about boundaries are made by considering properties of both languages involved. Following this intuition, the system learns segmentation boundaries for Chinese by examining the frequencies of alignments of English words to blocks of Chinese characters. Since the model is designed with the end task in mind, an extrinsic evaluation approach is adopted and the value of the segmentation is tested in a MT task. When the bilingual segmentation model is used, consistently good performance is obtained on translation for three different data sets.
Ma and Way’s work [3] outlined above shows that for tasks like MT that involve more that one language, preprocessing of source and target languages can be better informed using the properties of both languages. Two other studies, also in MT, use multilingual information from an entirely different perspective. These show that a third language, other than source and target, can used to improve MT performance by an intermediate step involving information from this language.
An example of this approach is the use of a "pivot" language to aid the translation from a source to target language when parallel resources are scarce for the language pair. For instance, the translation process could be cascaded and performed in two stages, source to pivot language translation followed by translation from pivot to target language. English has commonly been used in prior work because parallel texts are often available with English and such resources are plentiful. Paul et al. [5] however seek to answer the question whether English is always the best choice. Their experiments involve pair-wise translation experiments with 12 languages. The results confirm their intuitions. For a majority of the pairs, the best performance was obtained using a pivot that was not English. Interestingly, the optimal pivot for a language pair is also found to different for translations in the two directions. Based on their results, Paul et al. [5] suggest that in addition to amount of training data available with the pivot language, translation quality from source to pivot and pivot to target is another important factor to consider while choosing a pivot. Relatedness of the pivot language to source and target is also another factor because it indirectly affects the quality of the associated translations.
Chen et al. [4] provide additional work in this direction. Their focus is on phrase based MT where a table of mappings is maintained for phrases from the source and target languages. In this framework, a large amount of training data is needed for good performance but with it comes subsequent problems with model size and increased time for searching the table. In addition, tables could also contain noise from wrong alignments that could adversely affect translation quality. This work presents a novel approach to solving these problems using information from a third “bridging” language. A mapping from source to target is kept in the phrase table only if there is also a corresponding mapping from source to target through a common bridge phrase. A direct effect of this constraint is reduction in size of the phrase tables. The reduction also eliminates noisy entries and the table can be loaded and translations performed considerably faster than when using the full table. The filtering works best with a language closely related to source and target. Then matching phrases between source, target and bridging languages are easily found and can be used to refine the source-target mappings. On the other hand with very unrelated languages, lots of filtering can drastically reduce the information in the table.
While related languages can be used to improve performance by joint learning or in a multilingual model, Harbusch et al. [6] show that sometimes similarities are high enough to enable complete system transfer from one language to another. Central to this work is the development of a system that can generate ellipsis in coordination-based constructions. The underlying framework is a set of rules that warrant elision in appropriate situations. An example of such elisions in English would be the reduced construction "I wanted a cat and my sister did too" in place of "I wanted a cat and my sister wanted a cat too". In this work, the authors demonstrate that rules for generating elliptical structures in Estonian are extremely similar to that in German although the two languages come from different families. As a result, an ellipsis generation system written for German could be completely transferred and used for Estonian sentences with no modification.
A number of other recent work also note improvements from multilingual models. In multisource MT, the model is built to take advantage of existing translations for a text in different languages to improve performance for translation into a new language. Schroeder et al. [9] present a novel method for multisource translation and observe steady improvements as the number of languages increase in the input set. Bouchard-Côté et al. [7] show that re-construction of ancient languages is also done better using a larger set of languages that constitute its modern counterparts.
References:
[1] Alexander Fraser, RenjingWang, and Hinrich Schütze. 2009. Rich bitext projection features for parse reranking In Proceedings of EACL 2009, pages 282-290.
[2] Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2009. Adding more languages improves unsupervised multilingual part-of-speech tagging: a bayesian non-parametric approach In Proceedings of HLT-NAACL 2009, pages 83-91.
[3] Yanjun Ma and Andy Way. 2009. Bilingually motivated domain-adapted word segmentation for statistical machine translation In Proceedings of EACL 2009, pages 549-557.
[4] Yu Chen, Martin Kay, and Andreas Eisele. 2009. Intersecting multilingual data for faster and better statistical translations In Proceedings of HLT-NAACL 2009, pages 128-136.
[5] Michael Paul, Hirofumi Yamamoto, Eiichiro Sumita, and Satoshi Nakamura. 2009. On the importance of pivot language selection for statistical machine translation In Proceedings of NAACL-HLT 2009, Short Papers, pages 221-224.
[6] Karin Harbusch, Mare Koit, and Haldur Õim. 2009. A comparison of clausal coordinate ellipsis in Estonian and German: Remarkably similar elision rules allow a language-independent ellipsis-generation moduleIn Proceedings of the Demonstrations Session at EACL 2009, pages 25-28.
[7] Alexandre Bouchard-Côté, Thomas L. Griffiths, and Dan Klein. 2009. Improved reconstruction of protolanguage word forms In Proceedings of HLT-NAACL 2009, pages 65-73.
[8] Shay Cohen and Noah A. Smith. 2009. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction In Proceedings of HLT-NAACL 2009, pages 74-82.
[9] Josh Schroeder, Trevor Cohn, and Philipp Koehn. 2009. Word lattices for multi-source translation In Proceedings of EACL 2009, pages 719-727.
[10] Marianna Apidianaki. 2009. Data-Driven Semantic Analysis for Multilingual WSD and Lexical Selection in Translation In Proceedings of EACL 2009, pages 77-85.
If you have comments, corrections, or additions to this article, please contact the author: Annie Louis, lannie [at] seas [dot] upenn [dot] edu.


Add A Comment