A survey of Machine Translation

Introduction

What is machine translation?

Machine Translation [MT] refers to computerised systems that translate natural languages without any human assistance. The goal of sentence-level MT is to find the most probable target sentence $\bf \hat y$ given a source sentence $\bf x$ such that the target conveys the same meaning as the source sentence. Mathematically, this can be expressed as:

\bf{\hat y} = \underset{\bf y}{arg\,\max}\; P_{\theta} (\bf y \mid \bf x)

Modelling the conditional probability $\bf P(\bf y \mid \bf x)$ with learnable parameters $\theta$ is done using various MT models and techniques ranging from rule-based and statistical models to neural machine translation [NMT] models. Most existing NMT models are auto-regressive, i.e. they define a probability distribution over target sentences $\bf P(\bf y \mid \bf x)$ by factorising it into individual conditionals as

\bf P(\bf y \mid \bf x) = \prod_{i = 1}^{N} \bf P(y_i \mid y_1, \ldots, y_{i - 1}, \bf x) \tag{1}

where $y_i$ is the current target word and $y_1, \ldots, y_{i - 1}$ are previously generated words. Once $\bf P(\bf y \mid \bf x)$ is learned by a translation model, a source sentence is translated by searching for the sentence that maximises the conditional probability [1, 2].

Why is machine translation a problem worth solving?

The ability to communicate effectively is essential for human interaction and development, particularly in fields such as science, medicine, and technology, where collaboration between people from different countries is essential for progress. However, language barriers can often hinder communication, especially in a globalised world where people from different cultures and countries interact frequently. A study by Lee et al., 2020 [3] exploring the impact of MT on English-as-a-Foreign-Language students' writing skills in Korea showed that the group with access to machine translation tools produced essays with significantly higher accuracy and complexity scores than the control group. Improvements in data-collection and model training allowed Google to add 24 new languages to Google Translate in one go, benefitting under-represented speaker populations in Africa and South Asia [4]. In a recent paper, Khoong and Rodriguez, 2022 [5] argue that MT has the potential to improve communication between healthcare providers and non-native speakers, leading to better and more equitable healthcare outcomes. However, the field still faces many challenges, from needing large-scale datasets for low-resource languages to adapting systems to specialised domains.

Brief Overview of Approaches

Machine translation has come a long way since its formal inception in the late 1940s and its first public demonstration by the Georgetown-IBM research group in 1954. An overview of the ancient arts of rule-based and statistical MT systems can be found in Hutchins, 1997 [6]. This section focuses on different neural-network-based machine translation systems and specifically attention-based approaches, which are also expanded upon in a later section. Neural models have become the de-facto standard and are consistently approaching human-level performance in various settings [7]. NMTs are also being widely adopted in industry and have seen deployments in many large production systems [8, 4].

The Encoder-Decoder Framework

The encoder-decoder structure, first proposed by Neco and Forcada 1997 [10], is the current de-facto standard for NMT models. These systems are characterised by an encoder network which computes a latent representation of the source sentence, followed by a decoder network which generates the translated sentence from that representation. Different encoder-decoder architectures model the individual conditional $\bf P(y_i \mid y_1, \ldots, y_{i - 1}, \bf x)$ from Eq. 1 differently. Recurrent neural networks [RNNs] were first introduced to model the distribution as a function of the current word given previously generated words along with some hidden state and fixed-length representation of the input.

Before Transformers

Kalchbrenner and Blunsom, 2018 [11] were among the first to present a standalone NMT system without components from statistical MT [SMT]. They demonstrated using a convolutional neural network [CNN] based encoder to model sentence pairs to capture syntactic and lexical features of the input sentences. Following this line of research, Sutskever et al., 2014 [1] and Cho et al., 2014b [13] explored the use of stacked LSTMs and GRUs in the encoder, respectively, to generate a fixed-length encoding of the source sequence. However, fixed-length source encodings have been shown to lead to poor translations for long input sentences, as reported by Cho et al., 2014a [14]. To address the performance bottleneck of fixed encodings, Bahdanau et al., 2015 [15] proposed the attention mechanism. This approach allows the model to attend to specific parts of the input sequence while generating the output, negating the need for fixed input representations.

The Transformer era

Sequential models provided a significant increase in performance compared to traditional SMT techniques. However, their use in large-scale machine translation was and continues to be limited by the challenge of parallelising training examples, which creates a bottleneck in processing longer sentences. Vaswani et al., 2017 [16] proposed the Transformer architecture to replace traditional recurrent and convolutional neural network layers. The authors presented an improvement over the vanilla attention mechanism [15] with 'self-attention', which allows the Transformer to learn global dependencies between the words in the sequence, enabling the generation of more informative and context-sensitive word embeddings. These embeddings, called 'contextualised embeddings' because they are generated by considering the entire input sequence, have been shown to significantly outperform traditional fixed source encodings and improve the model's performance on various natural language processing tasks, including machine translation. The paper also described another novel mechanism called 'multi-head attention', which stacks multiple self-attention 'heads' in parallel to enable the model to attend to different positions in the input sequence simultaneously, improving the quality of the learned representations while also making the model parallelisable.

Every aspect of the vanilla Transformer has been improved and modified in various ways to improve its performance, from the attention mechanism [18, 19], and positional encodings [20, 21] to the activation functions of the feed-forward networks [22, 23]. Devlin et al., 2019 [20] introduced a novel language representation model - Bidirectional Encoder Representations from Transformers [BERT] that pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right contexts in all layers, which allows it to capture a deeper understanding of language. The authors also proposed a novel pre-training objective called 'Masked Language Modeling', which involves randomly masking some input tokens and training the model to predict the masked tokens. BERT achieved new state-of-the-art results on 11 NLP tasks, including machine translation and has become the basis for many subsequent advances in the field [25].

Key Challenges and Current Work

Datasets

One of the main challenges in machine translation is the availability of large, high-quality datasets for training and evaluating models. Over the years, several datasets have been developed specifically for machine translation research. The Workshop on Machine Translation [WMT] has been running an annual evaluation campaign since 2006 [26, 27, 28, 29], which includes a shared task for machine translation. The datasets used in this task are typically parallel corpora of news articles covering a range of languages, including English, German, French, and Chinese. The International Workshop on Spoken Language Translation [IWSLT] is a yearly workshop focusing on spoken language translation [30, 31, 32, 33]. The datasets used in this workshop include audio recordings of speeches, as well as transcripts and translations in various languages. In recent years, datasets have only gotten more extensive and diverse, enabling more complex translation tasks and models. XTREME [34] is a benchmark dataset for evaluating the cross-lingual generalisation capabilities of pre-trained multilingual models covering 40 typologically diverse languages and 9 tasks, including machine translation. Flores-101 [35] is a benchmark dataset for low-resource machine translation, which consists of parallel sentences in 101 languages, making it one of the most extensive multilingual machine translation datasets available.

Evaluation

Numerous evaluation metrics have been proposed to evaluate the quality of the generated translations. The most popular of them, BLEU, short for Bilingual Evaluation Understudy, has been the de-facto standard for evaluating translation outputs since it was first proposed by Papineni et al., 2002 [36]. The core idea of BLEU is to aggregate the count of words and phrases that overlap between machine and reference translation. BLEU scores range from 0 to 1, where 1 means a perfect translation. However, using BLEU directly is suboptimal because it relies on $n$ -gram overlap, which is heavily dependent on the specific tokenisation used. Tokenising aggressively can artificially raise the score and make comparing results across different studies difficult. SacreBLEU [37] addresses this challenge by providing a hassle-free computation of shareable, comparable and reproducible BLEU scores. Human evaluation, however, is still considered the gold standard in this field as it takes into account the nuances of language that can be difficult for machines to capture. Human evaluators can assess not only the translation's accuracy but also the output's fluency and naturalness. In addition, human evaluation can provide valuable insights into the text's cultural context, which can be crucial for producing high-quality translations. MT evaluation is an active research area and was also the WMT shared task for 2022 [26], where participants had to predict the quality of generated translations without access to references.

Low-resource languages

The vast majority of improvements made in machine translation in the last decades have been for high-resource languages, i.e. the languages that have large quantities of training data available digitally [39]. High-resource languages like English, French and Japanese rarely have dataset size concerns. For instance, the English--French corpus used by Cho et al., 2014a [?] as early as 2014 contained 348 million parallel sentences. However, low-resource languages have not received enough attention from the NLP community despite being widely spoken around the world due to a multitude of reasons: lack of state investments, no codified research norms, lax organisational priorities, Western-centrism and logistical challenges in procuring training data to name a few [40]. While NMT systems have demonstrated remarkable performance in high-resource data scenarios, research has indicated that these models exhibit low data efficiency and perform worse than unsupervised methods or phrase-based statistical machine translation in low-resource conditions [41]. However, recent research has demonstrated that NMT is suitable in low-data settings but is very sensitive to hyperparameters such as vocabulary size, word dropout, and others [42]. A recent initiative towards rectifying the lack of resources for low-resource languages is the FLORES-101 benchmark by Goyal et al., 2022 [35], which consists of the same set of English sentences translated into 100 other languages. However, it has the limitation that for non-English pairs, the two sides are "translationese" and not mutual translations of each other.

Domain¹ adaptation

NMT systems struggle in scenarios where words have different translations, and their meaning is expressed in different styles in different domains. For example, a model trained exclusively on law reports is unlikely to perform well in clinical medicine [44]. It has been shown that NMT systems drop in performance when training and test domains do not match and when in-domain training data is scarce [41]. This is of particular concern when machine translation is used for information summarising - users are likely to be misled by hallucinated content in the generated translation. A naive solution is to tailor the NMT model to every specific domain. In addition to being a highly impractical approach, high-quality parallel data only exists for some domains, and often, large amounts of training data are only available out of domain. Luong et al., 2015 [46] demonstrated that a pre-trained system can be repurposed to translate new domains more quickly than training a new model and often performs better on the new domain.

Decoding

The task of finding the most likely translation $\bf \hat y$ for a given source sentence $\bf x$ is known as the decoding problem. Decoding in MT is a challenging problem as the search space grows exponentially with sequence length making a complete enumeration of the search space impossible [1]. The most widely adopted training method for sequence-to-sequence models is maximum likelihood estimation [MLE], where decoding is done by predicting the output to which the model assigns maximum likelihood. However, as the models predict tokens one by one, exact search is not feasible in the general case, and the community has resorted to using heuristics instead. The most popular of these heuristics is beam search which has been shown to have severe flaws over the years. Stahlberg and Bryne, 2019 [48] showed that the model assigns the highest score to the empty sentence in greater than 50% of the cases and that search errors are more frequent than model errors, in addition to being more difficult to diagnose and fix. Welleck et al., 2020 [49] found that a sequence which receives zero probability under a recurrent language model's distribution can receive non-zero probability under the distribution induced by the decoding algorithm. Stahlberg and Bryne, 2019 [48] provide a possible explanation for the MT community's continuing use of beam search despite its flaws: search errors in beam search decoding, paradoxically, prevent the decoder from choosing the empty hypothesis, which often gets the global best model score as a side-effect of using maximum likelihood estimation.

Robustness and adversarial attacks

Like most other deep learning models, NMT models have been found to be sensitive to synthetic and natural noise [51], distributional shift and adversarial examples [52]. Real-world MT systems need to deal with increasingly non-standard and noisy text found on the internet but absent from many standard benchmark datasets. Machine translation robustness featured as a shared task in the WMT 2020 challenge [27] where MT systems were evaluated in zero-shot and few-shot scenarios to test for robustness. All accepted submissions trained their systems using big-transformer models, boosted performance with tagged back-translation, continued training with filtered and in-domain data, and assembled ensembles of different models to improve performance.

The increasing body of work on adversarial examples has shown the potential hazards of employing brittle machine learning systems so widely in practical applications [54, 55, 56]. Anastasopoulos et al., 2019 [57] focus on the grammatical errors made by non-native speakers and show that augmenting training data with sentences containing artificially introduced grammatical errors can make the system more robust to such errors. Belinkov and Bisk, 2018 [51] show that character-based NMT models break down when presented with both natural and synthetic noise. They also demonstrate that synthetic noise does not capture a lot of the variation present in natural noise resulting in models that perform poorly while translating natural noise. Heigold et al., 2018 [52] evaluate the robustness of NMT systems against perturbed word forms that do not pose a challenge to humans and corroborate the finding that training on noisy data can help models achieve improved performance on noisy data.

Bias

Natural language training data inevitably reflects the biases and stereotypes present in our society. Systems trained on this biased data often reflect or even amplify these biases and their harmful stereotypes. Prates et al., 2020 [60] showed that translating sentences from gender-neutral languages to English using Google Translate exhibited gender biases and a strong tendency toward male defaults. Google Translate now adds feminine and masculine forms for translated sentences, partially addressing some of the shortcomings mentioned in the paper. Saunders and Bryne, 2020 [61] proposed treating gender debiasing as a domain adaptation problem making use of the extensive literature in domain adaptation for NMT systems. They demonstrate improved debiasing without degradation in overall translation quality by transfer learning on a small set of trusted, gender-balanced examples.

Possible Areas of Future Work

Large Language Models

Transformers have changed the zeitgeist of MT research from fully-supervised learning to pre-train and fine-tune and now to pre-train and prompt. Large language models (LLMs) can now be prompted to perform very high-quality machine translation (MT), even though they were not explicitly trained for this task. Ghazvininejad et al., 2023 [62] propose using a dictionary to identify rare words or phrases in the source language and then generating prompts that provide additional context for these words or phrases, which are then used to guide the LLM to generate more accurate translations. The authors demonstrate the effectiveness of this approach by evaluating it on several language pairs and showing significant improvements in machine translation performance.

Despite its great potential, prompt-based learning faces several challenges. Zhang et al., 2023 [63] demonstrate that sometimes prompting results in the rejection of the input where the LLM responds in the wrong target language, under-translates the input, mistranslates entities like dates, or even just copies source phrases. In addition to the general limitations of LLMs, such as hallucination, the authors also observed a phenomenon specific to prompting, which they call the 'prompt trap'. This occurs when translations are heavily influenced by the prompt or the prefix of the source template leading to suboptimal or incorrect translations. Empirical evidence suggests that the performance of an LLM depends on both the templates being used and the answers being considered. However, finding the best combination of template and answer simultaneously through search or learning remains a challenging research question [64].

Multilingual

Achieving human-level universal translation between all possible natural language pairs is the holy grail of machine translation research. Multilingual NMT [MNMT] systems are highly desirable as they can be trained with data from various language pairs, which can aid resource-poor languages in acquiring extra knowledge from other languages [65]. Furthermore, MNMT systems tend to exhibit better generalisation capabilities due to their exposure to diverse languages resulting in improved translation quality compared with bilingual NMT systems in a phenomenon referred to as 'translation knowledge transfer' [66]. Fan et al., 2021 [39] proposed M2M-100, a Many-to-Many multilingual translation model capable of translating between the 9,900 directions of 100 languages. The authors employed both dense and sparse scaling techniques by introducing language-specific parameters trained with a novel random re-routing scheme. Their model outperforms an English-centric baseline by more than 10 BLEU points on average when translating directly between non-English directions.

Current MNMT approaches experience difficulties incorporating over 100 language pairs without sacrificing translation quality---incremental learning and knowledge distillation show promise in addressing this issue. Translating multilingualism within a sentence, such as code-mixed input and output, creoles, and pidgins, is an exciting research direction as compact MNMT models can handle code-mixed input, but code-mixed output is still an open problem [68].

Document-level

Despite its success, machine translation has been based mainly on strong independence and locality assumptions. This means that sentences are translated in isolation, independent of their document-level inter-dependencies. However, text is made up of collocated and structured groups of sentences bound together by complex linguistic elements, referred to as 'discourse' [69]. Moreover, ambiguous words in a sentence can only be disambiguated by their surrounding context. A recent paper by Liu et al., 2020 [70] illustrates this research direction. The authors corrupt input documents by masking phrases and permuting sentences, resulting in input sequences up to 512 tokens and then train a single Transformer model to recover the original monolingual document segments. By using document fragments, the model is able to learn long-range dependencies between sentences and outperform sentence-level NMTs. However, it was also observed that without pre-training, document-level NMT models perform much worse than their sentence-level counterparts, suggesting that pre-training is a crucial step and a promising strategy for improving document-level NMT performance.

Despite promising results, document-level NMTs face multiple challenges [2]. Existing metrics like BLEU and METEOR do not account for specific discourse phenomena in the translation, which can lead to failures in evaluating the quality of longer pieces of generated text. Most methods only use a small context beyond a single sentence, which consists of neighbouring sentences and do not incorporate context from the whole document. Additionally, more research is required to determine whether the global context is truly beneficial to improve translation performance.

Conclusion

The field of machine translation is rapidly evolving, with many exciting developments in areas such as large language models, multilingual translation, and document-level translation. While many challenges remain to be addressed, including robustness, bias and lack of data for under-represented languages, the potential for machine translation to bridge language barriers and facilitate communication between people worldwide is immense. Continued research and innovation in the field will be crucial to unlocking this potential and creating more effective and accurate machine translation systems.

To have another language is to possess a second soul.
-- Charlemagne

References

Sequence to sequence learning with neural networks
Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Advances in neural information processing systems, Vol 27.
A survey on document-level neural machine translation: Methods and evaluation
Maruf, S., Saleh, F. and Haffari, G., 2021. ACM Computing Surveys (CSUR), Vol 54(2), pp. 1--36. ACM New York, NY, USA.
The impact of using machine translation on EFL students’ writing
Lee, S., 2020. Computer assisted language learning, Vol 33(3), pp. 157--175. Taylor \& Francis.
Building Machine Translation Systems for the Next Thousand Languages
Bapna, A., Caswell, I., Kreutzer, J., Firat, O., Esch, D.v., Siddhant, A., Niu, M., Baljekar, P.N., Garcia, X., Macherey, W., Breiner, T., Axelrod, V.S., Riesa, J., Cao, Y., Chen, M., Macherey, K., Krikun, M., Wang, P., Gutkin, A., Shah, A., Huang, Y., Chen, Z., Wu, Y. and Hughes, M.R., 2022.
A research agenda for using machine translation in clinical medicine
Khoong, E.C. and Rodriguez, J.A., 2022. Journal of General Internal Medicine, Vol 37(5), pp. 1275--1277. Springer.
From first conception to first demonstration: the nascent years of machine translation, 1947--1954. a chronology
Hutchins, J., 1997. Machine Translation, Vol 12, pp. 195--252. Springer.
Achieving human parity on automatic chinese to english news translation [link]
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M. and others, ., 2018.
How to move to neural machine translation for enterprise-scale programs—an early adoption case study [link]
Schmidt, T. and Marg, L., 2018. Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pp. 309--313. European Association for Machine Translation.
Building Machine Translation Systems for the Next Thousand Languages
Bapna, A., Caswell, I., Kreutzer, J., Firat, O., Esch, D.v., Siddhant, A., Niu, M., Baljekar, P.N., Garcia, X., Macherey, W., Breiner, T., Axelrod, V.S., Riesa, J., Cao, Y., Chen, M., Macherey, K., Krikun, M., Wang, P., Gutkin, A., Shah, A., Huang, Y., Chen, Z., Wu, Y. and Hughes, M.R., 2022.
Asynchronous translations with recurrent neural nets
Neco, R.P. and Forcada, M.L., 1997. Proceedings of International Conference on Neural Networks (ICNN'97), Vol 4, pp. 2535--2540.
Recurrent Convolutional Neural Networks for Discourse Compositionality [link]
Kalchbrenner, N. and Blunsom, P., 2013. Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 119--126. Association for Computational Linguistics.
Sequence to sequence learning with neural networks
Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Advances in neural information processing systems, Vol 27.
Learning Phrase Representations using {RNN} Encoder{--}Decoder for Statistical Machine Translation [link]
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y., 2014. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP}), pp. 1724--1734. Association for Computational Linguistics. DOI: 10.3115/v1/D14-1179
On the Properties of Neural Machine Translation: Encoder{--}Decoder Approaches [link]
Cho, K., van Merrienboer, B., Bahdanau, D. and Bengio, Y., 2014. Proceedings of {SSST}-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103--111. Association for Computational Linguistics. DOI: 10.3115/v1/W14-4012
Neural Machine Translation by Jointly Learning to Align and Translate [PDF]
Bahdanau, D., Cho, K. and Bengio, Y., 2015. 3rd International Conference on Learning Representations, {ICLR} 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Attention is all you need
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Advances in neural information processing systems, Vol 30.
Neural Machine Translation by Jointly Learning to Align and Translate [PDF]
Bahdanau, D., Cho, K. and Bengio, Y., 2015. 3rd International Conference on Learning Representations, {ICLR} 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Low-Rank and Locality Constrained Self-Attention for Sequence Modeling
Guo, Q., Qiu, X., Xue, X. and Zhang, Z., 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol 27(12), pp. 2213-2222. DOI: 10.1109/TASLP.2019.2944078
Multi-scale self-attention for text classification
Guo, Q., Qiu, X., Liu, P., Xue, X. and Zhang, Z., 2020. Proceedings of the AAAI Conference on Artificial Intelligence, Vol 34, pp. 7847--7854.
{BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding [link]
Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019. Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171--4186. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423
Transformer-{XL}: Attentive Language Models beyond a Fixed-Length Context [link]
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. and Salakhutdinov, R., 2019. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978--2988. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1285
Searching for Activation Functions [link]
Ramachandran, P., Zoph, B. and Le, Q.V., 2018. 6th International Conference on Learning Representations, {ICLR} 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net.
Generative pretraining from pixels
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D. and Sutskever, I., 2020. International conference on machine learning, pp. 1691--1703.
{BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding [link]
Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019. Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171--4186. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423
{ALBERT:} {A} Lite {BERT} for Self-supervised Learning of Language Representations [link]
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P. and Soricut, R., 2020. 8th International Conference on Learning Representations, {ICLR} 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Findings of the {WMT} 2022 Shared Task on Quality Estimation [link]
Zerva, C., Blain, F., Rei, R., Lertvittayakumjorn, P., C. de Souza, J.G., Eger, S., Kanojia, D., Alves, D., Or{\u{a}}san, C., Fomicheva, M., Martins, A.F.T. and Specia, L., 2022. Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 69--99. Association for Computational Linguistics.
Findings of the WMT 2020 shared task on machine translation robustness
Specia, L., Li, Z., Pino, J., Chaudhary, V., Guzman, F., Neubig, G., Durrani, N., Belinkov, Y., Koehn, P., Sajjad, H. and others, ., 2020. Proceedings of the Fifth Conference on Machine Translation, pp. 76--91.
Findings of the wmt 2018 shared task on parallel corpus filtering
Koehn, P., Khayrallah, H., Heafield, K. and Forcada, M.L., 2018. Proceedings of the third conference on machine translation: shared task papers, pp. 726--739.
Findings of the wmt 2016 bilingual document alignment shared task
Buck, C. and Koehn, P., 2016. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 554--563.
Findings of the IWSLT 2022 Evaluation Campaign.
Antonios, A., Loc, B., Bentivogli, L., Boito, M.Z., Ond{\v{r}}ej, B., Cattoni, R., Anna, C., Georgiana, D., Kevin, D., Maha, E. and others, ., 2022. Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pp. 98--157.
Findings of the IWSLT 2020 evaluation campaign
Ansari, E., Axelrod, A., Bach, N., Bojar, O., Cattoni, R., Dalvi, F., Durrani, N., Federico, M., Federmann, C., Gu, J. and others, ., 2020. Proceedings of the 17th International Conference on Spoken Language Translation, pp. 1--34.
Proceedings of the 15th International Conference on Spoken Language Translation [link]
, 2018. International Conference on Spoken Language Translation.
The IWSLT 2016 evaluation campaign
Cettolo, M., Niehues, J., Stuker, S., Bentivogli, L., Cattoni, R. and Federico, M., 2016. Proceedings of the 13th International Conference on Spoken Language Translation.
Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O. and Johnson, M., 2020. International Conference on Machine Learning, pp. 4411--4421.
The flores-101 evaluation benchmark for low-resource and multilingual machine translation
Goyal, N., Gao, C., Chaudhary, V., Chen, P., Wenzek, G., Ju, D., Krishnan, S., Ranzato, M., Guzman, F. and Fan, A., 2022. Transactions of the Association for Computational Linguistics, Vol 10, pp. 522--538. MIT Press.
Bleu: a method for automatic evaluation of machine translation
Papineni, K., Roukos, S., Ward, T. and Zhu, W., 2002. Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311--318.
A Call for Clarity in Reporting {BLEU} Scores [link]
Post, M., 2018. Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186--191. Association for Computational Linguistics. DOI: 10.18653/v1/W18-6319
Findings of the {WMT} 2022 Shared Task on Quality Estimation [link]
Zerva, C., Blain, F., Rei, R., Lertvittayakumjorn, P., C. de Souza, J.G., Eger, S., Kanojia, D., Alves, D., Or{\u{a}}san, C., Fomicheva, M., Martins, A.F.T. and Specia, L., 2022. Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 69--99. Association for Computational Linguistics.
Beyond english-centric multilingual machine translation
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V. and others, ., 2021. The Journal of Machine Learning Research, Vol 22(1), pp. 4839--4886. JMLRORG.
No language left behind: Scaling human-centered machine translation [link]
Costa-jussa, M.R., Cross, J., {\c{C}}elebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J. and others, ., 2022.
Six Challenges for Neural Machine Translation [link]
Koehn, P. and Knowles, R., 2017. Proceedings of the First Workshop on Neural Machine Translation, pp. 28--39. Association for Computational Linguistics. DOI: 10.18653/v1/W17-3204
Revisiting Low-Resource Neural Machine Translation: A Case Study [link]
Sennrich, R. and Zhang, B., 2019. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 211--221. Association for Computational Linguistics. DOI: 10.18653/v1/P19-1021
The flores-101 evaluation benchmark for low-resource and multilingual machine translation
Goyal, N., Gao, C., Chaudhary, V., Chen, P., Wenzek, G., Ju, D., Krishnan, S., Ranzato, M., Guzman, F. and Fan, A., 2022. Transactions of the Association for Computational Linguistics, Vol 10, pp. 522--538. MIT Press.
Curriculum Learning for Domain Adaptation in Neural Machine Translation [link]
Zhang, X., Shapiro, P., Kumar, G., McNamee, P., Carpuat, M. and Duh, K., 2019. Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1903--1915. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1189
Six Challenges for Neural Machine Translation [link]
Koehn, P. and Knowles, R., 2017. Proceedings of the First Workshop on Neural Machine Translation, pp. 28--39. Association for Computational Linguistics. DOI: 10.18653/v1/W17-3204
Effective Approaches to Attention-based Neural Machine Translation [link]
Luong, T., Pham, H. and Manning, C.D., 2015. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412--1421. Association for Computational Linguistics. DOI: 10.18653/v1/D15-1166
Sequence to sequence learning with neural networks
Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Advances in neural information processing systems, Vol 27.
On {NMT} Search Errors and Model Errors: Cat Got Your Tongue? [link]
Stahlberg, F. and Byrne, B., 2019. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3356--3362. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1331
Consistency of a Recurrent Language Model With Respect to Incomplete Decoding [link]
Welleck, S., Kulikov, I., Kim, J., Pang, R.Y. and Cho, K., 2020. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5553--5568. Association for Computational Linguistics. DOI: 10.18653/v1/2020.emnlp-main.448
On {NMT} Search Errors and Model Errors: Cat Got Your Tongue? [link]
Stahlberg, F. and Byrne, B., 2019. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3356--3362. Association for Computational Linguistics. DOI: 10.18653/v1/D19-1331
Synthetic and Natural Noise Both Break Neural Machine Translation [link]
Belinkov, Y. and Bisk, Y., 2018. 6th International Conference on Learning Representations, {ICLR} 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
How Robust Are Character-Based Word Embeddings in Tagging and {MT} Against Wrod Scramlbing or Randdm Nouse? [link]
Heigold, G., Varanasi, S., Neumann, G. and van Genabith, J., 2018. Proceedings of the 13th Conference of the Association for Machine Translation in the {A}mericas (Volume 1: Research Track), pp. 68--80. Association for Machine Translation in the Americas.
Findings of the WMT 2020 shared task on machine translation robustness
Specia, L., Li, Z., Pino, J., Chaudhary, V., Guzman, F., Neubig, G., Durrani, N., Belinkov, Y., Koehn, P., Sajjad, H. and others, ., 2020. Proceedings of the Fifth Conference on Machine Translation, pp. 76--91.
Explaining and Harnessing Adversarial Examples [PDF]
Goodfellow, I.J., Shlens, J. and Szegedy, C., 2015. 3rd International Conference on Learning Representations, {ICLR} 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Simple Black-Box Adversarial Attacks on Deep Neural Networks.
Narodytska, N. and Kasiviswanathan, S.P., 2017. CVPR Workshops, Vol 2, pp. 2.
Robsut wrod reocginiton via semi-character recurrent neural network
Sakaguchi, K., Duh, K., Post, M. and Van Durme, B., 2017. Proceedings of the AAAI Conference on Artificial Intelligence, Vol 31.
Neural Machine Translation of Text from Non-Native Speakers [link]
Anastasopoulos, A., Lui, A., Nguyen, T.Q. and Chiang, D., 2019. Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3070--3080. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1311
Synthetic and Natural Noise Both Break Neural Machine Translation [link]
Belinkov, Y. and Bisk, Y., 2018. 6th International Conference on Learning Representations, {ICLR} 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
How Robust Are Character-Based Word Embeddings in Tagging and {MT} Against Wrod Scramlbing or Randdm Nouse? [link]
Heigold, G., Varanasi, S., Neumann, G. and van Genabith, J., 2018. Proceedings of the 13th Conference of the Association for Machine Translation in the {A}mericas (Volume 1: Research Track), pp. 68--80. Association for Machine Translation in the Americas.
Assessing gender bias in machine translation: a case study with google translate
Prates, M.O., Avelar, P.H. and Lamb, L.C., 2020. Neural Computing and Applications, Vol 32, pp. 6363--6381. Springer.
Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem [link]
Saunders, D. and Byrne, B., 2020. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7724--7736. Association for Computational Linguistics. DOI: 10.18653/v1/2020.acl-main.690
Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation
Ghazvininejad, M., Gonen, H. and Zettlemoyer, L., 2023.
Prompting Large Language Model for Machine Translation: A Case Study
Zhang, B., Haddow, B. and Birch, A., 2023. arXiv preprint arXiv:2301.07069.
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. and Neubig, G., 2023. ACM Computing Surveys, Vol 55(9), pp. 1--35. ACM New York, NY.
Native language influence during second language acquisition: A large-scale learner corpus analysis
Shatz, I., 2017. Proceedings of the Pacific Second Language Research Forum (PacSLRF 2016), pp. 175--180.
A Survey on Transfer Learning
Pan, S.J. and Yang, Q., 2010. IEEE Transactions on Knowledge and Data Engineering, Vol 22(10), pp. 1345-1359. DOI: 10.1109/TKDE.2009.191
Beyond english-centric multilingual machine translation
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V. and others, ., 2021. The Journal of Machine Learning Research, Vol 22(1), pp. 4839--4886. JMLRORG.
{G}oogle{'}s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation [link]
Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F., Wattenberg, M., Corrado, G., Hughes, M. and Dean, J., 2017. Transactions of the Association for Computational Linguistics, Vol 5, pp. 339--351. MIT Press. DOI: 10.1162/tacl_a_00065
Speech and Language Processing (2nd Edition)
Jurafsky, D. and Martin, J.H., 2009. Prentice-Hall, Inc.
Multilingual Denoising Pre-training for Neural Machine Translation [link]
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M. and Zettlemoyer, L., 2020. Transactions of the Association for Computational Linguistics, Vol 8, pp. 726--742. MIT Press. DOI: 10.1162/tacl_a_00343
A survey on document-level neural machine translation: Methods and evaluation
Maruf, S., Saleh, F. and Haffari, G., 2021. ACM Computing Surveys (CSUR), Vol 54(2), pp. 1--36. ACM New York, NY, USA.

Here, domain is defined by a corpus from a specific source and may differ from other domains in topic, genre, or style↩