UPC researcher Marta Ruiz Costa-Jussà has been recognised with a Starting Grant from the European Research Council (ERC) to explore new automatic translation methods for text and voice. With the LUNAR project, she will research an automatic translation system that is more efficient than existing ones and offers similar qualities for majority and minority languages.
Traditional dictionaries have been obsolete for years. New technologies have taken over and improved on the function of dictionaries to allow much faster, more complete consultations. Similarly, thousands of translators a transformation in their profession in recent years. The slowest, most tedious process of translating long texts has been automated and human talent is focused on more technical, abstract aspects of language, which are still difficult for machines to understand.
Automatic translation software is everywhere and brings cultures closer together in a way that has not been achieved before. Almost instantaneously, it can translate text and voice between hundreds of languages. However, there is still considerable room for improvement. Since 2002, the Universitat Politècnica de Catalunya · BarcelonaTech (UPC) has been a benchmark in the field, led by professors in the Signal Theory and Communication and Computer Science departments such as José B. Mariño, José A. R. Fonollosa and Lluís Màrquez.
The LUNAR project
Recently, researcher Marta Ruiz Costa-Jussà, from the Computer Science Department, was awarded a grant of 1.5 million euros from the European Research Council (ERC) to carry out research in this area. Ruiz Costa-Jussà has successfully coordinated national and international projects and has received several distinctions such as the Google Faculty Research Awards of 2018 and 2019.
The Lifelong UNiversal lAnguage Representation – LUNAR project will investigate various improvements in the neural system used in automatic translation since 2014. This system, based on deep learning, has left behind the systems of rules and statistics previously used in automatic translation. Systems of rules required thousands of rules and enormous dictionaries. Statistical systems needed banks of translations for each language (quadratic dependence). Although neural systems are also dependent on banks of translations, they provide an alternative through which the neural system establishes an intermediate language (a type of mathematical Esperanto) to and from which all translations pass. Thus, the entire process is more efficient (with linear dependence) and higher quality, as all efforts can be concentrated on coding and decoding the intermediate language.
However, this method, which is used by the giants of automatic translation, has some disadvantages. For example, as one universal coder and decoder is used, languages that have fed the system with less resources cannot obtain translations as rich as those obtained by languages with more resources. These are minority languages, or languages from remote areas where the language has not been computerised as fully.
Language inclusivity and voice translation
The LUNAR project will study a solution for languages that are underrepresented in text databases and in the audio material that feeds the system. By establishing specific coders and decoders for each language, the aim is for translation from the intermediate language to be as rich and complete as possible in all languages.
In addition, the LUNAR project will make it possible for this system to work for automatic voice translation. This aspect is one of the most notable, as achieving it would represent great progress in this research field (it is a functionality that even the giants of automatic translation have not been able to apply).
Ethics in automatic translation systems
Finally, the LUNAR project will develop an awareness of the bias in current automatic translations. This is a geopolitical bias, which consists of the underrepresentation of African and Asian languages, among others, and results in worse translations of these languages. It is also a gender bias, which is inevitably absorbed from the texts and audio material that feed the system and can cause, for example, the neutral English word “nurse” to always be translated as feminine and “doctor” as masculine. Finally, it is a corporate bias, as much of these data come from large corporations that in some way influence the range of vocabulary and the types of information that the system uses. The results of LUNAR do not ignore these biases; they report and mitigate them.