Today, we are thrilled to announce that the most recent addition to our Turing Universal Language Representation Model family (T-ULRv6) has achieved the 1st position on both the Google XTREME and GLUE leaderboards, demonstrating that a single multilingual model can achieve state-of-the-art capabilities in both English and Multilingual understanding tasks. This marks the first time that a multilingual language understanding model has achieved the top position on both leaderboards, outperforming models that are specialized for either English or multilingual tasks and thereby helping lift the curse of multilinguality. Resulting from a collaboration between the Microsoft Turing team and Microsoft Research, the T-ULRv6 XXL model based on “XY-LENT”, outperforms the current 2nd best model by an average score of 0.5 points on the XTREME leaderboard and achieves 1st place on the GLUE leaderboard.
Figure 1: XTREME leaderboard showing T-ULRv6 XXL at the top
Figure 2: GLUE leaderboard showing T-ULRv6 XXL at the top
This achievement builds on our work XY-LENT by leveraging X-Y bitexts as well as incorporating the key innovations of T-ULRv5, such as the XLM-E architecture, the novel pretraining tasks of MRTD and TRTD, the improved training data and vocabulary, and the advanced fine-tuning technique of XTune.
Furthermore, to enable scaling to our XXL sized models, we leverage the memory optimization benefits afforded by ZeRO.
Beyond English-centric bitexts for better multilingual language representation learning
The key advancement of T-ULRv6 comes from going beyond using English centric (EN-X) bitext pairs to leverage multi-directional (X-Y) bitext pairs in different languages, such as French-German, Hindi-Urdu or Swahili-Arabic. While leveraging such bitexts is common in multilingual machine translation, necessitated by the nature of the problem there, we show that leveraging them for multilingual encoder training brings surprising performance gains. While EN-X bitexts are useful for learning cross-lingual alignment and shared representation, they are limited in their coverage and diversity of languages and domains. On the other hand, X-Y bitexts can provide richer and more balanced information for learning multilingual representations that generalize better to a wider range of languages and tasks.
To effectively utilize X-Y bitexts, we adopted a novel sampling strategy that ensures an efficient distribution of data across languages while keeping the marginal distribution of languages consistent. This in turn ensures that the model is still capable of retaining strong English performance, as demonstrated above. We also reconstruct our vocabulary using VoCap that allocates more tokens to low-resource languages and reduces the vocabulary overlap between languages. This, in turn, helps in mitigating the data sparsity which often plagues the multilingual corpora and models.
A noteworthy property of such encoders is the parameter efficiency afforded by our approach. Our XY-LENT XXL variant outperforms both XLM-R XXL and mT5 XXL while being ~2x and ~3x smaller respectively. Even for the models in the base, large and XL categories, they are state-of-the-art compared to other models within the category, and oftentimes competitive across categories. The strong performance in conjunction with the lower parameter count is especially useful for Product scenarios.
|T-ULRv6 - XXL
|Figure 3: T-ULRv6 (XY-LENT) is SoTA within model size bands while being parameter efficient
A single model for English and multilingual tasks
Another remarkable aspect of T-ULRv6 XXL is that it achieves state-of-the-art performance on both English and multilingual tasks with a single model, without sacrificing quality or efficiency. This means that users no longer need to choose which pretrained model to use based on the language of the NLP task, as T-ULRv6 XXL can handle both scenarios well. This simplifies the model selection and deployment process and reduces the computational and storage costs of maintaining multiple models.
To accomplish this, T-ULRv6 leverages the power of scaling and the benefits of non-English bitexts to overcome the curse of multilinguality, i.e., the trade-off between English and multilingual performance that often plagues multilingual models. T-ULRv6 not only outperforms specialized English models on the GLUE benchmark, which covers a range of English natural language understanding tasks, but also surpasses specialized multilingual models on the XTREME benchmark, which covers 40 typologically diverse languages and nine cross-lingual tasks. Moreover, T-ULRv6 does so with a much smaller model size than its competitors, demonstrating its parameter efficiency and scalability.
Figure 4: T-ULRv6 (XY-LENT) demonstrating strong performance across language families for a variety of tasks.
Applications and release information
We are proud to announce that T-ULRv6 is already powering the language universalization of Microsoft Bing, enabling users to search and discover information across languages and domains. T-ULRv6 will also soon enhance other Microsoft products with its state-of-the-art multilingual capabilities. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. Working across countries and languages, T-ULRv6 is a key illustration of this mission in action, and we hope to make it available to more customers in the future.
We also strongly believe in sharing our AI technology with the research community and fostering collaboration and innovation. That is why we have launched the Microsoft Turing Academic Program (MS-TAP), which allows researchers to submit proposals and get access to T-ULRv6 and other Turing models in greater detail. We invite you to join us in exploring the potential and challenges of multilingual language understanding and generation, and to provide us with your valuable feedback and insights. We are also working on releasing the Base and Large model checkpoints for further promoting research in this direction.
Towards more inclusive AI with multilingual technology
Multilingual technology is not only a technical challenge, but also a social responsibility. We are committed to democratizing AI by addressing the barriers that limit its accessibility and inclusivity, such as the lack of training data, the high cost of language modeling, and the complexity of multilingual systems. T-ULRv6 is a significant step in this direction, as it offers a more efficient and scalable framework for developing cross-lingual systems that can handle both English and multilingual tasks with a single model. We are inspired by the opportunity to further advance the state of the art and develop new multilingual capabilities that can benefit more people and organizations around the world. We hope that our work will contribute to the community's progress towards making AI more inclusive and accessible to all.