Skip to content

Meta Develops AI Model for Multilingual Translation and Transcription

Meta just announced SeamlessM4T, an AI model that translates and transcribes 100 languages across text and speech. With open-source availability and a breakthrough in speech-to-speech and speech-to-text capabilities, SeamlessM4T marks a significant leap in AI-powered language understanding.

Meta, the company behind Facebook's rebranding, has unveiled a groundbreaking AI model called SeamlessM4T, designed to comprehend a diverse range of dialects and facilitate translation and transcription for nearly 100 languages across both text and speech.

In a significant leap forward in the field of AI-driven speech-to-speech and speech-to-text capabilities, Meta's SeamlessM4T introduces a single model for on-demand translations that can foster efficient communication among individuals speaking different languages. According to Meta, this innovative model eliminates the need for a separate language identification system.

SeamlessM4T builds upon Meta's prior AI endeavors like No Language Left Behind, a text-to-text machine translation model, and Universal Speech Translator, known for supporting the Hokkien language's direct speech-to-speech translation. Furthermore, it draws from the Massively Multilingual Speech framework, which empowers speech recognition, language identification, and speech synthesis across a wide spectrum of over 1,100 languages.

Meta isn't the sole contender in the AI translation and transcription arena. Industry leaders like Amazon, Microsoft, OpenAI, and several startups offer diverse commercial services and open source models. Google, for instance, is working on the Universal Speech Model, a facet of its comprehensive initiative to comprehend the world's 1,000 most widely spoken languages. Additionally, Mozilla has spearheaded Common Voice, a vast collection of multi-language voices aimed at training automatic speech recognition algorithms.

However, SeamlessM4T stands out as one of the most ambitious attempts to merge translation and transcription capabilities into a unified model.

In its development, SeamlessM4T was nurtured through scraping publicly available text (amounting to tens of billions of sentences) and speech (4 million hours) from various sources on the web. Juan Pino, a research scientist in Meta's AI research division, explained the diverse nature of the data sources while keeping their specifics undisclosed.

Yet, concerns have arisen over the practice of employing public data for commercial AI training, leading to lawsuits against companies. Some contend that these entities should acknowledge the data source, offer compensation, and provide opt-out options. However, Meta asserts that its mined data, which could contain personal identifiers, isn't copyrighted and is predominantly sourced from open or licensed platforms.

Utilizing this amalgamated data, Meta fashioned the SeamlessAlign training dataset, driving the SeamlessM4T model's capabilities. The researchers aligned 443,000 hours of speech with corresponding texts, producing 29,000 hours of "speech-to-speech" alignments. This process equipped SeamlessM4T to transcribe speech to text, translate text, generate speech from text, and even convert spoken words from one language to another.

Meta contends that in an internal benchmark, SeamlessM4T excelled at speech-to-text tasks amidst background noise and speaker variations compared to current state-of-the-art speech transcription models. The data-rich training dataset is attributed to this success, providing SeamlessM4T an edge over models restricted to only speech or text data.

Meta's blog post states, "With state-of-the-art results, we believe SeamlessM4T is an important breakthrough in the AI community’s quest toward creating universal multitask systems."

However, concerns linger about the model's potential biases.

An analysis by The Conversation underscores the biases present in AI-driven translations, including gender bias. For instance, Google Translate once assumed doctors were male and nurses were female in certain languages. Similar bias was observed in Bing's translation, which incorrectly rendered "the table is soft" in German as the feminine "die Tabelle."

Speech recognition algorithms are not exempt from biases either. A study revealed that prominent speech recognition systems were twice as likely to inaccurately transcribe audio from Black speakers as opposed to white speakers.

Unsurprisingly, SeamlessM4T shares these biases.

Meta's accompanying whitepaper reveals that the model tends to "overgeneralize to masculine forms when translating from neutral terms." Additionally, it performs better when translating from masculine references, such as nouns like "he" in English, for most languages. In cases where gender information is absent, SeamlessM4T prefers translating the masculine form around 10% of the time.

Meta contends that while SeamlessM4T doesn't produce excessive toxic text in its translations, a common issue with AI text models, it's not flawless. In certain languages like Bengali and Kyrgyz, it produces more toxic translations, particularly regarding socioeconomic status and culture. Furthermore, translations involving sexual orientation and religion tend to be more toxic.

The public demonstration of SeamlessM4T incorporates filters to identify toxicity in both input and output speech. However, these filters aren't enabled by default in the open source release of the model.

An unaddressed concern with AI translators is the potential loss of linguistic diversity due to overuse. Unlike AI, human interpreters make unique choices when translating, adding their own touch to the process. This often leads to distinct "translationese." While AI might provide more accurate translations, it could come at the cost of variety and diversity in translation.

Hence, Meta advises against employing SeamlessM4T for long-form or certified translations recognized by government agencies and translation authorities. The company also discourages its use in medical or legal contexts to mitigate potential mistranslations.

This precaution is warranted, considering instances where AI mistranslations have led to significant consequences. Misinterpretations have resulted in mistaken law enforcement actions and legal disputes.

Juan Pino commented, "This single system approach reduces errors and delays, increasing the efficiency and quality of the translation process, bringing us closer to making seamless translation possible." He envisions a future where this foundational model enhances communication capabilities and fosters a world where understanding is universal.

Nevertheless, it remains to be seen how human involvement will continue to play a vital role in ensuring accurate and nuanced translations, avoiding unintended consequences as AI continues to evolve.