Global Big Data Conference

Industry News Details

Microsoft Challenges AI Rivals with Three New Foundational Models Posted on : Apr 02 - 2026

Microsoft’s AI research division has unveiled three new foundational models capable of generating text, voice, and images, marking another step in the company’s push to expand its multimodal AI capabilities and compete with other leading labs—while maintaining its partnership with OpenAI.

The newly introduced models include MAI-Transcribe-1, which converts speech into text across 25 languages and is significantly faster than the company’s Azure Fast service; MAI-Voice-1, an audio generation model that can produce up to 60 seconds of sound in just one second and supports custom voice creation; and MAI-Image-2, designed for video generation.

MAI-Image-2 was first released on MAI Playground, a testing platform for large language models, in March, and all three models are now available through Microsoft Foundry. The transcription and voice models can also be accessed via the Playground.

These models were developed by Microsoft’s MAI Superintelligence team, led by Mustafa Suleyman, which was established in late 2025. Suleyman emphasized a “human-centered” approach to AI development, focusing on real-world communication and practical use cases, while hinting at more models to come.

In a highly competitive AI market, Microsoft is positioning these models as cost-effective alternatives to offerings from competitors like Google and OpenAI. Pricing starts at $0.36 per hour for transcription, $22 per million characters for voice generation, and $5 per million text tokens (with $33 per million image tokens) for image generation.

Despite building its own AI ecosystem, Microsoft reaffirmed its ongoing collaboration with OpenAI—a partnership backed by over $13 billion in investment—while also continuing its strategy of both developing in-house technologies and working with external providers.

Get the