Microsoft Launches MAI-Voice-2, Image-2.5, Transcribe-1.5 at Build
Krasa AI
2026-06-02
5 minute read
Microsoft Launches MAI-Voice-2, Image-2.5, Transcribe-1.5 at Build
Microsoft expanded its in-house multimodal lineup at Build 2026 on Tuesday with three new models: MAI-Voice-2 for text-to-speech with cross-lingual voice cloning, MAI-Image-2.5 for image generation and editing, and MAI-Transcribe-1.5 for speech recognition across 43 languages.
All three are available in Microsoft Foundry as of Tuesday and are the same models already powering Copilot, Bing, PowerPoint, and Azure Speech in production.
Why this matters
Microsoft has spent the past year building first-party alternatives to nearly every modality where it previously relied on OpenAI or third-party vendors. Tuesday's announcements fill in the multimodal gaps: voice, image, transcription.
The strategic point isn't that any single one of these models is the best in its class. The point is that Microsoft now ships its own credible option in every modality enterprises use. Developers can build full multimodal applications inside Foundry without ever calling out to a non-Microsoft API. For customers worried about data flow, vendor concentration, or pricing leverage, that's a meaningful change.
What was announced
MAI-Voice-2 is the headline. It's a multilingual text-to-speech model that supports voice cloning and voice prompting in more than 15 languages. The most striking feature Microsoft calls "identity preservation" — the model can recreate a specific speaker's vocal identity, then speak as that person in any of the supported languages. A CEO can record a message in English and have it delivered in their own voice in Japanese, German, or Hindi.
Microsoft is shipping two variants: MAI-Voice-2 for highest fidelity and a Flash variant tuned for low-latency real-time applications like call centers and live captioning.
MAI-Image-2.5 is an update to Microsoft's image generation lineup that adds image-to-image editing alongside text-to-image generation. The new "control with preservation" feature lets users edit specific elements of an image while keeping the rest of the composition intact — a long-standing weakness of diffusion models that has limited their use in professional design workflows.
Microsoft claims MAI-Image-2.5 debuted at No. 3 on the Arena.ai image generation leaderboard among model families. A Flash variant is available for high-volume production workloads. Microsoft also flagged improvements in text rendering inside generated images, which has been the most-requested fix from PowerPoint and Designer users.
MAI-Transcribe-1.5 is the quietest of the three announcements but arguably the most operationally significant. The model supports 43 languages with content biasing — the ability to tell the model in advance which industry jargon, proper nouns, or domain terms to expect, dramatically improving accuracy on specialized recordings. Microsoft says the model retains its No. 1 spot on the FLEURS multilingual speech recognition benchmark.
Industry impact
For ISVs building consumer products, MAI-Voice-2 changes what's economically feasible. Real-time voice cloning across 15 languages, billed through Azure consumption, makes localized customer service, dubbing, and accessibility products buildable by teams of any size. The competitive pressure on standalone voice AI vendors — ElevenLabs and OpenAI's voice models among them — just intensified.
For enterprise transcription, MAI-Transcribe-1.5's content biasing feature targets the workflows that have stubbornly resisted automation: legal depositions, medical dictation, technical meetings full of acronyms. A 43-language footprint with the No. 1 FLEURS score gives Microsoft a credible pitch against Otter, Rev, AssemblyAI, and the in-house transcription stacks at Google and AWS.
For design and marketing teams, MAI-Image-2.5's editing controls slot into the workflow gap between Adobe Firefly and standalone tools like Midjourney. Image editing inside the PowerPoint and Designer surfaces most knowledge workers already use is the kind of distribution advantage that's hard for standalone vendors to counter.
Expert perspectives
Microsoft's framing emphasizes that these aren't research demos. MAI-Voice and MAI-Image have been running production traffic across Copilot, Bing, and PowerPoint for months. The Foundry availability simply opens those same models to third-party developers.
Mustafa Suleyman, CEO of Microsoft AI, described the seven-model launch — including MAI-Thinking-1 and MAI-Code-1-Flash — as part of an incremental "hill-climbing" strategy. The pitch is that each model is shipped, used, measured, and improved against real workloads rather than benchmarked in a lab.
Independent voice AI developers reacting on social media noted that the cross-lingual identity preservation feature in MAI-Voice-2 sets a new bar. Doing voice cloning in one language is solved; doing it credibly across 15 languages while preserving the speaker's identity is a meaningfully harder problem.
What's next
All three models are available in Microsoft Foundry today. Pricing is on consumption, with no upfront commitment required. Microsoft published documentation, code samples, and a model catalog with regional availability listed.
Watch for two follow-ups. First, the on-device variants: Microsoft has signaled that distilled versions of MAI-Voice and MAI-Image will land in Foundry Local later this summer, enabling fully offline voice and image generation. Second, language expansion: Microsoft committed to growing MAI-Voice-2's language support beyond the current 15+ over the next year, with priority on Indian languages and Southeast Asian markets.
Ethical considerations are also coming. Voice cloning at this quality level intensifies the deepfake problem. Microsoft says it has built in audio watermarking and consent-verification flows for the identity preservation feature, but enterprise customers should expect compliance reviews to take longer than usual before deploying voice cloning into production.
Bottom line
Microsoft now ships first-party voice, image, and transcription models, all available in Foundry today and already running in production. If you've been picking vendors per modality, the single-vendor option is real. Watch the consent and watermarking story closely as voice cloning at this fidelity rolls into the enterprise.
Sources
Don't fall behind
Expert AI Implementation →Related Articles
Anthropic Launches Claude Fable 5: Its Most Capable Model Yet
Anthropic released Claude Fable 5, a Mythos-class model that's state-of-the-art on nearly every benchmark — with new safeguards built in. Here's what it means.
min read
China Plans $295B AI Data Center Buildout to Rival the US
China is readying a $295 billion plan to build nationwide AI data centers using mostly domestic chips — squeezing out Nvidia and AMD. Here's what it means.
min read
Flourish Raises $500M to Copy the Brain and Fix AI's Power Crisis
Flourish raised $500M at a $2.5B valuation — backed by Jeff Bezos — to build brain-inspired AI that runs on a fraction of today's energy. Here's the bet.
min read