Microsoft Research

Syndicate content
Updated: 37 min 1 sec ago

New Future of Work: AI is driving rapid change, uneven benefits

Thu, 04/09/2026 - 18:11
At a glance
  • AI is driving rapid changes in the workplace, more sharply than those covered in previous editions of the New Future of Work
  • AI is changing how people work together, not just enabling them to work faster or from remote locations. Organizations that treat AI as a collaborative partner are seeing the biggest benefits.
  • The benefits of AI are not yet evenly distributed, underscoring the need for industry leaders to build AI that expands opportunity. The future is not predetermined. It will be shaped by the choices we make today.
  • Human expertise matters more, not less, in an AI-powered world. People are shifting from merely doing work to guiding, critiquing, and improving the work of AI.

For the past five years, the New Future of Work report has captured how work is changing. This year, the shift feels especially sharp. Previous editions have focused on technology’s role in increasing productivity by automating tasks, accelerating communication, and expanding access to information, as well as the rise of remote work. Today, generative AI has put this transformation on fast forward. Instead of simply speeding up existing workflows, AI increasingly participates in them, shaping how people create, decide, collaborate, and learn.

For decades, researchers across Microsoft have studied these changes not as abstract trends but as lived experiences. Across organizations and occupations, people are experimenting with AI in uneven, creative, and sometimes surprising ways. Many are saving time, expanding their capabilities, and taking on more complex work, but the real opportunity ahead is to use AI to help us work better, together.

Publication New Future of Work Report 2025 

The New Future of Work report brings together research from inside and outside of Microsoft to understand what is happening as AI enters workplaces. Through the efforts of dozens of authors and editors, it draws on evidence from large‑scale data analyses, field and lab studies, and theory to look at who is using AI, why they are using it, and how it is reshaping productivity, collaboration, learning, and judgment. It highlights professions where changes are unfolding especially quickly, as well as the broader societal impact of these technologies.

Taken together, these findings point to a central insight: The future of work is not something that will simply happen to us. We are actively constructing it, through the choices individuals make, the norms teams build, the systems organizations adopt, and the discoveries researchers uncover. At the same time, AI’s role is still evolving, and it is driving a range of impact—some of which may be viewed as positive or negative. What follows is a research-backed snapshot of this moment in time and what it can teach us about how to collectively create a new and better future of work with AI.

Adoption and usage

Generative AI is entering workplaces quickly, likely faster than most earlier technologies. But the patterns of who uses it, and how, will shape who benefits. Reports on early adoption appear to show significant penetration: in one German survey, 38% of employed respondents reported using AI at work. But usage and confidence vary widely across sectors, and men report using AI at work more often than women. It’s not yet clear whether that variability is driven by occupational distributions, relative comfort with new tools, or something else. This raises the challenge that uneven adoption is likely to translate into uneven productivity gains, learning opportunities, downstream career paths and more between those who adopt and those who do not.

A look at generative AI adoption globally reveals further differences. High-income countries still lead overall usage, but the fastest growth is happening in low- and middle-income regions. When local languages are poorly served, people switch to English simply to get reliable results. Without investment in infrastructure and multilingual model development, AI risks reinforcing existing divides rather than narrowing them.

Inside organizations, the decision to use or not use AI is shaped less by strategy decks and more by culture. People try new tools when they trust their employer and feel safe experimenting. They stick with tools that make their work better, but might reject tools that seem designed to replace them—which is a common concern among workers. And many of the most useful applications don’t come from top-down initiatives at all but from employees trying things, discovering what actually helps, and sharing those insights with colleagues. Research has shown that involving workers’ perspectives in the design of workplace technologies promotes sustainable improvements in productivity and well-being.

We are also starting to see what people actually do with AI. At Anthropic, an analysis of millions of user conversations found that 37% of Claude usage was tied to software and mathematical occupations. A study of Microsoft Copilot conversations found high applicability to the activities of information workers across sales, media, tech, and administrative roles. But the broader point is simpler: most occupations include at least some tasks where AI is useful.

These shifts come with social side effects. Several studies show that employees who use AI can be perceived as less capable, even when their output is identical to that of people who didn’t use AI. Whether these perception penalties fall unevenly across groups is still an open question. However, managers who have used AI tend to evaluate AI-assisted work more fairly. This suggests that AI may require broad exposure before it can be used openly and without judgment.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab Impact on work and labor markets

Understanding who uses AI and why they use it can help assess its value, but the harder question is how it impacts productivity and labor markets, which can be less straightforward. Productivity can increase through time saved, higher-quality work, or simply feeling more capable. Surveyed enterprise users of AI report saving 40–60 minutes a day, while model-based evaluations show frontier systems can approach quality levels like that of experts on a growing range of tasks. But AI may also reduce productivity. In one U.S. survey, 40% of employees said they had received “workslop”, i.e. AI-generated content that looks polished but isn’t accurate or useful, in the past month. When that happens, any time savings can quickly disappear, and quality can actually suffer.

We still don’t have the full picture of what this means for jobs and labor markets more broadly. Large-scale empirical work finds no clear aggregate effects on unemployment, hours worked, or job openings. However, AI does seem to be reducing opportunities for younger, inexperienced workers. Entry-level roles rely less on experience and knowledge and are easier to automate. Empirical evidence suggests employment for workers aged 22–25 in highly AI-exposed jobs declined by 16% relative to similar but less-exposed roles, and hiring into junior positions appears to slow after firms adopt AI. This pattern raises a longer-term concern: automating jobs that enable workers to learn skills may undermine how expertise is built over time. This point is reinforced by research using theoretical models as well as empirical evidence.

Meanwhile, AI is also changing which skills matter. Roles that mention AI skills in their job postings are nearly twice as likely to also emphasize analytical thinking, resilience, and digital literacy. Demand for work that can be outsourced to AI models more easily, including data-related tasks or routine translation, continues to fall. Even where overall employment remains stable, AI is already reshaping how jobs are structured and this trend will continue.

As more empirical evidence comes in, theoretical work helps frame what might lie ahead. One recurring theme is that human judgment – spotting opportunities, working under ambiguity or choosing from outputs – becomes more valuable as AI improves. And organizations that use AI to augment what people can do often end up creating new kinds of work, rather than simply eliminating existing ones. If AI is meant to deliver on its potential to support broad prosperity gains, the path forward is less about replacing tasks and more about expanding what people are able to do.

Human-AI collaboration

As AI becomes more capable, the nature of human-AI interaction is changing. AI systems are increasingly playing a role in decision-making, creativity, and communication, with AI systems being positioned as a “collaborator.” This raises questions about how to support “collaboration” between people and AI, what we can learn from how people interact with each other, and where the capabilities of AI systems raise different opportunities and create different requirements.

At the heart of effective collaboration is common ground: the shared understanding that allows people to coordinate and communicate. In human conversation, we constantly check for alignment – through clarifications, acknowledgements, and follow-up questions. Yet current AI systems often skip these steps, generating responses that assume understanding rather than building it. Research shows that this lack of conversational grounding can lead to breakdowns in human-AI interaction. Encouragingly, systems like CollabLLM (opens in new tab), which prompt AI to ask clarifying questions and respond over multiple turns, have shown improved task performance and more interactive exchanges.

Trust is another essential aspect of collaboration. Although AI can process vast amounts of information, its usefulness in decision-making depends on how well it grasps human goals, and how well people understand its capabilities. Using AI that doesn’t understand a person’s objectives can lead to worse outcomes than using no AI at all. Yet people often overestimate AI’s abilities, which distort their judgment on when and how to use it. Systems that support selective delegation can improve these decisions, especially when the AI is programmed to account for this selective approach in its responses.

AI’s advancing capabilities are fueling a shift in people’s roles. This includes software production, where developers who once wrote code from beginning to end are increasingly reviewing and refining AI-generated suggestions. Writers and designers are acting more as curators and editors, guiding AI outputs rather than producing everything from scratch. This shift demands new skills – like crafting effective prompts, vetting AI responses, and maintaining quality oversight – and new tools to support them.

Current chat-based interfaces are often too limited for these evolving workflows. Alongside knowledge about the capabilities, limitations, and workings of an AI system, as well as domain expertise and situational awareness to enable intervention, oversight requires observability of system activity, decisions, and outputs. New interface designs are emerging to address this, including visualizations of AI reasoning, shared editing spaces, and mixed-initiative systems that allow humans and AI to take turns leading a task. These innovations aim to preserve human agency while making AI more transparent and responsive.

Ultimately, the future of work is about building complementary interactions between people, drawing on knowledge of how people collaborate, while acknowledging the unique challenges of human-AI interaction, and drawing on AI capabilities to do so.

AI for teamwork

AI systems have been designed from the ground up to work best for individuals, not for teams of people. It is no surprise then, that when people use AI as a team, they often underperform, even relative to an individual using AI.

The good news is that a growing amount of research is dedicated to AI that supports team and group interaction. Researchers are using two broad approaches: (1) process-focused strategies, i.e. building AI to facilitate specific team processes like information sharing and (2) outcome-focused strategies, i.e. training end-to-end AI systems that attempt to learn from short- and long-range team outcomes.

Some examples of the former include systems that provide a devil’s advocate perspective in a group discussion or help amplify minority perspectives. Examples of the latter include systems that try to help teams make good decisions or drive meetings towards achieving goals.

Theory from fields like collective intelligence would suggest that both approaches have great potential: AI can unlock new models of collaboration that are wildly different and more productive than we’ve had before. One notable example is AI enabling much more ephemeral teams, where a precise group of people in a given organization (or even beyond) can come together to solve a specific problem, then disband when the problem is solved.

More philosophically, it can be useful to understand even individual interaction with a large language model (LLM) as a type of teamwork. In fact, “collective intelligence” is perhaps a more accurate term for technologies like LLMs than “artificial intelligence”. LLMs take knowledge from millions of people who have written web content or posted in places like Reddit and Wikipedia, interacted with chatbots, and generated other types of data, and make that available to individuals on demand. Every time you interact with an LLM, you’re interacting with the work of millions of people, without the impossible overhead of that scale of collaboration.

Thinking, learning and psychological influences

Generative AI is changing cognition and learning while also introducing new psychological dynamics. This is making design choices about agency, effort, and well-being increasingly consequential. 

A central pattern emerging in generative AI is a shift from ‘thinking by doing’ (e.g. writing a document) toward ‘choosing from outputs’ (e.g. prompting AI to write a document). This may weaken the judgment and practices that sustain human expertise unless it is paired with user experiences that keep people cognitively engaged, and upskilling/reskilling to accommodate changes in available work. AI can also be designed to support thinking rather than substitute for it, for example by provoking reflection, scaffolding reasoning, and workflows that help people ‘decide how to decide’ through alternatives and critiques. For ideation and creativity, benefits can be fragile. Using LLMs at the wrong time can reduce originality and self-efficacy, and repeated cognitive offloading can carry over even when AI is removed. To avoid trading short-term accuracy for long-term capability, AI experiences should help users practice the judgment needed to challenge and refine AI outputs.

AI use in education is already widespread, but much of this activity runs through general-purpose tools rather than education-specific products, while training and policy are still catching up. In learning contexts, the speed and ease with which AI is being designed to meet workplace tasks may conflict with the needs of education. Learning often benefits from ‘desirable difficulties,’ and heavy reliance on summaries and syntheses may make learning shallower without thoughtful support. This may involve trying problems before turning to AI for help, and question-driven tutoring that requires students to justify and check outputs. Coding education remains essential, but needs to change focus from memorizing syntax to centering abstraction and accountability, such as problem framing and critical review. Workplace training can counter overreliance and ‘work-slop’ productivity problems by helping workers reframe AI as a thought partner, prompting reflective interaction and strengthening calibration and verification habits so workers retain responsibility for final decisions. 

Finally, conversational AI is increasingly being used for social and emotional support, making empathy and psychological well-being core design and governance concerns, especially because effects can vary sharply by user context and interaction patterns. That variability also raises the stakes for anthropomorphic behaviors. Clearer definitions and measurement are needed to understand when systems appear human-like and what consequences follow. Broader mapping of the design space can help designers anticipate implications and choose alternatives.

Specific roles & industries

While much of the NFW report highlights broad work patterns such as collaboration, communication, and decision-making, we also examined specific professions that are seeing especially rapid disruption. Among those that stand out in this year’s edition are software engineering and science. To counter some of the misunderstandings around these fields, we address several myths, including:

  • Counting AI-generated lines of code is a meaningful productivity metric
  • Current tools will instantly turn every developer into a “10× engineer”

Adoption primarily depends on model capability. Beyond myth-busting, we see real shifts in the software lifecycle. Historically, PMs (product/program/project managers) focused on customer needs, telemetry, design, and feedback, while developers wrote the code. With generative AI, these boundaries are blurring. PMs report doing more technical work and writing more code, while developers increasingly engage in higher-level planning and conceptual thinking as they interact with AI agents.

This shift is illustrated by the rise of vibe coding—developing software through iterative prompting rather than directly writing and editing code. Studies show that experienced computer science students are better at vibe coding than novices, able to steer models with a smaller number of targeted prompts. As humans build trust with AI assistants, work becomes more co-creative, enabling engineers to stay “in flow” through continuous iteration.

Together, these changes point to a deeper transformation in how software is built—both the mechanics of code production and the ways teams coordinate, plan, and collaborate.

Science is also seeing significant AI-driven acceleration. AI is meaningfully accelerating scientific discoveries by assisting researchers in identifying promising ideas, retracing known results, and surfacing cross-field connections. Foundation models also make it easier to work with diverse data types and enable experiments at a previously impossible scale.

Benefits of increased research productivity and moderate quality gains appear to be most pronounced for early career researchers and non-English speaking scientists, for whom AI can act as both a collaborator and a form of access to advanced tooling.

However, AI introduces new risks. Issues of data provenance, accountability, and replication become more complex when generative systems are involved. Small variations in prompts can significantly change outcomes, making results harder to verify. Models may reproduce ideas without attribution or hallucinate entirely, increasing the burden of source-checking. And because many models tend toward sycophantic responses, scientists may overestimate the novelty or correctness of AI-generated insights.

Closing

Generative AI will not arrive in some distant future, it is reshaping work right now. Here are a few things to take away:

  1. AI isn’t just speeding up work—it’s changing how we work together.
    This year’s research shows a real shift: AI is moving from automating tasks to actively shaping how people create, decide, collaborate, and learn. The organizations seeing the biggest gains are the ones treating AI as a collaborative partner—not a bolt‑on tool—and building the culture, norms, and confidence to experiment.
  2. The benefits of AI are real, but they’re not evenly distributed—yet.
    Adoption is rising fast across countries, professions, and industries, but the gaps in access, confidence, and usage are widening. Early evidence shows that who uses AI (and how) will determine who benefits. Industry leaders need to ensure AI expands opportunity rather than reinforces divides.
  3. Human expertise matters more—not less—in an AI‑powered world.
    Across software engineering, science, and knowledge work, AI is transforming roles: people are shifting from doing the work to guiding, critiquing, and improving it. The organizations that thrive will be the ones that invest in judgment, critical thinking, and responsible oversight—and design AI experiences that keep people thoughtfully engaged.

The research in this year’s New Future of Work report points to both opportunity and responsibility. The future is not predetermined. It will be shaped by the choices we make today—in how we build AI systems, how organizations adopt them, and how individuals learn to work alongside them. Microsoft remains committed to studying these changes as they unfold, grounding our understanding in evidence, and ensuring that the future we are collectively building is one where AI helps us all work better, together.

Opens in a new tab

The post New Future of Work: AI is driving rapid change, uneven benefits appeared first on Microsoft Research.

Categories: Microsoft

ADeLe: Predicting and explaining AI performance across tasks

Wed, 04/01/2026 - 18:00
At a glance
  • AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
  • Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
  • It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
  • By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.

In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADeLe moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.

ADeLe-based evaluation

ADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.

Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.

Figure 1. Top: (1) Model performance on the ADeLe benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires. Evaluating ADeLe

Using ADeLe, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.

ADeLe shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADeLe makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.

Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.

Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: OpenAI models. Middle: Llama models. Right: DeepSeek-R1 distilled models.

This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADeLe surfaces them within a single framework.

ADeLe also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.

Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADeLe shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.

Reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADeLe identifies where that point is for each model.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Looking ahead

ADeLe is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.

As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADeLe offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on GitHub (opens in new tab).

Opens in a new tab

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

Categories: Microsoft

AsgardBench: A benchmark for visually grounded interactive planning

Thu, 03/26/2026 - 21:02
At a glance
  • To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.
  • AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold.
  • Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.
  • Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.

The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.

To address this, we created AsgardBench. In the paper, AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question AsgardBench is designed to answer.

Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, AsgardBench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.

For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.

Figure 1: Agent observations and corresponding action plans in AsgardBench. Each image is paired with the plan generated from that observation. This illustrates how AsgardBench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence. How it works

Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.

Evaluating AsgardBench

We tested several leading vision-capable models on AsgardBench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.

Providing that kind of detailed failure information raises performance for all models in AsgardBench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. AsgardBench’s performance is illustrated in Figure 2.

Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that AsgardBench requires perception-based reasoning.

The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab Implications and looking ahead

AsgardBench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.

The failure patterns AsgardBench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.

AsgardBench is open source and available on GitHub (opens in new tab), providing a foundation for advancing research in visually grounded planning.

Acknowledgements

We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.

Opens in a new tab

The post AsgardBench: A benchmark for visually grounded interactive planning appeared first on Microsoft Research.

Categories: Microsoft

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

Thu, 03/26/2026 - 18:03
At a glance
  • VLM-based robot planners struggle with long, complex tasks because natural-language plans can be ambiguous, especially when specifying both actions and locations.
  • GroundedPlanBench evaluates whether models can plan actions and determine where they should occur across diverse, real-world robot scenarios.
  • Video-to-Spatially Grounded Planning (V2GP) is a framework that converts robot demonstration videos into spatially grounded training data, enabling models to learn planning and grounding jointly.
  • Grounded planning improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations.

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks down for long, complex tasks because natural-language plans can be ambiguous or even hallucinated when specifying actions and locations (Figure 1). Because planning and spatial reasoning are handled separately, errors in one stage can propagate to the next. This raises a key question: can a VLM determine both what to do and where to do it simultaneously?

Figure 1. Failures in VLM-based task planners, where ambiguous language leads to non-executable actions. Planning with spatial grounding

To address this problem, we developed GroundedPlanBench (opens in new tab). In our paper, “Spatially Grounded Long-Horizon Task Planning in the Wild,” we describe how this new benchmark evaluates whether VLMs can plan actions and determine where those actions should occur across diverse real-world environments. We also built Video-to-Spatially Grounded Planning (V2GP), a framework that converts robot demonstration videos into training data to help VLMs learn this capability.

Evaluating these with both open- and closed-source VLMs, we found that grounded planning for long, complex tasks is challenging. At the same time, V2GP improves both planning and grounding, with gains validated on our benchmark and in real-world experiments using robots.

How GroundedPlanBench works

To create realistic robot scenarios, we built our benchmark from 308 robot manipulation scenes in the Distributed Robot Interaction Dataset (DROID) (opens in new tab), a large collection of recordings of robots performing tasks. We worked with experts to review each scene and define tasks that a robot could perform. Each task was written in two styles: explicit instructions that clearly describe the actions (e.g., “put a spoon on the white plate”) and implicit instructions that describe the goal more generally (e.g., “tidy up the table”).

For each task, the plan was broken down into four basic actions—graspplaceopen, and close—each tied to a specific location in the image. Grasp, open, and close actions were linked to a box drawn around the target object, while place actions were linked to a box showing where the object should be placed.

Figure 2 illustrates medium- and long-duration tasks, along with their explicit and implicit instructions. In total, GroundedPlanBench contains 1,009 tasks, ranging from 1–4 actions (345 tasks) to 5–8 (381) and 9–26 (283).

Figure 2. Examples of tasks in GroundedPlanBench. How V2GP works

The V2GP framework first detects moments when the robot interacts with objects using the recorded gripper signals. It then generates a text description of the manipulated object with a multimodal language model. Guided by this description, the system tracks the object across the video using Meta’s advanced open-vocabulary image and video segmentation model, SAM3. The system then constructs grounded plans from the tracking results, identifying the object’s location at the moment it is grasped and where it is placed.

This process is illustrated in Figure 3. It yielded 43K grounded plans with varying lengths: 34,646 plans with 1–4 actions, 4,368 with 5–8 actions, and 4,448 with 9–26 actions.

Figure 3. The V2GP framework converts robot videos into spatially grounded plans. Evaluating decoupled versus grounded planning

To evaluate GroundedPlanBench in real-world robotic settings, we used Qwen3-VL (opens in new tab) as our base model. Qwen3-VL is a vision-language model that processes text, images, and video to support multimodal reasoning. It performs well on standard multimodal reasoning benchmarks without additional training. We first evaluated it, along with other proprietary models, on GroundedPlanBench without any task-specific training (Table 1). We then fine-tuned it on V2GP training data and compared it with a decoupled approach, in which planning and grounding are handled separately.

In this setup, a VLM first generated a plan describing what the robot should do. We used GPT-5.2 or Qwen3-VL-4B for this step. The plan was then passed to a spatial grounding model, Embodied-R1 (opens in new tab), which converted the plans into executable signals. Embodied-R1 is a large vision-language model trained for embodied reasoning and pointing, where the model identifies specific locations in the image to guide the robot’s actions. We selected it for spatial grounding because its training targets embodied spatial reasoning and point-based localization, making it well suited for grounding model outputs to specific locations in an image.

Figure 4 highlights a key limitation of this approach: ambiguity in natural language. For example, Qwen3-VL-4B generated grasp actions by referring to “napkin on the table” for all four napkins in the scene, leading Embodied-R1 to ground each action the same napkin. GPT-5.2 produced more descriptive phrases, such as “top-left napkin” or “upper-center napkin,” but these were still too imprecise for the model to reliably distinguish between them and were again grounded to the same object.

Figure 4. Decoupled vs. grounded planning, illustrating how ambiguous language causes actions to be grounded to the wrong objects.

This limitation becomes more pronounced in real-world robot manipulation, where environments are often cluttered and complex. As a result, decoupled approaches struggle to work reliably. In contrast, our approach, grounded planning, performs planning and grounding jointly within a single model and improves both planning and grounding performance.

Table 1 presents evaluation results for open- and closed-source VLMs on GroundedPlanBench. Multi-step planning and handling of implicit instructions were challenging for all models, while training Qwen3-VL-4B and Qwen3-VL-32B with V2GP led to significant improvements in grounded planning.

Table 1. Evaluation results on GroundedPlanBench. Task Success Rate (TSR) measures the percentage of tasks completed correctly, requiring all actions to be both correctly planned and spatially grounded. Action Recall Rate (ARR) measures the proportion of generated actions that match the sub-actions defined in the dataset, regardless of order. The V2GP approach improves performance on both metrics and achieves the best results (shown in bold).

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Implications and looking forward

Integrating planning and grounding within a single model offers a path to more reliable robot manipulation in real-world settings. Rather than relying on separate stages, this approach keeps decisions about what to do and where to act tightly coupled, but models still struggle with longer, multi-step tasks and implicit instructions. Models must reason over longer sequences of actions and maintain consistency across many steps and goals described indirectly, as in everyday language.

Looking ahead, a promising direction combines grounded planning with world models, which enable robots to predict the outcomes of actions before executing them. Together, these capabilities could allow robots to decide what to do, where to act, and what will happen next, bringing us closer to systems that can plan and act reliably in the real world.

Acknowledgements

This research was conducted in collaboration with Korea University, Microsoft Research, University of Wisconsin-Madison, and supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No. RS-2025-25439490) funded by the Korea government (MSIT).

Opens in a new tab

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

Categories: Microsoft

Systematic debugging for AI agents: Introducing the AgentRx framework

Thu, 03/12/2026 - 18:38
At a glance
  • Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.
  • Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.
  • Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.
  • Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.

As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.

When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.

Today, we are excited to announce the open-source release of AgentRx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the AgentRx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.

The challenge: Why AI agents are hard to debug

Modern AI agents are often:

  • Long-horizon: They perform dozens of actions over extended periods.
  • Probabilistic: The same input might lead to different outputs, making reproduction difficult.
  • Multi-agent: Failures can be “passed” between agents, masking the original root cause.

Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.

Introducing AgentRx: An automated diagnostic “prescription”

AgentRx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, AgentRx uses a structured, multi-stage pipeline:

  1. Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.
  2. Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).
  3. Guarded evaluation: AgentRx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.
  4. LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.
The AgentRx workflow: Given a failed trajectory, tool schemas, and domain policy, AgentRx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the critical failure step and root-cause category. A New Benchmark for Agent Failures

To evaluate AgentRx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:

  • τ-bench: Structured API workflows for retail and service tasks.
  • Flash: Real-world incident management and system troubleshooting.
  • Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.

Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).

Taxonomy CategoryDescriptionPlan Adherence FailureIgnored required steps / did extra unplanned actionsInvention of New InformationAltered facts not grounded in trace/tool outputInvalid InvocationTool call malformed / missing args / schema-invalidMisinterpretation of Tool OutputRead tool output incorrectly; acted on wrong assumptionsIntent–Plan MisalignmentMisread user goal/constraints and planned wronglyUnder-specified User IntentCould not proceed because required info wasn’t availableIntent Not SupportedNo available tool can do what’s being askedGuardrails TriggeredExecution blocked by safety/access restrictionsSystem FailureConnectivity/tool endpoint failures Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but AgentRx focuses on identifying the first critical breach. Key Results

In our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:

  • +23.6% absolute improvement in failure localization accuracy.
  • +22.9% improvement in root-cause attribution.

By providing the “why” behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.

Join the Community: Open Source Release

We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.

We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.

Acknowledgements

We would like to thank Avaljot Singh and Suman Nath for contributing to this project.

Opens in a new tab

The post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.

Categories: Microsoft

PlugMem: Transforming raw agent interactions into reusable knowledge

Tue, 03/10/2026 - 18:00
At a glance
  • Today’s AI agents store long interaction histories but struggle to reuse them effectively.
  • Raw memory retrieval can overwhelm agents with lengthy, low-value context.
  • PlugMem transforms interaction history into structured, reusable knowledge.
  • A single, general-purpose memory module improves performance across diverse agent benchmarks while using fewer memory tokens.

It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use.

More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix useful experiences with irrelevant details, making retrieval slower and less reliable. The challenge is not storing more experiences, but organizing them so that agents can quickly identify what matters in the moment.

In our recent paper “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,” we introduce a plug-and-play memory system that transforms raw agent interactions into reusable knowledge. Rather than treating memory as text to retrieve, PlugMem organizes that history into structured knowledge designed to support decisions as the agent acts.

Cognitive science offers a useful framework here. It distinguishes between remembering events, knowing facts, and knowing how to perform tasks. Past events provide context, but effective decisions rely on the facts and skills extracted from those events.

This distinction motivated a shift in how we decided to design memory for AI agents. PlugMem implements this shift by converting the agent’s interaction history, such as dialogues, documents, and web sessions, into structured, compact knowledge units that can be reused across tasks.

How PlugMem works

A key difference between PlugMem and conventional AI memory systems is what gets stored. Traditional approaches store text chunks or named entities (references to people, places, and concepts). PlugMem uses facts and reusable skills as the fundamental building blocks of memory. This design reduces redundancy, increases information density, and improves retrieval precision. It’s built around three core components:

Structure. Raw interactions are standardized and transformed into propositional knowledge (facts) and prescriptive knowledge (reusable skills). These knowledge units are organized into a structured memory graph, enabling knowledge to be stored in a form designed for reuse.

Retrieval. Rather than retrieving long passages of text, PlugMem retrieves knowledge units that are aligned with the current task. High-level concepts and inferred intents serve as routing signals, surfacing the most relevant information for the decision at hand.

Reasoning. Retrieved knowledge is distilled into concise, task-ready guidance before being passed to the base agent, ensuring that only decision-relevant knowledge enters the agent’s context window.

Figure 1 illustrates how these components work together.

Figure 1. PlugMem organizes different types of agent interactions into a knowledge-centric memory graph, enabling structured retrieval and reasoning. One memory, any task

Most AI memory systems are built for one job. A conversational memory module is designed around dialogue. A knowledge-retrieval system is tuned to look up facts. A web agent’s memory is optimized for navigating pages. Each performs well in its target setting but rarely transfers without significant redesign.

PlugMem takes a different approach. It is a foundational memory layer that can be attached to any AI agent without needing to modify it for a specific task.

Evaluating PlugMem

To test PlugMem, we evaluated the same memory module on three benchmarks that each make different demands on memory:

  • Answering questions across long multi-turn conversations
  • Finding facts that span multiple Wikipedia articles
  • Making decisions while browsing the web

Across all three, PlugMem consistently outperformed both generic retrieval methods and task-specific memory designs while allowing the AI agent to use significantly less memory token budget in the process.

Measuring memory by utility, not size

We wanted to evaluate whether the right information was reaching the agent at the right moment, without overwhelming the model’s context window, which has limited capacity. To do this, we introduced a metric that measures how much useful, decision-relevant information a memory module contributes relative to how much context it consumes.

When we plotted utility against context consumption, PlugMem consistently came out ahead: it delivered more decision-relevant information while consuming less of the AI agent’s context than other approaches, as shown in Figure 2. These results suggest that transforming experience into knowledge—rather than storing and retrieving raw logs—produces memory that is more useful and efficient.

Figure 2. Across all three benchmarks, PlugMem delivered more useful memory with less of the agent’s context window. Why general-purpose memory can outperform task-specific designs

General-purpose memory modules can outperform systems tailored to specific tasks because the decisive factor is not specialization but whether memory can surface the right knowledge precisely when the agent needs it. Structure, retrieval, and reasoning each play a distinct role, and getting all three right matters more than optimizing for a single use case.

PlugMem is not meant to replace task-specific approaches. It provides a general memory foundation upon which task adaptations can be layered. Our experiments show that combining PlugMem with task-specific techniques yields further gains.

Toward reusable memory for agents

As AI agents take on longer and more complex tasks, its memory needs to evolve from storing past interactions to actively supplying reusable knowledge. The goal is for agents to carry useful facts and strategies from one task to the next rather than starting from scratch each time.

PlugMem represents a step in that direction, grounding memory design in cognitive principles and treating knowledge as the primary unit of reuse. As agent capabilities expand, knowledge-centric memory may prove to be a critical building block for the next generation of intelligent agents.

Code and experimental results are publicly available on GitHub (opens in new tab) so that others can reproduce the results and conduct their own research.

Opens in a new tab

The post PlugMem: Transforming raw agent interactions into reusable knowledge appeared first on Microsoft Research.

Categories: Microsoft

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

Wed, 03/04/2026 - 20:05
At a glance
  • Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
  • We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, inferring about changes in sequences of images, and much more. Beyond these general capabilities, it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens. In particular, our model presents an appealing value relative to popular open-weight models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning.

Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values. 

In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.

A focus on smaller and faster vision–language models

Many popular vision-language models (VLMs) have trended towards growing in parameter count and, in particular, the number of tokens they consume and generate. This leads to increase in training and inference-time cost and latency, and impedes their usability for downstream deployment, especially in resource‑constrained or interactive settings.

A growing countertrend towards smaller (opens in new tab) models aims to boost efficiency, enabled by careful model design and data curation – a goal pioneered by the Phi family of models (opens in new tab) and furthered by Phi-4-reasoning-vision-15B. We specifically build on learnings from the Phi-4 and Phi-4-Reasoning language models and show how a multimodal model can be trained to cover a wide range of vision and language tasks without relying on extremely large training datasets, architectures, or excessive inference‑time token generation. Our model is intended to be lightweight enough to run on modest hardware while remaining capable of structured reasoning when it is beneficial. Our model was trained with far less compute than many recent open-weight VLMs of similar size. We used just 200 billion tokens of multimodal data leveraging Phi-4-reasoning (trained with 16 billion tokens) based on a core model Phi-4 (400 billion unique tokens), compared to more than 1 trillion tokens used for training multimodal models like Qwen 2.5 VL (opens in new tab) and 3 VL (opens in new tab), Kimi-VL (opens in new tab), and Gemma3 (opens in new tab). We can therefore present a compelling option compared to existing models pushing the pareto-frontier of the tradeoff between accuracy and compute costs.

Figure 2: Phi-4-Reasoning-Vision can help with a wide range of everyday tasks. Lessons from training a multimodal model

Training a multimodal reasoning model raises numerous questions and requires many nuanced design choices around model architecture, dataset quality and composition, and the interaction between reasoning‑heavy and non-reasoning perception‑focused tasks.

Model architecture: Early- vs mid-fusion

Model architectures for VLMs differ primarily in how visual and textual information is fused. Mid-fusion models use a pretrained vision encoder to convert images into visual tokens that are projected into a pretrained LLM’s embedding space, enabling cross-modal reasoning while leveraging components already trained on trillions of tokens. Early-fusion models process image patches and text tokens in a single model transformer, yielding richer joint representations but at significantly higher compute, memory, and data cost. We adopted a mid-fusion architecture as it offers a practical trade-off for building a performant model with modest resources.

Model architecture: Vision encoder and image processing

We build on the SigLIP-2 (opens in new tab) vision encoder and the Phi-4-Reasoning backbone. In previous research, we found that multimodal language models sometimes struggled to solve tasks, not because of a lack of reasoning proficiency, but rather an inability to extract and select relevant perceptual information from the image. An example would be a high-resolution screenshot that is information-dense with relatively small interactive elements.

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3 (opens in new tab) uses pan-and-scan and NVILA (opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

MethodMax TokensMathVistaScreenSpotScreenSpot-ProV*BenchDynamic-S2309642.978.49.452.9Multi-crop309643.467.85.451.8Multi-crop with S2204843.479.110.657.1Dynamic resolution204845.281.59.251.3Dynamic resolution360044.979.717.556.0Table 1: Results with different resolution handling approaches. The top two configurations on each benchmark are in bold. Data: Quality and composition

As with its language backbone Phi-4-Reasoning, Phi-4-reasoning-vision-15B was trained with a deliberate focus on data quality. Our final dataset consists primarily of data from three sources: open-source datasets which were meticulously filtered and improved; high-quality domain-specific internal data; and high-quality data from targeted acquisitions. The overwhelming majority of our data lies in the first category: data which originated as open-source data, which were significantly filtered and improved, whether by removing low-quality datasets or records, programmatically fixing errors in data formatting, or using open-source images as seeds to synthetically generate higher-quality accompanying text.

The process of improving open-source data began by manually reviewing samples from each dataset. Typically, 5 to 10 minutes were sufficient to classify data as excellent-quality, good questions with wrong answers, low-quality questions or images, or high-quality with formatting errors. Excellent data was kept largely unchanged. For data with incorrect answers or poor-quality captions, we re-generated responses using GPT-4o and o4-mini, excluding datasets where error rates remained too high. Low-quality questions proved difficult to salvage, but when the images themselves were high quality, we repurposed them as seeds for new caption or visual question answering (VQA) data. Datasets with fundamentally flawed images were excluded entirely. We also fixed a surprisingly large number of formatting and logical errors across widely used open-source datasets.

We extracted additional value from existing datasets through reformatting, diversification, and using images as seeds for new data generation. We generated detailed image descriptions alongside original QA pairs for math and science data, had data perform “double-duty” by embedding instruction-following requirements directly into domain-specific QA, created “scrambled,” “caption-matching,” and “what’s changed?” records to improve multi-image reasoning and sequential navigation for CUA scenarios, and diversifying prompt styles to encourage robustness beyond perfectly structured questions.

To supplement the improved open-source data, we utilize high-quality internal datasets, several math-specific datasets which were acquired during training of the Phi-4 language model, and also some domain-specific curated data; for example, latex-OCR data generated by processing and rendering equations from arXiv documents.

before returning a bounding box coordinates for a UI grounding task, and the other uses a tag with step-by-step reasoning to answer a chart question about expatriate populations, concluding with "Dubai." " class="wp-image-1163336"/> Figure 3: Phi-4-reasoning-vision-15B training data composition and examples Data: Mathematics vs. computer-use data proportion

One of our goals was to train a model that performs well across general vision-language tasks, while excelling at mathematical and scientific reasoning and computer-use scenarios. How to structure datasets for generalizable reasoning remains an open question—particularly because the relationship between data scale and reasoning performance can lead to starkly different design decisions, such as training a single model on a large dataset versus multiple specialized models with targeted post-training.

Research on long-tailed classification robustness has suggested that balancing or removing data from overrepresented tasks or subgroups (opens in new tab) is an effective method for ensuring good performance. Nevertheless, these insights are not fully utilized or explored when it comes to training VLMs, which at times have favored scale over careful data balancing. To achieve our goals, we conducted a set of experiments to analyze a range of data ratios between our focus domains.

Using the same 5 billion parameter proxy model as for previous experiments, we trained while varying the amount of mathematics and science vs. computer-use data for each run. Each dataset included the same subset of 1 million general image-text pairs as a baseline. For mathematics and science data, we used a subsample of 150,000 records, optionally duplicating each one up to three times. Next, we included up to 450,000 computer-use records, and optionally an additional 400,000 from Phi-Ground.

We found that that multimodal mathematics and science performance were not harmed by additional computer-use data, and vice versa. Interestingly, we found that increasing mathematics data by 3x while keeping computer-use data constant improved math, science, and computer-use benchmarks.

GeneralMath and ScienceCUATotalMMMUMathVistaScreenSpot-V21M150K450K1.6M44.037.448.21M150K850K2.0M44.137.360.01M450K450K1.9M45.336.048.31M450K850K2.3M43.438.963.11M150K150K1.3M44.236.929.81M150K250K1.4M45.437.437.7Table 2: Varying the ratios of math and CUA data. Increasing math data by 3x while keeping computer-use data constant improves both math and computer-use benchmarks.  Data: Synthetic data for text-rich visual reasoning

Recent work (opens in new tab) suggests that targeted synthetic data can materially improve multimodal reasoning, particularly for text-rich visual domains such as charts, documents, diagrams, and rendered mathematics. Using images, questions, and answers that are programmatically generated and grounded in the visual structure enables precise control over visual content and supervision quality, resulting in data that avoids many annotation errors, ambiguities, and distributional biases common in scraped datasets. This enables cleaner alignment between visual perception and multi-step inference, which has been shown to translate into measurable gains on reasoning-heavy benchmarks.

Synthetic text-rich images expand coverage of long-tail visual formats that are underrepresented in real data but disproportionately impact reasoning accuracy, improving not only visual grounding but also downstream reasoning by ensuring that failures are less often caused by perceptual errors. We found that programmatically generated synthetic data is a useful augmentation to high-quality real datasets — not a replacement, but a scalable mechanism for strengthening both perception and reasoning that complements the training objectives in compact multimodal models such as Phi-4-reasoning-vision-15B.

Mixing non-reasoning and reasoning as a design objective

In language-only settings, reasoning traces have improved performance on many tasks, but they require additional compute which adds undesired latency. In multimodal settings, this tradeoff is less clear-cut, for tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful (opens in new tab), while mathematical and scientific problem-solving benefit from multi-step reasoning. Thus, the choice of when to reason or not can be quite nuanced.

Training approaches for multimodal reasoning models

Language-only reasoning models are typically created through supervised fine-tuning (SFT) or reinforcement learning (RL): SFT is simpler but requires large amounts of expensive reasoning trace data, while RL reduces data requirements at the cost of significantly increased training complexity and compute. Multimodal reasoning models follow a similar process, but the design space is more complex. With a mid-fusion architecture, the first decision is whether the base language model is itself a reasoning or non-reasoning model. This leads to several possible training pipelines:

  • Non-reasoning LLM → reasoning multimodal training: Reasoning and multimodal capabilities are trained together.
  • Non-reasoning LLM → non-reasoning multimodal → reasoning multimodal training: Multimodal capabilities are learned first, then reasoning is added.
  • Reasoning LLM → reasoning multimodal training: A reasoning base is used, but all multimodal data must include reasoning traces.
  • Our approach: Reasoning LLM → mixed non-reasoning / reasoning multimodal training. A reasoning-capable base is trained on a hybrid data mixture, learning when to reason and when to respond directly.

Approaches 1 and 2 offer flexibility in designing multimodal reasoning behavior from scratch using widely available non-reasoning LLM checkpoints but place a heavy burden on multimodal training. Approach 1 must teach visual understanding and reasoning simultaneously and requires a large amount of multimodal reasoning data, while Approach 2 can be trained with less reasoning data but risks catastrophic forgetting, as reasoning training may degrade previously learned visual capabilities. Both risk weaker reasoning than starting from a reasoning-capable base. Approach 3 inherits strong reasoning foundations, but like Approach 1, it requires reasoning traces for all training data and produces reasoning traces for all queries, even when not beneficial.

Our approach: A mixed reasoning and non-reasoning model

Phi-4-reasoning-vision-15B adopts the 4th approach listed previously, as it balances reasoning capability, inference efficiency, and data requirements. It inherits a strong reasoning foundation but uses a hybrid approach to combine the strengths of alternatives while mitigating their drawbacks. Our model defaults to direct inference for perception-focused domains where reasoning adds latency without improving accuracy, avoiding unnecessary verbosity and reducing inference costs, and it invokes longer reasoning paths for domains, such as math and science, that benefit from structured multi-step reasoning (opens in new tab).

Our model is trained with SFT, where reasoning samples include “…” sections with chain-of-thought reasoning before the final answer, covering domains like math and science. Non-reasoning samples are tagged to start with a “” token, signaling a direct response, and cover perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Reasoning data comprises approximately 20% of the total mix. Starting from a reasoning-capable backbone means this data grounds existing reasoning in visual contexts rather than teaching it to reason from scratch.

This approach is not without limitations. The balance between modes is a direct function of design choices we made, informed by recent literature (opens in new tab) and observed model behavior during training—though the boundary between modes can be imprecise as it is learned implicitly from the data distribution. Our model allows control through explicit prompting with “” or “” tokens when the user wants to override the default reasoning behavior. The 20/80 reasoning-to-non-reasoning data split may not be optimal for all domains or deployment contexts. Evaluating the ideal balance of data and the model’s ability to switch appropriately between modes remains an open problem.

We view this mixed approach not as a definitive solution, but as one practical and well-motivated point in the design space for balancing latency, accuracy, and flexibility in multimodal systems.

Applications Figure 4: Phi-4-Reasoning-Vision can interpret sequences of images 

Phi-4-reasoning-vision-15B is a high-performing model across many vision-language tasks. It sees and understands the world by looking at a photo, document, chart, or screen and making sense of it. In practice that covers an enormous range of applications — just a few examples include: describing images and answering questions about them, interpreting changes and trends in images sequences, and recognizing objects, landmarks, and transcribing text.

Highlights: Scientific and mathematical reasoning and supporting computer-using agents (CUA)

In addition to general vision and language tasks, Phi-4-reasoning-vision-15B was designed to excel at tasks that combine visual input with structured inference, such as solving math problems presented in visual form, such as handwritten or diagram-based questions, extracting and reasoning over quantitative information in documents and charts, and supporting multi-step reasoning in educational or scientific analysis contexts.

Figure 5: Phi-4-reasoning-vision-15B is great at math and science  Figure 6: Phi-4-reasoning-vision-15B can help with written math problems 

In addition, we trained Phi-4-reasoning-vision-15B to have skills that can enable agents to interact with graphical user interfaces by interpreting screen content and selecting actions. With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields. Due to its low inference-time needs it is great for interactive environments where low latency and compact model size are essential.

Figure 7: Phi-4-reasoning-vision-15B can help navigate computer UIs Evaluation

Phi-4-reasoning-vision-15B was evaluated for accuracy and timing using two complementary open-source frameworks to ensure both rigorous and standardized analysis: Eureka ML Insights (opens in new tab) and VLMEvalKit (opens in new tab).

BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force nothinkPhi-4-mm-instructKimi-VL-A3B-Instructgemma-3-12b-itQwen3-VL-8B-Instruct-4KQwen3-VL-8B-Instruct-32KQwen3-VL-32B-Instruct-4KQwen3-VL-32B-Instruct-32KAI2D_TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA_TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse_MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision_MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista_MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU_VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot_v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 3: Accuracy comparisons relative to popular open-weight, non-thinking models  BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force thinkingKimi-VL-A3B-Thinkinggemma-3-12b-itQwen3-VL-8B-Thinking-4KQwen3-VL-8B-Thinking-40KQwen3-VL-32B-Thiking-4KQwen3-VL-32B-Thinking-40KAI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA_TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse_MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision_MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista_MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU_VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot_v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 4: Accuracy comparisons relative to popular open-weight, thinking models 

Our model balances thinking and non-thinking performance – on average showing better accuracy in the default “mixed-reasoning” behavior than when forcing thinking vs. non-thinking. Only in a few cases does forcing a specific mode improve performance (MathVerse and MMU_val for thinking and ScreenSpot_v2 for non-thinking). Compared to recent popular, open-weight models, our model provides a desirable trade-off between accuracy and cost (as a function of inference time compute and output tokens), as discussed previously.

Note: All numbers here are the result of running benchmarks ourselves and may be lower than other previously shared numbers. Instead of quoting leaderboards, we performed our own benchmarking, so we could understand scaling performance as a function of output token counts for related models. We made our best effort to run fair evaluations and used recommended evaluation platforms with model-specific recommended settings and prompts provided for all third-party models. For Qwen models we use the recommended token counts and also ran evaluations matching our max output token count of 4096. For Phi-4-reasoning-vision-15B, we used our system prompt and chat template but did not do any custom user-prompting or parameter tuning, and we ran all evaluations with temperature=0.0, greedy decoding, and 4096 max output tokens. These numbers are provided for comparison and analysis rather than as leaderboard claims. For maximum transparency and fairness, we will release all our evaluation logs publicly. For more details on our evaluation methodology, please see our technical report (opens in new tab).

Safety

As with other Phi models, Phi-4-reasoning-vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. For further details, check out our technical report (opens in new tab).

Open release and community engagement

Phi-4-reasoning-vision-15B is available on Microsoft Foundry (opens in new tab) and HuggingFace (opens in new tab) with additional examples and details on GitHub (opens in new tab). For additional guidance on how to use our model properly and safely, please refer to our Model card (opens in new tab). For further details on the technical aspects of the model, training, and evaluation, see our technical report (opens in new tab).

In line with our goal of supporting future AI development in the community, Phi-4-reasoning-vision-15B is released under a permissive license with model weights, fine‑tuning code, and benchmark logs. We intend this release to complement existing work by providing concrete artifacts that help close gaps in understanding how compact multimodal reasoning models can be built and studied.

Looking forward

Smaller vision–language models with selective, task‑aware reasoning offer one promising direction for making multimodal systems more practical and accessible. We present our model and its learnings to inform ongoing research in multimodal modeling, computer‑using agents, and mathematical scientific reasoning. We hope these details are useful to researchers exploring similar tradeoffs and invite critical evaluation, replication, and extension by the community. If you’d like to join us and help shape the future of multimodal models, please apply for one of our open roles.

Acknowledgements

We thank Rachel Ward for her extensive work on data collection and curation. We thank the GenDatasets, PhiGround, SimCity, and Fara-7B efforts for invaluable training data. We thank Harkirat Behl, Mojan Javaheripi, and Suriya Gunasekar for providing us with Phi-4 checkpoints and guidance on training with Phi models. We additionally thank Sahaj Agarwal, Ahmed Awadallah, Qi Dai, Gustavo de Rosa, Rafah Hosn, Ece Kamar, Piero Kauffmann, Yash Lara, Chong Luo, Caio César Teodoro Mendes, Akshay Nambi, Craig Presti, Matthew Rosoff, Corby Rosset, Marco Rossi, Kashyap Patel, Adil Salim, Sidhartha Sen, Shital Shah, Pratyusha Sharma, Alexey Taymanov, Vibhav Vineet, John Weiss, Spencer Whitehead, the AI Frontiers Team and Leadership, and Microsoft Research Leadership, for their valuable help, insightful discussions, and continued support throughout this work.

Opens in a new tab

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Categories: Microsoft

CORPGEN advances AI agents for real work

Thu, 02/26/2026 - 19:06
At a glance
  • Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs).
  • Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%.
  • CORPGEN introduces digital employees, with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends.
  • Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve.

By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once.

In our paper, “CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap.

Introducing Multi-Horizon Task Environments

Replicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours.

To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off.

We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems.

CORPGEN’s architecture

CORPGEN introduces digital employees: LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday.

Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session.

CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step.

Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled.

Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture.

Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization. How digital employees collaborate

When multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication.

There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue.

When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so.

Evaluating CORPGEN

We evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out.

Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more.

Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%.

Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Implications and looking forward

The results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation.

CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software.

Acknowledgments

This work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation.

Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.

Opens in a new tab

The post CORPGEN advances AI agents for real work appeared first on Microsoft Research.

Categories: Microsoft

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions

Thu, 02/19/2026 - 18:00

Insights from Microsoft’s Media Integrity and Authentication: Status, Directions, and Futures report

It has become increasingly difficult to distinguish fact from fiction when viewing online images and videos. Resilient, trustworthy technologies can help people determine whether the content they are viewing was captured by a camera or microphone—or generated or modified by AI tools. 

We refer to technologies aimed at helping viewers verify the source and history—that is, the provenance—of digital content as media integrity and authentication (MIA) methods. This technique, driven by the Coalition for Content Provenance and Authenticity (opens in new tab) (C2PA), a standards body dedicated to scaling these capabilities, as well as complementary methods such as watermarks and fingerprinting, have become critically important with the rapid advance of AI systems capable of creating, realistic imagery, video, and audio at scale.

A convergence of forces

Our team recognized an inflection point in the evolution of online content integrity, driven by the convergence of four forces:

  • Growing saturation of synthetic media, driven by proliferation of high-fidelity content-generation tools and the explosion of AI generated or modified media online
  • Forthcoming legislation both nationally and internationally seeking to define what “verifiable” provenance should mean in practice
  • Mounting pressure on implementers to ensure authentication signals are clear and helpful, especially as signals increase when legislation goes into effect in 2026
  • Heightened awareness of potential adversarial attacks that attempt to exploit weaknesses in authenticity systems

The usefulness and trustworthiness of provenance signals, whether certifying content as synthetic or as an authentic capture of real-world scenes, will depend not only on advances in technology, but also on how the broader digital ecosystem adopts, implements, and governs these tools. Aligning around implementation choices that promote consistency and clarity is essential to ensure transparency signals strengthen, rather than erode, public confidence.

To address these challenges, we launched a comprehensive evaluation of the real-world limits, edge cases, and emerging “attack surfaces” for MIA methods. Today, we are publishing our findings in the Media Integrity & Authentication: Status, Directions & Futures report. The report distills lessons learned and outlines practical directions for strengthening media integrity in the years ahead.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Findings and directions forward

Our research recognizes that different media integrity and authenticity methods serve differing purposes and offer distinct levels of protection. After defining each method in detail, we focused on secure provenance (C2PA), imperceptible watermarking, and soft hash fingerprinting across images, audio, and video.

Grounded in our evaluation of these MIA methods across modalities, attack categories, and real-world workflows, several new findings emerged including two new concepts:

  • High-Confidence Provenance Authentication: a critical capability for verifying, under defined conditions, whether claims about the origin of and modifications made to an asset can be validated with high certainty.
  • Sociotechnical Provenance Attacks: attacks aimed at deception and capable of inverting signals, making authentic content appear synthetic, and synthetic content appear authentic.

Drawing on our findings, we identified four promising directions for further strengthening media authentication, along with suggestions to support more effective implementation strategies and future decisions. We’ve summarized the findings and directions below, with additional detail available in the report.

Promising directionsHigh-level findingsDelivering high-confidence provenance authentication– Implementation and display choices may affect the reliability of provenance indicators and how they are interpreted by the public.

– Using a C2PA provenance manifest for media created and signed in a high security environment enables high-confidence validation.

– High-confidence validation is also possible across a broader volume of images, audio, and video when an imperceptible watermark is linked to C2PA provenance manifest as an additional layer to recover the provenance information if removed.

– Fingerprinting is not an enabler for high-confidence validation and can involve significant costs when expected at scale. However, it can support manual forensics.Mitigating confusion from sociotechnical provenance attacks– MIA methods are susceptible to sociotechnical attacks on provenance that may mislead the public, resulting in confusion and misplaced trust about an asset’s provenance if there is an overreliance on low-quality signals.

– Layering and linking secure provenance and imperceptible watermarking methods to achieve high confidence validation also offers a promising option to both deter and mitigate the impact of attacks.

– Unintended consequences may result from the use of methods lacking authentication, such as the use of perceptible watermarks in the absence of secure provenance. Perceptible watermarks may cause confusion in cases of forgery or discourage people from consulting high-confidence provenance information via a validation tool, if such perceptible disclosures are taken at face value.

UX design that enables users to explore manifest details—such as where edits occurred or region of interest—has the potential to reduce confusion and support forensics and fact checking efforts.  Enabling more trusted provenance on edge devices– High-confidence results aren’t feasible when provenance is added by a conventional offline device (e.g., camera or recording device without connectivity).

Implementing a secure enclave within the hardware layer of offline devices is essential to make the provenance of captured images, audio, and video more trustworthy.Investing in ongoing research and policy development– All three methods offer organizations valuable tools for addressing operational challenges such as fraud prevention, risk management, and digital accountability. 

UX and display are promising directions for research. Important directions include in-stream tools that display provenance information where people are and distinguish between high- and lower-confidence provenance signals.

Stakeholders should conduct ongoing analysis and red teaming to identify and mitigate weaknesses through technical approaches, policies, and laws.    The journey continues

This report marks the beginning of a new chapter in our media provenance journey (opens in new tab), building on years of foundational work, from developing the very first prototype in 2019 to co-founding the C2PA in 2021 and helping catalyze an ecosystem that has since grown to more than 6,000 members and affiliates (opens in new tab) supporting C2PA Content Credentials. This research represents the next evolution of that long‑standing commitment.

We hope that by sharing our learnings will help others prepare for an important wave, especially as generative technologies accelerate and provenance signals multiply. This work is already underway across our products at Microsoft. Together, these directions highlight opportunities for the ecosystem to align, harden, and innovate, so authentication signals are not merely visible, but robust, meaningful, and resilient throughout the content lifecycle.

Opens in a new tab

The post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.

Categories: Microsoft

Project Silica’s advances in glass storage technology

Wed, 02/18/2026 - 18:11
At a glance
  • Microsoft Research publishes breakthrough in Nature on glass-based data storage that could preserve information for 10,000 years. 
  • New technique extends technology from expensive fused silica to ordinary borosilicate glass found in kitchen cookware. 
  • Innovations enable faster parallel writing, simplified readers (one camera instead of three), and easier manufacturing. 
  • Phase voxel method requires only a single laser pulse, significantly reducing complexity and cost.

Long-term preservation of digital information has long challenged archivists and datacenters, as magnetic tapes and hard drives degrade within decades. Existing archival storage solutions have limited media lifespans that make them less than ideal for preserving information for future generations.

Now, we are excited to report significant progress on Project Silica (opens in new tab), our effort to encode data in glass using femtosecond lasers, a technology that could preserve information for 10,000 years. Glass is a permanent data storage material that is resistant to water, heat, and dust.

In findings published in Nature (opens in new tab), we describe a breakthrough that extends the technology beyond expensive fused silica to ordinary borosilicate glass. A readily available and lower-cost medium, this is the same material found in kitchen cookware and oven doors. This advance addresses key barriers to commercialization: cost and availability of storage media. We have unlocked the science for parallel high-speed writing and developed a technique to permit accelerated aging tests on the written glass, suggesting that the data should remain intact for at least 10,000 years.

Storing data inside glass with femtosecond (opens in new tab) laser pulses is one of the few technologies on the horizon with the potential for durable, immutable, and long-lived storage. Although we have been leading innovation in this type of storage for years, prior to this research the technique only worked with pure fused silica glass, a type of glass that is relatively difficult to manufacture and available from only a few sources.

In the paper, we show how data can be stored in borosilicate glass. The new technique stores hundreds of layers of data in glass only 2mm thin, as with previous methods, but with important improvements. The reader for the glass now needs only one camera, not three or four, reducing cost and size. In addition, the writing devices require fewer parts, making them easier to manufacture and calibrate, and enabling them to encode data more quickly.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Key scientific discoveries

The Nature paper details several key new scientific discoveries:

Advances in birefringent voxel (opens in new tab) writing: For the previous type of data storage in fused silica glass using birefringent (i.e., polarization) voxels, we developed a technique to reduce the number of pulses used to form the voxel from many to only two, critically showing that the polarization of the first pulse is not important to the polarization of the voxel formed. We further developed this to enable pseudo-single-pulse writing, in which a single pulse can be split after its polarization is set to simultaneously form the first pulse for one voxel (where the polarization doesn’t matter) and the second pulse of another (where the set polarization is essential). We demonstrated how to use this pseudo-single-pulse writing to enable fast writing with beam scanning across the media.

Phase voxels, a new storage method: We invented a new type of data storage in glass called phase voxels, in which the phase change of the glass is modified instead of its polarization, showing that only a single pulse is necessary to make a phase voxel. We demonstrated that these phase voxels can also be formed in borosilicate glass and devised a technique to read the phase information from phase voxels encoded in this material. We showed that the much higher levels of three-dimensional inter-symbol interference in phase voxels can be mitigated with a machine learning classification model.

Parallel writing capabilities: By combining a mathematical model of pre-heating and post-heating within the glass with the invention of a multi-beam delivery system, we showed that many data voxels can be written in proximity in the glass at the same time, significantly increasing writing speed. We explained a method for using light emissions (a side effect of voxel formation) for both static calibration and dynamic control to fully support automatic writing operations.

Optimization and longevity testing: We developed a new way to optimize symbol encodings using machine learning and a better way to understand the tradeoff between error rates, error protection, and error recovery when evaluating new digital storage systems. We also created a new nondestructive optical method (opens in new tab) to identify the aging of data storage voxels within the glass, using this and standard accelerated aging techniques to support data lasting 10,000 years. We extended the industry standard Gray codes to apply to nonpower-of-two numbers of symbols.

Skip slideshow for: Previous slide Previous slide

A piece of Project Silica media written with data.

A research-grade Writer used to set the record for high speed data writing into glass.

A research-grade Reader for retrieving data from glass.

Close up of Writer showing high-speed multi-beam data encoding on laser pulses.

End of slideshow for: Demonstrating the technology

As a research initiative, Project Silica has demonstrated these advances through several proofs of concept, including storing Warner Bros.’ “Superman” movie on quartz glass (opens in new tab), partnering with Global Music Vault (opens in new tab) to preserve music under ice for 10,000 years (opens in new tab), and working with students on a “Golden Record 2.0” project (opens in new tab), a digitally curated archive of images, sounds, music, and spoken language, crowdsourced to represent and preserve humanity’s diversity for millennia.

Looking ahead

The research phase is now complete, and we are continuing to consider learnings from Project Silica as we explore the ongoing need for sustainable, long-term preservation of digital information. We have added this paper to our published works so that others can build on them.

Related work

Project Silica has made scientific advances across multiple areas beyond laser direct writing (LDW) in glass, including archival storage systems design, archival workload analysis, datacenter robotics, erasure coding, free-space optical components, and machine learning-based methods for symbol decoding in storage systems. Many of these innovations were described in our ACM Transactions on Storage publication (opens in new tab) in 2025.

Opens in a new tab

The post Project Silica’s advances in glass storage technology appeared first on Microsoft Research.

Categories: Microsoft

Rethinking imitation learning with Predictive Inverse Dynamics Models

Thu, 02/05/2026 - 19:00
At a glance
  • Imitation learning becomes easier when an AI agent understands why an action is taken.
  • Predictive Inverse Dynamics Models (PIDMs) predict plausible future states, clarifying the direction of behavior during imitation learning.
  • Even imperfect predictions reduce ambiguity, making it clearer which action makes sense in the moment.
  • This makes PIDMs far more data‑efficient than traditional approaches.

Imitation learning teaches AI agents by example: show the agent recordings of how people perform a task and let it infer what to do. The most common approach, Behavior Cloning (BC), frames this as a simple question: “Given the current state of the environment, what action would an expert take?”

In practice, this is done through supervised learning, where the states serve as inputs and expert actions as outputs. While simple in principle, BC often requires large demonstration datasets to account for the natural variability in human behavior, but collecting such datasets can be costly and difficult in real-world settings.

Predictive Inverse Dynamics Models (PIDMs) offer a different take on imitation learning by changing how agents interpret human behavior. Instead of directly mapping states to actions, PIDMs break down the problem into two subproblems: predicting what should happen next and inferring an appropriate action to go from the current state to the predicted future state. While PIDMs often outperform BC, it has not been clear why they work so well, motivating a closer look at the mechanisms behind their performance.

In the paper, “When does predictive inverse dynamics outperform behavior cloning?” we show how this two-stage approach enables PIDMs to learn effective policies from far fewer demonstrations than BC. By grounding the selection process in a plausible future, PIDMs provide a clearer basis for choosing an action during inference. In practice, this can mean achieving comparable performance with as few as one-fifth the demonstrations required by BC, even when predictions are imperfect.

Figure 1. BC vs. PIDM architectures. (Top) Behavior Cloning learns how to perform a direct mapping from the current state to an action. (Bottom) PIDMs add a state predictor that predicts future states. They then use an inverse dynamics model to predict the action required to move from the current state towards that future state. Both approaches share a common latent representation through a shared state encoder. How PIDMs rethink imitation

PIDMs’ approach to imitation learning consists of two core elements: a model that forecasts plausible future states, and an inverse dynamics model (IDM) that predicts the action needed to move from the present state toward that future. Instead of asking, “What action would an expert take?” PIDMs effectively ask, “What would an expert try to achieve, and what action would lead to it?” This shift turns the information in the current observation (e.g., video frame) into a coherent sense of direction, reducing ambiguity about intent and making action prediction easier.

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Explore the series Opens in a new tab Real-world validation in a 3D gameplay environment

To evaluate PIDMs under realistic conditions, we trained agents on human gameplay demonstrations in a visually rich video game. These conditions include operating directly from raw video input, interacting with a complex 3D environment in real time at 30 frames per second, and handling visual artifacts and unpredictable system delays.  

The agents ran from beginning to end, taking video frames as input and continuously deciding which buttons to press and how to move the joysticks. Instead of relying on a hand-coded set of game variables and rules, the model worked directly from visual input, using past examples to predict what comes next and choosing actions that moved play in that direction.

We ran all experiments on a cloud gaming platform, which introduced additional delays and visual distortions. Despite these challenges, the PIDM agents consistently matched human patterns of play and achieved high success rates across tasks, as shown in Video 1 below and Videos 2 and 3 in the appendix.

Video 1. A player (left) and a PIDM agent (right) side by side playing the game Bleeding Edge. Both navigate the same trajectory, jumping over obstacles and engaging with nonplayer characters. Despite network delays, the agent closely matches the player’s timing and movement in real time. Why and when PIDMs outperform BC

Of course, AI agents do not have access to future outcomes. They can only generate predictions based on available data, and those predictions are sometimes wrong. This creates a central trade‑off for PIDMs.

On one hand, anticipating where the agent should be heading can clarify what action makes sense in the present. Knowing the intended direction helps narrow an otherwise ambiguous choice. On the other hand, inaccurate predictions can occasionally steer the model toward the wrong action.

The key insight is that these effects are not symmetric. While prediction errors introduce some risk, reducing ambiguity in the present often matters more. Our theoretical analysis shows that even with imperfect predictions, PIDMs outperform BC as long as the prediction error remains modest. If future states were known perfectly, PIDMs would outperform BC outright.

In practice, this means that clarifying intent often matters more than accurately predicting the future. That advantage is most evident in the situations where BC struggles: where human behavior varies and actions are driven by underlying goals rather than by what is immediately visible on the screen.

BC requires many demonstrations because each example is noisy and open to multiple interpretations. PIDMs, by contrast, sharpen each demonstration by linking actions to the future states they aim to reach. As a result, PIDMs can learn effective action strategies from far fewer examples.

Evaluation

To test these ideas under realistic conditions, we designed a sequence of experiments that begins with a simple, interpretable 2D environment (Video 4 in the appendix) and culminates in a complex 3D video game. We trained both BC and PIDM on very small datasets, ranging from one to fifty demonstrations in the 2D environment and from five to thirty for the 3D video game. Across all tasks, PIDM reached high success rates with far fewer demonstrations than BC.

In the 2D setting, BC needed two to five times more data to match PIDM’s performance (Figure 2). In the 3D game, BC needed 66% more data to achieve comparable results (Video 5 in the appendix).

Figure 2. Performance gains in the 2D environment. As the number of training demonstrations increases, PIDM consistently achieves higher success rates than BC across all four tasks. Curves show mean performance, with shading indicating variability across 20 experiments for reproducibility. Takeaway: Intent matters in imitation learning

The main message of our investigation is simple: imitation becomes easier when intent is made explicit. Predicting a plausible future, even an imperfect one, helps resolve ambiguity about which action makes sense right now, much like driving more confidently in the fog when the driver already knows where the road is headed. PIDM shifts imitation learning from pure copying toward goal-oriented action.

This approach has limits. If predictions of future states become too unreliable, they can mislead the model about the intended next move. In those cases, the added uncertainty can outweigh the benefit of reduced ambiguity, causing PIDM to underperform BC.

But when predictions are reasonably accurate, reframing action prediction as “How do I get there from here?” helps explain why learning from small, messy human datasets can be surprisingly effective. In settings where data is expensive and demonstrations are limited, that shift in perspective can make a meaningful difference.

Appendix: Visualizations and results (videos) A player, a naïve action-replay baseline, and a PIDM agent playing Bleeding Edge Video 2. (Left) The player completes the task under normal conditions. (Middle) The baseline replays the recorded actions at their original timestamps, which initially appears to work. Because the game runs on a cloud gaming platform, however, random network delays quickly push the replay out of sync, causing the trajectory to fail. (Right) Under the same conditions, the PIDM agent behaves differently. Instead of naively replaying actions, it continuously interprets visual input, predicts how the behavior is likely to unfold, and adapts its actions in real time. This allows it to correct delays, recover from deviations, and successfully reproduce the task in settings where naïve replay inevitably fails. A player and a PIDM agent performing a complex task in Bleeding Edge Video 3. In this video, the task exhibits strong partial observability: correct behavior depends on whether a location is being visited for the first or second time. For example, in the first encounter, the agent proceeds straight up the ramp; on the second, it turns right toward the bridge. Similarly, it may jump over a box on the first pass but walk around it on the second. The PIDM agent reproduces this trajectory reliably, using coarse future guidance to select actions in the correct direction. Visualization of the 2D navigation environment Video 4. These videos show ten demonstrations for each of four tasks: Four Room, Zigzag, Maze, and Multiroom. In all cases, the setup is the same: the character (blue box) moves through the environment and must reach a sequence of goals (red squares). The overlaid trajectories visualize the paths the player took; the models never see these paths. Instead, they observe only their character’s current location, the position of all goals, and whether each goal has already been reached. Because these demonstrations come from real players, no two paths are identical: players pause, take detours, or correct small mistakes along the way. That natural variability is exactly what the models must learn to handle. PIDM vs. BC in a 3D environment Video 5. The PIDM agent achieves an 85% success rate with only fifteen demonstrations used in training. The BC agent struggles to stay on track and levels off around 60%. The contrast illustrates how differently the two approaches perform when training data is limited. Opens in a new tab

The post Rethinking imitation learning with Predictive Inverse Dynamics Models appeared first on Microsoft Research.

Categories: Microsoft

Paza: Introducing automatic speech recognition benchmarks and models for low resource languages

Thu, 02/05/2026 - 07:07
At a glance
  • Microsoft Research releases PazaBench and Paza automatic speech recognition models, advancing speech technology for low resource languages.
  • Human-centered pipeline for low-resource languages: Built for and tested by communities, Paza is an end-to-end, continuous pipeline that elevates historically under-represented languages and makes speech models usable in real-world, low-resource contexts.
  • First-of-its-kind ASR leaderboard, starting with African languages: Pazabench is the first automatic speech recognition (ASR) leaderboard for low-resource languages. Launching with 39 African languages and 51 state-of-the-art models, it tracks three key metrics across leading public and community datasets.
  • Human-centered Paza ASR models: Minimal data, fine-tuned ASR models grounded in real-world testing with farmers on everyday mobile devices, covering six Kenyan languages: Swahili, Dholuo, Kalenjin, Kikuyu, Maasai, and Somali.

According to the 2025 Microsoft AI Diffusion Report approximately one in six people globally had used a generative AI product. Yet for billions of people, the promise of voice interaction still falls short, and whilst AI is becoming increasingly multilingual, a key question remains: Do these models actually work for all languages and the people who rely on them? This challenge is one we first confronted through Project Gecko—a collaboration between Microsoft Research and Digital Green (opens in new tab), where field teams across Africa and India focused on building usable AI tools for farmers.

Gecko revealed how often speech systems fail in real‑world, low‑resource environments—where many languages go unrecognized and non‑Western accents are frequently misunderstood. Yet speech remains the primary medium of communication globally. For communities across Kenya, Africa, and beyond, this mismatch creates cascading challenges: without foundational data representing their languages and cultures, innovation stalls, and the digital and AI divides widen. 

Paza addresses this with a human-centered speech models pipeline. Through PazaBench, it benchmarks low-resource languages using both public and community-sourced data, and through Paza models, it fine-tunes speech models to deliver outsized gains in mid- and low-resource languages, evaluating with community testers using real devices in real contexts. Upcoming playbooks complement this work by sharing practical guidance on dataset creation, fine-tuning approaches with minimal data and evaluation considerations, introducing a continuous pipeline that enables researchers and practitioners to build and evaluate systems grounded in real human use.

How Project Gecko informed Paza’s design

In addition to building cost-effective, adaptable AI systems, the extensive fieldwork on Project Gecko highlighted an important lesson: Building usable speech models in low‑resource settings is not only a data problem, but also a design and evaluation problem. For AI systems to be useful, they must work in local languages, support hands‑free interaction through voice, text, and video, and deliver information in formats that fit real-world environments, that is, on low-bandwidth mobile devices, in noisy settings, and for varying literacy levels.  

These insights shaped the design of Paza, from the Swahili phrase paza sauti meaning “to project,” or “to raise your voice.”  The name reflects our intent: rather than simply adding more languages to existing systems, Paza is about co-creating speech technologies in partnership with the communities who use them. Guided by this principle, Paza puts human use first, which enables model improvement. 

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab PazaBench: The first ASR leaderboard for low-resource languages

PazaBench is the first automatic speech recognition (ASR) leaderboard dedicated to low‑resource languages. It launches with initial coverage for 39 African languages and benchmarks 52 state‑of‑the‑art ASR and language models, including newly released Paza ASR models for six Kenyan languages. The platform aggregates leading public and community datasets from diverse styles of speech including conversational, scripted read aloud, unscripted, broadcast news, and domain-specific data—into one easy‑to‑explore platform per language. This makes it easier for researchers, developers, and product teams to easily assess which models perform best across underserved languages and diverse regions, understand trade-offs between speed and accuracy while identifying where gaps persist. 

PazaBench tracks three core metrics:

  1. Character Error Rate (CER) which is important for languages with rich word forms, where meaning is built by combining word parts, therefore errors at the character level can significantly impact meaning
  2. Word Error Rate (WER) for word-level transcript accuracy
  3. RTFx (Inverse Real‑Time Factor) which measures how fast transcription runs relative to real‑time audio duration.

More than scores, PazaBench standardizes evaluation to prioritize dataset gaps, identify underperforming languages, and highlight where localized models beat wider coverage ASR models—offering early evidence of the value of African‑centric innovation.

Explore PazaBench

To contribute to the benchmark, request additional language evaluation on the leaderboard.

Paza ASR Models: Built with and for Kenyan languages

The Paza ASR models consist of three fine-tuned ASR models built on top of state‑of‑the‑art model architectures. Each model targets Swahili, a mid-resource language and five low‑resource Kenyan languages; Dholuo, Kalenjin, Kikuyu, Maasai and Somali. The models are fine-tuned on public and curated proprietary datasets.  

Fine‑tuning the three models allowed us to explore supportive approaches toward a shared goal: building speech recognition systems that are usable for local contexts starting with the six Kenyan languages and bridging the gaps of multi-lingual and multi-modal video question and answering through the MMCT agent. (opens in new tab)

See the MMCT agent in action in the field

Early versions of two models in Kikuyu and Swahili were deployed on mobile devices and tested directly with farmers in real‑world settings, enabling the team to observe how the models performed with everyday use. Farmers provided in‑the‑moment feedback on accuracy, usability, and relevance, highlighting where transcripts broke down, which errors were most disruptive, and what improvements would make the models more helpful in practice. This feedback loop directly informed subsequent fine‑tuning, ensuring model improvements were driven not only by benchmark scores, but by the needs and expectations of the communities they are intended to serve.

Explore Paza Collection Here

Here is how Paza models compare to three state-of-the-art ASR models today:

Figure 1: Character Error Rate (CER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance. Figure 2: Word Error Rate (WER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower WER indicates better transcription performance.

1) Paza‑Phi‑4‑Multimodal‑Instruct

Microsoft’s Phi‑4 multimodal‑instruct (opens in new tab) is a next‑generation small language model built to reason across audio, text, and vision. With Paza, we extend its audio capabilities, adapting a powerful multimodal architecture into a high‑quality automatic speech recognition (ASR) system for low‑resource African languages.

Fine‑tuned on unified multilingual speech datasets, the model was optimized specifically for transcription in the six languages. The model preserves its underlying transformer architecture and multi-modal capabilities, while selectively fine-tuning only the audio‑specific components, enabling strong cross‑lingual generalization.

As the results below show, this model delivers consistent improvements in transcription quality across all six languages.

Figure 3: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 4: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Test the model here

2) Paza‑MMS‑1B‑All

This model is fine-tuned on Meta’s mms-1b-all model, which employs a large-scale Wav2Vec2.0-style encoder with lightweight language-specific adapters to enable efficient multilingual specialization. For this release, each of the six language adapters was fine‑tuned independently on curated low‑resource datasets, allowing targeted adaptation while keeping the shared encoder largely frozen.

As shown in the figures below, this model improves transcription accuracy while maintaining the model’s strong cross‑lingual generalization.

Figure 5: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 6: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Join the Research Early Access Program

3) Paza‑Whisper‑Large‑v3‑Turbo

This model is finetuned on OpenAI’s whisper-large-v3-turbo base model. Whisper is a transformer-based encoder–decoder model which delivers robust automatic speech recognition (ASR) capabilities. This model was fine‑tuned on the entire unified multilingual ASR dataset, on the mentioned six languages, to encourage cross-lingual generalization. In addition, an extra post‑processing step was applied to address the known Whisper hallucination failure modes, improving transcription reliability.

As shown below, this release achieves improved transcription accuracy while retaining Whisper’s robustness.

Figure 7: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 8: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Test the model here Where do we go from here

AI is reshaping how the world communicates. Designing with people, not just for them, means looking beyond the languages that are already well‑served. We plan to expand PazaBench beyond African languages and evaluate state‑of‑the‑art ASR models across more low‑resource languages globally. The Paza ASR models are an early step; truly supporting small and under‑represented languages requires dedicated datasets, strong local partnerships, and rigorous evaluation. Meaningful progress depends on sustained collaboration with the communities who speak these languages, and expanding responsibly means prioritizing depth and quality over broad but shallow coverage. 

As we continue this work, we’re distilling our methods into a forthcoming playbook to help the broader ecosystem curate datasets, fine‑tune responsibly, and evaluate models in real‑world conditions. And we’re not stopping at speech—additional playbooks will guide teams building AI tools and applications for multilingual, multicultural contexts, and give them practical recommendations for deploying across diverse communities. 

Together, these guides—grounded in technical advances and community‑driven design—share our learnings to help researchers, engineers, and designers build more human‑centered AI systems. 

Acknowledgements

The following researchers played an integral role in this work: Najeeb Abdulhamid, Felermino Ali, Liz Ankrah, Kevin Chege, Ogbemi Ekwejunor-Etchie, Ignatius Ezeani, Tanuja Ganu, Antonis Krasakis, Mercy Kwambai, Samuel Maina, Muchai Mercy, Danlami Mohammed, Nick Mumero, Martin Mwiti, Stephanie Nyairo, Millicent Ochieng and Jacki O’Neill.

We would like to thank the Digital Green (opens in new tab) team—Rikin Gandhi, Alex Mwaura, Jacqueline Wang’ombe, Kevin Mugambi, Lorraine Nyambura, Juan Pablo, Nereah Okanga, Ramaskanda R.S, Vineet Singh, Nafhtari Wanjiku, Kista Ogot, Samuel Owinya and the community evaluators in Nyeri and Nandi, Kenya — for their valuable contributions to this work.

We extend our gratitude to the creators, community contributors, and maintainers of African Next Voices Kenya (opens in new tab), African Next Voices South Africa (opens in new tab), ALFFA (opens in new tab), Digigreen (opens in new tab), Google FLEURS (opens in new tab), Mozilla Common Voice (opens in new tab) and Naija Voices (opens in new tab) whose efforts have been invaluable in advancing African languages speech data.

Opens in a new tab

The post Paza: Introducing automatic speech recognition benchmarks and models for low resource languages appeared first on Microsoft Research.

Categories: Microsoft

UniRG: Scaling medical imaging report generation with multimodal reinforcement learning

Tue, 01/27/2026 - 19:00
At a glance
  • AI-driven medical image report generation can help medical providers become more efficient and productive.
  • Current models are difficult to train because reporting practices vary widely among providers.
  • Universal Report Generation (UniRG) uses reinforcement learning to align model training with real-world radiology practice rather than proxy text-generation objectives.
  • UniRG has achieved state-of-the-art performance across datasets, metrics, diagnostic tasks, longitudinal settings, and demographic subgroups.
  • Test results show that reinforcement learning, guided by clinically meaningful reward signals, can substantially improve the reliability and generality of medical vision–language models.

AI can be used to produce clinically meaningful radiology reports using medical images like chest x-rays. Medical image report generation can reduce reporting burden while improving workflow efficiency for healthcare professionals. Beyond the real-world benefits, report generation has also become a critical benchmark for evaluating multimodal reasoning in healthcare AI.

Despite recent advances driven by large vision–language models, current systems still face major limitations in real-world clinical settings. One challenge stems from the wide variation in radiology reporting practices across institutions, departments, and patient populations. A model trained with supervised fine-tuning on one set of data may learn its specific phrasing and conventions instead of more general patterns—a problem known as overfitting. As a result, the model performs well on that data but delivers poor results when evaluated on unseen institutions or external datasets. Moreover, since model training is often aimed at producing text that looks similar to existing reports, some well written but clinically inaccurate reports can slip through.

In this blog, we introduce Universal Report Generation (UniRG) (opens in new tab), a reinforcement learning–based framework for medical imaging report generation. This work is a research prototype intended to advance medical AI research and is not validated for clinical use. UniRG uses reinforcement learning as a unifying mechanism to directly optimize clinically grounded evaluation signals, aligning model training with real-world radiology practice rather than proxy text-generation objectives. Using this framework, we train UniRG-CXR (opens in new tab), a state-of-the-art chest x-ray report generation model at scale, spanning over 560,000 studies, 780,000 images, and 226,000 patients from more than 80 medical institutions.

To our knowledge, this is the first report generation model to achieve consistent state-of-the-art performance across report-level metrics, disease-level diagnostic accuracy, cross-institution generalization, longitudinal report generation, and demographic subgroups. These results demonstrate that reinforcement learning, when guided by clinically meaningful reward signals, can substantially improve both the reliability and generality of medical vision–language models.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab A unified framework for scaling medical image report generation

UniRG builds state-of-the-art report generation models by combining supervised fine-tuning with reinforcement learning, which optimizes a composite reward that integrates rule-based metrics, model-based semantic metrics, and LLM-based clinical error signals. This approach allows the resulting model UniRG-CXR to learn from diverse data sources, move beyond dataset-specific reporting patterns, and learn representations that generalize across institutions, metrics, and clinical contexts. Notably, UniRG-CXR sets a new state of the art on the authoritative ReXrank leaderboard (opens in new tab), a public leaderboard for chest X-ray image interpretation, as of 01/22/2026, surpassing previous best models by substantial margins (Figure 1).

Figure 1. Overview of UniRG-CXR. (a) Training Data: UniRG-CXR is trained on the training splits of MIMIC-CXR, CheXpert Plus, and ReXGradient-160k, covering diverse institutions and patient demographics. (b) Training and Rewards: Taking input from the current image, clinical context (e.g., indication), and optionally prior studies, UniRG-CXR uses GRPO reinforcement learning to optimize composite rewards that combine rule-based, model-based, and LLM-based metrics. (c) Evaluation: We assess UniRG-CXR on held-out test sets (MIMIC-CXR, CheXpert Plus, ReXGradient), and unseen datasets (IU Xray and proprietary data). Report quality measured using ReXrank metrics and an LLM-based clinical-error metric, while diagnostic ability is evaluated via F1-based disease classification from generated reports. (d) ReXrank Results: UniRG-CXR achieves SOTA performance across four datasets and two generation settings (findings only and findings + impression), showing substantial gains over prior state-of-the-art systems. Universal improvements across metrics and clinical errors

Rather than excelling on one metric at the expense of others, UniRG-CXR delivers balanced improvements across many different measures of report quality. More importantly, it produces reports with substantially fewer clinically significant errors. This indicates that the model is not just learning how to sound like a radiology report, but is better capturing the underlying clinical facts. Explicitly optimizing for clinical correctness helps the model avoid common failure modes where fluent language masks incorrect or missing findings (Figure 2).

Figure 2. UniRG-CXR achieves state-of-the-art performance, delivering consistent and comprehensive performance gains across metrics. (a) On the ReXrank leaderboard, UniRG-CXR (green) shows robust, universal improvement across all evaluation metrics.  (b). Starting from the same SFT checkpoint, RL with our combined reward achieves more balanced gains across metrics and the highest RadCliQ-v1 score compared to RL on single metrics. This ablation study is trained and tested on MIMIC (c). Ablation study on the training dynamics shows RL full (UniRG-CXR) achieves significantly better RadCliQ-v1 score than RL only on BLEU. (d). During training, RL full (UniRG-CXR) shows a steady decrease in clinical errors per report as compared with a fluctuating trajectory without consistent improvement from an ablation run without error awareness (i.e. removing CheXprompt metric optimization). Both (c) and (d) show results on 1024 MIMIC validation set from ablations that are trained on MIMIC. (e). Case studies illustrate that UniRG-CXR can produce error-free reports, unlike MedVersa and MedGemma. (f). UniRG-CXR yields a substantially higher proportion of reports with $\leq 1$ error and fewer with $\geq 4$ errors than prior models. Strong performance in longitudinal report generation

In clinical practice, radiologists often compare current images with prior exams to determine whether a condition is improving, worsening, or unchanged. UniRG-CXR is able to incorporate this historical information effectively, generating reports that reflect meaningful changes over time. This allows the model to describe new findings, progression, or resolution of disease more accurately, moving closer to how radiologists reason across patient histories rather than treating each exam in isolation (Figure 3).

Figure 3. UniRG-CXR enhances longitudinal report generation. (a). Comparing UniRG-CXR and its non-longitudinal ablation with prior models on longitudinal report generation, we show UniRG-CXR exhibits the best performance and the longitudinal information is beneficial to the performance. (b). UniRG-CXR achieves the best performance across different longitudinal encounter points ranging from the first encounter to the more complex 5th+ encounters, showcasing its improvements are across the board. In comparison, prior models such as GPT-5, GPT-4o and MedGemma are barely surpassing the copy prior report baseline (grey lines).  (c). Compared with prior models which barely improve over the copy prior baseline (dashed line), UniRG-CXR significantly and consistently improves performance across different temporal disease change categories including new development, no change, progression and regression (categorized by GPT-5 on ground truth report). Qualitative examples are shown for each category where UniRG-CXR correctly predicts the temporal change based on the input. All results in this figure are on MIMIC test set with prior information where available. Robust generalization across institutions and populations

UniRG-CXR maintains strong performance even when applied to data from institutions it has never seen before. This suggests that the model is learning general clinical patterns rather than memorizing institution-specific reporting styles. In addition, its performance remains stable across different patient subgroups, including age, gender, and race. This robustness is critical for real-world deployment, where models must perform reliably across diverse populations and healthcare environments (Figure 4).

Figure 4. Generalization and robustness of UniRG-CXR. (a). We evaluate UniRG-CXR in a zero-shot setting on two datasets from previously unseen institutions: IU-Xray and PD (proprietary data). UniRG-CXR consistently outperforms prior models, maintaining substantial performance gains in this challenging setup. (b) and (c) present condition-level F1 scores on MIMIC-CXR and PD and highlight that UniRG-CXR remains the overall top-performing model in condition-level diagnostic accuracy. (d). UniRG-CXR demonstrates stable and robust performance across gender, age, and race subgroups, all of which exceed the performance of the second-best model (the dashed lines). UniRG is a promising step toward scaling medical imaging report generation

UniRG introduces a reinforcement learning–based framework that rethinks how medical imaging report generation models are trained and evaluated. By directly optimizing clinically grounded reward signals, UniRG-CXR achieves state-of-the-art performance across datasets, metrics, diagnostic tasks, longitudinal settings, and demographic subgroups, addressing longstanding limitations of supervised-only approaches.

Looking ahead, this framework can be extended to additional imaging modalities and clinical tasks, and combined with richer multimodal patient data such as prior imaging, laboratory results, and clinical notes. More broadly, UniRG highlights the promise of reinforcement learning as a core component of next-generation medical foundation models that are robust, generalizable, and clinically aligned.

UniRG reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health (opens in new tab), with other exciting progress such as GigaPath, BiomedCLIP, LLaVA-Rad (opens in new tab), BiomedJourney, BiomedParse, TrialScope, Curiosity.

Paper co-authors: Qianchu Liu, Sheng Zhang, Guanghui Qin, Yu Gu, Ying Jin, Sam Preston, Yanbo Xu, Sid Kiblawi, Wen-wai Yim, Tim Ossowski, Tristan Naumann, Mu Wei, Hoifung Poon

Opens in a new tab

The post UniRG: Scaling medical imaging report generation with multimodal reinforcement learning appeared first on Microsoft Research.

Categories: Microsoft

Argos: Multimodal reinforcement learning with agentic verifier for AI agents

Tue, 01/20/2026 - 19:00
At a glance
  • Today’s multimodal AI systems can give answers that sound right but may not be grounded in what they actually observe over time, leading to unpredictable errors and safety risks in real-world settings.
  • Argos is a verification framework for multimodal reinforcement learning that trains models by rewarding not just correct answers, but correct answers grounded in visual and temporal evidence, using automated verification rather than human labeling. It selects the appropriate specialized tools for each answer based on what needs to be verified. 
  • Models trained with Argos show stronger spatial reasoning, far fewer visual hallucinations, more stable learning dynamics, and better performance on robotics and real-world tasks while requiring fewer training samples.

Over the past few years, AI systems have become much better at discerning images, generating language, and performing tasks within physical and virtual environments. Yet they still fail in ways that are hard to predict and even harder to fix. A robot might try to grasp a tool when the object is visibly blocked, or a visual assistant integrated into smart glasses might describe objects that aren’t actually present.

These errors often arise because today’s multimodal agents are trained to generate outputs that are plausible rather than grounded in the actual information they receive from their environment. As a result, a model’s output can seem correct while relying on incorrect information. As AI systems are increasingly used to navigate 3D spaces and make decisions in real-world settings, this gap can be a safety and reliability concern.

To tackle this challenge, we posed the question: How can we train AI agents to generate correct answers and take appropriate actions for the right reasons so that their behavior is reliable even as the environment or tasks change?

Argos represents a novel answer to this challenge. It’s an agentic verification framework designed to improve the reliability of reinforcement learning in multimodal models. Reinforcement learning is a training method where AI models learn by receiving rewards for desired behaviors and penalties for undesired ones, gradually improving their performance through trial and error.

Rather than rewarding only correct behaviors, Argos evaluates how those behaviors were produced. It draws on a pool of larger, more capable teacher models and rule-based checks to verify two things: first, that the objects and events a model references actually exist in its input, and second, that the model’s reasoning aligns with what it observes. Argos rewards the model when both conditions are met. In practice, these rewards help curate high-quality training data and guide the model’s further training.

How Argos works

Argos functions as a verification layer on top of an existing multimodal model. Given an image or video, a task or query, and information about the model’s reasoning and output, Argos identifies where the model indicates objects are located in the image, when it indicates events occur in a video, and what action or answer it produces.

Argos then applies specialized tools tailored to the specific content to evaluate and score three aspects of the model’s output. It checks whether the answer is correct, whether referenced objects and events appear at the indicated locations and times, and whether the reasoning is consistent with the visual evidence and the answer (Figure 1).

These scores are combined using a gated aggregation function, a method that dynamically adjusts the importance of different scores. It emphasizes reasoning checks only when the final output is correct. This design prevents unreliable feedback from dominating training and produces a stable reward signal for reinforcement learning.

Figure 1. Argos selects different specialized tools to verify and score the accuracy of referenced points and events in the agent’s reasoning. Using Argos to curate data for supervised fine-tuning

Argos also helps curate high-quality training data to provide the model with a strong foundation in grounded reasoning. Before the reinforcement learning stage begins, Argos uses a multi-stage process to generate data that is explicitly tied to visual locations and time intervals.

In the first stage, Argos identifies the objects, actions, and events that are relevant to a task and links them to specific locations in images or specific moments in videos. These references are overlaid on images and selected video frames. Next, a reasoning model generates step-by-step explanations that refer to these visual locations and time spans.

Finally, Argos evaluates each generated example for accuracy and visual grounding, filtering out low-quality training data and retaining only data that is both correct and well-grounded in visual input. The resulting dataset is then used in an initial training phase, where the model learns to generate reasoning steps before producing its final output. This process is illustrated in Figure 2.

Figure 2. Argos generates step-by-step reasoning grounded in image locations and video timestamps then filters out low-quality training data. Evaluation

Building on this foundation in grounded reasoning, we further trained the model using reinforcement learning guided by Argos and evaluated its performance across a range of benchmarks. On spatial reasoning tasks, the Argos-trained model outperformed both the base model Qwen2.5-VL-7B and the stronger Video-R1 baseline across challenging 3D scenarios and multi-view tasks. Models trained with Argos also showed a substantial reduction of hallucinations compared with both standard chain-of-thought prompting and reinforcement learning baselines.

Finally, we evaluated the model in robotics and other real-world task settings, focusing on high-level planning and fine-grained control. Models trained with Argos performed better on complex, multi-step tasks. Notably, these improvements were achieved using fewer training samples than existing approaches, highlighting the importance of reward design in producing more capable and data-efficient agents. Figure 3 illustrates some of these findings.

Figure 3. Performance of Argos compared with baseline models on the task of visual hallucination detection (left) and embodied task planning and completion (right).  How Argos shapes reinforcement learning

To understand how Argos affects learning, we took the same vision-language model that had been trained on our curated dataset and fine-tuned it using reinforcement learning in two different ways. In one approach, Argos was an agentic verifier, checking the correctness of outputs and the quality of reasoning. In the other, the model received feedback only on whether its answers were correct.

We evaluated both versions on 1,500 samples from a new dataset and tracked their performance throughout the learning process (Figure 4). Although they started at similar levels, the model without Argos quickly got worse. Its accuracy steadily declined, and it increasingly gave answers that ignored what was in the videos. It learned to game the system by producing answers that seemed correct without grounding them in visual evidence.

The model trained with Argos showed the opposite pattern. Accuracy improved steadily, and the model became better at linking its reasoning to what appeared in the videos. This difference highlights the value of verification: when training rewards both correct outputs and sound reasoning based on visual and temporal evidence, models learn to be more reliable rather than simply finding shortcuts to high scores.

Figure 4. Comparison of response accuracy changes with and without Argos across two model versions (left) and differences in visual grounding accuracy over training for both versions (right). Potential impact and looking forward

This research points toward a different way of building AI agents for real-world applications. Rather than fixing errors after they occur, it focuses on training agents to systematically anchor their reasoning in what they actually receive as input throughout the training process.

The potential applications span many domains. A visual assistant for a self-driving car that verifies what’s actually in an image is less likely to report phantom obstacles. A system that automates digital tasks and checks each action against what’s displayed on the screen is less likely to click the wrong button.

As AI systems move beyond research labs into homes, factories, and offices, reliable reasoning becomes essential for safety and trust. Argos represents an early example of verification systems that evolve alongside the AI models they supervise. Future verifiers could be tailored for specific fields like medical imaging, industrial simulations, and business analytics. As more advanced models and richer data sources become available, researchers can use them to improve these verification systems, providing even better guidance during training and further reducing hallucinations.

We hope that this research helps move the field toward AI systems that are both capable and interpretable: agents that can explain their decisions, point to the evidence behind them, and be trained to adhere to real-world requirements and values.

Opens in a new tab

The post Argos: Multimodal reinforcement learning with agentic verifier for AI agents appeared first on Microsoft Research.

Categories: Microsoft

OptiMind: A small language model with optimization expertise

Thu, 01/15/2026 - 16:00
At a glance
  • Many real-world business problems can benefit from optimization, but translating decisions, constraints, and goals from natural language into optimization algorithms is slow.
  • OptiMind is a small language model designed to convert business problems described in natural language into the mathematical formulations needed by optimization software.
  • OptiMind is trained on a carefully curated, expert-aligned dataset and applies domain-specific hints and self-checks at inference time, improving its accuracy.
  • OptiMind matches or exceeds the performance of much larger systems, can run locally to protect sensitive data, produces more reliable formulations, and reduces the time and expertise needed to prepare optimization models.

Enterprises across industries, from energy to finance, use optimization models to plan complex operations like supply chains and logistics. These models work by defining three elements: the choices that can be made (such as production quantities or delivery routes), the rules and limits those choices must follow, and the goal, whether that’s minimizing costs, meeting customer demand, or improving efficiency.

Over the past few decades, many businesses have shifted from judgment-based decision-making to data-driven approaches, leading to major efficiency gains and cost savings. Advances in AI promise to accelerate this shift even further, potentially cutting decision times from days to minutes while delivering better results.

In practice, however, turning real-world business problems into a form that optimization software can understand is challenging. This translation process requires expressing decisions, constraints, and objectives in mathematical terms. The work demands specialized expertise, and it can take anywhere from one day to several weeks to solve complex problems. 

To address this challenge, we’re introducing OptiMind, a small language model designed to convert problems described in plain language into the mathematical formulations that optimization software needs. Built on a 20-billion parameter model, OptiMind is compact by today’s standards yet matches the performance of larger, more complex systems. Its modest size means it can run locally on users’ devices, enabling fast iteration while keeping sensitive business data on users’ devices rather than transmitting it to external servers.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab How it works

OptiMind incorporates knowledge from optimization experts both during training and when it’s being used to improve formulation accuracy at scale. Three stages enable this: domain-specific hints improve training data quality, the model undergoes fine-tuning, and expert reasoning guides the model as it works.

Figure 1. From problem description to solution 

One of the central challenges in developing OptiMind was the poor quality of existing public datasets for optimization problems. Many examples were incomplete or contained incorrect solutions. To address this, we developed a systematic approach that combines automation with expert review. It organizes problems into well-known categories, such as scheduling or routing, and identifies common error patterns within each. Using these insights, we generated expert-verified “hints” to guide the process, enabling the system to regenerate higher-quality solutions and filter out unsolvable examples (Figure 2). The result is a training dataset that more accurately reflects how optimization experts structure problems.

Figure 2. Process for correcting training data

Using this refined dataset, we applied supervised fine-tuning to the base model. Rather than simply generating code, we trained OptiMind to produce structured mathematical formulations alongside intermediate reasoning steps, helping it avoid the common mistakes found in earlier datasets.

When in use, the model’s reliability further improves. When given a new problem, OptiMind first classifies it into a category, such as scheduling or network design. It then applies expert hints relevant to that type of problem, which act as reminders to check for errors before generating a solution. For particularly challenging problems, the system generates multiple solutions and either selects the most frequently occurring one or uses feedback to refine its response. This approach increases accuracy without requiring a larger model, as illustrated in Figure 3.

Figure 3. OptiMind’s inference process Evaluation

To test the system, we turned to three widely used public benchmarks that represent some of the most complex formulation tasks in the field. On closer inspection, we discovered that 30 to 50 percent of the original test data was flawed. After manually correcting the issues, OptiMind improved accuracy by approximately 10 percent over the base model. Figure 4 and Table 1 show detailed comparisons: OptiMind outperformed other open-source models under 32 billion parameters and, when combined with expert hints and correction strategies, matched or exceeded the performance of current leading models.

Figure 4. Average accuracy percentages over all models. Table 1. Performance of all models on corrected benchmark datasets

OptiMind is more reliable than other models because it learns from higher-quality, domain-aligned data. And by correcting errors and inconsistencies in standard datasets, we significantly reduced the model’s tendency to hallucinate relative to the base and comparison models.

Looking forward

While supervised fine-tuning has provided a strong foundation, we are exploring reinforcement learning to further refine OptiMind’s reasoning capabilities. We’re also investigating automated frameworks that would allow LLMs to generate their own expert hints, enabling continuous autonomous improvement. Additionally, we are working with Microsoft product teams and industry collaborators to expand OptiMind’s utility, adding support for more programming languages and a variety of input formats, including Excel and other widely used tools.

We’re releasing OptiMind as an experimental model to gather community feedback and inform future development. The model is available through Microsoft Foundry (opens in new tab) and Hugging Face (opens in new tab), and we’ve open-sourced the benchmarks and data-processing procedures on GitHub (opens in new tab) to support more reliable evaluation across the field. We welcome feedback through GitHub (opens in new tab), and invite those interested in shaping the future of optimization to apply for one of our open roles.

Opens in a new tab

The post OptiMind: A small language model with optimization expertise appeared first on Microsoft Research.

Categories: Microsoft

eXTReMe Tracker