Microsoft

ADeLe: Predicting and explaining AI performance across tasks

Microsoft Research - Wed, 04/01/2026 - 18:00
At a glance
  • AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
  • Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
  • It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
  • By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.

In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADeLe moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.

ADeLe-based evaluation

ADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.

Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.

Figure 1. Top: (1) Model performance on the ADeLe benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires. Evaluating ADeLe

Using ADeLe, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.

ADeLe shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADeLe makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.

Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.

Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: OpenAI models. Middle: Llama models. Right: DeepSeek-R1 distilled models.

This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADeLe surfaces them within a single framework.

ADeLe also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.

Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADeLe shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.

Reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADeLe identifies where that point is for each model.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Looking ahead

ADeLe is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.

As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADeLe offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on GitHub (opens in new tab).

Opens in a new tab

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

Categories: Microsoft

AsgardBench: A benchmark for visually grounded interactive planning

Microsoft Research - Thu, 03/26/2026 - 21:02
At a glance
  • To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.
  • AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold.
  • Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.
  • Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.

The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.

To address this, we created AsgardBench. In the paper, AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question AsgardBench is designed to answer.

Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, AsgardBench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.

For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.

Figure 1: Agent observations and corresponding action plans in AsgardBench. Each image is paired with the plan generated from that observation. This illustrates how AsgardBench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence. How it works

Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.

Evaluating AsgardBench

We tested several leading vision-capable models on AsgardBench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.

Providing that kind of detailed failure information raises performance for all models in AsgardBench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. AsgardBench’s performance is illustrated in Figure 2.

Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that AsgardBench requires perception-based reasoning.

The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Start now Opens in a new tab Implications and looking ahead

AsgardBench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.

The failure patterns AsgardBench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.

AsgardBench is open source and available on GitHub (opens in new tab), providing a foundation for advancing research in visually grounded planning.

Acknowledgements

We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.

Opens in a new tab

The post AsgardBench: A benchmark for visually grounded interactive planning appeared first on Microsoft Research.

Categories: Microsoft

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

Microsoft Research - Thu, 03/26/2026 - 18:03
At a glance
  • VLM-based robot planners struggle with long, complex tasks because natural-language plans can be ambiguous, especially when specifying both actions and locations.
  • GroundedPlanBench evaluates whether models can plan actions and determine where they should occur across diverse, real-world robot scenarios.
  • Video-to-Spatially Grounded Planning (V2GP) is a framework that converts robot demonstration videos into spatially grounded training data, enabling models to learn planning and grounding jointly.
  • Grounded planning improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations.

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks down for long, complex tasks because natural-language plans can be ambiguous or even hallucinated when specifying actions and locations (Figure 1). Because planning and spatial reasoning are handled separately, errors in one stage can propagate to the next. This raises a key question: can a VLM determine both what to do and where to do it simultaneously?

Figure 1. Failures in VLM-based task planners, where ambiguous language leads to non-executable actions. Planning with spatial grounding

To address this problem, we developed GroundedPlanBench (opens in new tab). In our paper, “Spatially Grounded Long-Horizon Task Planning in the Wild,” we describe how this new benchmark evaluates whether VLMs can plan actions and determine where those actions should occur across diverse real-world environments. We also built Video-to-Spatially Grounded Planning (V2GP), a framework that converts robot demonstration videos into training data to help VLMs learn this capability.

Evaluating these with both open- and closed-source VLMs, we found that grounded planning for long, complex tasks is challenging. At the same time, V2GP improves both planning and grounding, with gains validated on our benchmark and in real-world experiments using robots.

How GroundedPlanBench works

To create realistic robot scenarios, we built our benchmark from 308 robot manipulation scenes in the Distributed Robot Interaction Dataset (DROID) (opens in new tab), a large collection of recordings of robots performing tasks. We worked with experts to review each scene and define tasks that a robot could perform. Each task was written in two styles: explicit instructions that clearly describe the actions (e.g., “put a spoon on the white plate”) and implicit instructions that describe the goal more generally (e.g., “tidy up the table”).

For each task, the plan was broken down into four basic actions—graspplaceopen, and close—each tied to a specific location in the image. Grasp, open, and close actions were linked to a box drawn around the target object, while place actions were linked to a box showing where the object should be placed.

Figure 2 illustrates medium- and long-duration tasks, along with their explicit and implicit instructions. In total, GroundedPlanBench contains 1,009 tasks, ranging from 1–4 actions (345 tasks) to 5–8 (381) and 9–26 (283).

Figure 2. Examples of tasks in GroundedPlanBench. How V2GP works

The V2GP framework first detects moments when the robot interacts with objects using the recorded gripper signals. It then generates a text description of the manipulated object with a multimodal language model. Guided by this description, the system tracks the object across the video using Meta’s advanced open-vocabulary image and video segmentation model, SAM3. The system then constructs grounded plans from the tracking results, identifying the object’s location at the moment it is grasped and where it is placed.

This process is illustrated in Figure 3. It yielded 43K grounded plans with varying lengths: 34,646 plans with 1–4 actions, 4,368 with 5–8 actions, and 4,448 with 9–26 actions.

Figure 3. The V2GP framework converts robot videos into spatially grounded plans. Evaluating decoupled versus grounded planning

To evaluate GroundedPlanBench in real-world robotic settings, we used Qwen3-VL (opens in new tab) as our base model. Qwen3-VL is a vision-language model that processes text, images, and video to support multimodal reasoning. It performs well on standard multimodal reasoning benchmarks without additional training. We first evaluated it, along with other proprietary models, on GroundedPlanBench without any task-specific training (Table 1). We then fine-tuned it on V2GP training data and compared it with a decoupled approach, in which planning and grounding are handled separately.

In this setup, a VLM first generated a plan describing what the robot should do. We used GPT-5.2 or Qwen3-VL-4B for this step. The plan was then passed to a spatial grounding model, Embodied-R1 (opens in new tab), which converted the plans into executable signals. Embodied-R1 is a large vision-language model trained for embodied reasoning and pointing, where the model identifies specific locations in the image to guide the robot’s actions. We selected it for spatial grounding because its training targets embodied spatial reasoning and point-based localization, making it well suited for grounding model outputs to specific locations in an image.

Figure 4 highlights a key limitation of this approach: ambiguity in natural language. For example, Qwen3-VL-4B generated grasp actions by referring to “napkin on the table” for all four napkins in the scene, leading Embodied-R1 to ground each action the same napkin. GPT-5.2 produced more descriptive phrases, such as “top-left napkin” or “upper-center napkin,” but these were still too imprecise for the model to reliably distinguish between them and were again grounded to the same object.

Figure 4. Decoupled vs. grounded planning, illustrating how ambiguous language causes actions to be grounded to the wrong objects.

This limitation becomes more pronounced in real-world robot manipulation, where environments are often cluttered and complex. As a result, decoupled approaches struggle to work reliably. In contrast, our approach, grounded planning, performs planning and grounding jointly within a single model and improves both planning and grounding performance.

Table 1 presents evaluation results for open- and closed-source VLMs on GroundedPlanBench. Multi-step planning and handling of implicit instructions were challenging for all models, while training Qwen3-VL-4B and Qwen3-VL-32B with V2GP led to significant improvements in grounded planning.

Table 1. Evaluation results on GroundedPlanBench. Task Success Rate (TSR) measures the percentage of tasks completed correctly, requiring all actions to be both correctly planned and spatially grounded. Action Recall Rate (ARR) measures the proportion of generated actions that match the sub-actions defined in the dataset, regardless of order. The V2GP approach improves performance on both metrics and achieves the best results (shown in bold).

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Explore the series Opens in a new tab Implications and looking forward

Integrating planning and grounding within a single model offers a path to more reliable robot manipulation in real-world settings. Rather than relying on separate stages, this approach keeps decisions about what to do and where to act tightly coupled, but models still struggle with longer, multi-step tasks and implicit instructions. Models must reason over longer sequences of actions and maintain consistency across many steps and goals described indirectly, as in everyday language.

Looking ahead, a promising direction combines grounded planning with world models, which enable robots to predict the outcomes of actions before executing them. Together, these capabilities could allow robots to decide what to do, where to act, and what will happen next, bringing us closer to systems that can plan and act reliably in the real world.

Acknowledgements

This research was conducted in collaboration with Korea University, Microsoft Research, University of Wisconsin-Madison, and supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No. RS-2025-25439490) funded by the Korea government (MSIT).

Opens in a new tab

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

Categories: Microsoft

Systematic debugging for AI agents: Introducing the AgentRx framework

Microsoft Research - Thu, 03/12/2026 - 18:38
At a glance
  • Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.
  • Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.
  • Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.
  • Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.

As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.

When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.

Today, we are excited to announce the open-source release of AgentRx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the AgentRx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.

The challenge: Why AI agents are hard to debug

Modern AI agents are often:

  • Long-horizon: They perform dozens of actions over extended periods.
  • Probabilistic: The same input might lead to different outputs, making reproduction difficult.
  • Multi-agent: Failures can be “passed” between agents, masking the original root cause.

Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.

Introducing AgentRx: An automated diagnostic “prescription”

AgentRx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, AgentRx uses a structured, multi-stage pipeline:

  1. Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.
  2. Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).
  3. Guarded evaluation: AgentRx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.
  4. LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.
The AgentRx workflow: Given a failed trajectory, tool schemas, and domain policy, AgentRx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the critical failure step and root-cause category. A New Benchmark for Agent Failures

To evaluate AgentRx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:

  • τ-bench: Structured API workflows for retail and service tasks.
  • Flash: Real-world incident management and system troubleshooting.
  • Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.

Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).

Taxonomy CategoryDescriptionPlan Adherence FailureIgnored required steps / did extra unplanned actionsInvention of New InformationAltered facts not grounded in trace/tool outputInvalid InvocationTool call malformed / missing args / schema-invalidMisinterpretation of Tool OutputRead tool output incorrectly; acted on wrong assumptionsIntent–Plan MisalignmentMisread user goal/constraints and planned wronglyUnder-specified User IntentCould not proceed because required info wasn’t availableIntent Not SupportedNo available tool can do what’s being askedGuardrails TriggeredExecution blocked by safety/access restrictionsSystem FailureConnectivity/tool endpoint failures Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but AgentRx focuses on identifying the first critical breach. Key Results

In our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:

  • +23.6% absolute improvement in failure localization accuracy.
  • +22.9% improvement in root-cause attribution.

By providing the “why” behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.

Join the Community: Open Source Release

We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.

We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.

Acknowledgements

We would like to thank Avaljot Singh and Suman Nath for contributing to this project.

Opens in a new tab

The post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.

Categories: Microsoft

PlugMem: Transforming raw agent interactions into reusable knowledge

Microsoft Research - Tue, 03/10/2026 - 18:00
At a glance
  • Today’s AI agents store long interaction histories but struggle to reuse them effectively.
  • Raw memory retrieval can overwhelm agents with lengthy, low-value context.
  • PlugMem transforms interaction history into structured, reusable knowledge.
  • A single, general-purpose memory module improves performance across diverse agent benchmarks while using fewer memory tokens.

It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use.

More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix useful experiences with irrelevant details, making retrieval slower and less reliable. The challenge is not storing more experiences, but organizing them so that agents can quickly identify what matters in the moment.

In our recent paper “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,” we introduce a plug-and-play memory system that transforms raw agent interactions into reusable knowledge. Rather than treating memory as text to retrieve, PlugMem organizes that history into structured knowledge designed to support decisions as the agent acts.

Cognitive science offers a useful framework here. It distinguishes between remembering events, knowing facts, and knowing how to perform tasks. Past events provide context, but effective decisions rely on the facts and skills extracted from those events.

This distinction motivated a shift in how we decided to design memory for AI agents. PlugMem implements this shift by converting the agent’s interaction history, such as dialogues, documents, and web sessions, into structured, compact knowledge units that can be reused across tasks.

How PlugMem works

A key difference between PlugMem and conventional AI memory systems is what gets stored. Traditional approaches store text chunks or named entities (references to people, places, and concepts). PlugMem uses facts and reusable skills as the fundamental building blocks of memory. This design reduces redundancy, increases information density, and improves retrieval precision. It’s built around three core components:

Structure. Raw interactions are standardized and transformed into propositional knowledge (facts) and prescriptive knowledge (reusable skills). These knowledge units are organized into a structured memory graph, enabling knowledge to be stored in a form designed for reuse.

Retrieval. Rather than retrieving long passages of text, PlugMem retrieves knowledge units that are aligned with the current task. High-level concepts and inferred intents serve as routing signals, surfacing the most relevant information for the decision at hand.

Reasoning. Retrieved knowledge is distilled into concise, task-ready guidance before being passed to the base agent, ensuring that only decision-relevant knowledge enters the agent’s context window.

Figure 1 illustrates how these components work together.

Figure 1. PlugMem organizes different types of agent interactions into a knowledge-centric memory graph, enabling structured retrieval and reasoning. One memory, any task

Most AI memory systems are built for one job. A conversational memory module is designed around dialogue. A knowledge-retrieval system is tuned to look up facts. A web agent’s memory is optimized for navigating pages. Each performs well in its target setting but rarely transfers without significant redesign.

PlugMem takes a different approach. It is a foundational memory layer that can be attached to any AI agent without needing to modify it for a specific task.

Evaluating PlugMem

To test PlugMem, we evaluated the same memory module on three benchmarks that each make different demands on memory:

  • Answering questions across long multi-turn conversations
  • Finding facts that span multiple Wikipedia articles
  • Making decisions while browsing the web

Across all three, PlugMem consistently outperformed both generic retrieval methods and task-specific memory designs while allowing the AI agent to use significantly less memory token budget in the process.

Measuring memory by utility, not size

We wanted to evaluate whether the right information was reaching the agent at the right moment, without overwhelming the model’s context window, which has limited capacity. To do this, we introduced a metric that measures how much useful, decision-relevant information a memory module contributes relative to how much context it consumes.

When we plotted utility against context consumption, PlugMem consistently came out ahead: it delivered more decision-relevant information while consuming less of the AI agent’s context than other approaches, as shown in Figure 2. These results suggest that transforming experience into knowledge—rather than storing and retrieving raw logs—produces memory that is more useful and efficient.

Figure 2. Across all three benchmarks, PlugMem delivered more useful memory with less of the agent’s context window. Why general-purpose memory can outperform task-specific designs

General-purpose memory modules can outperform systems tailored to specific tasks because the decisive factor is not specialization but whether memory can surface the right knowledge precisely when the agent needs it. Structure, retrieval, and reasoning each play a distinct role, and getting all three right matters more than optimizing for a single use case.

PlugMem is not meant to replace task-specific approaches. It provides a general memory foundation upon which task adaptations can be layered. Our experiments show that combining PlugMem with task-specific techniques yields further gains.

Toward reusable memory for agents

As AI agents take on longer and more complex tasks, its memory needs to evolve from storing past interactions to actively supplying reusable knowledge. The goal is for agents to carry useful facts and strategies from one task to the next rather than starting from scratch each time.

PlugMem represents a step in that direction, grounding memory design in cognitive principles and treating knowledge as the primary unit of reuse. As agent capabilities expand, knowledge-centric memory may prove to be a critical building block for the next generation of intelligent agents.

Code and experimental results are publicly available on GitHub (opens in new tab) so that others can reproduce the results and conduct their own research.

Opens in a new tab

The post PlugMem: Transforming raw agent interactions into reusable knowledge appeared first on Microsoft Research.

Categories: Microsoft

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

Microsoft Research - Wed, 03/04/2026 - 20:05
At a glance
  • Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
  • We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, inferring about changes in sequences of images, and much more. Beyond these general capabilities, it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens. In particular, our model presents an appealing value relative to popular open-weight models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning.

Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values. 

In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.

A focus on smaller and faster vision–language models

Many popular vision-language models (VLMs) have trended towards growing in parameter count and, in particular, the number of tokens they consume and generate. This leads to increase in training and inference-time cost and latency, and impedes their usability for downstream deployment, especially in resource‑constrained or interactive settings.

A growing countertrend towards smaller (opens in new tab) models aims to boost efficiency, enabled by careful model design and data curation – a goal pioneered by the Phi family of models (opens in new tab) and furthered by Phi-4-reasoning-vision-15B. We specifically build on learnings from the Phi-4 and Phi-4-Reasoning language models and show how a multimodal model can be trained to cover a wide range of vision and language tasks without relying on extremely large training datasets, architectures, or excessive inference‑time token generation. Our model is intended to be lightweight enough to run on modest hardware while remaining capable of structured reasoning when it is beneficial. Our model was trained with far less compute than many recent open-weight VLMs of similar size. We used just 200 billion tokens of multimodal data leveraging Phi-4-reasoning (trained with 16 billion tokens) based on a core model Phi-4 (400 billion unique tokens), compared to more than 1 trillion tokens used for training multimodal models like Qwen 2.5 VL (opens in new tab) and 3 VL (opens in new tab), Kimi-VL (opens in new tab), and Gemma3 (opens in new tab). We can therefore present a compelling option compared to existing models pushing the pareto-frontier of the tradeoff between accuracy and compute costs.

Figure 2: Phi-4-Reasoning-Vision can help with a wide range of everyday tasks. Lessons from training a multimodal model

Training a multimodal reasoning model raises numerous questions and requires many nuanced design choices around model architecture, dataset quality and composition, and the interaction between reasoning‑heavy and non-reasoning perception‑focused tasks.

Model architecture: Early- vs mid-fusion

Model architectures for VLMs differ primarily in how visual and textual information is fused. Mid-fusion models use a pretrained vision encoder to convert images into visual tokens that are projected into a pretrained LLM’s embedding space, enabling cross-modal reasoning while leveraging components already trained on trillions of tokens. Early-fusion models process image patches and text tokens in a single model transformer, yielding richer joint representations but at significantly higher compute, memory, and data cost. We adopted a mid-fusion architecture as it offers a practical trade-off for building a performant model with modest resources.

Model architecture: Vision encoder and image processing

We build on the SigLIP-2 (opens in new tab) vision encoder and the Phi-4-Reasoning backbone. In previous research, we found that multimodal language models sometimes struggled to solve tasks, not because of a lack of reasoning proficiency, but rather an inability to extract and select relevant perceptual information from the image. An example would be a high-resolution screenshot that is information-dense with relatively small interactive elements.

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3 (opens in new tab) uses pan-and-scan and NVILA (opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

MethodMax TokensMathVistaScreenSpotScreenSpot-ProV*BenchDynamic-S2309642.978.49.452.9Multi-crop309643.467.85.451.8Multi-crop with S2204843.479.110.657.1Dynamic resolution204845.281.59.251.3Dynamic resolution360044.979.717.556.0Table 1: Results with different resolution handling approaches. The top two configurations on each benchmark are in bold. Data: Quality and composition

As with its language backbone Phi-4-Reasoning, Phi-4-reasoning-vision-15B was trained with a deliberate focus on data quality. Our final dataset consists primarily of data from three sources: open-source datasets which were meticulously filtered and improved; high-quality domain-specific internal data; and high-quality data from targeted acquisitions. The overwhelming majority of our data lies in the first category: data which originated as open-source data, which were significantly filtered and improved, whether by removing low-quality datasets or records, programmatically fixing errors in data formatting, or using open-source images as seeds to synthetically generate higher-quality accompanying text.

The process of improving open-source data began by manually reviewing samples from each dataset. Typically, 5 to 10 minutes were sufficient to classify data as excellent-quality, good questions with wrong answers, low-quality questions or images, or high-quality with formatting errors. Excellent data was kept largely unchanged. For data with incorrect answers or poor-quality captions, we re-generated responses using GPT-4o and o4-mini, excluding datasets where error rates remained too high. Low-quality questions proved difficult to salvage, but when the images themselves were high quality, we repurposed them as seeds for new caption or visual question answering (VQA) data. Datasets with fundamentally flawed images were excluded entirely. We also fixed a surprisingly large number of formatting and logical errors across widely used open-source datasets.

We extracted additional value from existing datasets through reformatting, diversification, and using images as seeds for new data generation. We generated detailed image descriptions alongside original QA pairs for math and science data, had data perform “double-duty” by embedding instruction-following requirements directly into domain-specific QA, created “scrambled,” “caption-matching,” and “what’s changed?” records to improve multi-image reasoning and sequential navigation for CUA scenarios, and diversifying prompt styles to encourage robustness beyond perfectly structured questions.

To supplement the improved open-source data, we utilize high-quality internal datasets, several math-specific datasets which were acquired during training of the Phi-4 language model, and also some domain-specific curated data; for example, latex-OCR data generated by processing and rendering equations from arXiv documents.

before returning a bounding box coordinates for a UI grounding task, and the other uses a tag with step-by-step reasoning to answer a chart question about expatriate populations, concluding with "Dubai." " class="wp-image-1163336"/> Figure 3: Phi-4-reasoning-vision-15B training data composition and examples Data: Mathematics vs. computer-use data proportion

One of our goals was to train a model that performs well across general vision-language tasks, while excelling at mathematical and scientific reasoning and computer-use scenarios. How to structure datasets for generalizable reasoning remains an open question—particularly because the relationship between data scale and reasoning performance can lead to starkly different design decisions, such as training a single model on a large dataset versus multiple specialized models with targeted post-training.

Research on long-tailed classification robustness has suggested that balancing or removing data from overrepresented tasks or subgroups (opens in new tab) is an effective method for ensuring good performance. Nevertheless, these insights are not fully utilized or explored when it comes to training VLMs, which at times have favored scale over careful data balancing. To achieve our goals, we conducted a set of experiments to analyze a range of data ratios between our focus domains.

Using the same 5 billion parameter proxy model as for previous experiments, we trained while varying the amount of mathematics and science vs. computer-use data for each run. Each dataset included the same subset of 1 million general image-text pairs as a baseline. For mathematics and science data, we used a subsample of 150,000 records, optionally duplicating each one up to three times. Next, we included up to 450,000 computer-use records, and optionally an additional 400,000 from Phi-Ground.

We found that that multimodal mathematics and science performance were not harmed by additional computer-use data, and vice versa. Interestingly, we found that increasing mathematics data by 3x while keeping computer-use data constant improved math, science, and computer-use benchmarks.

GeneralMath and ScienceCUATotalMMMUMathVistaScreenSpot-V21M150K450K1.6M44.037.448.21M150K850K2.0M44.137.360.01M450K450K1.9M45.336.048.31M450K850K2.3M43.438.963.11M150K150K1.3M44.236.929.81M150K250K1.4M45.437.437.7Table 2: Varying the ratios of math and CUA data. Increasing math data by 3x while keeping computer-use data constant improves both math and computer-use benchmarks.  Data: Synthetic data for text-rich visual reasoning

Recent work (opens in new tab) suggests that targeted synthetic data can materially improve multimodal reasoning, particularly for text-rich visual domains such as charts, documents, diagrams, and rendered mathematics. Using images, questions, and answers that are programmatically generated and grounded in the visual structure enables precise control over visual content and supervision quality, resulting in data that avoids many annotation errors, ambiguities, and distributional biases common in scraped datasets. This enables cleaner alignment between visual perception and multi-step inference, which has been shown to translate into measurable gains on reasoning-heavy benchmarks.

Synthetic text-rich images expand coverage of long-tail visual formats that are underrepresented in real data but disproportionately impact reasoning accuracy, improving not only visual grounding but also downstream reasoning by ensuring that failures are less often caused by perceptual errors. We found that programmatically generated synthetic data is a useful augmentation to high-quality real datasets — not a replacement, but a scalable mechanism for strengthening both perception and reasoning that complements the training objectives in compact multimodal models such as Phi-4-reasoning-vision-15B.

Mixing non-reasoning and reasoning as a design objective

In language-only settings, reasoning traces have improved performance on many tasks, but they require additional compute which adds undesired latency. In multimodal settings, this tradeoff is less clear-cut, for tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful (opens in new tab), while mathematical and scientific problem-solving benefit from multi-step reasoning. Thus, the choice of when to reason or not can be quite nuanced.

Training approaches for multimodal reasoning models

Language-only reasoning models are typically created through supervised fine-tuning (SFT) or reinforcement learning (RL): SFT is simpler but requires large amounts of expensive reasoning trace data, while RL reduces data requirements at the cost of significantly increased training complexity and compute. Multimodal reasoning models follow a similar process, but the design space is more complex. With a mid-fusion architecture, the first decision is whether the base language model is itself a reasoning or non-reasoning model. This leads to several possible training pipelines:

  • Non-reasoning LLM → reasoning multimodal training: Reasoning and multimodal capabilities are trained together.
  • Non-reasoning LLM → non-reasoning multimodal → reasoning multimodal training: Multimodal capabilities are learned first, then reasoning is added.
  • Reasoning LLM → reasoning multimodal training: A reasoning base is used, but all multimodal data must include reasoning traces.
  • Our approach: Reasoning LLM → mixed non-reasoning / reasoning multimodal training. A reasoning-capable base is trained on a hybrid data mixture, learning when to reason and when to respond directly.

Approaches 1 and 2 offer flexibility in designing multimodal reasoning behavior from scratch using widely available non-reasoning LLM checkpoints but place a heavy burden on multimodal training. Approach 1 must teach visual understanding and reasoning simultaneously and requires a large amount of multimodal reasoning data, while Approach 2 can be trained with less reasoning data but risks catastrophic forgetting, as reasoning training may degrade previously learned visual capabilities. Both risk weaker reasoning than starting from a reasoning-capable base. Approach 3 inherits strong reasoning foundations, but like Approach 1, it requires reasoning traces for all training data and produces reasoning traces for all queries, even when not beneficial.

Our approach: A mixed reasoning and non-reasoning model

Phi-4-reasoning-vision-15B adopts the 4th approach listed previously, as it balances reasoning capability, inference efficiency, and data requirements. It inherits a strong reasoning foundation but uses a hybrid approach to combine the strengths of alternatives while mitigating their drawbacks. Our model defaults to direct inference for perception-focused domains where reasoning adds latency without improving accuracy, avoiding unnecessary verbosity and reducing inference costs, and it invokes longer reasoning paths for domains, such as math and science, that benefit from structured multi-step reasoning (opens in new tab).

Our model is trained with SFT, where reasoning samples include “…” sections with chain-of-thought reasoning before the final answer, covering domains like math and science. Non-reasoning samples are tagged to start with a “” token, signaling a direct response, and cover perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Reasoning data comprises approximately 20% of the total mix. Starting from a reasoning-capable backbone means this data grounds existing reasoning in visual contexts rather than teaching it to reason from scratch.

This approach is not without limitations. The balance between modes is a direct function of design choices we made, informed by recent literature (opens in new tab) and observed model behavior during training—though the boundary between modes can be imprecise as it is learned implicitly from the data distribution. Our model allows control through explicit prompting with “” or “” tokens when the user wants to override the default reasoning behavior. The 20/80 reasoning-to-non-reasoning data split may not be optimal for all domains or deployment contexts. Evaluating the ideal balance of data and the model’s ability to switch appropriately between modes remains an open problem.

We view this mixed approach not as a definitive solution, but as one practical and well-motivated point in the design space for balancing latency, accuracy, and flexibility in multimodal systems.

Applications Figure 4: Phi-4-Reasoning-Vision can interpret sequences of images 

Phi-4-reasoning-vision-15B is a high-performing model across many vision-language tasks. It sees and understands the world by looking at a photo, document, chart, or screen and making sense of it. In practice that covers an enormous range of applications — just a few examples include: describing images and answering questions about them, interpreting changes and trends in images sequences, and recognizing objects, landmarks, and transcribing text.

Highlights: Scientific and mathematical reasoning and supporting computer-using agents (CUA)

In addition to general vision and language tasks, Phi-4-reasoning-vision-15B was designed to excel at tasks that combine visual input with structured inference, such as solving math problems presented in visual form, such as handwritten or diagram-based questions, extracting and reasoning over quantitative information in documents and charts, and supporting multi-step reasoning in educational or scientific analysis contexts.

Figure 5: Phi-4-reasoning-vision-15B is great at math and science  Figure 6: Phi-4-reasoning-vision-15B can help with written math problems 

In addition, we trained Phi-4-reasoning-vision-15B to have skills that can enable agents to interact with graphical user interfaces by interpreting screen content and selecting actions. With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields. Due to its low inference-time needs it is great for interactive environments where low latency and compact model size are essential.

Figure 7: Phi-4-reasoning-vision-15B can help navigate computer UIs Evaluation

Phi-4-reasoning-vision-15B was evaluated for accuracy and timing using two complementary open-source frameworks to ensure both rigorous and standardized analysis: Eureka ML Insights (opens in new tab) and VLMEvalKit (opens in new tab).

BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force nothinkPhi-4-mm-instructKimi-VL-A3B-Instructgemma-3-12b-itQwen3-VL-8B-Instruct-4KQwen3-VL-8B-Instruct-32KQwen3-VL-32B-Instruct-4KQwen3-VL-32B-Instruct-32KAI2D_TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA_TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse_MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision_MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista_MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU_VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot_v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 3: Accuracy comparisons relative to popular open-weight, non-thinking models  BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force thinkingKimi-VL-A3B-Thinkinggemma-3-12b-itQwen3-VL-8B-Thinking-4KQwen3-VL-8B-Thinking-40KQwen3-VL-32B-Thiking-4KQwen3-VL-32B-Thinking-40KAI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA_TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse_MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision_MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista_MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU_VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot_v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 4: Accuracy comparisons relative to popular open-weight, thinking models 

Our model balances thinking and non-thinking performance – on average showing better accuracy in the default “mixed-reasoning” behavior than when forcing thinking vs. non-thinking. Only in a few cases does forcing a specific mode improve performance (MathVerse and MMU_val for thinking and ScreenSpot_v2 for non-thinking). Compared to recent popular, open-weight models, our model provides a desirable trade-off between accuracy and cost (as a function of inference time compute and output tokens), as discussed previously.

Note: All numbers here are the result of running benchmarks ourselves and may be lower than other previously shared numbers. Instead of quoting leaderboards, we performed our own benchmarking, so we could understand scaling performance as a function of output token counts for related models. We made our best effort to run fair evaluations and used recommended evaluation platforms with model-specific recommended settings and prompts provided for all third-party models. For Qwen models we use the recommended token counts and also ran evaluations matching our max output token count of 4096. For Phi-4-reasoning-vision-15B, we used our system prompt and chat template but did not do any custom user-prompting or parameter tuning, and we ran all evaluations with temperature=0.0, greedy decoding, and 4096 max output tokens. These numbers are provided for comparison and analysis rather than as leaderboard claims. For maximum transparency and fairness, we will release all our evaluation logs publicly. For more details on our evaluation methodology, please see our technical report (opens in new tab).

Safety

As with other Phi models, Phi-4-reasoning-vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. For further details, check out our technical report (opens in new tab).

Open release and community engagement

Phi-4-reasoning-vision-15B is available on Microsoft Foundry (opens in new tab) and HuggingFace (opens in new tab) with additional examples and details on GitHub (opens in new tab). For additional guidance on how to use our model properly and safely, please refer to our Model card (opens in new tab). For further details on the technical aspects of the model, training, and evaluation, see our technical report (opens in new tab).

In line with our goal of supporting future AI development in the community, Phi-4-reasoning-vision-15B is released under a permissive license with model weights, fine‑tuning code, and benchmark logs. We intend this release to complement existing work by providing concrete artifacts that help close gaps in understanding how compact multimodal reasoning models can be built and studied.

Looking forward

Smaller vision–language models with selective, task‑aware reasoning offer one promising direction for making multimodal systems more practical and accessible. We present our model and its learnings to inform ongoing research in multimodal modeling, computer‑using agents, and mathematical scientific reasoning. We hope these details are useful to researchers exploring similar tradeoffs and invite critical evaluation, replication, and extension by the community. If you’d like to join us and help shape the future of multimodal models, please apply for one of our open roles.

Acknowledgements

We thank Rachel Ward for her extensive work on data collection and curation. We thank the GenDatasets, PhiGround, SimCity, and Fara-7B efforts for invaluable training data. We thank Harkirat Behl, Mojan Javaheripi, and Suriya Gunasekar for providing us with Phi-4 checkpoints and guidance on training with Phi models. We additionally thank Sahaj Agarwal, Ahmed Awadallah, Qi Dai, Gustavo de Rosa, Rafah Hosn, Ece Kamar, Piero Kauffmann, Yash Lara, Chong Luo, Caio César Teodoro Mendes, Akshay Nambi, Craig Presti, Matthew Rosoff, Corby Rosset, Marco Rossi, Kashyap Patel, Adil Salim, Sidhartha Sen, Shital Shah, Pratyusha Sharma, Alexey Taymanov, Vibhav Vineet, John Weiss, Spencer Whitehead, the AI Frontiers Team and Leadership, and Microsoft Research Leadership, for their valuable help, insightful discussions, and continued support throughout this work.

Opens in a new tab

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Categories: Microsoft

CORPGEN advances AI agents for real work

Microsoft Research - Thu, 02/26/2026 - 19:06
At a glance
  • Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs).
  • Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%.
  • CORPGEN introduces digital employees, with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends.
  • Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve.

By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once.

In our paper, “CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap.

Introducing Multi-Horizon Task Environments

Replicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours.

To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off.

We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems.

CORPGEN’s architecture

CORPGEN introduces digital employees: LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday.

Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session.

CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step.

Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled.

Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture.

Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization. How digital employees collaborate

When multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication.

There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue.

When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so.

Evaluating CORPGEN

We evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out.

Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more.

Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%.

Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Implications and looking forward

The results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation.

CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software.

Acknowledgments

This work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation.

Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.

Opens in a new tab

The post CORPGEN advances AI agents for real work appeared first on Microsoft Research.

Categories: Microsoft

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions

Microsoft Research - Thu, 02/19/2026 - 18:00

Insights from Microsoft’s Media Integrity and Authentication: Status, Directions, and Futures report

It has become increasingly difficult to distinguish fact from fiction when viewing online images and videos. Resilient, trustworthy technologies can help people determine whether the content they are viewing was captured by a camera or microphone—or generated or modified by AI tools. 

We refer to technologies aimed at helping viewers verify the source and history—that is, the provenance—of digital content as media integrity and authentication (MIA) methods. This technique, driven by the Coalition for Content Provenance and Authenticity (opens in new tab) (C2PA), a standards body dedicated to scaling these capabilities, as well as complementary methods such as watermarks and fingerprinting, have become critically important with the rapid advance of AI systems capable of creating, realistic imagery, video, and audio at scale.

A convergence of forces

Our team recognized an inflection point in the evolution of online content integrity, driven by the convergence of four forces:

  • Growing saturation of synthetic media, driven by proliferation of high-fidelity content-generation tools and the explosion of AI generated or modified media online
  • Forthcoming legislation both nationally and internationally seeking to define what “verifiable” provenance should mean in practice
  • Mounting pressure on implementers to ensure authentication signals are clear and helpful, especially as signals increase when legislation goes into effect in 2026
  • Heightened awareness of potential adversarial attacks that attempt to exploit weaknesses in authenticity systems

The usefulness and trustworthiness of provenance signals, whether certifying content as synthetic or as an authentic capture of real-world scenes, will depend not only on advances in technology, but also on how the broader digital ecosystem adopts, implements, and governs these tools. Aligning around implementation choices that promote consistency and clarity is essential to ensure transparency signals strengthen, rather than erode, public confidence.

To address these challenges, we launched a comprehensive evaluation of the real-world limits, edge cases, and emerging “attack surfaces” for MIA methods. Today, we are publishing our findings in the Media Integrity & Authentication: Status, Directions & Futures report. The report distills lessons learned and outlines practical directions for strengthening media integrity in the years ahead.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Findings and directions forward

Our research recognizes that different media integrity and authenticity methods serve differing purposes and offer distinct levels of protection. After defining each method in detail, we focused on secure provenance (C2PA), imperceptible watermarking, and soft hash fingerprinting across images, audio, and video.

Grounded in our evaluation of these MIA methods across modalities, attack categories, and real-world workflows, several new findings emerged including two new concepts:

  • High-Confidence Provenance Authentication: a critical capability for verifying, under defined conditions, whether claims about the origin of and modifications made to an asset can be validated with high certainty.
  • Sociotechnical Provenance Attacks: attacks aimed at deception and capable of inverting signals, making authentic content appear synthetic, and synthetic content appear authentic.

Drawing on our findings, we identified four promising directions for further strengthening media authentication, along with suggestions to support more effective implementation strategies and future decisions. We’ve summarized the findings and directions below, with additional detail available in the report.

Promising directionsHigh-level findingsDelivering high-confidence provenance authentication– Implementation and display choices may affect the reliability of provenance indicators and how they are interpreted by the public.

– Using a C2PA provenance manifest for media created and signed in a high security environment enables high-confidence validation.

– High-confidence validation is also possible across a broader volume of images, audio, and video when an imperceptible watermark is linked to C2PA provenance manifest as an additional layer to recover the provenance information if removed.

– Fingerprinting is not an enabler for high-confidence validation and can involve significant costs when expected at scale. However, it can support manual forensics.Mitigating confusion from sociotechnical provenance attacks– MIA methods are susceptible to sociotechnical attacks on provenance that may mislead the public, resulting in confusion and misplaced trust about an asset’s provenance if there is an overreliance on low-quality signals.

– Layering and linking secure provenance and imperceptible watermarking methods to achieve high confidence validation also offers a promising option to both deter and mitigate the impact of attacks.

– Unintended consequences may result from the use of methods lacking authentication, such as the use of perceptible watermarks in the absence of secure provenance. Perceptible watermarks may cause confusion in cases of forgery or discourage people from consulting high-confidence provenance information via a validation tool, if such perceptible disclosures are taken at face value.

UX design that enables users to explore manifest details—such as where edits occurred or region of interest—has the potential to reduce confusion and support forensics and fact checking efforts.  Enabling more trusted provenance on edge devices– High-confidence results aren’t feasible when provenance is added by a conventional offline device (e.g., camera or recording device without connectivity).

Implementing a secure enclave within the hardware layer of offline devices is essential to make the provenance of captured images, audio, and video more trustworthy.Investing in ongoing research and policy development– All three methods offer organizations valuable tools for addressing operational challenges such as fraud prevention, risk management, and digital accountability. 

UX and display are promising directions for research. Important directions include in-stream tools that display provenance information where people are and distinguish between high- and lower-confidence provenance signals.

Stakeholders should conduct ongoing analysis and red teaming to identify and mitigate weaknesses through technical approaches, policies, and laws.    The journey continues

This report marks the beginning of a new chapter in our media provenance journey (opens in new tab), building on years of foundational work, from developing the very first prototype in 2019 to co-founding the C2PA in 2021 and helping catalyze an ecosystem that has since grown to more than 6,000 members and affiliates (opens in new tab) supporting C2PA Content Credentials. This research represents the next evolution of that long‑standing commitment.

We hope that by sharing our learnings will help others prepare for an important wave, especially as generative technologies accelerate and provenance signals multiply. This work is already underway across our products at Microsoft. Together, these directions highlight opportunities for the ecosystem to align, harden, and innovate, so authentication signals are not merely visible, but robust, meaningful, and resilient throughout the content lifecycle.

Opens in a new tab

The post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.

Categories: Microsoft

Project Silica’s advances in glass storage technology

Microsoft Research - Wed, 02/18/2026 - 18:11
At a glance
  • Microsoft Research publishes breakthrough in Nature on glass-based data storage that could preserve information for 10,000 years. 
  • New technique extends technology from expensive fused silica to ordinary borosilicate glass found in kitchen cookware. 
  • Innovations enable faster parallel writing, simplified readers (one camera instead of three), and easier manufacturing. 
  • Phase voxel method requires only a single laser pulse, significantly reducing complexity and cost.

Long-term preservation of digital information has long challenged archivists and datacenters, as magnetic tapes and hard drives degrade within decades. Existing archival storage solutions have limited media lifespans that make them less than ideal for preserving information for future generations.

Now, we are excited to report significant progress on Project Silica (opens in new tab), our effort to encode data in glass using femtosecond lasers, a technology that could preserve information for 10,000 years. Glass is a permanent data storage material that is resistant to water, heat, and dust.

In findings published in Nature (opens in new tab), we describe a breakthrough that extends the technology beyond expensive fused silica to ordinary borosilicate glass. A readily available and lower-cost medium, this is the same material found in kitchen cookware and oven doors. This advance addresses key barriers to commercialization: cost and availability of storage media. We have unlocked the science for parallel high-speed writing and developed a technique to permit accelerated aging tests on the written glass, suggesting that the data should remain intact for at least 10,000 years.

Storing data inside glass with femtosecond (opens in new tab) laser pulses is one of the few technologies on the horizon with the potential for durable, immutable, and long-lived storage. Although we have been leading innovation in this type of storage for years, prior to this research the technique only worked with pure fused silica glass, a type of glass that is relatively difficult to manufacture and available from only a few sources.

In the paper, we show how data can be stored in borosilicate glass. The new technique stores hundreds of layers of data in glass only 2mm thin, as with previous methods, but with important improvements. The reader for the glass now needs only one camera, not three or four, reducing cost and size. In addition, the writing devices require fewer parts, making them easier to manufacture and calibrate, and enabling them to encode data more quickly.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Key scientific discoveries

The Nature paper details several key new scientific discoveries:

Advances in birefringent voxel (opens in new tab) writing: For the previous type of data storage in fused silica glass using birefringent (i.e., polarization) voxels, we developed a technique to reduce the number of pulses used to form the voxel from many to only two, critically showing that the polarization of the first pulse is not important to the polarization of the voxel formed. We further developed this to enable pseudo-single-pulse writing, in which a single pulse can be split after its polarization is set to simultaneously form the first pulse for one voxel (where the polarization doesn’t matter) and the second pulse of another (where the set polarization is essential). We demonstrated how to use this pseudo-single-pulse writing to enable fast writing with beam scanning across the media.

Phase voxels, a new storage method: We invented a new type of data storage in glass called phase voxels, in which the phase change of the glass is modified instead of its polarization, showing that only a single pulse is necessary to make a phase voxel. We demonstrated that these phase voxels can also be formed in borosilicate glass and devised a technique to read the phase information from phase voxels encoded in this material. We showed that the much higher levels of three-dimensional inter-symbol interference in phase voxels can be mitigated with a machine learning classification model.

Parallel writing capabilities: By combining a mathematical model of pre-heating and post-heating within the glass with the invention of a multi-beam delivery system, we showed that many data voxels can be written in proximity in the glass at the same time, significantly increasing writing speed. We explained a method for using light emissions (a side effect of voxel formation) for both static calibration and dynamic control to fully support automatic writing operations.

Optimization and longevity testing: We developed a new way to optimize symbol encodings using machine learning and a better way to understand the tradeoff between error rates, error protection, and error recovery when evaluating new digital storage systems. We also created a new nondestructive optical method (opens in new tab) to identify the aging of data storage voxels within the glass, using this and standard accelerated aging techniques to support data lasting 10,000 years. We extended the industry standard Gray codes to apply to nonpower-of-two numbers of symbols.

Skip slideshow for: Previous slide Previous slide

A piece of Project Silica media written with data.

A research-grade Writer used to set the record for high speed data writing into glass.

A research-grade Reader for retrieving data from glass.

Close up of Writer showing high-speed multi-beam data encoding on laser pulses.

End of slideshow for: Demonstrating the technology

As a research initiative, Project Silica has demonstrated these advances through several proofs of concept, including storing Warner Bros.’ “Superman” movie on quartz glass (opens in new tab), partnering with Global Music Vault (opens in new tab) to preserve music under ice for 10,000 years (opens in new tab), and working with students on a “Golden Record 2.0” project (opens in new tab), a digitally curated archive of images, sounds, music, and spoken language, crowdsourced to represent and preserve humanity’s diversity for millennia.

Looking ahead

The research phase is now complete, and we are continuing to consider learnings from Project Silica as we explore the ongoing need for sustainable, long-term preservation of digital information. We have added this paper to our published works so that others can build on them.

Related work

Project Silica has made scientific advances across multiple areas beyond laser direct writing (LDW) in glass, including archival storage systems design, archival workload analysis, datacenter robotics, erasure coding, free-space optical components, and machine learning-based methods for symbol decoding in storage systems. Many of these innovations were described in our ACM Transactions on Storage publication (opens in new tab) in 2025.

Opens in a new tab

The post Project Silica’s advances in glass storage technology appeared first on Microsoft Research.

Categories: Microsoft

Rethinking imitation learning with Predictive Inverse Dynamics Models

Microsoft Research - Thu, 02/05/2026 - 19:00
At a glance
  • Imitation learning becomes easier when an AI agent understands why an action is taken.
  • Predictive Inverse Dynamics Models (PIDMs) predict plausible future states, clarifying the direction of behavior during imitation learning.
  • Even imperfect predictions reduce ambiguity, making it clearer which action makes sense in the moment.
  • This makes PIDMs far more data‑efficient than traditional approaches.

Imitation learning teaches AI agents by example: show the agent recordings of how people perform a task and let it infer what to do. The most common approach, Behavior Cloning (BC), frames this as a simple question: “Given the current state of the environment, what action would an expert take?”

In practice, this is done through supervised learning, where the states serve as inputs and expert actions as outputs. While simple in principle, BC often requires large demonstration datasets to account for the natural variability in human behavior, but collecting such datasets can be costly and difficult in real-world settings.

Predictive Inverse Dynamics Models (PIDMs) offer a different take on imitation learning by changing how agents interpret human behavior. Instead of directly mapping states to actions, PIDMs break down the problem into two subproblems: predicting what should happen next and inferring an appropriate action to go from the current state to the predicted future state. While PIDMs often outperform BC, it has not been clear why they work so well, motivating a closer look at the mechanisms behind their performance.

In the paper, “When does predictive inverse dynamics outperform behavior cloning?” we show how this two-stage approach enables PIDMs to learn effective policies from far fewer demonstrations than BC. By grounding the selection process in a plausible future, PIDMs provide a clearer basis for choosing an action during inference. In practice, this can mean achieving comparable performance with as few as one-fifth the demonstrations required by BC, even when predictions are imperfect.

Figure 1. BC vs. PIDM architectures. (Top) Behavior Cloning learns how to perform a direct mapping from the current state to an action. (Bottom) PIDMs add a state predictor that predicts future states. They then use an inverse dynamics model to predict the action required to move from the current state towards that future state. Both approaches share a common latent representation through a shared state encoder. How PIDMs rethink imitation

PIDMs’ approach to imitation learning consists of two core elements: a model that forecasts plausible future states, and an inverse dynamics model (IDM) that predicts the action needed to move from the present state toward that future. Instead of asking, “What action would an expert take?” PIDMs effectively ask, “What would an expert try to achieve, and what action would lead to it?” This shift turns the information in the current observation (e.g., video frame) into a coherent sense of direction, reducing ambiguity about intent and making action prediction easier.

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Explore the series Opens in a new tab Real-world validation in a 3D gameplay environment

To evaluate PIDMs under realistic conditions, we trained agents on human gameplay demonstrations in a visually rich video game. These conditions include operating directly from raw video input, interacting with a complex 3D environment in real time at 30 frames per second, and handling visual artifacts and unpredictable system delays.  

The agents ran from beginning to end, taking video frames as input and continuously deciding which buttons to press and how to move the joysticks. Instead of relying on a hand-coded set of game variables and rules, the model worked directly from visual input, using past examples to predict what comes next and choosing actions that moved play in that direction.

We ran all experiments on a cloud gaming platform, which introduced additional delays and visual distortions. Despite these challenges, the PIDM agents consistently matched human patterns of play and achieved high success rates across tasks, as shown in Video 1 below and Videos 2 and 3 in the appendix.

Video 1. A player (left) and a PIDM agent (right) side by side playing the game Bleeding Edge. Both navigate the same trajectory, jumping over obstacles and engaging with nonplayer characters. Despite network delays, the agent closely matches the player’s timing and movement in real time. Why and when PIDMs outperform BC

Of course, AI agents do not have access to future outcomes. They can only generate predictions based on available data, and those predictions are sometimes wrong. This creates a central trade‑off for PIDMs.

On one hand, anticipating where the agent should be heading can clarify what action makes sense in the present. Knowing the intended direction helps narrow an otherwise ambiguous choice. On the other hand, inaccurate predictions can occasionally steer the model toward the wrong action.

The key insight is that these effects are not symmetric. While prediction errors introduce some risk, reducing ambiguity in the present often matters more. Our theoretical analysis shows that even with imperfect predictions, PIDMs outperform BC as long as the prediction error remains modest. If future states were known perfectly, PIDMs would outperform BC outright.

In practice, this means that clarifying intent often matters more than accurately predicting the future. That advantage is most evident in the situations where BC struggles: where human behavior varies and actions are driven by underlying goals rather than by what is immediately visible on the screen.

BC requires many demonstrations because each example is noisy and open to multiple interpretations. PIDMs, by contrast, sharpen each demonstration by linking actions to the future states they aim to reach. As a result, PIDMs can learn effective action strategies from far fewer examples.

Evaluation

To test these ideas under realistic conditions, we designed a sequence of experiments that begins with a simple, interpretable 2D environment (Video 4 in the appendix) and culminates in a complex 3D video game. We trained both BC and PIDM on very small datasets, ranging from one to fifty demonstrations in the 2D environment and from five to thirty for the 3D video game. Across all tasks, PIDM reached high success rates with far fewer demonstrations than BC.

In the 2D setting, BC needed two to five times more data to match PIDM’s performance (Figure 2). In the 3D game, BC needed 66% more data to achieve comparable results (Video 5 in the appendix).

Figure 2. Performance gains in the 2D environment. As the number of training demonstrations increases, PIDM consistently achieves higher success rates than BC across all four tasks. Curves show mean performance, with shading indicating variability across 20 experiments for reproducibility. Takeaway: Intent matters in imitation learning

The main message of our investigation is simple: imitation becomes easier when intent is made explicit. Predicting a plausible future, even an imperfect one, helps resolve ambiguity about which action makes sense right now, much like driving more confidently in the fog when the driver already knows where the road is headed. PIDM shifts imitation learning from pure copying toward goal-oriented action.

This approach has limits. If predictions of future states become too unreliable, they can mislead the model about the intended next move. In those cases, the added uncertainty can outweigh the benefit of reduced ambiguity, causing PIDM to underperform BC.

But when predictions are reasonably accurate, reframing action prediction as “How do I get there from here?” helps explain why learning from small, messy human datasets can be surprisingly effective. In settings where data is expensive and demonstrations are limited, that shift in perspective can make a meaningful difference.

Appendix: Visualizations and results (videos) A player, a naïve action-replay baseline, and a PIDM agent playing Bleeding Edge Video 2. (Left) The player completes the task under normal conditions. (Middle) The baseline replays the recorded actions at their original timestamps, which initially appears to work. Because the game runs on a cloud gaming platform, however, random network delays quickly push the replay out of sync, causing the trajectory to fail. (Right) Under the same conditions, the PIDM agent behaves differently. Instead of naively replaying actions, it continuously interprets visual input, predicts how the behavior is likely to unfold, and adapts its actions in real time. This allows it to correct delays, recover from deviations, and successfully reproduce the task in settings where naïve replay inevitably fails. A player and a PIDM agent performing a complex task in Bleeding Edge Video 3. In this video, the task exhibits strong partial observability: correct behavior depends on whether a location is being visited for the first or second time. For example, in the first encounter, the agent proceeds straight up the ramp; on the second, it turns right toward the bridge. Similarly, it may jump over a box on the first pass but walk around it on the second. The PIDM agent reproduces this trajectory reliably, using coarse future guidance to select actions in the correct direction. Visualization of the 2D navigation environment Video 4. These videos show ten demonstrations for each of four tasks: Four Room, Zigzag, Maze, and Multiroom. In all cases, the setup is the same: the character (blue box) moves through the environment and must reach a sequence of goals (red squares). The overlaid trajectories visualize the paths the player took; the models never see these paths. Instead, they observe only their character’s current location, the position of all goals, and whether each goal has already been reached. Because these demonstrations come from real players, no two paths are identical: players pause, take detours, or correct small mistakes along the way. That natural variability is exactly what the models must learn to handle. PIDM vs. BC in a 3D environment Video 5. The PIDM agent achieves an 85% success rate with only fifteen demonstrations used in training. The BC agent struggles to stay on track and levels off around 60%. The contrast illustrates how differently the two approaches perform when training data is limited. Opens in a new tab

The post Rethinking imitation learning with Predictive Inverse Dynamics Models appeared first on Microsoft Research.

Categories: Microsoft

Paza: Introducing automatic speech recognition benchmarks and models for low resource languages

Microsoft Research - Thu, 02/05/2026 - 07:07
At a glance
  • Microsoft Research releases PazaBench and Paza automatic speech recognition models, advancing speech technology for low resource languages.
  • Human-centered pipeline for low-resource languages: Built for and tested by communities, Paza is an end-to-end, continuous pipeline that elevates historically under-represented languages and makes speech models usable in real-world, low-resource contexts.
  • First-of-its-kind ASR leaderboard, starting with African languages: Pazabench is the first automatic speech recognition (ASR) leaderboard for low-resource languages. Launching with 39 African languages and 51 state-of-the-art models, it tracks three key metrics across leading public and community datasets.
  • Human-centered Paza ASR models: Minimal data, fine-tuned ASR models grounded in real-world testing with farmers on everyday mobile devices, covering six Kenyan languages: Swahili, Dholuo, Kalenjin, Kikuyu, Maasai, and Somali.

According to the 2025 Microsoft AI Diffusion Report approximately one in six people globally had used a generative AI product. Yet for billions of people, the promise of voice interaction still falls short, and whilst AI is becoming increasingly multilingual, a key question remains: Do these models actually work for all languages and the people who rely on them? This challenge is one we first confronted through Project Gecko—a collaboration between Microsoft Research and Digital Green (opens in new tab), where field teams across Africa and India focused on building usable AI tools for farmers.

Gecko revealed how often speech systems fail in real‑world, low‑resource environments—where many languages go unrecognized and non‑Western accents are frequently misunderstood. Yet speech remains the primary medium of communication globally. For communities across Kenya, Africa, and beyond, this mismatch creates cascading challenges: without foundational data representing their languages and cultures, innovation stalls, and the digital and AI divides widen. 

Paza addresses this with a human-centered speech models pipeline. Through PazaBench, it benchmarks low-resource languages using both public and community-sourced data, and through Paza models, it fine-tunes speech models to deliver outsized gains in mid- and low-resource languages, evaluating with community testers using real devices in real contexts. Upcoming playbooks complement this work by sharing practical guidance on dataset creation, fine-tuning approaches with minimal data and evaluation considerations, introducing a continuous pipeline that enables researchers and practitioners to build and evaluate systems grounded in real human use.

How Project Gecko informed Paza’s design

In addition to building cost-effective, adaptable AI systems, the extensive fieldwork on Project Gecko highlighted an important lesson: Building usable speech models in low‑resource settings is not only a data problem, but also a design and evaluation problem. For AI systems to be useful, they must work in local languages, support hands‑free interaction through voice, text, and video, and deliver information in formats that fit real-world environments, that is, on low-bandwidth mobile devices, in noisy settings, and for varying literacy levels.  

These insights shaped the design of Paza, from the Swahili phrase paza sauti meaning “to project,” or “to raise your voice.”  The name reflects our intent: rather than simply adding more languages to existing systems, Paza is about co-creating speech technologies in partnership with the communities who use them. Guided by this principle, Paza puts human use first, which enables model improvement. 

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab PazaBench: The first ASR leaderboard for low-resource languages

PazaBench is the first automatic speech recognition (ASR) leaderboard dedicated to low‑resource languages. It launches with initial coverage for 39 African languages and benchmarks 52 state‑of‑the‑art ASR and language models, including newly released Paza ASR models for six Kenyan languages. The platform aggregates leading public and community datasets from diverse styles of speech including conversational, scripted read aloud, unscripted, broadcast news, and domain-specific data—into one easy‑to‑explore platform per language. This makes it easier for researchers, developers, and product teams to easily assess which models perform best across underserved languages and diverse regions, understand trade-offs between speed and accuracy while identifying where gaps persist. 

PazaBench tracks three core metrics:

  1. Character Error Rate (CER) which is important for languages with rich word forms, where meaning is built by combining word parts, therefore errors at the character level can significantly impact meaning
  2. Word Error Rate (WER) for word-level transcript accuracy
  3. RTFx (Inverse Real‑Time Factor) which measures how fast transcription runs relative to real‑time audio duration.

More than scores, PazaBench standardizes evaluation to prioritize dataset gaps, identify underperforming languages, and highlight where localized models beat wider coverage ASR models—offering early evidence of the value of African‑centric innovation.

Explore PazaBench

To contribute to the benchmark, request additional language evaluation on the leaderboard.

Paza ASR Models: Built with and for Kenyan languages

The Paza ASR models consist of three fine-tuned ASR models built on top of state‑of‑the‑art model architectures. Each model targets Swahili, a mid-resource language and five low‑resource Kenyan languages; Dholuo, Kalenjin, Kikuyu, Maasai and Somali. The models are fine-tuned on public and curated proprietary datasets.  

Fine‑tuning the three models allowed us to explore supportive approaches toward a shared goal: building speech recognition systems that are usable for local contexts starting with the six Kenyan languages and bridging the gaps of multi-lingual and multi-modal video question and answering through the MMCT agent. (opens in new tab)

See the MMCT agent in action in the field

Early versions of two models in Kikuyu and Swahili were deployed on mobile devices and tested directly with farmers in real‑world settings, enabling the team to observe how the models performed with everyday use. Farmers provided in‑the‑moment feedback on accuracy, usability, and relevance, highlighting where transcripts broke down, which errors were most disruptive, and what improvements would make the models more helpful in practice. This feedback loop directly informed subsequent fine‑tuning, ensuring model improvements were driven not only by benchmark scores, but by the needs and expectations of the communities they are intended to serve.

Explore Paza Collection Here

Here is how Paza models compare to three state-of-the-art ASR models today:

Figure 1: Character Error Rate (CER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance. Figure 2: Word Error Rate (WER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower WER indicates better transcription performance.

1) Paza‑Phi‑4‑Multimodal‑Instruct

Microsoft’s Phi‑4 multimodal‑instruct (opens in new tab) is a next‑generation small language model built to reason across audio, text, and vision. With Paza, we extend its audio capabilities, adapting a powerful multimodal architecture into a high‑quality automatic speech recognition (ASR) system for low‑resource African languages.

Fine‑tuned on unified multilingual speech datasets, the model was optimized specifically for transcription in the six languages. The model preserves its underlying transformer architecture and multi-modal capabilities, while selectively fine-tuning only the audio‑specific components, enabling strong cross‑lingual generalization.

As the results below show, this model delivers consistent improvements in transcription quality across all six languages.

Figure 3: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 4: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Test the model here

2) Paza‑MMS‑1B‑All

This model is fine-tuned on Meta’s mms-1b-all model, which employs a large-scale Wav2Vec2.0-style encoder with lightweight language-specific adapters to enable efficient multilingual specialization. For this release, each of the six language adapters was fine‑tuned independently on curated low‑resource datasets, allowing targeted adaptation while keeping the shared encoder largely frozen.

As shown in the figures below, this model improves transcription accuracy while maintaining the model’s strong cross‑lingual generalization.

Figure 5: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 6: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Join the Research Early Access Program

3) Paza‑Whisper‑Large‑v3‑Turbo

This model is finetuned on OpenAI’s whisper-large-v3-turbo base model. Whisper is a transformer-based encoder–decoder model which delivers robust automatic speech recognition (ASR) capabilities. This model was fine‑tuned on the entire unified multilingual ASR dataset, on the mentioned six languages, to encourage cross-lingual generalization. In addition, an extra post‑processing step was applied to address the known Whisper hallucination failure modes, improving transcription reliability.

As shown below, this release achieves improved transcription accuracy while retaining Whisper’s robustness.

Figure 7: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 8: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Test the model here Where do we go from here

AI is reshaping how the world communicates. Designing with people, not just for them, means looking beyond the languages that are already well‑served. We plan to expand PazaBench beyond African languages and evaluate state‑of‑the‑art ASR models across more low‑resource languages globally. The Paza ASR models are an early step; truly supporting small and under‑represented languages requires dedicated datasets, strong local partnerships, and rigorous evaluation. Meaningful progress depends on sustained collaboration with the communities who speak these languages, and expanding responsibly means prioritizing depth and quality over broad but shallow coverage. 

As we continue this work, we’re distilling our methods into a forthcoming playbook to help the broader ecosystem curate datasets, fine‑tune responsibly, and evaluate models in real‑world conditions. And we’re not stopping at speech—additional playbooks will guide teams building AI tools and applications for multilingual, multicultural contexts, and give them practical recommendations for deploying across diverse communities. 

Together, these guides—grounded in technical advances and community‑driven design—share our learnings to help researchers, engineers, and designers build more human‑centered AI systems. 

Acknowledgements

The following researchers played an integral role in this work: Najeeb Abdulhamid, Felermino Ali, Liz Ankrah, Kevin Chege, Ogbemi Ekwejunor-Etchie, Ignatius Ezeani, Tanuja Ganu, Antonis Krasakis, Mercy Kwambai, Samuel Maina, Muchai Mercy, Danlami Mohammed, Nick Mumero, Martin Mwiti, Stephanie Nyairo, Millicent Ochieng and Jacki O’Neill.

We would like to thank the Digital Green (opens in new tab) team—Rikin Gandhi, Alex Mwaura, Jacqueline Wang’ombe, Kevin Mugambi, Lorraine Nyambura, Juan Pablo, Nereah Okanga, Ramaskanda R.S, Vineet Singh, Nafhtari Wanjiku, Kista Ogot, Samuel Owinya and the community evaluators in Nyeri and Nandi, Kenya — for their valuable contributions to this work.

We extend our gratitude to the creators, community contributors, and maintainers of African Next Voices Kenya (opens in new tab), African Next Voices South Africa (opens in new tab), ALFFA (opens in new tab), Digigreen (opens in new tab), Google FLEURS (opens in new tab), Mozilla Common Voice (opens in new tab) and Naija Voices (opens in new tab) whose efforts have been invaluable in advancing African languages speech data.

Opens in a new tab

The post Paza: Introducing automatic speech recognition benchmarks and models for low resource languages appeared first on Microsoft Research.

Categories: Microsoft

UniRG: Scaling medical imaging report generation with multimodal reinforcement learning

Microsoft Research - Tue, 01/27/2026 - 19:00
At a glance
  • AI-driven medical image report generation can help medical providers become more efficient and productive.
  • Current models are difficult to train because reporting practices vary widely among providers.
  • Universal Report Generation (UniRG) uses reinforcement learning to align model training with real-world radiology practice rather than proxy text-generation objectives.
  • UniRG has achieved state-of-the-art performance across datasets, metrics, diagnostic tasks, longitudinal settings, and demographic subgroups.
  • Test results show that reinforcement learning, guided by clinically meaningful reward signals, can substantially improve the reliability and generality of medical vision–language models.

AI can be used to produce clinically meaningful radiology reports using medical images like chest x-rays. Medical image report generation can reduce reporting burden while improving workflow efficiency for healthcare professionals. Beyond the real-world benefits, report generation has also become a critical benchmark for evaluating multimodal reasoning in healthcare AI.

Despite recent advances driven by large vision–language models, current systems still face major limitations in real-world clinical settings. One challenge stems from the wide variation in radiology reporting practices across institutions, departments, and patient populations. A model trained with supervised fine-tuning on one set of data may learn its specific phrasing and conventions instead of more general patterns—a problem known as overfitting. As a result, the model performs well on that data but delivers poor results when evaluated on unseen institutions or external datasets. Moreover, since model training is often aimed at producing text that looks similar to existing reports, some well written but clinically inaccurate reports can slip through.

In this blog, we introduce Universal Report Generation (UniRG) (opens in new tab), a reinforcement learning–based framework for medical imaging report generation. This work is a research prototype intended to advance medical AI research and is not validated for clinical use. UniRG uses reinforcement learning as a unifying mechanism to directly optimize clinically grounded evaluation signals, aligning model training with real-world radiology practice rather than proxy text-generation objectives. Using this framework, we train UniRG-CXR (opens in new tab), a state-of-the-art chest x-ray report generation model at scale, spanning over 560,000 studies, 780,000 images, and 226,000 patients from more than 80 medical institutions.

To our knowledge, this is the first report generation model to achieve consistent state-of-the-art performance across report-level metrics, disease-level diagnostic accuracy, cross-institution generalization, longitudinal report generation, and demographic subgroups. These results demonstrate that reinforcement learning, when guided by clinically meaningful reward signals, can substantially improve both the reliability and generality of medical vision–language models.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab A unified framework for scaling medical image report generation

UniRG builds state-of-the-art report generation models by combining supervised fine-tuning with reinforcement learning, which optimizes a composite reward that integrates rule-based metrics, model-based semantic metrics, and LLM-based clinical error signals. This approach allows the resulting model UniRG-CXR to learn from diverse data sources, move beyond dataset-specific reporting patterns, and learn representations that generalize across institutions, metrics, and clinical contexts. Notably, UniRG-CXR sets a new state of the art on the authoritative ReXrank leaderboard (opens in new tab), a public leaderboard for chest X-ray image interpretation, as of 01/22/2026, surpassing previous best models by substantial margins (Figure 1).

Figure 1. Overview of UniRG-CXR. (a) Training Data: UniRG-CXR is trained on the training splits of MIMIC-CXR, CheXpert Plus, and ReXGradient-160k, covering diverse institutions and patient demographics. (b) Training and Rewards: Taking input from the current image, clinical context (e.g., indication), and optionally prior studies, UniRG-CXR uses GRPO reinforcement learning to optimize composite rewards that combine rule-based, model-based, and LLM-based metrics. (c) Evaluation: We assess UniRG-CXR on held-out test sets (MIMIC-CXR, CheXpert Plus, ReXGradient), and unseen datasets (IU Xray and proprietary data). Report quality measured using ReXrank metrics and an LLM-based clinical-error metric, while diagnostic ability is evaluated via F1-based disease classification from generated reports. (d) ReXrank Results: UniRG-CXR achieves SOTA performance across four datasets and two generation settings (findings only and findings + impression), showing substantial gains over prior state-of-the-art systems. Universal improvements across metrics and clinical errors

Rather than excelling on one metric at the expense of others, UniRG-CXR delivers balanced improvements across many different measures of report quality. More importantly, it produces reports with substantially fewer clinically significant errors. This indicates that the model is not just learning how to sound like a radiology report, but is better capturing the underlying clinical facts. Explicitly optimizing for clinical correctness helps the model avoid common failure modes where fluent language masks incorrect or missing findings (Figure 2).

Figure 2. UniRG-CXR achieves state-of-the-art performance, delivering consistent and comprehensive performance gains across metrics. (a) On the ReXrank leaderboard, UniRG-CXR (green) shows robust, universal improvement across all evaluation metrics.  (b). Starting from the same SFT checkpoint, RL with our combined reward achieves more balanced gains across metrics and the highest RadCliQ-v1 score compared to RL on single metrics. This ablation study is trained and tested on MIMIC (c). Ablation study on the training dynamics shows RL full (UniRG-CXR) achieves significantly better RadCliQ-v1 score than RL only on BLEU. (d). During training, RL full (UniRG-CXR) shows a steady decrease in clinical errors per report as compared with a fluctuating trajectory without consistent improvement from an ablation run without error awareness (i.e. removing CheXprompt metric optimization). Both (c) and (d) show results on 1024 MIMIC validation set from ablations that are trained on MIMIC. (e). Case studies illustrate that UniRG-CXR can produce error-free reports, unlike MedVersa and MedGemma. (f). UniRG-CXR yields a substantially higher proportion of reports with $\leq 1$ error and fewer with $\geq 4$ errors than prior models. Strong performance in longitudinal report generation

In clinical practice, radiologists often compare current images with prior exams to determine whether a condition is improving, worsening, or unchanged. UniRG-CXR is able to incorporate this historical information effectively, generating reports that reflect meaningful changes over time. This allows the model to describe new findings, progression, or resolution of disease more accurately, moving closer to how radiologists reason across patient histories rather than treating each exam in isolation (Figure 3).

Figure 3. UniRG-CXR enhances longitudinal report generation. (a). Comparing UniRG-CXR and its non-longitudinal ablation with prior models on longitudinal report generation, we show UniRG-CXR exhibits the best performance and the longitudinal information is beneficial to the performance. (b). UniRG-CXR achieves the best performance across different longitudinal encounter points ranging from the first encounter to the more complex 5th+ encounters, showcasing its improvements are across the board. In comparison, prior models such as GPT-5, GPT-4o and MedGemma are barely surpassing the copy prior report baseline (grey lines).  (c). Compared with prior models which barely improve over the copy prior baseline (dashed line), UniRG-CXR significantly and consistently improves performance across different temporal disease change categories including new development, no change, progression and regression (categorized by GPT-5 on ground truth report). Qualitative examples are shown for each category where UniRG-CXR correctly predicts the temporal change based on the input. All results in this figure are on MIMIC test set with prior information where available. Robust generalization across institutions and populations

UniRG-CXR maintains strong performance even when applied to data from institutions it has never seen before. This suggests that the model is learning general clinical patterns rather than memorizing institution-specific reporting styles. In addition, its performance remains stable across different patient subgroups, including age, gender, and race. This robustness is critical for real-world deployment, where models must perform reliably across diverse populations and healthcare environments (Figure 4).

Figure 4. Generalization and robustness of UniRG-CXR. (a). We evaluate UniRG-CXR in a zero-shot setting on two datasets from previously unseen institutions: IU-Xray and PD (proprietary data). UniRG-CXR consistently outperforms prior models, maintaining substantial performance gains in this challenging setup. (b) and (c) present condition-level F1 scores on MIMIC-CXR and PD and highlight that UniRG-CXR remains the overall top-performing model in condition-level diagnostic accuracy. (d). UniRG-CXR demonstrates stable and robust performance across gender, age, and race subgroups, all of which exceed the performance of the second-best model (the dashed lines). UniRG is a promising step toward scaling medical imaging report generation

UniRG introduces a reinforcement learning–based framework that rethinks how medical imaging report generation models are trained and evaluated. By directly optimizing clinically grounded reward signals, UniRG-CXR achieves state-of-the-art performance across datasets, metrics, diagnostic tasks, longitudinal settings, and demographic subgroups, addressing longstanding limitations of supervised-only approaches.

Looking ahead, this framework can be extended to additional imaging modalities and clinical tasks, and combined with richer multimodal patient data such as prior imaging, laboratory results, and clinical notes. More broadly, UniRG highlights the promise of reinforcement learning as a core component of next-generation medical foundation models that are robust, generalizable, and clinically aligned.

UniRG reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health (opens in new tab), with other exciting progress such as GigaPath, BiomedCLIP, LLaVA-Rad (opens in new tab), BiomedJourney, BiomedParse, TrialScope, Curiosity.

Paper co-authors: Qianchu Liu, Sheng Zhang, Guanghui Qin, Yu Gu, Ying Jin, Sam Preston, Yanbo Xu, Sid Kiblawi, Wen-wai Yim, Tim Ossowski, Tristan Naumann, Mu Wei, Hoifung Poon

Opens in a new tab

The post UniRG: Scaling medical imaging report generation with multimodal reinforcement learning appeared first on Microsoft Research.

Categories: Microsoft

Argos: Multimodal reinforcement learning with agentic verifier for AI agents

Microsoft Research - Tue, 01/20/2026 - 19:00
At a glance
  • Today’s multimodal AI systems can give answers that sound right but may not be grounded in what they actually observe over time, leading to unpredictable errors and safety risks in real-world settings.
  • Argos is a verification framework for multimodal reinforcement learning that trains models by rewarding not just correct answers, but correct answers grounded in visual and temporal evidence, using automated verification rather than human labeling. It selects the appropriate specialized tools for each answer based on what needs to be verified. 
  • Models trained with Argos show stronger spatial reasoning, far fewer visual hallucinations, more stable learning dynamics, and better performance on robotics and real-world tasks while requiring fewer training samples.

Over the past few years, AI systems have become much better at discerning images, generating language, and performing tasks within physical and virtual environments. Yet they still fail in ways that are hard to predict and even harder to fix. A robot might try to grasp a tool when the object is visibly blocked, or a visual assistant integrated into smart glasses might describe objects that aren’t actually present.

These errors often arise because today’s multimodal agents are trained to generate outputs that are plausible rather than grounded in the actual information they receive from their environment. As a result, a model’s output can seem correct while relying on incorrect information. As AI systems are increasingly used to navigate 3D spaces and make decisions in real-world settings, this gap can be a safety and reliability concern.

To tackle this challenge, we posed the question: How can we train AI agents to generate correct answers and take appropriate actions for the right reasons so that their behavior is reliable even as the environment or tasks change?

Argos represents a novel answer to this challenge. It’s an agentic verification framework designed to improve the reliability of reinforcement learning in multimodal models. Reinforcement learning is a training method where AI models learn by receiving rewards for desired behaviors and penalties for undesired ones, gradually improving their performance through trial and error.

Rather than rewarding only correct behaviors, Argos evaluates how those behaviors were produced. It draws on a pool of larger, more capable teacher models and rule-based checks to verify two things: first, that the objects and events a model references actually exist in its input, and second, that the model’s reasoning aligns with what it observes. Argos rewards the model when both conditions are met. In practice, these rewards help curate high-quality training data and guide the model’s further training.

How Argos works

Argos functions as a verification layer on top of an existing multimodal model. Given an image or video, a task or query, and information about the model’s reasoning and output, Argos identifies where the model indicates objects are located in the image, when it indicates events occur in a video, and what action or answer it produces.

Argos then applies specialized tools tailored to the specific content to evaluate and score three aspects of the model’s output. It checks whether the answer is correct, whether referenced objects and events appear at the indicated locations and times, and whether the reasoning is consistent with the visual evidence and the answer (Figure 1).

These scores are combined using a gated aggregation function, a method that dynamically adjusts the importance of different scores. It emphasizes reasoning checks only when the final output is correct. This design prevents unreliable feedback from dominating training and produces a stable reward signal for reinforcement learning.

Figure 1. Argos selects different specialized tools to verify and score the accuracy of referenced points and events in the agent’s reasoning. Using Argos to curate data for supervised fine-tuning

Argos also helps curate high-quality training data to provide the model with a strong foundation in grounded reasoning. Before the reinforcement learning stage begins, Argos uses a multi-stage process to generate data that is explicitly tied to visual locations and time intervals.

In the first stage, Argos identifies the objects, actions, and events that are relevant to a task and links them to specific locations in images or specific moments in videos. These references are overlaid on images and selected video frames. Next, a reasoning model generates step-by-step explanations that refer to these visual locations and time spans.

Finally, Argos evaluates each generated example for accuracy and visual grounding, filtering out low-quality training data and retaining only data that is both correct and well-grounded in visual input. The resulting dataset is then used in an initial training phase, where the model learns to generate reasoning steps before producing its final output. This process is illustrated in Figure 2.

Figure 2. Argos generates step-by-step reasoning grounded in image locations and video timestamps then filters out low-quality training data. Evaluation

Building on this foundation in grounded reasoning, we further trained the model using reinforcement learning guided by Argos and evaluated its performance across a range of benchmarks. On spatial reasoning tasks, the Argos-trained model outperformed both the base model Qwen2.5-VL-7B and the stronger Video-R1 baseline across challenging 3D scenarios and multi-view tasks. Models trained with Argos also showed a substantial reduction of hallucinations compared with both standard chain-of-thought prompting and reinforcement learning baselines.

Finally, we evaluated the model in robotics and other real-world task settings, focusing on high-level planning and fine-grained control. Models trained with Argos performed better on complex, multi-step tasks. Notably, these improvements were achieved using fewer training samples than existing approaches, highlighting the importance of reward design in producing more capable and data-efficient agents. Figure 3 illustrates some of these findings.

Figure 3. Performance of Argos compared with baseline models on the task of visual hallucination detection (left) and embodied task planning and completion (right).  How Argos shapes reinforcement learning

To understand how Argos affects learning, we took the same vision-language model that had been trained on our curated dataset and fine-tuned it using reinforcement learning in two different ways. In one approach, Argos was an agentic verifier, checking the correctness of outputs and the quality of reasoning. In the other, the model received feedback only on whether its answers were correct.

We evaluated both versions on 1,500 samples from a new dataset and tracked their performance throughout the learning process (Figure 4). Although they started at similar levels, the model without Argos quickly got worse. Its accuracy steadily declined, and it increasingly gave answers that ignored what was in the videos. It learned to game the system by producing answers that seemed correct without grounding them in visual evidence.

The model trained with Argos showed the opposite pattern. Accuracy improved steadily, and the model became better at linking its reasoning to what appeared in the videos. This difference highlights the value of verification: when training rewards both correct outputs and sound reasoning based on visual and temporal evidence, models learn to be more reliable rather than simply finding shortcuts to high scores.

Figure 4. Comparison of response accuracy changes with and without Argos across two model versions (left) and differences in visual grounding accuracy over training for both versions (right). Potential impact and looking forward

This research points toward a different way of building AI agents for real-world applications. Rather than fixing errors after they occur, it focuses on training agents to systematically anchor their reasoning in what they actually receive as input throughout the training process.

The potential applications span many domains. A visual assistant for a self-driving car that verifies what’s actually in an image is less likely to report phantom obstacles. A system that automates digital tasks and checks each action against what’s displayed on the screen is less likely to click the wrong button.

As AI systems move beyond research labs into homes, factories, and offices, reliable reasoning becomes essential for safety and trust. Argos represents an early example of verification systems that evolve alongside the AI models they supervise. Future verifiers could be tailored for specific fields like medical imaging, industrial simulations, and business analytics. As more advanced models and richer data sources become available, researchers can use them to improve these verification systems, providing even better guidance during training and further reducing hallucinations.

We hope that this research helps move the field toward AI systems that are both capable and interpretable: agents that can explain their decisions, point to the evidence behind them, and be trained to adhere to real-world requirements and values.

Opens in a new tab

The post Argos: Multimodal reinforcement learning with agentic verifier for AI agents appeared first on Microsoft Research.

Categories: Microsoft

OptiMind: A small language model with optimization expertise

Microsoft Research - Thu, 01/15/2026 - 16:00
At a glance
  • Many real-world business problems can benefit from optimization, but translating decisions, constraints, and goals from natural language into optimization algorithms is slow.
  • OptiMind is a small language model designed to convert business problems described in natural language into the mathematical formulations needed by optimization software.
  • OptiMind is trained on a carefully curated, expert-aligned dataset and applies domain-specific hints and self-checks at inference time, improving its accuracy.
  • OptiMind matches or exceeds the performance of much larger systems, can run locally to protect sensitive data, produces more reliable formulations, and reduces the time and expertise needed to prepare optimization models.

Enterprises across industries, from energy to finance, use optimization models to plan complex operations like supply chains and logistics. These models work by defining three elements: the choices that can be made (such as production quantities or delivery routes), the rules and limits those choices must follow, and the goal, whether that’s minimizing costs, meeting customer demand, or improving efficiency.

Over the past few decades, many businesses have shifted from judgment-based decision-making to data-driven approaches, leading to major efficiency gains and cost savings. Advances in AI promise to accelerate this shift even further, potentially cutting decision times from days to minutes while delivering better results.

In practice, however, turning real-world business problems into a form that optimization software can understand is challenging. This translation process requires expressing decisions, constraints, and objectives in mathematical terms. The work demands specialized expertise, and it can take anywhere from one day to several weeks to solve complex problems. 

To address this challenge, we’re introducing OptiMind, a small language model designed to convert problems described in plain language into the mathematical formulations that optimization software needs. Built on a 20-billion parameter model, OptiMind is compact by today’s standards yet matches the performance of larger, more complex systems. Its modest size means it can run locally on users’ devices, enabling fast iteration while keeping sensitive business data on users’ devices rather than transmitting it to external servers.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab How it works

OptiMind incorporates knowledge from optimization experts both during training and when it’s being used to improve formulation accuracy at scale. Three stages enable this: domain-specific hints improve training data quality, the model undergoes fine-tuning, and expert reasoning guides the model as it works.

Figure 1. From problem description to solution 

One of the central challenges in developing OptiMind was the poor quality of existing public datasets for optimization problems. Many examples were incomplete or contained incorrect solutions. To address this, we developed a systematic approach that combines automation with expert review. It organizes problems into well-known categories, such as scheduling or routing, and identifies common error patterns within each. Using these insights, we generated expert-verified “hints” to guide the process, enabling the system to regenerate higher-quality solutions and filter out unsolvable examples (Figure 2). The result is a training dataset that more accurately reflects how optimization experts structure problems.

Figure 2. Process for correcting training data

Using this refined dataset, we applied supervised fine-tuning to the base model. Rather than simply generating code, we trained OptiMind to produce structured mathematical formulations alongside intermediate reasoning steps, helping it avoid the common mistakes found in earlier datasets.

When in use, the model’s reliability further improves. When given a new problem, OptiMind first classifies it into a category, such as scheduling or network design. It then applies expert hints relevant to that type of problem, which act as reminders to check for errors before generating a solution. For particularly challenging problems, the system generates multiple solutions and either selects the most frequently occurring one or uses feedback to refine its response. This approach increases accuracy without requiring a larger model, as illustrated in Figure 3.

Figure 3. OptiMind’s inference process Evaluation

To test the system, we turned to three widely used public benchmarks that represent some of the most complex formulation tasks in the field. On closer inspection, we discovered that 30 to 50 percent of the original test data was flawed. After manually correcting the issues, OptiMind improved accuracy by approximately 10 percent over the base model. Figure 4 and Table 1 show detailed comparisons: OptiMind outperformed other open-source models under 32 billion parameters and, when combined with expert hints and correction strategies, matched or exceeded the performance of current leading models.

Figure 4. Average accuracy percentages over all models. Table 1. Performance of all models on corrected benchmark datasets

OptiMind is more reliable than other models because it learns from higher-quality, domain-aligned data. And by correcting errors and inconsistencies in standard datasets, we significantly reduced the model’s tendency to hallucinate relative to the base and comparison models.

Looking forward

While supervised fine-tuning has provided a strong foundation, we are exploring reinforcement learning to further refine OptiMind’s reasoning capabilities. We’re also investigating automated frameworks that would allow LLMs to generate their own expert hints, enabling continuous autonomous improvement. Additionally, we are working with Microsoft product teams and industry collaborators to expand OptiMind’s utility, adding support for more programming languages and a variety of input formats, including Excel and other widely used tools.

We’re releasing OptiMind as an experimental model to gather community feedback and inform future development. The model is available through Microsoft Foundry (opens in new tab) and Hugging Face (opens in new tab), and we’ve open-sourced the benchmarks and data-processing procedures on GitHub (opens in new tab) to support more reliable evaluation across the field. We welcome feedback through GitHub (opens in new tab), and invite those interested in shaping the future of optimization to apply for one of our open roles.

Opens in a new tab

The post OptiMind: A small language model with optimization expertise appeared first on Microsoft Research.

Categories: Microsoft

Inilah Keunggulan Yang Ditawarkan Situs Sabung Ayam Online Resmi Di Indonesia

Microsoft Kitchen - Wed, 09/20/2023 - 16:59

Situs judi IDN Slot online yang resmi dan terbaik adalah tempat untuk player yang ingin melakukan taruhan dengan cara online. di dalamnya kamu akan menemukan permainan sabung ayam yang sudah terkenal di Indonesia. game sabung ayam sendiri adalah permainan yang sangat disukai oleh para pecinta ayam aduan tidak hanya di Indonesia saja tapi juga di berbagai belahan dunia. Dikarenakan adanya larangan perjudian, sekarang seluruh pecinta ayam aduan melakuka taruhan dengan sistem online. karena itu kamu bisa mencoba game ini di agen judi resmi dan terpercaya untuk dapatkan keseruan tanpa batas di dalamnya.

Beragam Keunggulan Yang Ditawarkan Situs Sabung Ayam Online Resmi Indonesia

Kebanyakan petaruh di Indonesia yang melakukan taruhan sabung ayam diwajibkan untuk memilih agen atau situs judi sabung ayam terbaik terlebih dahulu. Karena ketika pemilihan agen dapat dilakukan oleh player, tentu saja hal ini akan memudahkan jalannya dalam mendapatkan keuntungan dengan mudah. agen judi sabung ayam terbaik sendiri menawarkan beberapa keungguln yang membuat petaruh suka dan jatuh hati saat bermain di dalamnya. berikut ini sudah ada keunggulan yang akan kamu temukan di dalam situs judi sabung ayam resmi untuk para player di indonesia:

  • Fitur Live Streaming

Untuk keunggulan yang akan kamu dapatkan pertama kali adalah live streaming. Jadi perlu diketahui, lewat fitur yang satu ini, kamu akan menemukan sebuah perlombaan secara langsung. Fitur live streaming memungkinkan para player untuk merasakan sensasi bermain yang sangat mirip seperti pada bandar darat langsung. Karena itu kebanyakan player akan lebih memilih bermain game sabung ayam bersama agen judi yang menyediakan ftur live streaming di dalamnya supaya taruhan lebih menyenangkan.

Para player yang ingin bermain dapat masuk ke dalam pertandingan lewat stus atau aplikasi. Jadi cobalah untuk temukan agen-agen yang memiliki fitur ini di dalamnya. karena ketika kamu ada di dalam sebuah agen judi sabung ayam dengan fitur seperti ini, itu artinya kamu sudah berhasil dapatkan agen terbaik. disini kamu bisa melakukan taruhan dengan aman dan nyaman serta mendapatkan hasil yang begitu menggiurkan.

  • Hadir untuk semua kalangan

Kemudian, kamu juga akan menemukan banyak sekali player yang ikut bermain di dalam agen judi seperti ini. karena itu, game ini hadir untuk semua kalangan player yang membuatnya semakin populer. Game ini bisa diakses dengan mudah oleh player karena alat main yang digunakan hanyalah sebuah smartphone yang dihubungkan ke jaringan internet saja.

Jadi apabila kamu sudah menemukan jaringan internet di dalam smartphone milik kamu, kamu bisa akses sabung ayam online kapan saja dan dimana saja. kamu juga dapat menikmati permainan ini dengan penawaran tanpa batas yang membuat game ini sangat sayang bila dilewatkan begitu saja. jadi cobalah untuk melakukan pemilihan situs sabung ayam sampai menemukan agen seperti ini.

  • Terjamin kEamanannya

Dan yang terakhir adalah mendapatkan game sabung ayam yang sudah terjamin keamanannya. Ini adalah salah satu keunggulan yang juga akan kamu dapatkan dari situs judi sabung ayam. Jadi apabila saat ini kamu mengikuti taruhan sabung ayam secara online, keamanan yang ada di dalam agen patut untuk kamu perhatikan dengan benar.

Pasalnya ketika kamu berada di dalam sebuah agen yang keamanannya tidak begitu terjamin, tentu saja kamu harus memperhatikan sistemnya dulu di dalam agen. Karena semua player yang bermain berhak mendapatkan keamanan pada saat berada di dalam agen. Keamanan dan kenyamanan adalah dua hal penting yang akan membantu player untuk bisa dapatkan keuntungan di setiap harinya. player yang bermain game taruhan online juga tidak perlu khawatir jika nanti tidak bisa mendapatkan keseruan pada game yang dimainkan.

Itulah beberapa keunggulan yang akan kamu dapatkan saat berada di dalam agen judi sabung ayam online resmi dan terpercaya. jadi apabila saat ini kamu tertarik dengan game ini, kamu harus temukan situs-situs dengan semua daftar keunggulan di atas untuk dapatkan keuntungan di setiap harinya.

Originally posted 2022-07-12 00:42:47. …

Strategi Main Sabung Ayam Online Yang Jarang Diketahui Oleh Player

Microsoft Kitchen - Mon, 09/18/2023 - 16:48

Bermain game judi Joker123 apk online adalah salah satu aktivitas yang saat ini sedang banyak dilakukan oleh player. aktivitas ini disukai oleh player karena bisa mendatangkan penghasilan dalam jumlah yang besar. karena itu, apabila saat ini kamu suka dengan taruhan sabung ayam, pastikan kamu bertaruh dengan strategi. Jika kamu punya strategi untuk bermain game sabung ayam, kesempatan kamu dalam mendapatkan kemenangan akan jauh lebih besar. kamu juga bisa menikmati hasil yang menggiurkan lewat kemenangan yang sudah berhasil diraih.

Berikut Ini Beberapa Strategi Main Sabung Ayam Online Yang Jarang Diketahui Oleh Player

Banyak petaruh mendambakan kemenangan dalam game sabung ayam yang dimainkan. Karena itu, jika kamu salah satunya, maka strategi dalam permainan harus kamu ketahui sejak awal. Jika kamu tahu strategi apa saja yang mesti dilakukan pada saat betting, hal ini akan membantu kamu dalam mendapatkan penghasilan yang besar. nah berikut ini sudah kami rangkum beberapa strategi untuk yang ingin bermain game sabung ayam dengan sistem online:

  • Memilih Pertandingan yang Tepat

Dikarenakan ada banyak pertandingan sabung ayam yang akan ditemukan di agen judi terpercaya, maka kamu perlu mencari pertandingan yang memang sudah diketahui dengan baik. Banyanya pertandingan sabung ayam membantu para petaruh untuk memilih yang benar-benar memguntungkan. Jangan pernah berpikir jika semua pertandingan bisa kamu nikmati. jadi sebaiknya cari informasi yang banyak dan lengkap terkait pertandingan yang akan diikuti nanti. Jika sudah mengetahui pertandinganya, barulah kamu bisa dapatkan kemenangan dalam game dengan mudah.

  • Amati Hasil Riwayat Pertandingan Terdahulu

Kemudian, strategi kedua untuk player yang ingin bermain game judi sabung ayam adalah mengamati hasil riwayat pertandingan dari kedua ayam yang diadu. Nantinya, kamu akan bertemu dengan ayam berwarna merah dan biru. Disini kamu harus pandai dalam memilih ayam yang dirasa bisa memenangkan pertarungan. Tapi untuk melakukan analisa, dibutuhkan informasi yang lengkap. Kamu bisa perhatikan hasil riwayat dari kedua ayam yang akan diadu.

Biasanya di agen judi sabung ayam online, player bisa menemukan hasil riwayat tersebut dengan mudah. informasi seperti ini tentu saja dibutuhkan oleh player. apalagi yang baru saja masuk ke dalam dunia taruhan adu ayam online itu sendiri. jadi bagi para pecinta ayam aduan, lakukan strategi yang kedua ini dan kamu bisa dapatkan kemenangan dengan mudah.

  • Modal Harus Dikelola dengan Baik

Strategi main game sabung ayam yang ketiga adalah modal harus dikelola dengan baik. Jadi buat yang ingin bermain taruhan sabung ayam, kamu harus pastikan jika modal yang akan dikeluarkan sudah melalui perhitungan yang matang. Jangan pernah berpikir jika uang yang kamu punya saat ini bisa kamu jadikan chip. Kamu harus perhatikan dulu berapa jumlah chip yang dibuthkan supaya nanti memudahkan proses deposit yang akan kamu lakukan.

Kebanyakan petaruh pemula langsung bertaruh dengan modal yang banyak.padahal jika hal ini dilakukan akan membuat taruhan yang dilakukan player justru tidak bisa memberikan keuntungan ataupun penghasilan. Maka dari itu, kamu tetap harus membatasi penggunaan modal yang akan dikeluarkan di setiap harinya. karena ini adalah bagian dari strategi yang perlu dilakukan oleh player yang bertaruh. Jika sudah mengaturnya, kerugian besar pasti tidak akan pernah kamu rasakan.

  • Bermain dI Situs Terbaik

Dan yang terakhir adalah bermain game sabung ayam di situs judi terbaik. ini merupakan strtegi bermain game judi sabung ayam ketiga yang mesti dilakukan player. jadi untuk yang ingin bermain game sabung ayam, coba pilih dan pilah situsnya dulu. Jika kamu sudah menemukan situs judi terbaik, kamu pasti akan mendapatkan tempat yang bisa berikan kenyamanan untuk playernya.

Itulah beberapa strategi main game judi sabung ayam online yang jarang diketahui oleh player. jadi untuk petaruh yang ingin bermain harus mengikuti strategi di atas untuk bisa dapatkan peluang menang yang besar. jika kamu bisa dapatkan kemenangan dalam permainan sabung ayam, silahkan tarik dananya untuk dapatkan untung menjanjikan. Selamat mencoba dan semoga bermanfaat.

Originally posted 2022-06-08 00:34:09. …

Apa Yang Harus Dilakukan Saat Main Poker Online Modal Kecil?

Microsoft Kitchen - Sat, 09/16/2023 - 16:25

Memang sekarang ini banyak sekali game judi joker123 online yang beredar di internet atau dunia maya dengan begitu bebasnya. Meski game judi dimainkan via onine, tetap saja harus ada modal untuk bisa mengakses dan menikmati keseruan pada game tersebut. begitu pun dengan game judi poker online, semua yang bermain game poker pastinya harus mempelajari dan memahami bagaimana caranya agar modal yang dibawa bisa memberikan hasil yang luar biasa. Karena itu, coba simak beberapa cara di bawah ini untuk pemula yang ingin bermain game poker tapi membawa modal dalamjumlah sedikit.

Hal-Hal Yang Perlu Dilakukan Saat Main Poker Online Memakai Modal Kecil

Permainan poker tidak dapat dipungkiri adalah game judi online yang membutuhkan modal bermain di dalamnya. Modal yang diperlukan pada saat bermain game poker adalah uang asli. Karena itu, ketika kamu berhasil dapatkan kemenangan, maka kemenangan tersebut akan membantu kamu untuk dapatkan penghasilan dalam jumlah yang sangat besar. jadi sudah tidak perlu heran lagi mengapa saat ini banyak petaruh yang bermain game poker dengan modal kecil. Jika kamubisa melakukan taruhan dengan modal kecil, kamu pasti akan bertaruh dengan aman. berikut ada beberapa hal yang sebaiknya dilakukan saat main poker dengan modal kecil:

  • MEnguasai Permainannya Dulu

Hal pertama yang mesti dilakukan oleh player pada saat bermain game poker memakai modal kecil adalah menguasai permainannya terlebih dahulu. Jadi disini kamu harus tahu jika penguasaan terhadap permainan judi poker sangat diperlukan oleh player. karena ketika kamu menguasai permainannya dengan baik, akan ada banyak hal positif yang bisa kamu dapatkan nanti.

Jika kamu termasuk salah seorang pemain baru atau pemula, mungkin kamu perlu waktu yang cukup banyak agar bisa mempelajari dan memahami aturan dalam game poker dengan baik. Jika kamu sudah melakukannya, barulah kamu boleh melakukan taruhan dengan uang asli dengan pemahaman yang kamu miliki. Karena kamu pasti bisa mengolah kartu yang didapatkan dengan benar jika penguasaan terhadap permainan sudah kamu dapatkan.

  • Memakai Konsentrasi Tingkat Tinggi

Kemudian, kamu juga perlu memakai konsentrasi tingkat tinggi pada saat bermain game poker. Ini adalah hal kedua yang perlu dilakukan oleh player. jangan pernah berpikir jika segala kondsii bisa kamu pakai untuk bermain game judi poker online. pasalnya kamu hanya bisa memenangkan permainan poker jika berada dalam konsentrasi. Kamu harus berkonsentrasi penuh pada taruhan dan fokus dengan segala tahapan yang kamu lalui untuk dapatkan kemenangan dengan mudah.

Kebanyakan player di Indonesia yang bermain tanpa konsentrasi justru akan mengalami kerugian dalam jumlah yang sangat besar. karena itu, kamu harus pastikan jika waktu dan tempat yang dipergunakan untuk bermain sudah tepat. pasalnya hanya dengan cara itu sja, kamu pasti bisa dapatkan taruhan yang lebih gampang untuk dimenangkan.

  • Memanfaatkan Trik Jitu

Trik dibutuhkan oleh player pada saat bermain game judi poker. Salah satu trik yang tidakboleh sampai kamu lewatkan adalah trik bluffing atau menggertak. Karena disini kamu harus tahu jika trik bluffing akan sangat membantu kamu untuk mengalahkan player laiin yang duduk di meja taruhan online. jadi trik ini harus kamu lakukan dengan penuh keberanian agar player lain percaya dan segera keluar.

Jika kamu memakai trik yang satu ini, pastikan kamu melakukannya di moment yang tepat. Tidak masalah meski saat ini kartu yang kamu miliki tidak begitu bagus. Jika kamu punya kartu yang tidak terlalu baik nilainya, kamu hanya perlu mengolahnya saja dan berani untuk bluffing. Karena tidak ada satupun player yang bisa mengetahui nilai kombinasi kartu yang kamu dapatkan saat ini. jadi coba untuk melakukan trik yang ketiga ini agar bisa memenangkan permainan denga mudah.

Itulah beberapa hal yang harus dilakukan oleh player bila bermain game judi poker online memakai modal dalamjumlah yang kecil. Jadi apabila saat ini kamu sedang tertarik untuk bermain taruhan poker, kamu boleh melakukan taruhan dengan sejumlah trik di atas. Selamat mencoba dan semoga bermanfaat. 

Originally posted 2022-05-23 00:16:33. …

Begini Cara Mengikuti Taruhan Sabung Ayam Online Yang Aman

Microsoft Kitchen - Thu, 09/14/2023 - 16:09

Beberapa cara sepertinya perlu kamu lakukan apabila ingin bermain game judi idnplay download online dengan aman dan nyaman. karena itu, apabila saat ini kamu mengikuti game sabung ayam, pastikan kamu melakukan taruhan dengan cara yang benar. Game ini sudah bisa diakses dan dinikmati dengan cara online. karena itu terdapat kemudahan pada saat mengakses permainannya. Kemudahan dalam mengakses game taruhan sabung ayam tentu saja dikarenakan akses ke dalam game yang hanya membutuhkan smartphone dan internet saja. jadi kamu bisa bermain dimanapun kamu mau dengan mudah.

Beberapa Cara Mengikuti Taruhan Sabung Ayam Online Dengan Aman

Berbeda halnya dengan game sabung ayam yang dimainkan secara langsung, permainan sabung ayam yang kini diakses via online tentu saja jauh lebih aman. karena kamu bisa akses game ini dimana saja yang kamu mau. Hanya dengan smrtphone dan internet saja, akses ke dalam game sudah bisa dilakukan dimanapun kamu mau. Karena itu, rata-rata player lebih suka bermain game sabung ayam dengan sistem online. jadi apabila kamu tetarik, coba simak cara mengikuti taruhan sabung ayam berikut ini agar prosesnya dapat berjalan dengan aman dan nyaman:

  • Mendaftar di Situs Judi Resmi

Pertama, kamu harus melakukan pendaftaran di situs judi yang resmi. Ini menjadi cara pertama yang harus kamu lakukan apabila ingin mengikuti taruhan sabung aym secara online. pendaftaran yang dilakukan di dalam agen judi resmi akan membantu kamu supaya bisa dapatkan akun member dengan segera. Data-data yang diberikan ke dalam agen harus data asli. Jangan pernah berpikir jika kamu bisa pergunakan data orang lain pada saat mendaftar di dalam agen sabung ayam.

Siapkan semua data diri yang akan diperlukan pada saat mendaftar. Karena itu, apabila saat ini kamu tertarik untuk bermain nanti, tidak ada salahnya untuk melakukan persiapan yang matang. Pasalnya jika kamu mempersiapkan semuanya dengan matang, hal ini akan membantu kamu supaya bisa menyelesaikan proses daftar dengan mudah. kamu juga bisa mendapatkan akun member tanpa harus dalam waktu yang lama.

  • Melakukan Deposit yang Pertama

Kemudian, kamu harus melakukan yang namanya deposit untuk pertama kalinya. Deposit ke dalam stus judi sabung ayam online adalah langkah kedua yang mesti dilakukan oleh player. jadi untuk yang saat ini melakukan transaksi deposit ke dalam agen judi sabung ayam, maka kamu perlu meminta terlebih dahulu nomor rekening agen  lewat cs yang bertugas. Tenang saja, cs akan membantu kamu supaya bisa mendapatkan nomor rekening terbaru milik situs sehingga tidak ada lagi kesalahan yang dilakukan player saat bertaruh.

Deposit ke dalam agen judi sabung ayam sudah semestinya dilakukan di waktu yang tepat. jadi untuk player yang ingin bermain game sabung ayam, jangan pernah bertransaksi jika kamu sendiri tidak tahu apakah bank dalam keadaan online atau tidak. Jadi saat deposit, kamu harus melakukannya ketika bank dalam keadaan online. Dengan begitu, transaksi akan berjalan dengan lancar dan kamu bisa mendapatkan chip untuk bermain taruhan di setiap harinya.

  • Memulai Taruhan dengan Bet Kecil

Dan cara terakhir untuk yang ingin mengikuti taruhan sabung ayam adalah memulai taruhan dengan bet kecil. Jadi untuk yang saat ini ingin bermain game sabung ayam, kamu perlu memasang taruhan dengan bet kecil terlebih dahulu. Jangan buru-buru melakukan pemasngan taruhan dengan bet besar. karena kamu akan mengalami kerugian yang besar jika langsung mengikuti taruhan dengan bet besar.

Bermain game sabung ayam dapat dilakukan dengan bet besar dan juga bet kecil. Jika kamu bermain game sabung ayam dengan bet kecil, kemungkinan untuk kamu bisa mendapatkan kemenangan akan jauh lebih besar. berbeda dengan taruhan yang dilakukan dengan bet besar dimana kebanyakan petaruh akan lebih terfokus hanya pada kemenangan dan sisa uang yang dimiliki saja. sehingga mereka lupa dengan kekalahan dan kerugian yang kerap diberikan game ini untuk playernya.

Itulah beberapa cara mengikuti taruhan sabung ayam online yang aman untuk pemula. Jadi supaya kamu bisa bertaruh nanti, coba ikuti satu per satu semua cara main di atas untuk dapatkan untung yang besar.

Originally posted 2022-05-07 00:40:10. …

Simak Tipsnya Jika Ingin bermain di Agen Judi Poker Online

Microsoft Kitchen - Tue, 09/12/2023 - 15:34

Dalam bermain game judi poker88 online, tentu kamu harus mengetahui terlebih dahulu sejumlah tips yang akan membantu kamu agar bisa menjalankan taruhan dengan baik. Tips bermain game judi poker sejatinya diperlukan oleh semua player terutama yang masih pemula. Karena ketika tips bermain game poker sudah diketahui oleh player, tentu hal ini akan membantu mempermudah proses taruhan yang akan dilakukan. maka dari itu, coba disimak dulu beberapa tips bermain game judi poker di bawah ini apabila ingin melakukan taruhan dengan mudah dan nyaman.

Beragam Tips Yang Diperlukan Jika Ingin Bermain Game Di Agen Poker Online

Pada saat bermain game judi poker, semua player tentu saja berharap jika mereka bisa mendapatkan hasil keuntungan dalam jumlah besar. tapi sayangnya, sebagai pemula, banyak hal yang sejatinya perlu kamu ketahui terlebih dahulu. Jika kamu tahu banyak hal tentang game yang dimainkan, tentu saja kemungkinan untuk kamu bisa dapatkan kemenangan akan semakin besar. karena itu, coba simak tips bermain game judi poker berikut ini agar kesempatan meraup untung besar akan semakin terbuka lebar:

  • Membaca Info Tentang Aturan Main Poker

Pertama, coba baca informasi tentang aturan main game poker yang benar. Jadi untuk yang saat ini suka dengan game judi poker, kamu harus pastika jika informasi di dalam permainan poker sudah kamu dapatkan sejak awal. Banyak hal yang mesti diketahui oleh player salah satunya adalah kombinasi dalam game poker itu sendiri. jadi disini kamu harus mengetahui informasi tentang kombinasi yang ada di dalam game poker supaya bisa dapatkan susunan terbaik pada saat bermain taruhan.

Aturan main game poker lainnya yang perlu diketahui oleh player adalah stategi main yang akan dibutuhkan atau berguna pada saat bermain. jadi kamu harus tahu jika strategi dalam game poker juga dibutuhkan. Salah satu strategi yang sangat populer di dalam dunia betting adalah strategi bluffing. Jadi kamu bisa melakukan bluffing untuk menggertak player lain agar mau keluar dari taruhan yang dimainkan.

  • Modal Tampil

Kemudian. Pada saat bermain game judi poker, kamu juga harus punya yang namanya modal tampil. Player yang ingin bermain game poker sudah sepatutnya melakukan deposit terlebih dahulu. Apabila sudah melakukan deposit, barulah uang yang dibawa ke dalam permainan disetorkan ke dalam rekening agen. Dengan uang tersebut, kamu bisa bermain taruhan poker di setiap harinya. kamu bisa mengikuti permainan poker tanpa harus menunggu waktu-waktu tertentu.

Dalam game poker, chip memang begitu dibutuhkan oleh semua player yang bertaruh. Maka dari itu, apabila saat ini kamu tengah tertarik untuk bermain judi poker online, jangan pernah beranggapan jika game poker ini bisa kamu akses atau mainkan tanpa chip atau modal di dalamnya. tanpa adanya modal, game apapun tidak akan bisa diakses termasuk game judi poker itu sendiri.

  • Bermain Sabar

Dan yang ketiga adalah bermain game taruhan poker dengan penuh kesabaran. Ini menjadi tips selanjutnya yang tidak boleh dilupakan oleh player di Indonesia. Karena ketika kamu berharap untuk terjun ke dalam game taruhan poker, tentu saja kesabaran menjadi salah satu hal yang sangat dibuthkan disini. Kamu bisa dapatkan banyak kemenangan dan keuntungan bila lebih bersabar dalam menjalankan taruhan online.

Sudah banyak petaruh di Indonesia yang saat ini melakukan taruhan dengan sikap terburu-buru. Bukan hanya memberikan efek kerugian dalam jumlah yang besar saja, jika kamu buru-buru mengikuti kegiatan betting yang ada khawatirnya nanti kerugian dalam jumlah besar juga akan kamu alami nanti. Kesabaran adalah salah satu teknik bermain yang sangat penting untuk dilakukan oleh para player indonesia.

Itulah beberapa tips yang harus dilakukan oleh player apabila ingin bermain bersama agen judi poker online yang terbaik. jadi apabila saat ini kamu mengikuti semua tips bermain di atas, kemugkinan untuk kamu bisa dapatkan keuntungan akan semakin besar. bahkan kamu juga bisa menikmati kesuksesan lewat game ini di setiap harinya. selamat mencoba.

Originally posted 2022-04-10 00:30:54. …

Trik Membuat Akun Judi Poker Online Yang Harus Dipelajari Pemula

Microsoft Kitchen - Sun, 09/10/2023 - 15:13

Pembuatan akun member di dalam agen judi idnplay poker online adalah salah satu informasi yang pastinya akan dibutuhkan oleh semua player pemula di indonesia. karena player pemula yang bermain game judi poker akan membutuhkan trik-trik supaya proses pembuatan akun member dapat berjalan mudah dan nyaman. trik membuat akun judi poker sudah sepatutnya dipelajari oleh pemula. Jadi jika kamu salah satu pemula yang saat ini tertarik dengan game poker, coba simak dulu beberapa trik membuat akun judi di bawah ini yang harus dipelajari oleh pemula.

Beragam Trik Untuk Player Yang Ingin Membuat Akun Judi Poker Online

Semua yang sudah terjun ke dalam dunia taruhan pasti ingin mencoba game judi poker yang kini bisa diakses dengan sistem online. terdapat begitu banyak perbedaan yang dimiliki game poker offline dan online. karena itu, jika kamu belum pernah mencoba game ini dengan sistem online, tentu kamu perlu menyimak dulu uraian kali ini. pasalnya banyak sekali hal penting yang sepatutnya diketahui termasuk salah satunya adalah panduan membuat akun judi poker. Berikut ini diantaranya trik membuat akun judi poker yang perlu dipelajari oleh pemula:

  • Main di Situs yang Direkomendasikan Orang-orang

Pertama, kamu harus mainkan game judi poker di situs yang sudah direkomendasikan banyak orang. Ini adalah cara main pertama yang perlu dilakukan oleh player. jika kamu bermain game judi poker, kamu tidak boleh salah dalam memilih situs judi. Situs yang dipilih harus situs yang terpercaya. adapun langkah memilih situs poker adalah melihat lisensi resmi dalam situs judi itu sendiri. jadi kamu harus mencari situs judi poker yang sudah mendapatkan sertifikat sebagai situs resmi terlebih dahulu.

Kemudian, kamu juga perlu melakukan taruhan di dalam situs yang sudah memiliki fasilitas dan layanan terlengkap di dalamnya. jadi untuk yang saat ini ingin bermain game judi poker, kamu harus perhatikan dulu apakah situs yang dipilih adlah situs yang sudah dilengkapi dengan pelayanan yang nyaman atau tidak. Karena situs judi terbaik pasti akan memberikan pelayanan terbaik untuk para player yang bermain.

  • Siapkan Dana

Kemudian, kamu perlu menyiapkan dana untuk bisa deposit ke dalam situs judi poker online. Bagi yang ingin bermain game poker, kamu perlu memiliki dana dan rekening atas nama kamu sendri. Jika kamu belum membuat rekening bank, silahkan buat dengan memakai nama kamu sendiri. karena pihak agen tidak akan memproses transaksi yang nama akun banknya berbeda dengan nama pemilik akun judi yang dibuat.

  • Mengisi Form Data

Langkah ketiga untuk player yang ingin membuat akun judi poker adalah mengisi form data. Jadi apabila kamu sudah menemukan situs judi dan menyiapkan dana yang cukup, ini adalah langkah ketiga yang perlu dilakukan. silahkan isi data-data yang benar. Adapun data yang sebaiknya diisi dengan data kamu sendiri adalah nama akun atau username, nomor rekening bank yang digunakan, jenis bank yang digunakan dan banyak lagi yang lain.

Jika kamu ingin mendapatkan kemudahan dalam melakukan pengisian data diri, usahakan untuk mempersiapkan data-data yang diperlukan. Persiapan data diri sebelum proses pendaftaran dilakukan adalah salah satu langkah yang mesti dilakukan oleh player. karena itu, kamu bisa isi form data dengan cara atau trik satu ini.

  • Memulai Taruhan

Dan langkah terakhir yang perlu dilakukan adalah memulai taruhan. ini adalah salah satu langkah membuat akun judi yang terakhir kali mesti dilakukan oleh player. jika kamu sudah memulai taruhan online, itu artinya kamu bisa melakukan taruhan kapan saja. tapi disini kamu harus periksa dulu apakah taruhan yang kamu lakukan sudah kamu mengerti dengan baik atau tidak. Jika tidak, usahakan untuk mempelajari terlebih dahulu aturan main di dalamnya.

Itulah beberapa trik membuat akun judi poker online yang sudah sepatutnya dipelajari dan dipahami oleh semua pemula di Indonesia. jika kamu sudah mempelajarinya, kamu pasti bisa membuat akun member dengan segera. Bahkan waktu yang dibutuhkan nanti hanya beberapa menit saja jika semua langkah sudah benar atau sesuai.

Originally posted 2022-03-25 00:07:14. …

Syndicate content

eXTReMe Tracker