Microsoft
Systematic debugging for AI agents: Introducing the AgentRx framework
- Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.
- Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.
- Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.
- Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.
As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.
When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.
Today, we are excited to announce the open-source release of AgentRx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the AgentRx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.
The challenge: Why AI agents are hard to debugModern AI agents are often:
- Long-horizon: They perform dozens of actions over extended periods.
- Probabilistic: The same input might lead to different outputs, making reproduction difficult.
- Multi-agent: Failures can be “passed” between agents, masking the original root cause.
Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.
Introducing AgentRx: An automated diagnostic “prescription”AgentRx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, AgentRx uses a structured, multi-stage pipeline:
- Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.
- Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).
- Guarded evaluation: AgentRx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.
- LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.
To evaluate AgentRx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:
- τ-bench: Structured API workflows for retail and service tasks.
- Flash: Real-world incident management and system troubleshooting.
- Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.
Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).
Taxonomy CategoryDescriptionPlan Adherence FailureIgnored required steps / did extra unplanned actionsInvention of New InformationAltered facts not grounded in trace/tool outputInvalid InvocationTool call malformed / missing args / schema-invalidMisinterpretation of Tool OutputRead tool output incorrectly; acted on wrong assumptionsIntent–Plan MisalignmentMisread user goal/constraints and planned wronglyUnder-specified User IntentCould not proceed because required info wasn’t availableIntent Not SupportedNo available tool can do what’s being askedGuardrails TriggeredExecution blocked by safety/access restrictionsSystem FailureConnectivity/tool endpoint failures Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but AgentRx focuses on identifying the first critical breach. Key ResultsIn our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:
- +23.6% absolute improvement in failure localization accuracy.
- +22.9% improvement in root-cause attribution.
By providing the “why” behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.
Join the Community: Open Source ReleaseWe believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.
- Read the Paper: AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
- Explore the Code & Data: https://aka.ms/AgentRx/Code (opens in new tab)
We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.
AcknowledgementsWe would like to thank Avaljot Singh and Suman Nath for contributing to this project.
Opens in a new tabThe post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.
PlugMem: Transforming raw agent interactions into reusable knowledge
- Today’s AI agents store long interaction histories but struggle to reuse them effectively.
- Raw memory retrieval can overwhelm agents with lengthy, low-value context.
- PlugMem transforms interaction history into structured, reusable knowledge.
- A single, general-purpose memory module improves performance across diverse agent benchmarks while using fewer memory tokens.
It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use.
More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix useful experiences with irrelevant details, making retrieval slower and less reliable. The challenge is not storing more experiences, but organizing them so that agents can quickly identify what matters in the moment.
In our recent paper “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,” we introduce a plug-and-play memory system that transforms raw agent interactions into reusable knowledge. Rather than treating memory as text to retrieve, PlugMem organizes that history into structured knowledge designed to support decisions as the agent acts.
Cognitive science offers a useful framework here. It distinguishes between remembering events, knowing facts, and knowing how to perform tasks. Past events provide context, but effective decisions rely on the facts and skills extracted from those events.
This distinction motivated a shift in how we decided to design memory for AI agents. PlugMem implements this shift by converting the agent’s interaction history, such as dialogues, documents, and web sessions, into structured, compact knowledge units that can be reused across tasks.
How PlugMem worksA key difference between PlugMem and conventional AI memory systems is what gets stored. Traditional approaches store text chunks or named entities (references to people, places, and concepts). PlugMem uses facts and reusable skills as the fundamental building blocks of memory. This design reduces redundancy, increases information density, and improves retrieval precision. It’s built around three core components:
Structure. Raw interactions are standardized and transformed into propositional knowledge (facts) and prescriptive knowledge (reusable skills). These knowledge units are organized into a structured memory graph, enabling knowledge to be stored in a form designed for reuse.
Retrieval. Rather than retrieving long passages of text, PlugMem retrieves knowledge units that are aligned with the current task. High-level concepts and inferred intents serve as routing signals, surfacing the most relevant information for the decision at hand.
Reasoning. Retrieved knowledge is distilled into concise, task-ready guidance before being passed to the base agent, ensuring that only decision-relevant knowledge enters the agent’s context window.
Figure 1 illustrates how these components work together.
Most AI memory systems are built for one job. A conversational memory module is designed around dialogue. A knowledge-retrieval system is tuned to look up facts. A web agent’s memory is optimized for navigating pages. Each performs well in its target setting but rarely transfers without significant redesign.
PlugMem takes a different approach. It is a foundational memory layer that can be attached to any AI agent without needing to modify it for a specific task.
Evaluating PlugMemTo test PlugMem, we evaluated the same memory module on three benchmarks that each make different demands on memory:
- Answering questions across long multi-turn conversations
- Finding facts that span multiple Wikipedia articles
- Making decisions while browsing the web
Across all three, PlugMem consistently outperformed both generic retrieval methods and task-specific memory designs while allowing the AI agent to use significantly less memory token budget in the process.
Measuring memory by utility, not sizeWe wanted to evaluate whether the right information was reaching the agent at the right moment, without overwhelming the model’s context window, which has limited capacity. To do this, we introduced a metric that measures how much useful, decision-relevant information a memory module contributes relative to how much context it consumes.
When we plotted utility against context consumption, PlugMem consistently came out ahead: it delivered more decision-relevant information while consuming less of the AI agent’s context than other approaches, as shown in Figure 2. These results suggest that transforming experience into knowledge—rather than storing and retrieving raw logs—produces memory that is more useful and efficient.
Figure 2. Across all three benchmarks, PlugMem delivered more useful memory with less of the agent’s context window. Why general-purpose memory can outperform task-specific designsGeneral-purpose memory modules can outperform systems tailored to specific tasks because the decisive factor is not specialization but whether memory can surface the right knowledge precisely when the agent needs it. Structure, retrieval, and reasoning each play a distinct role, and getting all three right matters more than optimizing for a single use case.
PlugMem is not meant to replace task-specific approaches. It provides a general memory foundation upon which task adaptations can be layered. Our experiments show that combining PlugMem with task-specific techniques yields further gains.
Toward reusable memory for agentsAs AI agents take on longer and more complex tasks, its memory needs to evolve from storing past interactions to actively supplying reusable knowledge. The goal is for agents to carry useful facts and strategies from one task to the next rather than starting from scratch each time.
PlugMem represents a step in that direction, grounding memory design in cognitive principles and treating knowledge as the primary unit of reuse. As agent capabilities expand, knowledge-centric memory may prove to be a critical building block for the next generation of intelligent agents.
Code and experimental results are publicly available on GitHub (opens in new tab) so that others can reproduce the results and conduct their own research.
Opens in a new tabThe post PlugMem: Transforming raw agent interactions into reusable knowledge appeared first on Microsoft Research.
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
- Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
- We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.
We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, inferring about changes in sequences of images, and much more. Beyond these general capabilities, it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens. In particular, our model presents an appealing value relative to popular open-weight models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning.
Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values.In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.
A focus on smaller and faster vision–language modelsMany popular vision-language models (VLMs) have trended towards growing in parameter count and, in particular, the number of tokens they consume and generate. This leads to increase in training and inference-time cost and latency, and impedes their usability for downstream deployment, especially in resource‑constrained or interactive settings.
A growing countertrend towards smaller (opens in new tab) models aims to boost efficiency, enabled by careful model design and data curation – a goal pioneered by the Phi family of models (opens in new tab) and furthered by Phi-4-reasoning-vision-15B. We specifically build on learnings from the Phi-4 and Phi-4-Reasoning language models and show how a multimodal model can be trained to cover a wide range of vision and language tasks without relying on extremely large training datasets, architectures, or excessive inference‑time token generation. Our model is intended to be lightweight enough to run on modest hardware while remaining capable of structured reasoning when it is beneficial. Our model was trained with far less compute than many recent open-weight VLMs of similar size. We used just 200 billion tokens of multimodal data leveraging Phi-4-reasoning (trained with 16 billion tokens) based on a core model Phi-4 (400 billion unique tokens), compared to more than 1 trillion tokens used for training multimodal models like Qwen 2.5 VL (opens in new tab) and 3 VL (opens in new tab), Kimi-VL (opens in new tab), and Gemma3 (opens in new tab). We can therefore present a compelling option compared to existing models pushing the pareto-frontier of the tradeoff between accuracy and compute costs.
Figure 2: Phi-4-Reasoning-Vision can help with a wide range of everyday tasks. Lessons from training a multimodal modelTraining a multimodal reasoning model raises numerous questions and requires many nuanced design choices around model architecture, dataset quality and composition, and the interaction between reasoning‑heavy and non-reasoning perception‑focused tasks.
Model architecture: Early- vs mid-fusionModel architectures for VLMs differ primarily in how visual and textual information is fused. Mid-fusion models use a pretrained vision encoder to convert images into visual tokens that are projected into a pretrained LLM’s embedding space, enabling cross-modal reasoning while leveraging components already trained on trillions of tokens. Early-fusion models process image patches and text tokens in a single model transformer, yielding richer joint representations but at significantly higher compute, memory, and data cost. We adopted a mid-fusion architecture as it offers a practical trade-off for building a performant model with modest resources.
Model architecture: Vision encoder and image processingWe build on the SigLIP-2 (opens in new tab) vision encoder and the Phi-4-Reasoning backbone. In previous research, we found that multimodal language models sometimes struggled to solve tasks, not because of a lack of reasoning proficiency, but rather an inability to extract and select relevant perceptual information from the image. An example would be a high-resolution screenshot that is information-dense with relatively small interactive elements.
Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3 (opens in new tab) uses pan-and-scan and NVILA (opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.
Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.
MethodMax TokensMathVistaScreenSpotScreenSpot-ProV*BenchDynamic-S2309642.978.49.452.9Multi-crop309643.467.85.451.8Multi-crop with S2204843.479.110.657.1Dynamic resolution204845.281.59.251.3Dynamic resolution360044.979.717.556.0Table 1: Results with different resolution handling approaches. The top two configurations on each benchmark are in bold. Data: Quality and compositionAs with its language backbone Phi-4-Reasoning, Phi-4-reasoning-vision-15B was trained with a deliberate focus on data quality. Our final dataset consists primarily of data from three sources: open-source datasets which were meticulously filtered and improved; high-quality domain-specific internal data; and high-quality data from targeted acquisitions. The overwhelming majority of our data lies in the first category: data which originated as open-source data, which were significantly filtered and improved, whether by removing low-quality datasets or records, programmatically fixing errors in data formatting, or using open-source images as seeds to synthetically generate higher-quality accompanying text.
The process of improving open-source data began by manually reviewing samples from each dataset. Typically, 5 to 10 minutes were sufficient to classify data as excellent-quality, good questions with wrong answers, low-quality questions or images, or high-quality with formatting errors. Excellent data was kept largely unchanged. For data with incorrect answers or poor-quality captions, we re-generated responses using GPT-4o and o4-mini, excluding datasets where error rates remained too high. Low-quality questions proved difficult to salvage, but when the images themselves were high quality, we repurposed them as seeds for new caption or visual question answering (VQA) data. Datasets with fundamentally flawed images were excluded entirely. We also fixed a surprisingly large number of formatting and logical errors across widely used open-source datasets.
We extracted additional value from existing datasets through reformatting, diversification, and using images as seeds for new data generation. We generated detailed image descriptions alongside original QA pairs for math and science data, had data perform “double-duty” by embedding instruction-following requirements directly into domain-specific QA, created “scrambled,” “caption-matching,” and “what’s changed?” records to improve multi-image reasoning and sequential navigation for CUA scenarios, and diversifying prompt styles to encourage robustness beyond perfectly structured questions.
To supplement the improved open-source data, we utilize high-quality internal datasets, several math-specific datasets which were acquired during training of the Phi-4 language model, and also some domain-specific curated data; for example, latex-OCR data generated by processing and rendering equations from arXiv documents.
before returning a bounding box coordinates for a UI grounding task, and the other uses a tag with step-by-step reasoning to answer a chart question about expatriate populations, concluding with "Dubai." " class="wp-image-1163336"/> Figure 3: Phi-4-reasoning-vision-15B training data composition and examples Data: Mathematics vs. computer-use data proportionOne of our goals was to train a model that performs well across general vision-language tasks, while excelling at mathematical and scientific reasoning and computer-use scenarios. How to structure datasets for generalizable reasoning remains an open question—particularly because the relationship between data scale and reasoning performance can lead to starkly different design decisions, such as training a single model on a large dataset versus multiple specialized models with targeted post-training.
Research on long-tailed classification robustness has suggested that balancing or removing data from overrepresented tasks or subgroups (opens in new tab) is an effective method for ensuring good performance. Nevertheless, these insights are not fully utilized or explored when it comes to training VLMs, which at times have favored scale over careful data balancing. To achieve our goals, we conducted a set of experiments to analyze a range of data ratios between our focus domains.
Using the same 5 billion parameter proxy model as for previous experiments, we trained while varying the amount of mathematics and science vs. computer-use data for each run. Each dataset included the same subset of 1 million general image-text pairs as a baseline. For mathematics and science data, we used a subsample of 150,000 records, optionally duplicating each one up to three times. Next, we included up to 450,000 computer-use records, and optionally an additional 400,000 from Phi-Ground.
We found that that multimodal mathematics and science performance were not harmed by additional computer-use data, and vice versa. Interestingly, we found that increasing mathematics data by 3x while keeping computer-use data constant improved math, science, and computer-use benchmarks.
GeneralMath and ScienceCUATotalMMMUMathVistaScreenSpot-V21M150K450K1.6M44.037.448.21M150K850K2.0M44.137.360.01M450K450K1.9M45.336.048.31M450K850K2.3M43.438.963.11M150K150K1.3M44.236.929.81M150K250K1.4M45.437.437.7Table 2: Varying the ratios of math and CUA data. Increasing math data by 3x while keeping computer-use data constant improves both math and computer-use benchmarks. Data: Synthetic data for text-rich visual reasoningRecent work (opens in new tab) suggests that targeted synthetic data can materially improve multimodal reasoning, particularly for text-rich visual domains such as charts, documents, diagrams, and rendered mathematics. Using images, questions, and answers that are programmatically generated and grounded in the visual structure enables precise control over visual content and supervision quality, resulting in data that avoids many annotation errors, ambiguities, and distributional biases common in scraped datasets. This enables cleaner alignment between visual perception and multi-step inference, which has been shown to translate into measurable gains on reasoning-heavy benchmarks.
Synthetic text-rich images expand coverage of long-tail visual formats that are underrepresented in real data but disproportionately impact reasoning accuracy, improving not only visual grounding but also downstream reasoning by ensuring that failures are less often caused by perceptual errors. We found that programmatically generated synthetic data is a useful augmentation to high-quality real datasets — not a replacement, but a scalable mechanism for strengthening both perception and reasoning that complements the training objectives in compact multimodal models such as Phi-4-reasoning-vision-15B.
Mixing non-reasoning and reasoning as a design objectiveIn language-only settings, reasoning traces have improved performance on many tasks, but they require additional compute which adds undesired latency. In multimodal settings, this tradeoff is less clear-cut, for tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful (opens in new tab), while mathematical and scientific problem-solving benefit from multi-step reasoning. Thus, the choice of when to reason or not can be quite nuanced.
Training approaches for multimodal reasoning modelsLanguage-only reasoning models are typically created through supervised fine-tuning (SFT) or reinforcement learning (RL): SFT is simpler but requires large amounts of expensive reasoning trace data, while RL reduces data requirements at the cost of significantly increased training complexity and compute. Multimodal reasoning models follow a similar process, but the design space is more complex. With a mid-fusion architecture, the first decision is whether the base language model is itself a reasoning or non-reasoning model. This leads to several possible training pipelines:
- Non-reasoning LLM → reasoning multimodal training: Reasoning and multimodal capabilities are trained together.
- Non-reasoning LLM → non-reasoning multimodal → reasoning multimodal training: Multimodal capabilities are learned first, then reasoning is added.
- Reasoning LLM → reasoning multimodal training: A reasoning base is used, but all multimodal data must include reasoning traces.
- Our approach: Reasoning LLM → mixed non-reasoning / reasoning multimodal training. A reasoning-capable base is trained on a hybrid data mixture, learning when to reason and when to respond directly.
Approaches 1 and 2 offer flexibility in designing multimodal reasoning behavior from scratch using widely available non-reasoning LLM checkpoints but place a heavy burden on multimodal training. Approach 1 must teach visual understanding and reasoning simultaneously and requires a large amount of multimodal reasoning data, while Approach 2 can be trained with less reasoning data but risks catastrophic forgetting, as reasoning training may degrade previously learned visual capabilities. Both risk weaker reasoning than starting from a reasoning-capable base. Approach 3 inherits strong reasoning foundations, but like Approach 1, it requires reasoning traces for all training data and produces reasoning traces for all queries, even when not beneficial.
Our approach: A mixed reasoning and non-reasoning modelPhi-4-reasoning-vision-15B adopts the 4th approach listed previously, as it balances reasoning capability, inference efficiency, and data requirements. It inherits a strong reasoning foundation but uses a hybrid approach to combine the strengths of alternatives while mitigating their drawbacks. Our model defaults to direct inference for perception-focused domains where reasoning adds latency without improving accuracy, avoiding unnecessary verbosity and reducing inference costs, and it invokes longer reasoning paths for domains, such as math and science, that benefit from structured multi-step reasoning (opens in new tab).
Our model is trained with SFT, where reasoning samples include “…” sections with chain-of-thought reasoning before the final answer, covering domains like math and science. Non-reasoning samples are tagged to start with a “” token, signaling a direct response, and cover perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Reasoning data comprises approximately 20% of the total mix. Starting from a reasoning-capable backbone means this data grounds existing reasoning in visual contexts rather than teaching it to reason from scratch.
This approach is not without limitations. The balance between modes is a direct function of design choices we made, informed by recent literature (opens in new tab) and observed model behavior during training—though the boundary between modes can be imprecise as it is learned implicitly from the data distribution. Our model allows control through explicit prompting with “” or “” tokens when the user wants to override the default reasoning behavior. The 20/80 reasoning-to-non-reasoning data split may not be optimal for all domains or deployment contexts. Evaluating the ideal balance of data and the model’s ability to switch appropriately between modes remains an open problem.
We view this mixed approach not as a definitive solution, but as one practical and well-motivated point in the design space for balancing latency, accuracy, and flexibility in multimodal systems.
Applications Figure 4: Phi-4-Reasoning-Vision can interpret sequences of imagesPhi-4-reasoning-vision-15B is a high-performing model across many vision-language tasks. It sees and understands the world by looking at a photo, document, chart, or screen and making sense of it. In practice that covers an enormous range of applications — just a few examples include: describing images and answering questions about them, interpreting changes and trends in images sequences, and recognizing objects, landmarks, and transcribing text.
Highlights: Scientific and mathematical reasoning and supporting computer-using agents (CUA)In addition to general vision and language tasks, Phi-4-reasoning-vision-15B was designed to excel at tasks that combine visual input with structured inference, such as solving math problems presented in visual form, such as handwritten or diagram-based questions, extracting and reasoning over quantitative information in documents and charts, and supporting multi-step reasoning in educational or scientific analysis contexts.
Figure 5: Phi-4-reasoning-vision-15B is great at math and science Figure 6: Phi-4-reasoning-vision-15B can help with written math problemsIn addition, we trained Phi-4-reasoning-vision-15B to have skills that can enable agents to interact with graphical user interfaces by interpreting screen content and selecting actions. With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields. Due to its low inference-time needs it is great for interactive environments where low latency and compact model size are essential.
Figure 7: Phi-4-reasoning-vision-15B can help navigate computer UIs EvaluationPhi-4-reasoning-vision-15B was evaluated for accuracy and timing using two complementary open-source frameworks to ensure both rigorous and standardized analysis: Eureka ML Insights (opens in new tab) and VLMEvalKit (opens in new tab).
BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force nothinkPhi-4-mm-instructKimi-VL-A3B-Instructgemma-3-12b-itQwen3-VL-8B-Instruct-4KQwen3-VL-8B-Instruct-32KQwen3-VL-32B-Instruct-4KQwen3-VL-32B-Instruct-32KAI2D_TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA_TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse_MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision_MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista_MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU_VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot_v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 3: Accuracy comparisons relative to popular open-weight, non-thinking models BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force thinkingKimi-VL-A3B-Thinkinggemma-3-12b-itQwen3-VL-8B-Thinking-4KQwen3-VL-8B-Thinking-40KQwen3-VL-32B-Thiking-4KQwen3-VL-32B-Thinking-40KAI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA_TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse_MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision_MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista_MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU_VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot_v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 4: Accuracy comparisons relative to popular open-weight, thinking modelsOur model balances thinking and non-thinking performance – on average showing better accuracy in the default “mixed-reasoning” behavior than when forcing thinking vs. non-thinking. Only in a few cases does forcing a specific mode improve performance (MathVerse and MMU_val for thinking and ScreenSpot_v2 for non-thinking). Compared to recent popular, open-weight models, our model provides a desirable trade-off between accuracy and cost (as a function of inference time compute and output tokens), as discussed previously.
Note: All numbers here are the result of running benchmarks ourselves and may be lower than other previously shared numbers. Instead of quoting leaderboards, we performed our own benchmarking, so we could understand scaling performance as a function of output token counts for related models. We made our best effort to run fair evaluations and used recommended evaluation platforms with model-specific recommended settings and prompts provided for all third-party models. For Qwen models we use the recommended token counts and also ran evaluations matching our max output token count of 4096. For Phi-4-reasoning-vision-15B, we used our system prompt and chat template but did not do any custom user-prompting or parameter tuning, and we ran all evaluations with temperature=0.0, greedy decoding, and 4096 max output tokens. These numbers are provided for comparison and analysis rather than as leaderboard claims. For maximum transparency and fairness, we will release all our evaluation logs publicly. For more details on our evaluation methodology, please see our technical report (opens in new tab).
SafetyAs with other Phi models, Phi-4-reasoning-vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. For further details, check out our technical report (opens in new tab).
Open release and community engagementPhi-4-reasoning-vision-15B is available on Microsoft Foundry (opens in new tab) and HuggingFace (opens in new tab) with additional examples and details on GitHub (opens in new tab). For additional guidance on how to use our model properly and safely, please refer to our Model card (opens in new tab). For further details on the technical aspects of the model, training, and evaluation, see our technical report (opens in new tab).
In line with our goal of supporting future AI development in the community, Phi-4-reasoning-vision-15B is released under a permissive license with model weights, fine‑tuning code, and benchmark logs. We intend this release to complement existing work by providing concrete artifacts that help close gaps in understanding how compact multimodal reasoning models can be built and studied.
Looking forwardSmaller vision–language models with selective, task‑aware reasoning offer one promising direction for making multimodal systems more practical and accessible. We present our model and its learnings to inform ongoing research in multimodal modeling, computer‑using agents, and mathematical scientific reasoning. We hope these details are useful to researchers exploring similar tradeoffs and invite critical evaluation, replication, and extension by the community. If you’d like to join us and help shape the future of multimodal models, please apply for one of our open roles.
AcknowledgementsWe thank Rachel Ward for her extensive work on data collection and curation. We thank the GenDatasets, PhiGround, SimCity, and Fara-7B efforts for invaluable training data. We thank Harkirat Behl, Mojan Javaheripi, and Suriya Gunasekar for providing us with Phi-4 checkpoints and guidance on training with Phi models. We additionally thank Sahaj Agarwal, Ahmed Awadallah, Qi Dai, Gustavo de Rosa, Rafah Hosn, Ece Kamar, Piero Kauffmann, Yash Lara, Chong Luo, Caio César Teodoro Mendes, Akshay Nambi, Craig Presti, Matthew Rosoff, Corby Rosset, Marco Rossi, Kashyap Patel, Adil Salim, Sidhartha Sen, Shital Shah, Pratyusha Sharma, Alexey Taymanov, Vibhav Vineet, John Weiss, Spencer Whitehead, the AI Frontiers Team and Leadership, and Microsoft Research Leadership, for their valuable help, insightful discussions, and continued support throughout this work.
Opens in a new tabThe post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.
CORPGEN advances AI agents for real work
- Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs).
- Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%.
- CORPGEN introduces digital employees, with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends.
- Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve.
By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once.
In our paper, “CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap.
Introducing Multi-Horizon Task EnvironmentsReplicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours.
To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off.
We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems.
CORPGEN’s architectureCORPGEN introduces digital employees: LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday.
Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session.CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step.
Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled.
Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture.
Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization. How digital employees collaborateWhen multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication.
There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue.
When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so.
Evaluating CORPGENWe evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out.
Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more.
Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%.
Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and IndustryDiscover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
Listen now Opens in a new tab Implications and looking forwardThe results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation.
CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software.
AcknowledgmentsThis work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation.
Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.
Opens in a new tabThe post CORPGEN advances AI agents for real work appeared first on Microsoft Research.
Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions
Insights from Microsoft’s Media Integrity and Authentication: Status, Directions, and Futures report
It has become increasingly difficult to distinguish fact from fiction when viewing online images and videos. Resilient, trustworthy technologies can help people determine whether the content they are viewing was captured by a camera or microphone—or generated or modified by AI tools.
We refer to technologies aimed at helping viewers verify the source and history—that is, the provenance—of digital content as media integrity and authentication (MIA) methods. This technique, driven by the Coalition for Content Provenance and Authenticity (opens in new tab) (C2PA), a standards body dedicated to scaling these capabilities, as well as complementary methods such as watermarks and fingerprinting, have become critically important with the rapid advance of AI systems capable of creating, realistic imagery, video, and audio at scale.
A convergence of forcesOur team recognized an inflection point in the evolution of online content integrity, driven by the convergence of four forces:
- Growing saturation of synthetic media, driven by proliferation of high-fidelity content-generation tools and the explosion of AI generated or modified media online
- Forthcoming legislation both nationally and internationally seeking to define what “verifiable” provenance should mean in practice
- Mounting pressure on implementers to ensure authentication signals are clear and helpful, especially as signals increase when legislation goes into effect in 2026
- Heightened awareness of potential adversarial attacks that attempt to exploit weaknesses in authenticity systems
The usefulness and trustworthiness of provenance signals, whether certifying content as synthetic or as an authentic capture of real-world scenes, will depend not only on advances in technology, but also on how the broader digital ecosystem adopts, implements, and governs these tools. Aligning around implementation choices that promote consistency and clarity is essential to ensure transparency signals strengthen, rather than erode, public confidence.
To address these challenges, we launched a comprehensive evaluation of the real-world limits, edge cases, and emerging “attack surfaces” for MIA methods. Today, we are publishing our findings in the Media Integrity & Authentication: Status, Directions & Futures report. The report distills lessons learned and outlines practical directions for strengthening media integrity in the years ahead.
Spotlight: Event Series
Microsoft Research ForumJoin us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.
Watch on-demand Opens in a new tab Findings and directions forwardOur research recognizes that different media integrity and authenticity methods serve differing purposes and offer distinct levels of protection. After defining each method in detail, we focused on secure provenance (C2PA), imperceptible watermarking, and soft hash fingerprinting across images, audio, and video.
Grounded in our evaluation of these MIA methods across modalities, attack categories, and real-world workflows, several new findings emerged including two new concepts:
- High-Confidence Provenance Authentication: a critical capability for verifying, under defined conditions, whether claims about the origin of and modifications made to an asset can be validated with high certainty.
- Sociotechnical Provenance Attacks: attacks aimed at deception and capable of inverting signals, making authentic content appear synthetic, and synthetic content appear authentic.
Drawing on our findings, we identified four promising directions for further strengthening media authentication, along with suggestions to support more effective implementation strategies and future decisions. We’ve summarized the findings and directions below, with additional detail available in the report.
Promising directionsHigh-level findingsDelivering high-confidence provenance authentication– Implementation and display choices may affect the reliability of provenance indicators and how they are interpreted by the public.– Using a C2PA provenance manifest for media created and signed in a high security environment enables high-confidence validation.
– High-confidence validation is also possible across a broader volume of images, audio, and video when an imperceptible watermark is linked to C2PA provenance manifest as an additional layer to recover the provenance information if removed.
– Fingerprinting is not an enabler for high-confidence validation and can involve significant costs when expected at scale. However, it can support manual forensics.Mitigating confusion from sociotechnical provenance attacks– MIA methods are susceptible to sociotechnical attacks on provenance that may mislead the public, resulting in confusion and misplaced trust about an asset’s provenance if there is an overreliance on low-quality signals.
– Layering and linking secure provenance and imperceptible watermarking methods to achieve high confidence validation also offers a promising option to both deter and mitigate the impact of attacks.
– Unintended consequences may result from the use of methods lacking authentication, such as the use of perceptible watermarks in the absence of secure provenance. Perceptible watermarks may cause confusion in cases of forgery or discourage people from consulting high-confidence provenance information via a validation tool, if such perceptible disclosures are taken at face value.
– UX design that enables users to explore manifest details—such as where edits occurred or region of interest—has the potential to reduce confusion and support forensics and fact checking efforts. Enabling more trusted provenance on edge devices– High-confidence results aren’t feasible when provenance is added by a conventional offline device (e.g., camera or recording device without connectivity).
– Implementing a secure enclave within the hardware layer of offline devices is essential to make the provenance of captured images, audio, and video more trustworthy.Investing in ongoing research and policy development– All three methods offer organizations valuable tools for addressing operational challenges such as fraud prevention, risk management, and digital accountability.
– UX and display are promising directions for research. Important directions include in-stream tools that display provenance information where people are and distinguish between high- and lower-confidence provenance signals.
– Stakeholders should conduct ongoing analysis and red teaming to identify and mitigate weaknesses through technical approaches, policies, and laws. The journey continues
This report marks the beginning of a new chapter in our media provenance journey (opens in new tab), building on years of foundational work, from developing the very first prototype in 2019 to co-founding the C2PA in 2021 and helping catalyze an ecosystem that has since grown to more than 6,000 members and affiliates (opens in new tab) supporting C2PA Content Credentials. This research represents the next evolution of that long‑standing commitment.
We hope that by sharing our learnings will help others prepare for an important wave, especially as generative technologies accelerate and provenance signals multiply. This work is already underway across our products at Microsoft. Together, these directions highlight opportunities for the ecosystem to align, harden, and innovate, so authentication signals are not merely visible, but robust, meaningful, and resilient throughout the content lifecycle.
Opens in a new tabThe post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.
Project Silica’s advances in glass storage technology
- Microsoft Research publishes breakthrough in Nature on glass-based data storage that could preserve information for 10,000 years.
- New technique extends technology from expensive fused silica to ordinary borosilicate glass found in kitchen cookware.
- Innovations enable faster parallel writing, simplified readers (one camera instead of three), and easier manufacturing.
- Phase voxel method requires only a single laser pulse, significantly reducing complexity and cost.
Long-term preservation of digital information has long challenged archivists and datacenters, as magnetic tapes and hard drives degrade within decades. Existing archival storage solutions have limited media lifespans that make them less than ideal for preserving information for future generations.
Now, we are excited to report significant progress on Project Silica (opens in new tab), our effort to encode data in glass using femtosecond lasers, a technology that could preserve information for 10,000 years. Glass is a permanent data storage material that is resistant to water, heat, and dust.
In findings published in Nature (opens in new tab), we describe a breakthrough that extends the technology beyond expensive fused silica to ordinary borosilicate glass. A readily available and lower-cost medium, this is the same material found in kitchen cookware and oven doors. This advance addresses key barriers to commercialization: cost and availability of storage media. We have unlocked the science for parallel high-speed writing and developed a technique to permit accelerated aging tests on the written glass, suggesting that the data should remain intact for at least 10,000 years.
Storing data inside glass with femtosecond (opens in new tab) laser pulses is one of the few technologies on the horizon with the potential for durable, immutable, and long-lived storage. Although we have been leading innovation in this type of storage for years, prior to this research the technique only worked with pure fused silica glass, a type of glass that is relatively difficult to manufacture and available from only a few sources.
In the paper, we show how data can be stored in borosilicate glass. The new technique stores hundreds of layers of data in glass only 2mm thin, as with previous methods, but with important improvements. The reader for the glass now needs only one camera, not three or four, reducing cost and size. In addition, the writing devices require fewer parts, making them easier to manufacture and calibrate, and enabling them to encode data more quickly.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and IndustryDiscover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
Listen now Opens in a new tab Key scientific discoveriesThe Nature paper details several key new scientific discoveries:
Advances in birefringent voxel (opens in new tab) writing: For the previous type of data storage in fused silica glass using birefringent (i.e., polarization) voxels, we developed a technique to reduce the number of pulses used to form the voxel from many to only two, critically showing that the polarization of the first pulse is not important to the polarization of the voxel formed. We further developed this to enable pseudo-single-pulse writing, in which a single pulse can be split after its polarization is set to simultaneously form the first pulse for one voxel (where the polarization doesn’t matter) and the second pulse of another (where the set polarization is essential). We demonstrated how to use this pseudo-single-pulse writing to enable fast writing with beam scanning across the media.
Phase voxels, a new storage method: We invented a new type of data storage in glass called phase voxels, in which the phase change of the glass is modified instead of its polarization, showing that only a single pulse is necessary to make a phase voxel. We demonstrated that these phase voxels can also be formed in borosilicate glass and devised a technique to read the phase information from phase voxels encoded in this material. We showed that the much higher levels of three-dimensional inter-symbol interference in phase voxels can be mitigated with a machine learning classification model.
Parallel writing capabilities: By combining a mathematical model of pre-heating and post-heating within the glass with the invention of a multi-beam delivery system, we showed that many data voxels can be written in proximity in the glass at the same time, significantly increasing writing speed. We explained a method for using light emissions (a side effect of voxel formation) for both static calibration and dynamic control to fully support automatic writing operations.
Optimization and longevity testing: We developed a new way to optimize symbol encodings using machine learning and a better way to understand the tradeoff between error rates, error protection, and error recovery when evaluating new digital storage systems. We also created a new nondestructive optical method (opens in new tab) to identify the aging of data storage voxels within the glass, using this and standard accelerated aging techniques to support data lasting 10,000 years. We extended the industry standard Gray codes to apply to nonpower-of-two numbers of symbols.
Skip slideshow for: Previous slide Previous slideA piece of Project Silica media written with data.
A research-grade Writer used to set the record for high speed data writing into glass.
A research-grade Reader for retrieving data from glass.
Close up of Writer showing high-speed multi-beam data encoding on laser pulses.
End of slideshow for: Demonstrating the technologyAs a research initiative, Project Silica has demonstrated these advances through several proofs of concept, including storing Warner Bros.’ “Superman” movie on quartz glass (opens in new tab), partnering with Global Music Vault (opens in new tab) to preserve music under ice for 10,000 years (opens in new tab), and working with students on a “Golden Record 2.0” project (opens in new tab), a digitally curated archive of images, sounds, music, and spoken language, crowdsourced to represent and preserve humanity’s diversity for millennia.
Looking aheadThe research phase is now complete, and we are continuing to consider learnings from Project Silica as we explore the ongoing need for sustainable, long-term preservation of digital information. We have added this paper to our published works so that others can build on them.
Related workProject Silica has made scientific advances across multiple areas beyond laser direct writing (LDW) in glass, including archival storage systems design, archival workload analysis, datacenter robotics, erasure coding, free-space optical components, and machine learning-based methods for symbol decoding in storage systems. Many of these innovations were described in our ACM Transactions on Storage publication (opens in new tab) in 2025.
Opens in a new tabThe post Project Silica’s advances in glass storage technology appeared first on Microsoft Research.
Rethinking imitation learning with Predictive Inverse Dynamics Models
- Imitation learning becomes easier when an AI agent understands why an action is taken.
- Predictive Inverse Dynamics Models (PIDMs) predict plausible future states, clarifying the direction of behavior during imitation learning.
- Even imperfect predictions reduce ambiguity, making it clearer which action makes sense in the moment.
- This makes PIDMs far more data‑efficient than traditional approaches.
Imitation learning teaches AI agents by example: show the agent recordings of how people perform a task and let it infer what to do. The most common approach, Behavior Cloning (BC), frames this as a simple question: “Given the current state of the environment, what action would an expert take?”
In practice, this is done through supervised learning, where the states serve as inputs and expert actions as outputs. While simple in principle, BC often requires large demonstration datasets to account for the natural variability in human behavior, but collecting such datasets can be costly and difficult in real-world settings.
Predictive Inverse Dynamics Models (PIDMs) offer a different take on imitation learning by changing how agents interpret human behavior. Instead of directly mapping states to actions, PIDMs break down the problem into two subproblems: predicting what should happen next and inferring an appropriate action to go from the current state to the predicted future state. While PIDMs often outperform BC, it has not been clear why they work so well, motivating a closer look at the mechanisms behind their performance.
In the paper, “When does predictive inverse dynamics outperform behavior cloning?” we show how this two-stage approach enables PIDMs to learn effective policies from far fewer demonstrations than BC. By grounding the selection process in a plausible future, PIDMs provide a clearer basis for choosing an action during inference. In practice, this can mean achieving comparable performance with as few as one-fifth the demonstrations required by BC, even when predictions are imperfect.
Figure 1. BC vs. PIDM architectures. (Top) Behavior Cloning learns how to perform a direct mapping from the current state to an action. (Bottom) PIDMs add a state predictor that predicts future states. They then use an inverse dynamics model to predict the action required to move from the current state towards that future state. Both approaches share a common latent representation through a shared state encoder. How PIDMs rethink imitationPIDMs’ approach to imitation learning consists of two core elements: a model that forecasts plausible future states, and an inverse dynamics model (IDM) that predicts the action needed to move from the present state toward that future. Instead of asking, “What action would an expert take?” PIDMs effectively ask, “What would an expert try to achieve, and what action would lead to it?” This shift turns the information in the current observation (e.g., video frame) into a coherent sense of direction, reducing ambiguity about intent and making action prediction easier.
Azure AI Foundry LabsGet a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry Opens in a new tab Real-world validation in a 3D gameplay environmentTo evaluate PIDMs under realistic conditions, we trained agents on human gameplay demonstrations in a visually rich video game. These conditions include operating directly from raw video input, interacting with a complex 3D environment in real time at 30 frames per second, and handling visual artifacts and unpredictable system delays.
The agents ran from beginning to end, taking video frames as input and continuously deciding which buttons to press and how to move the joysticks. Instead of relying on a hand-coded set of game variables and rules, the model worked directly from visual input, using past examples to predict what comes next and choosing actions that moved play in that direction.
We ran all experiments on a cloud gaming platform, which introduced additional delays and visual distortions. Despite these challenges, the PIDM agents consistently matched human patterns of play and achieved high success rates across tasks, as shown in Video 1 below and Videos 2 and 3 in the appendix.
Video 1. A player (left) and a PIDM agent (right) side by side playing the game Bleeding Edge. Both navigate the same trajectory, jumping over obstacles and engaging with nonplayer characters. Despite network delays, the agent closely matches the player’s timing and movement in real time. Why and when PIDMs outperform BCOf course, AI agents do not have access to future outcomes. They can only generate predictions based on available data, and those predictions are sometimes wrong. This creates a central trade‑off for PIDMs.
On one hand, anticipating where the agent should be heading can clarify what action makes sense in the present. Knowing the intended direction helps narrow an otherwise ambiguous choice. On the other hand, inaccurate predictions can occasionally steer the model toward the wrong action.
The key insight is that these effects are not symmetric. While prediction errors introduce some risk, reducing ambiguity in the present often matters more. Our theoretical analysis shows that even with imperfect predictions, PIDMs outperform BC as long as the prediction error remains modest. If future states were known perfectly, PIDMs would outperform BC outright.
In practice, this means that clarifying intent often matters more than accurately predicting the future. That advantage is most evident in the situations where BC struggles: where human behavior varies and actions are driven by underlying goals rather than by what is immediately visible on the screen.
BC requires many demonstrations because each example is noisy and open to multiple interpretations. PIDMs, by contrast, sharpen each demonstration by linking actions to the future states they aim to reach. As a result, PIDMs can learn effective action strategies from far fewer examples.
EvaluationTo test these ideas under realistic conditions, we designed a sequence of experiments that begins with a simple, interpretable 2D environment (Video 4 in the appendix) and culminates in a complex 3D video game. We trained both BC and PIDM on very small datasets, ranging from one to fifty demonstrations in the 2D environment and from five to thirty for the 3D video game. Across all tasks, PIDM reached high success rates with far fewer demonstrations than BC.
In the 2D setting, BC needed two to five times more data to match PIDM’s performance (Figure 2). In the 3D game, BC needed 66% more data to achieve comparable results (Video 5 in the appendix).
Figure 2. Performance gains in the 2D environment. As the number of training demonstrations increases, PIDM consistently achieves higher success rates than BC across all four tasks. Curves show mean performance, with shading indicating variability across 20 experiments for reproducibility. Takeaway: Intent matters in imitation learningThe main message of our investigation is simple: imitation becomes easier when intent is made explicit. Predicting a plausible future, even an imperfect one, helps resolve ambiguity about which action makes sense right now, much like driving more confidently in the fog when the driver already knows where the road is headed. PIDM shifts imitation learning from pure copying toward goal-oriented action.
This approach has limits. If predictions of future states become too unreliable, they can mislead the model about the intended next move. In those cases, the added uncertainty can outweigh the benefit of reduced ambiguity, causing PIDM to underperform BC.
But when predictions are reasonably accurate, reframing action prediction as “How do I get there from here?” helps explain why learning from small, messy human datasets can be surprisingly effective. In settings where data is expensive and demonstrations are limited, that shift in perspective can make a meaningful difference.
Appendix: Visualizations and results (videos) A player, a naïve action-replay baseline, and a PIDM agent playing Bleeding Edge Video 2. (Left) The player completes the task under normal conditions. (Middle) The baseline replays the recorded actions at their original timestamps, which initially appears to work. Because the game runs on a cloud gaming platform, however, random network delays quickly push the replay out of sync, causing the trajectory to fail. (Right) Under the same conditions, the PIDM agent behaves differently. Instead of naively replaying actions, it continuously interprets visual input, predicts how the behavior is likely to unfold, and adapts its actions in real time. This allows it to correct delays, recover from deviations, and successfully reproduce the task in settings where naïve replay inevitably fails. A player and a PIDM agent performing a complex task in Bleeding Edge Video 3. In this video, the task exhibits strong partial observability: correct behavior depends on whether a location is being visited for the first or second time. For example, in the first encounter, the agent proceeds straight up the ramp; on the second, it turns right toward the bridge. Similarly, it may jump over a box on the first pass but walk around it on the second. The PIDM agent reproduces this trajectory reliably, using coarse future guidance to select actions in the correct direction. Visualization of the 2D navigation environment Video 4. These videos show ten demonstrations for each of four tasks: Four Room, Zigzag, Maze, and Multiroom. In all cases, the setup is the same: the character (blue box) moves through the environment and must reach a sequence of goals (red squares). The overlaid trajectories visualize the paths the player took; the models never see these paths. Instead, they observe only their character’s current location, the position of all goals, and whether each goal has already been reached. Because these demonstrations come from real players, no two paths are identical: players pause, take detours, or correct small mistakes along the way. That natural variability is exactly what the models must learn to handle. PIDM vs. BC in a 3D environment Video 5. The PIDM agent achieves an 85% success rate with only fifteen demonstrations used in training. The BC agent struggles to stay on track and levels off around 60%. The contrast illustrates how differently the two approaches perform when training data is limited. Opens in a new tabThe post Rethinking imitation learning with Predictive Inverse Dynamics Models appeared first on Microsoft Research.
Paza: Introducing automatic speech recognition benchmarks and models for low resource languages
- Microsoft Research releases PazaBench and Paza automatic speech recognition models, advancing speech technology for low resource languages.
- Human-centered pipeline for low-resource languages: Built for and tested by communities, Paza is an end-to-end, continuous pipeline that elevates historically under-represented languages and makes speech models usable in real-world, low-resource contexts.
- First-of-its-kind ASR leaderboard, starting with African languages: Pazabench is the first automatic speech recognition (ASR) leaderboard for low-resource languages. Launching with 39 African languages and 51 state-of-the-art models, it tracks three key metrics across leading public and community datasets.
- Human-centered Paza ASR models: Minimal data, fine-tuned ASR models grounded in real-world testing with farmers on everyday mobile devices, covering six Kenyan languages: Swahili, Dholuo, Kalenjin, Kikuyu, Maasai, and Somali.
According to the 2025 Microsoft AI Diffusion Report approximately one in six people globally had used a generative AI product. Yet for billions of people, the promise of voice interaction still falls short, and whilst AI is becoming increasingly multilingual, a key question remains: Do these models actually work for all languages and the people who rely on them? This challenge is one we first confronted through Project Gecko—a collaboration between Microsoft Research and Digital Green (opens in new tab), where field teams across Africa and India focused on building usable AI tools for farmers.
Gecko revealed how often speech systems fail in real‑world, low‑resource environments—where many languages go unrecognized and non‑Western accents are frequently misunderstood. Yet speech remains the primary medium of communication globally. For communities across Kenya, Africa, and beyond, this mismatch creates cascading challenges: without foundational data representing their languages and cultures, innovation stalls, and the digital and AI divides widen.
Paza addresses this with a human-centered speech models pipeline. Through PazaBench, it benchmarks low-resource languages using both public and community-sourced data, and through Paza models, it fine-tunes speech models to deliver outsized gains in mid- and low-resource languages, evaluating with community testers using real devices in real contexts. Upcoming playbooks complement this work by sharing practical guidance on dataset creation, fine-tuning approaches with minimal data and evaluation considerations, introducing a continuous pipeline that enables researchers and practitioners to build and evaluate systems grounded in real human use.
How Project Gecko informed Paza’s designIn addition to building cost-effective, adaptable AI systems, the extensive fieldwork on Project Gecko highlighted an important lesson: Building usable speech models in low‑resource settings is not only a data problem, but also a design and evaluation problem. For AI systems to be useful, they must work in local languages, support hands‑free interaction through voice, text, and video, and deliver information in formats that fit real-world environments, that is, on low-bandwidth mobile devices, in noisy settings, and for varying literacy levels.
These insights shaped the design of Paza, from the Swahili phrase paza sauti meaning “to project,” or “to raise your voice.” The name reflects our intent: rather than simply adding more languages to existing systems, Paza is about co-creating speech technologies in partnership with the communities who use them. Guided by this principle, Paza puts human use first, which enables model improvement.
PODCAST SERIES
The AI Revolution in Medicine, RevisitedJoin Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.
Listen now Opens in a new tab PazaBench: The first ASR leaderboard for low-resource languagesPazaBench is the first automatic speech recognition (ASR) leaderboard dedicated to low‑resource languages. It launches with initial coverage for 39 African languages and benchmarks 52 state‑of‑the‑art ASR and language models, including newly released Paza ASR models for six Kenyan languages. The platform aggregates leading public and community datasets from diverse styles of speech including conversational, scripted read aloud, unscripted, broadcast news, and domain-specific data—into one easy‑to‑explore platform per language. This makes it easier for researchers, developers, and product teams to easily assess which models perform best across underserved languages and diverse regions, understand trade-offs between speed and accuracy while identifying where gaps persist.
PazaBench tracks three core metrics:
- Character Error Rate (CER) which is important for languages with rich word forms, where meaning is built by combining word parts, therefore errors at the character level can significantly impact meaning
- Word Error Rate (WER) for word-level transcript accuracy
- RTFx (Inverse Real‑Time Factor) which measures how fast transcription runs relative to real‑time audio duration.
More than scores, PazaBench standardizes evaluation to prioritize dataset gaps, identify underperforming languages, and highlight where localized models beat wider coverage ASR models—offering early evidence of the value of African‑centric innovation.
Explore PazaBenchTo contribute to the benchmark, request additional language evaluation on the leaderboard.
Paza ASR Models: Built with and for Kenyan languagesThe Paza ASR models consist of three fine-tuned ASR models built on top of state‑of‑the‑art model architectures. Each model targets Swahili, a mid-resource language and five low‑resource Kenyan languages; Dholuo, Kalenjin, Kikuyu, Maasai and Somali. The models are fine-tuned on public and curated proprietary datasets.
Fine‑tuning the three models allowed us to explore supportive approaches toward a shared goal: building speech recognition systems that are usable for local contexts starting with the six Kenyan languages and bridging the gaps of multi-lingual and multi-modal video question and answering through the MMCT agent. (opens in new tab)
See the MMCT agent in action in the fieldEarly versions of two models in Kikuyu and Swahili were deployed on mobile devices and tested directly with farmers in real‑world settings, enabling the team to observe how the models performed with everyday use. Farmers provided in‑the‑moment feedback on accuracy, usability, and relevance, highlighting where transcripts broke down, which errors were most disruptive, and what improvements would make the models more helpful in practice. This feedback loop directly informed subsequent fine‑tuning, ensuring model improvements were driven not only by benchmark scores, but by the needs and expectations of the communities they are intended to serve.
Explore Paza Collection HereHere is how Paza models compare to three state-of-the-art ASR models today:
Figure 1: Character Error Rate (CER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower CER indicates better transcription performance. Figure 2: Word Error Rate (WER) comparison across the Kenyan languages for several state‑of‑the‑art ASR models including the Paza models. Lower WER indicates better transcription performance.1) Paza‑Phi‑4‑Multimodal‑Instruct
Microsoft’s Phi‑4 multimodal‑instruct (opens in new tab) is a next‑generation small language model built to reason across audio, text, and vision. With Paza, we extend its audio capabilities, adapting a powerful multimodal architecture into a high‑quality automatic speech recognition (ASR) system for low‑resource African languages.
Fine‑tuned on unified multilingual speech datasets, the model was optimized specifically for transcription in the six languages. The model preserves its underlying transformer architecture and multi-modal capabilities, while selectively fine-tuning only the audio‑specific components, enabling strong cross‑lingual generalization.
As the results below show, this model delivers consistent improvements in transcription quality across all six languages.
Figure 3: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 4: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Test the model here2) Paza‑MMS‑1B‑All
This model is fine-tuned on Meta’s mms-1b-all model, which employs a large-scale Wav2Vec2.0-style encoder with lightweight language-specific adapters to enable efficient multilingual specialization. For this release, each of the six language adapters was fine‑tuned independently on curated low‑resource datasets, allowing targeted adaptation while keeping the shared encoder largely frozen.
As shown in the figures below, this model improves transcription accuracy while maintaining the model’s strong cross‑lingual generalization.
Figure 5: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 6: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Join the Research Early Access Program3) Paza‑Whisper‑Large‑v3‑Turbo
This model is finetuned on OpenAI’s whisper-large-v3-turbo base model. Whisper is a transformer-based encoder–decoder model which delivers robust automatic speech recognition (ASR) capabilities. This model was fine‑tuned on the entire unified multilingual ASR dataset, on the mentioned six languages, to encourage cross-lingual generalization. In addition, an extra post‑processing step was applied to address the known Whisper hallucination failure modes, improving transcription reliability.
As shown below, this release achieves improved transcription accuracy while retaining Whisper’s robustness.
Figure 7: Character Error Rate (CER) comparison across the six languages for the base model versus the finetuned Paza model. Lower CER indicates better transcription performance. Figure 8: Word Error Rate (WER) comparison across the six languages for the base model versus the finetuned Paza model. Lower WER indicates better transcription performance. Test the model here Where do we go from hereAI is reshaping how the world communicates. Designing with people, not just for them, means looking beyond the languages that are already well‑served. We plan to expand PazaBench beyond African languages and evaluate state‑of‑the‑art ASR models across more low‑resource languages globally. The Paza ASR models are an early step; truly supporting small and under‑represented languages requires dedicated datasets, strong local partnerships, and rigorous evaluation. Meaningful progress depends on sustained collaboration with the communities who speak these languages, and expanding responsibly means prioritizing depth and quality over broad but shallow coverage.
As we continue this work, we’re distilling our methods into a forthcoming playbook to help the broader ecosystem curate datasets, fine‑tune responsibly, and evaluate models in real‑world conditions. And we’re not stopping at speech—additional playbooks will guide teams building AI tools and applications for multilingual, multicultural contexts, and give them practical recommendations for deploying across diverse communities.
Together, these guides—grounded in technical advances and community‑driven design—share our learnings to help researchers, engineers, and designers build more human‑centered AI systems.
AcknowledgementsThe following researchers played an integral role in this work: Najeeb Abdulhamid, Felermino Ali, Liz Ankrah, Kevin Chege, Ogbemi Ekwejunor-Etchie, Ignatius Ezeani, Tanuja Ganu, Antonis Krasakis, Mercy Kwambai, Samuel Maina, Muchai Mercy, Danlami Mohammed, Nick Mumero, Martin Mwiti, Stephanie Nyairo, Millicent Ochieng and Jacki O’Neill.
We would like to thank the Digital Green (opens in new tab) team—Rikin Gandhi, Alex Mwaura, Jacqueline Wang’ombe, Kevin Mugambi, Lorraine Nyambura, Juan Pablo, Nereah Okanga, Ramaskanda R.S, Vineet Singh, Nafhtari Wanjiku, Kista Ogot, Samuel Owinya and the community evaluators in Nyeri and Nandi, Kenya — for their valuable contributions to this work.
We extend our gratitude to the creators, community contributors, and maintainers of African Next Voices Kenya (opens in new tab), African Next Voices South Africa (opens in new tab), ALFFA (opens in new tab), Digigreen (opens in new tab), Google FLEURS (opens in new tab), Mozilla Common Voice (opens in new tab) and Naija Voices (opens in new tab) whose efforts have been invaluable in advancing African languages speech data.
Opens in a new tabThe post Paza: Introducing automatic speech recognition benchmarks and models for low resource languages appeared first on Microsoft Research.
UniRG: Scaling medical imaging report generation with multimodal reinforcement learning
- AI-driven medical image report generation can help medical providers become more efficient and productive.
- Current models are difficult to train because reporting practices vary widely among providers.
- Universal Report Generation (UniRG) uses reinforcement learning to align model training with real-world radiology practice rather than proxy text-generation objectives.
- UniRG has achieved state-of-the-art performance across datasets, metrics, diagnostic tasks, longitudinal settings, and demographic subgroups.
- Test results show that reinforcement learning, guided by clinically meaningful reward signals, can substantially improve the reliability and generality of medical vision–language models.
AI can be used to produce clinically meaningful radiology reports using medical images like chest x-rays. Medical image report generation can reduce reporting burden while improving workflow efficiency for healthcare professionals. Beyond the real-world benefits, report generation has also become a critical benchmark for evaluating multimodal reasoning in healthcare AI.
Despite recent advances driven by large vision–language models, current systems still face major limitations in real-world clinical settings. One challenge stems from the wide variation in radiology reporting practices across institutions, departments, and patient populations. A model trained with supervised fine-tuning on one set of data may learn its specific phrasing and conventions instead of more general patterns—a problem known as overfitting. As a result, the model performs well on that data but delivers poor results when evaluated on unseen institutions or external datasets. Moreover, since model training is often aimed at producing text that looks similar to existing reports, some well written but clinically inaccurate reports can slip through.
In this blog, we introduce Universal Report Generation (UniRG) (opens in new tab), a reinforcement learning–based framework for medical imaging report generation. This work is a research prototype intended to advance medical AI research and is not validated for clinical use. UniRG uses reinforcement learning as a unifying mechanism to directly optimize clinically grounded evaluation signals, aligning model training with real-world radiology practice rather than proxy text-generation objectives. Using this framework, we train UniRG-CXR (opens in new tab), a state-of-the-art chest x-ray report generation model at scale, spanning over 560,000 studies, 780,000 images, and 226,000 patients from more than 80 medical institutions.
To our knowledge, this is the first report generation model to achieve consistent state-of-the-art performance across report-level metrics, disease-level diagnostic accuracy, cross-institution generalization, longitudinal report generation, and demographic subgroups. These results demonstrate that reinforcement learning, when guided by clinically meaningful reward signals, can substantially improve both the reliability and generality of medical vision–language models.
Azure AI Foundry LabsGet a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry Opens in a new tab A unified framework for scaling medical image report generationUniRG builds state-of-the-art report generation models by combining supervised fine-tuning with reinforcement learning, which optimizes a composite reward that integrates rule-based metrics, model-based semantic metrics, and LLM-based clinical error signals. This approach allows the resulting model UniRG-CXR to learn from diverse data sources, move beyond dataset-specific reporting patterns, and learn representations that generalize across institutions, metrics, and clinical contexts. Notably, UniRG-CXR sets a new state of the art on the authoritative ReXrank leaderboard (opens in new tab), a public leaderboard for chest X-ray image interpretation, as of 01/22/2026, surpassing previous best models by substantial margins (Figure 1).
Figure 1. Overview of UniRG-CXR. (a) Training Data: UniRG-CXR is trained on the training splits of MIMIC-CXR, CheXpert Plus, and ReXGradient-160k, covering diverse institutions and patient demographics. (b) Training and Rewards: Taking input from the current image, clinical context (e.g., indication), and optionally prior studies, UniRG-CXR uses GRPO reinforcement learning to optimize composite rewards that combine rule-based, model-based, and LLM-based metrics. (c) Evaluation: We assess UniRG-CXR on held-out test sets (MIMIC-CXR, CheXpert Plus, ReXGradient), and unseen datasets (IU Xray and proprietary data). Report quality measured using ReXrank metrics and an LLM-based clinical-error metric, while diagnostic ability is evaluated via F1-based disease classification from generated reports. (d) ReXrank Results: UniRG-CXR achieves SOTA performance across four datasets and two generation settings (findings only and findings + impression), showing substantial gains over prior state-of-the-art systems. Universal improvements across metrics and clinical errorsRather than excelling on one metric at the expense of others, UniRG-CXR delivers balanced improvements across many different measures of report quality. More importantly, it produces reports with substantially fewer clinically significant errors. This indicates that the model is not just learning how to sound like a radiology report, but is better capturing the underlying clinical facts. Explicitly optimizing for clinical correctness helps the model avoid common failure modes where fluent language masks incorrect or missing findings (Figure 2).
Figure 2. UniRG-CXR achieves state-of-the-art performance, delivering consistent and comprehensive performance gains across metrics. (a) On the ReXrank leaderboard, UniRG-CXR (green) shows robust, universal improvement across all evaluation metrics. (b). Starting from the same SFT checkpoint, RL with our combined reward achieves more balanced gains across metrics and the highest RadCliQ-v1 score compared to RL on single metrics. This ablation study is trained and tested on MIMIC (c). Ablation study on the training dynamics shows RL full (UniRG-CXR) achieves significantly better RadCliQ-v1 score than RL only on BLEU. (d). During training, RL full (UniRG-CXR) shows a steady decrease in clinical errors per report as compared with a fluctuating trajectory without consistent improvement from an ablation run without error awareness (i.e. removing CheXprompt metric optimization). Both (c) and (d) show results on 1024 MIMIC validation set from ablations that are trained on MIMIC. (e). Case studies illustrate that UniRG-CXR can produce error-free reports, unlike MedVersa and MedGemma. (f). UniRG-CXR yields a substantially higher proportion of reports with $\leq 1$ error and fewer with $\geq 4$ errors than prior models. Strong performance in longitudinal report generationIn clinical practice, radiologists often compare current images with prior exams to determine whether a condition is improving, worsening, or unchanged. UniRG-CXR is able to incorporate this historical information effectively, generating reports that reflect meaningful changes over time. This allows the model to describe new findings, progression, or resolution of disease more accurately, moving closer to how radiologists reason across patient histories rather than treating each exam in isolation (Figure 3).
Figure 3. UniRG-CXR enhances longitudinal report generation. (a). Comparing UniRG-CXR and its non-longitudinal ablation with prior models on longitudinal report generation, we show UniRG-CXR exhibits the best performance and the longitudinal information is beneficial to the performance. (b). UniRG-CXR achieves the best performance across different longitudinal encounter points ranging from the first encounter to the more complex 5th+ encounters, showcasing its improvements are across the board. In comparison, prior models such as GPT-5, GPT-4o and MedGemma are barely surpassing the copy prior report baseline (grey lines). (c). Compared with prior models which barely improve over the copy prior baseline (dashed line), UniRG-CXR significantly and consistently improves performance across different temporal disease change categories including new development, no change, progression and regression (categorized by GPT-5 on ground truth report). Qualitative examples are shown for each category where UniRG-CXR correctly predicts the temporal change based on the input. All results in this figure are on MIMIC test set with prior information where available. Robust generalization across institutions and populationsUniRG-CXR maintains strong performance even when applied to data from institutions it has never seen before. This suggests that the model is learning general clinical patterns rather than memorizing institution-specific reporting styles. In addition, its performance remains stable across different patient subgroups, including age, gender, and race. This robustness is critical for real-world deployment, where models must perform reliably across diverse populations and healthcare environments (Figure 4).
Figure 4. Generalization and robustness of UniRG-CXR. (a). We evaluate UniRG-CXR in a zero-shot setting on two datasets from previously unseen institutions: IU-Xray and PD (proprietary data). UniRG-CXR consistently outperforms prior models, maintaining substantial performance gains in this challenging setup. (b) and (c) present condition-level F1 scores on MIMIC-CXR and PD and highlight that UniRG-CXR remains the overall top-performing model in condition-level diagnostic accuracy. (d). UniRG-CXR demonstrates stable and robust performance across gender, age, and race subgroups, all of which exceed the performance of the second-best model (the dashed lines). UniRG is a promising step toward scaling medical imaging report generationUniRG introduces a reinforcement learning–based framework that rethinks how medical imaging report generation models are trained and evaluated. By directly optimizing clinically grounded reward signals, UniRG-CXR achieves state-of-the-art performance across datasets, metrics, diagnostic tasks, longitudinal settings, and demographic subgroups, addressing longstanding limitations of supervised-only approaches.
Looking ahead, this framework can be extended to additional imaging modalities and clinical tasks, and combined with richer multimodal patient data such as prior imaging, laboratory results, and clinical notes. More broadly, UniRG highlights the promise of reinforcement learning as a core component of next-generation medical foundation models that are robust, generalizable, and clinically aligned.
UniRG reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health (opens in new tab), with other exciting progress such as GigaPath, BiomedCLIP, LLaVA-Rad (opens in new tab), BiomedJourney, BiomedParse, TrialScope, Curiosity.
Paper co-authors: Qianchu Liu, Sheng Zhang, Guanghui Qin, Yu Gu, Ying Jin, Sam Preston, Yanbo Xu, Sid Kiblawi, Wen-wai Yim, Tim Ossowski, Tristan Naumann, Mu Wei, Hoifung Poon
Opens in a new tabThe post UniRG: Scaling medical imaging report generation with multimodal reinforcement learning appeared first on Microsoft Research.
Multimodal reinforcement learning with agentic verifier for AI agents
- Today’s multimodal AI systems can give answers that sound right but may not be grounded in what they actually observe over time, leading to unpredictable errors and safety risks in real-world settings.
- Argos is a verification framework for multimodal reinforcement learning that trains models by rewarding not just correct answers, but correct answers grounded in visual and temporal evidence, using automated verification rather than human labeling. It selects the appropriate specialized tools for each answer based on what needs to be verified.
- Models trained with Argos show stronger spatial reasoning, far fewer visual hallucinations, more stable learning dynamics, and better performance on robotics and real-world tasks while requiring fewer training samples.
Over the past few years, AI systems have become much better at discerning images, generating language, and performing tasks within physical and virtual environments. Yet they still fail in ways that are hard to predict and even harder to fix. A robot might try to grasp a tool when the object is visibly blocked, or a visual assistant integrated into smart glasses might describe objects that aren’t actually present.
These errors often arise because today’s multimodal agents are trained to generate outputs that are plausible rather than grounded in the actual information they receive from their environment. As a result, a model’s output can seem correct while relying on incorrect information. As AI systems are increasingly used to navigate 3D spaces and make decisions in real-world settings, this gap can be a safety and reliability concern.
To tackle this challenge, we posed the question: How can we train AI agents to generate correct answers and take appropriate actions for the right reasons so that their behavior is reliable even as the environment or tasks change?
Argos represents a novel answer to this challenge. It’s an agentic verification framework designed to improve the reliability of reinforcement learning in multimodal models. Reinforcement learning is a training method where AI models learn by receiving rewards for desired behaviors and penalties for undesired ones, gradually improving their performance through trial and error.
Rather than rewarding only correct behaviors, Argos evaluates how those behaviors were produced. It draws on a pool of larger, more capable teacher models and rule-based checks to verify two things: first, that the objects and events a model references actually exist in its input, and second, that the model’s reasoning aligns with what it observes. Argos rewards the model when both conditions are met. In practice, these rewards help curate high-quality training data and guide the model’s further training.
How Argos worksArgos functions as a verification layer on top of an existing multimodal model. Given an image or video, a task or query, and information about the model’s reasoning and output, Argos identifies where the model indicates objects are located in the image, when it indicates events occur in a video, and what action or answer it produces.
Argos then applies specialized tools tailored to the specific content to evaluate and score three aspects of the model’s output. It checks whether the answer is correct, whether referenced objects and events appear at the indicated locations and times, and whether the reasoning is consistent with the visual evidence and the answer (Figure 1).
These scores are combined using a gated aggregation function, a method that dynamically adjusts the importance of different scores. It emphasizes reasoning checks only when the final output is correct. This design prevents unreliable feedback from dominating training and produces a stable reward signal for reinforcement learning.
Figure 1. Argos selects different specialized tools to verify and score the accuracy of referenced points and events in the agent’s reasoning. Using Argos to curate data for supervised fine-tuningArgos also helps curate high-quality training data to provide the model with a strong foundation in grounded reasoning. Before the reinforcement learning stage begins, Argos uses a multi-stage process to generate data that is explicitly tied to visual locations and time intervals.
In the first stage, Argos identifies the objects, actions, and events that are relevant to a task and links them to specific locations in images or specific moments in videos. These references are overlaid on images and selected video frames. Next, a reasoning model generates step-by-step explanations that refer to these visual locations and time spans.
Finally, Argos evaluates each generated example for accuracy and visual grounding, filtering out low-quality training data and retaining only data that is both correct and well-grounded in visual input. The resulting dataset is then used in an initial training phase, where the model learns to generate reasoning steps before producing its final output. This process is illustrated in Figure 2.
Figure 2. Argos generates step-by-step reasoning grounded in image locations and video timestamps then filters out low-quality training data. EvaluationBuilding on this foundation in grounded reasoning, we further trained the model using reinforcement learning guided by Argos and evaluated its performance across a range of benchmarks. On spatial reasoning tasks, the Argos-trained model outperformed both the base model Qwen2.5-VL-7B and the stronger Video-R1 baseline across challenging 3D scenarios and multi-view tasks. Models trained with Argos also showed a substantial reduction of hallucinations compared with both standard chain-of-thought prompting and reinforcement learning baselines.
Finally, we evaluated the model in robotics and other real-world task settings, focusing on high-level planning and fine-grained control. Models trained with Argos performed better on complex, multi-step tasks. Notably, these improvements were achieved using fewer training samples than existing approaches, highlighting the importance of reward design in producing more capable and data-efficient agents. Figure 3 illustrates some of these findings.
Figure 3. Performance of Argos compared with baseline models on the task of visual hallucination detection (left) and embodied task planning and completion (right). How Argos shapes reinforcement learningTo understand how Argos affects learning, we took the same vision-language model that had been trained on our curated dataset and fine-tuned it using reinforcement learning in two different ways. In one approach, Argos was an agentic verifier, checking the correctness of outputs and the quality of reasoning. In the other, the model received feedback only on whether its answers were correct.
We evaluated both versions on 1,500 samples from a new dataset and tracked their performance throughout the learning process (Figure 4). Although they started at similar levels, the model without Argos quickly got worse. Its accuracy steadily declined, and it increasingly gave answers that ignored what was in the videos. It learned to game the system by producing answers that seemed correct without grounding them in visual evidence.
The model trained with Argos showed the opposite pattern. Accuracy improved steadily, and the model became better at linking its reasoning to what appeared in the videos. This difference highlights the value of verification: when training rewards both correct outputs and sound reasoning based on visual and temporal evidence, models learn to be more reliable rather than simply finding shortcuts to high scores.
Figure 4. Comparison of response accuracy changes with and without Argos across two model versions (left) and differences in visual grounding accuracy over training for both versions (right). Potential impact and looking forwardThis research points toward a different way of building AI agents for real-world applications. Rather than fixing errors after they occur, it focuses on training agents to systematically anchor their reasoning in what they actually receive as input throughout the training process.
The potential applications span many domains. A visual assistant for a self-driving car that verifies what’s actually in an image is less likely to report phantom obstacles. A system that automates digital tasks and checks each action against what’s displayed on the screen is less likely to click the wrong button.
As AI systems move beyond research labs into homes, factories, and offices, reliable reasoning becomes essential for safety and trust. Argos represents an early example of verification systems that evolve alongside the AI models they supervise. Future verifiers could be tailored for specific fields like medical imaging, industrial simulations, and business analytics. As more advanced models and richer data sources become available, researchers can use them to improve these verification systems, providing even better guidance during training and further reducing hallucinations.
We hope that this research helps move the field toward AI systems that are both capable and interpretable: agents that can explain their decisions, point to the evidence behind them, and be trained to adhere to real-world requirements and values.
Opens in a new tabThe post Multimodal reinforcement learning with agentic verifier for AI agents appeared first on Microsoft Research.
OptiMind: A small language model with optimization expertise
- Many real-world business problems can benefit from optimization, but translating decisions, constraints, and goals from natural language into optimization algorithms is slow.
- OptiMind is a small language model designed to convert business problems described in natural language into the mathematical formulations needed by optimization software.
- OptiMind is trained on a carefully curated, expert-aligned dataset and applies domain-specific hints and self-checks at inference time, improving its accuracy.
- OptiMind matches or exceeds the performance of much larger systems, can run locally to protect sensitive data, produces more reliable formulations, and reduces the time and expertise needed to prepare optimization models.
Enterprises across industries, from energy to finance, use optimization models to plan complex operations like supply chains and logistics. These models work by defining three elements: the choices that can be made (such as production quantities or delivery routes), the rules and limits those choices must follow, and the goal, whether that’s minimizing costs, meeting customer demand, or improving efficiency.
Over the past few decades, many businesses have shifted from judgment-based decision-making to data-driven approaches, leading to major efficiency gains and cost savings. Advances in AI promise to accelerate this shift even further, potentially cutting decision times from days to minutes while delivering better results.
In practice, however, turning real-world business problems into a form that optimization software can understand is challenging. This translation process requires expressing decisions, constraints, and objectives in mathematical terms. The work demands specialized expertise, and it can take anywhere from one day to several weeks to solve complex problems.
To address this challenge, we’re introducing OptiMind, a small language model designed to convert problems described in plain language into the mathematical formulations that optimization software needs. Built on a 20-billion parameter model, OptiMind is compact by today’s standards yet matches the performance of larger, more complex systems. Its modest size means it can run locally on users’ devices, enabling fast iteration while keeping sensitive business data on users’ devices rather than transmitting it to external servers.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and IndustryDiscover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
Listen now Opens in a new tab How it worksOptiMind incorporates knowledge from optimization experts both during training and when it’s being used to improve formulation accuracy at scale. Three stages enable this: domain-specific hints improve training data quality, the model undergoes fine-tuning, and expert reasoning guides the model as it works.
Figure 1. From problem description to solutionOne of the central challenges in developing OptiMind was the poor quality of existing public datasets for optimization problems. Many examples were incomplete or contained incorrect solutions. To address this, we developed a systematic approach that combines automation with expert review. It organizes problems into well-known categories, such as scheduling or routing, and identifies common error patterns within each. Using these insights, we generated expert-verified “hints” to guide the process, enabling the system to regenerate higher-quality solutions and filter out unsolvable examples (Figure 2). The result is a training dataset that more accurately reflects how optimization experts structure problems.
Figure 2. Process for correcting training dataUsing this refined dataset, we applied supervised fine-tuning to the base model. Rather than simply generating code, we trained OptiMind to produce structured mathematical formulations alongside intermediate reasoning steps, helping it avoid the common mistakes found in earlier datasets.
When in use, the model’s reliability further improves. When given a new problem, OptiMind first classifies it into a category, such as scheduling or network design. It then applies expert hints relevant to that type of problem, which act as reminders to check for errors before generating a solution. For particularly challenging problems, the system generates multiple solutions and either selects the most frequently occurring one or uses feedback to refine its response. This approach increases accuracy without requiring a larger model, as illustrated in Figure 3.
Figure 3. OptiMind’s inference process EvaluationTo test the system, we turned to three widely used public benchmarks that represent some of the most complex formulation tasks in the field. On closer inspection, we discovered that 30 to 50 percent of the original test data was flawed. After manually correcting the issues, OptiMind improved accuracy by approximately 10 percent over the base model. Figure 4 and Table 1 show detailed comparisons: OptiMind outperformed other open-source models under 32 billion parameters and, when combined with expert hints and correction strategies, matched or exceeded the performance of current leading models.
Figure 4. Average accuracy percentages over all models. Table 1. Performance of all models on corrected benchmark datasetsOptiMind is more reliable than other models because it learns from higher-quality, domain-aligned data. And by correcting errors and inconsistencies in standard datasets, we significantly reduced the model’s tendency to hallucinate relative to the base and comparison models.
Looking forwardWhile supervised fine-tuning has provided a strong foundation, we are exploring reinforcement learning to further refine OptiMind’s reasoning capabilities. We’re also investigating automated frameworks that would allow LLMs to generate their own expert hints, enabling continuous autonomous improvement. Additionally, we are working with Microsoft product teams and industry collaborators to expand OptiMind’s utility, adding support for more programming languages and a variety of input formats, including Excel and other widely used tools.
We’re releasing OptiMind as an experimental model to gather community feedback and inform future development. The model is available through Microsoft Foundry (opens in new tab) and Hugging Face (opens in new tab), and we’ve open-sourced the benchmarks and data-processing procedures on GitHub (opens in new tab) to support more reliable evaluation across the field. We welcome feedback through GitHub (opens in new tab), and invite those interested in shaping the future of optimization to apply for one of our open roles.
Opens in a new tabThe post OptiMind: A small language model with optimization expertise appeared first on Microsoft Research.
Agent Lightning: Adding reinforcement learning to AI agents without code rewrites
AI agents are reshaping software development, from writing code to carrying out complex instructions. Yet LLM-based agents are prone to errors and often perform poorly on complicated, multi-step tasks. Reinforcement learning (RL) is an approach where AI systems learn to make optimal decisions by receiving rewards or penalties for their actions, improving through trial and error. RL can help agents improve, but it typically requires developers to extensively rewrite their code. This discourages adoption, even though the data these agents generate could significantly boost performance through RL training.
To address this, a research team from Microsoft Research Asia – Shanghai has introduced Agent Lightning. This open-source (opens in new tab) framework makes AI agents trainable through RL by separating how agents execute tasks from model training, allowing developers to add RL capabilities with virtually no code modification.
Capturing agent behavior for trainingAgent Lightning converts an agent’s experience into a format that RL can use by treating the agent’s execution as a sequence of states and actions, where each state captures the agent’s status and each LLM call is an action that moves the agent to a new state.
This approach works for any workflow, no matter how complex. Whether it involves multiple collaborating agents or dynamic tool use, Agent Lightning breaks it down into a sequence of transitions. Each transition captures the LLM’s input, output, and reward (Figure 1). This standardized format means the data can be used for training without any additional steps.
Figure 1. An illustration of Agent Lightning’s standardized format using a retrieval-augmented generation (RAG) agent. Left: The full agent workflow, where the agent’s state updates after each component step. The green blocks show assigned variables, and the gray blocks indicate variables without content. Right: The collected transitions are based on the standardized format for the RL training process, with each transition corresponding to one LLM step that contains its prompt, result, and immediate reward. Hierarchical reinforcement learningTraditional RL training for agents that make multiple LLM requests involves stitching together all content into one long sequence and then identifying which parts should be learned and which ignored during training. This approach is difficult to implement and can create excessively long sequences that degrade model performance.
Instead, Agent Lightning’s LightningRL algorithm takes a hierarchical approach. After a task completes, a credit assignment module determines how much each LLM request contributed to the outcome and assigns it a corresponding reward. These independent steps, now paired with their own reward scores, can be used with any existing single-step RL algorithm, such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO) (Figure 2).
Figure 2. (a) Single-step GRPO: The LLM completes the task in one call. Multiple responses for the same task are compared to determine how strongly each should be reinforced. (b) Previous multi-step GRPO: The task involves multiple LLM calls. Multiple multi-step runs of the same task are compared, with non-LLM generated tokens (grey boxes) ignored during training. (c) LightningRL: The multi-step run is divided into individual LLM calls. Calls from the same task are compared to determine how strongly each should be reinforced. Each call includes its input, context, output, and reward, assigned by the credit assignment module.This design offers several benefits. It remains fully compatible with widely used single-step RL algorithms, allowing existing training methods to be applied without modification. Organizing data as a sequence of independent transitions lets developers flexibly construct the LLM input as needed, supporting complex behaviors like agents that use multiple tools or work with other agents. Additionally, by keeping sequences short, the approach scales cleanly and keeps training efficient.
Agent Lightning as middlewareAgent Lightning serves as middleware between RL algorithms and agent environments, providing modular components that enable scalable RL through standardized protocols and well-defined interfaces.
An agent runner manages the agents as they complete tasks. It distributes work and collects and stores the results and progress data. It operates separately from the LLMs, enabling them to run on different resources and scale to support multiple agents running concurrently.
An algorithm trains the models and hosts the LLMs used for inference and training. It orchestrates the overall RL cycle, managing which tasks are assigned, how agents complete them, and how models are updated based on what the agents learn. It typically runs on GPU resources and communicates with the agent runner through shared protocols.
The LightningStore (opens in new tab) serves as the central repository for all data exchanges within the system. It provides standardized interfaces and a shared format, ensuring that the different components can work together and enabling the algorithm and agent runner to communicate effectively.
Figure 3. The Agent Lightning frameworkAll RL cycles follow two steps: (1) Agent Lightning collects agent execution data (called “spans”) and store them in the data store; (2) it then retrieves the required data and sends it to the algorithm for training. Through this design, the algorithm can delegate tasks asynchronously to the agent runner, which completes them and reports the results back (Figure 4).
Figure 4. Agent Lightning’s RL cycleOne key advantage of this approach is its algorithmic flexibility. The system makes it easy for developers to customize how agents learn, whether they’re defining different rewards, capturing intermediate data, or experimenting with different training approaches.
Another advantage is resource efficiency. Agentic RL systems are complex, integrating agentic systems, LLM inference engines, and training frameworks. By separating these components, Agent Lightning makes this complexity manageable and allows each part to be optimized independently
A decoupled design allows each component to use the hardware that suits it best. The agent runner can use CPUs while model training uses GPUs. Each component can also scale independently, improving efficiency and making the system easier to maintain. In practice, developers can keep their existing agent frameworks and switch model calls to the Agent Lightning API without changing their agent code (Figure 5).
Figure 5. On the left, the developer implements the agent code. On the bottom right is the code required for Agent Lightning. The main body of the agent code is unchanged. Evaluation across three real-world scenariosAgent Lightning was tested on three distinct tasks, achieving consistent performance improvements across all scenarios (Figure 6):
Text-to-SQL (LangChain): In a system with three agents handling SQL generation, checking, and rewriting, Agent Lightning simultaneously optimized two of them, significantly improving the accuracy of generating executable SQL from natural language queries.
Retrieval-augmented generation (OpenAI Agents SDK implementation): On the multi-hop question-answering dataset MuSiQue, which requires querying a large Wikipedia database, Agent Lightning helped the agent generate more effective search queries and reason better from retrieved content.
Mathematical QA and tool use (AutoGen implementation): For complex math problems, Agent Lightning trained LLMs to more accurately determine when and how to call the tool and integrate the results into its reasoning, increasing accuracy.
Figure 6. Reward curves across the three evaluation scenarios Enabling continuous agent improvementBy simplifying RL integration, Agent Lightning can make it easier for developers to build, iterate, and deploy high-performance agents. We plan to expand Agent Lightning’s capabilities to include automatic prompt optimization and additional RL algorithms.
The framework is designed to serve as an open platform where any AI agent can improve through real-world practice. By bridging existing agentic systems with reinforcement learning, Agent Lightning aims to help create AI systems that learn from experience and improve over time.
Opens in a new tabThe post Agent Lightning: Adding reinforcement learning to AI agents without code rewrites appeared first on Microsoft Research.
Promptions helps make AI prompting more precise with dynamic UI controls
Anyone who uses AI systems knows the frustration: a prompt is given, the response misses the mark, and the cycle repeats. This trial-and-error loop can feel unpredictable and discouraging. To address this, we are excited to introduce Promptions (prompt + options), a UI framework that helps developers build AI interfaces with more precise user control.
Its simple design makes it easy to integrate into any setting that relies on added context, including customer support, education, and medicine. Promptions is available under the MIT license on Microsoft Foundry Labs (opens in new tab) and GitHub.
BackgroundPromptions builds on our research, “Dynamic Prompt Middleware: Contextual Prompt Refinement Controls for Comprehension Tasks.” This project examined how knowledge workers use generative AI when their goal is to understand rather than create. While much public discussion centers on AI producing text or images, understanding involves asking AI to explain, clarify, or teach—a task that can quickly become complex. Consider a spreadsheet formula: one user may want a simple syntax breakdown, another a debugging guide, and another an explanation suitable for teaching colleagues. The same formula can require entirely different explanations depending on the user’s role, expertise, and goals.
A great deal of complexity sits beneath these seemingly simple requests. Users often find that the way they phrase a question doesn’t match the level of detail the AI needs. Clarifying what they really want can require long, carefully worded prompts that are tiring to produce. And because the connection between natural language and system behavior isn’t always transparent, it can be difficult to predict how the AI will interpret a given request. In the end, users spend more time managing the interaction itself than understanding the material they hoped to learn.
Identifying how users want to guide AI outputsTo explore why these challenges persist and how people can better steer AI toward customized results, we conducted two studies with knowledge workers across technical and nontechnical roles. Their experiences highlighted important gaps that guided Promptions’ design.
Our first study involved 38 professionals across engineering, research, marketing, and program management. Participants reviewed design mock-ups that provided static prompt-refinement options—such as length, tone, or start with—for shaping AI responses.
Although these static options were helpful, they couldn’t adapt to the specific formula, code snippets, or text the participant was trying to understand. Participants also wanted direct ways to customize the tone, detail, or format of the response without having to type instructions.
Why dynamic refinement mattersThe second study tested prototypes in a controlled experiment. We compared the static design from the first study, called the “Static Prompt Refinement Control” (Static PRC), against a “Dynamic Prompt Refinement Control” (Dynamic PRC) with features that responded to participants’ feedback. Sixteen technical professionals familiar with generative AI completed six tasks, spanning code explanation, understanding a complex topic, and learning a new skill. Each participant tested both systems, with task assignments balanced to ensure fair comparison.
Comparing Dynamic PRC to Static PRC revealed key insights into how dynamic prompt-refinement options change users’ sense of control and exploration and how those options help them reflect on their understanding.
Static prompt refinementStatic PRC offered a set of pre‑selected controls (Figure 1) identified in the initial study. We expected these options to be useful across many types of explanation-seeking prompts.
Figure 1: The static PRC interface Dynamic prompt refinementWe built the Dynamic PRC system to automatically produce prompt options and refinements based on the user’s input, presenting them in real time so that users could adjust these controls and guide the AI’s responses more precisely (Figure 2).
Figure 2. Interaction flow in the Dynamic PRC system. (1) The user asks the system to explain a long Excel formula. (2) Dynamic PRC generates refinement options: Explanation Detail Level, Focus Areas, and Learning Objectives. (3) The user modifies these options. (4) The AI returns an explanation based on the selected options. (5) In the session chat panel, the user adds a request to control the structure or format of the response. (6) Dynamic PRC generates new option sets based on this input. (7) The AI produces an updated explanation reflecting the newly applied options. Azure AI Foundry LabsGet a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry Opens in a new tab FindingsParticipants consistently reported that dynamic controls made it easier to express the nuances of their tasks without repeatedly rephrasing their prompts. This reduced the effort of prompt engineering and allowed users to focus more on understanding content than managing the mechanics of phrasing.
Figure 3. Comparison of user preferences for Static PRC versus Dynamic PRC across key evaluation criteria.Contextual options prompted users to try refinements they might not have considered on their own. This behavior suggests that Dynamic PRC can broaden how users engage with AI explanations, helping them uncover new ways to approach tasks beyond their initial intent. Beyond exploration, the dynamic controls prompted participants to think more deliberately about their goals. Options like “Learning Objective” and “Response Format” helped them clarify what they needed, whether guidance on applying a concept or step-by-step troubleshooting help.
Figure 4. Participant ratings comparing the effectiveness of Static PRC and Dynamic PRCWhile participants valued Dynamic PRC’s adaptability, they also found it more difficult to interpret. Some struggled to anticipate how a selected option would influence the response, noting that the controls seemed opaque because the effect became clear only after the output appeared.
However, the overall positive response to Dynamic PRC showed us that Promptions could be broadly useful, leading us to share it with the developer community.
Technical designPromptions works as a lightweight middleware layer that sits between the user and the underlying language model (Figure 5). It has two main components:
Option Module. This module reviews the user’s prompt and conversation history, then generates a set of refinement options. These are presented as interactive UI elements (radio buttons, checkboxes, text fields) that directly shape how the AI interprets the prompt.
Chat Module. This module produces the AI’s response based on the refined prompt. When a user changes an option, the response immediately updates, making the interaction feel more like an evolving conversation than a cycle of repeated prompts.
Figure 5. Promptions middleware workflow. (1) The Option Module reads the user’s prompt and conversation history and (2) generates prompt options. (3) These options are rendered inline by a dedicated component. (4) The Chat Module incorporates these refined options alongside the original prompt and history to produce a response. (5) When the user adjusts the controls, the refinements update and the Chat Module regenerates the response accordingly. Adding Promptions to an applicationPromptions easily integrates into any conversational chat interface. Developers only need to add a component to display the options and connect it to the AI system. There’s no need to store date between sessions, which keeps implementation simple. The Microsoft Foundry Labs (opens in new tab) repository includes two sample applications, a generic chatbot and an image generator, that demonstrate this design in practice.
Promptions is well-suited for interfaces where users need to provide context but don’t want to write it all out. Instead of typing lengthy explanations, they can adjust the controls that guide the AI’s response to match their preferences.
Questions for further explorationPromptions raises important questions for future research. Key usability challenges include clarifying how dynamic options affect AI output and managing the complexity of multiple controls. Other questions involve balancing immediate adjustments with persistent settings and enabling users to share options collaboratively.
On the technical side, questions focus on generating more effective options, validating and customizing dynamic interfaces, gathering relevant context automatically, and supporting the ability to save and share option sets across sessions.
These questions, along with broader considerations of collaboration, ethics, security, and scalability, are guiding our ongoing work on Promptions and related systems.
Tool Explore Promptions on Microsoft Foundry LabsBy making Promptions open source, we hope to help developers create smarter, more responsive AI experiences.
Explore Promptions on Microsoft Foundry Labs (opens in new tab)
Opens in a new tabThe post Promptions helps make AI prompting more precise with dynamic UI controls appeared first on Microsoft Research.
GigaTIME: Scaling tumor microenvironment modeling using virtual population generated by multimodal AI
The convergence of digital transformation and the GenAI revolution creates an unprecedented opportunity for accelerating progress in precision health. Precision immunotherapy is a poster child for this transformation. Emerging technologies such as multiplex immunofluorescence (mIF) can assess internal states of individual cells along with their spatial locations, which is critical for deciphering how tumors interact with the immune system. The resulting insights, often referred to as the “grammar” of the tumor microenvironment, can help predict whether a tumor will respond to immunotherapy. If it is unlikely to respond, these insights can also inform strategies to reprogram the tumor from “cold” to “hot,” increasing its susceptibility to treatment.
This is exciting, but progress is hindered by the high cost and limited scalability of current technology. For example, obtaining mIF data of a couple dozen protein channels for a tissue sample can cost thousands of dollars, and even the most advanced labs can barely scale it to a tiny fraction of their available tissue samples.
In our paper published in Cell on December 9, “Multimodal AI generates virtual population for tumor microenvironment modeling (opens in new tab),” we present GigaTIME (opens in new tab), a multimodal AI model for translating routinely available hematoxylin and eosin (H&E) pathology slides to virtual mIF images. Developed in collaboration with Providence and the University of Washington, GigaTIME was trained on a Providence dataset of 40 million cells with paired H&E and mIF images across 21 protein channels. We applied GigaTIME to 14,256 cancer patients from 51 hospitals and over a thousand clinics within the Providence system. This effort generated a virtual population of around 300,000 mIF images spanning 24 cancer types and 306 cancer subtypes. This virtual population uncovered 1,234 statistically significant associations linking mIF protein activations with key clinical attributes such as biomarkers, staging, and patient survival. Independent external validation on 10,200 Cancer Genome Atlas (TCGA) patients further corroborated our findings.
To our knowledge, this is the first population-scale study of tumor immune microenvironment (TIME) based on spatial proteomics. Such studies were previously infeasible due to the scarcity of mIF data. By translating readily available H&E pathology slides into high-resolution virtual mIF data, GigaTIME provides a novel research framework for exploring precision immuno-oncology through population-scale TIME analysis and discovery. We have made our GigaTIME model publicly available at Microsoft Foundry Labs (opens in new tab) and on Hugging Face (opens in new tab) to help accelerate clinical research in precision oncology.
“GigaTIME is about unlocking insights that were previously out of reach,” explained Carlo Bifulco, MD, chief medical officer of Providence Genomics and medical director of cancer genomics and precision oncology at the Providence Cancer Institute. “By analyzing the tumor microenvironment of thousands of patients, GigaTIME has the potential to accelerate discoveries that will shape the future of precision oncology and improve patient outcomes.”
Azure AI Foundry LabsGet a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry Opens in a new tab GigaTIME generates a virtual population for tumor microenvironment modelingDigital pathology transforms a microscopy slide of stained tumor tissue into a high-resolution digital image, revealing details of cell morphology such as nucleus and cytoplasm. Such a slide only costs $5 to $10 per image and has become routinely available in cancer care. It is well known that H&E-based cell morphology contains information about the cellular states. Last year, we released GigaPath, the first digital pathology foundation model for scaling transformer architectures to gigapixel H&E slides. Afterward, researchers at Mount Sinai Hospital and Memorial Sloan Kettering Cancer Center showed in a global prospective trial that it can reliably predict a key biomarker from H&E slides for precision oncology triaging. However, such prior works are generally limited to average biomarker status across the entire tissue. GigaTIME thus represents a major step forward by learning to predict spatially resolved, single-cell states essential for tumor microenvironment modeling. In turn, this enables us to generate a virtual population of mIF images for large-scale TIME analysis (Figure 1).
Figure 1. GigaTIME enables population-scale tumor immune microenvironment (TIME) analysis. A, GigaTIME inputs a hematoxylin and eosin (H&E) whole-slide image and outputs multiplex immunofluorescence (mIF) across 21 protein channels. By applying GigaTIME to 14,256 patients, we generated a virtual population with mIF information, leading to population-scale discovery on clinical biomarkers and patient stratification, with independent validation on TCGA. B, Circular plot visualizing a TIME spectrum encompassing the GigaTIME-translated virtual mIF activation scores across different protein channels at the population scale, where each channel is represented as an individual circular bar chart segment. The inner circle encodes OncoTree, which classifies 14,256 patients into 306 subtypes across 24 cancer types. The outer circle groups these activations by cancer types, allowing visual comparison across major categories. C, Scatter plot comparing the subtype-level GigaTIME-translated virtual mIF activations between TCGA and Providence virtual populations. Each dot denotes the average activation score of a protein channel among all tumors of a cancer subtype. GigaTIME learns a multimodal AI model to translate pathology slides into spatial proteomics images, bridging cell morphology and cell states Figure 2. GigaTIME enables translation from hematoxylin and eosin (H&E) to multiplex immunofluorescence (mIF) images. A,B, Bar plot comparing GigaTIME and CycleGAN on the translation performance in terms of Dice score (A) and Pearson correlation (B). C, Scatter plots comparing the activation density of the translated mIF and the ground truth mIF across four channels. D, Qualitative results for a sample H&E whole-slide image from our held-out test set with zoomed-in visualizations of the measured mIF and GigaTIME-translated mIF for DAPI, PD-L1, and CD68 channels.GigaTIME learned a cross-modal AI translator from digital pathology to spatial multiplex proteomics by training on 40 million cells with paired H&E slides and mIF images from Providence. To our knowledge, this is the first large-scale study exploring multimodal AI for scaling virtual mIF generation. The high-quality paired data enabled much more accurate cross-modal translation compared to prior state-of-the-art methods (Figure 2).
Virtual population enables population-scale discovery of associations between cell states and key biomarkers Figure 3. GigaTIME identifies novel TIME protein vs biomarker associations at pan-cancer, cancer type, cancer subtype levels. A, GigaTIME generates a virtual population of 14,256 with virtual mIF by translating available H&E images to mIF images, enabling pan-cancer, cancer type, and cancer subtype levels of biomedical discovery. B-G, Correlation analysis between protein channels in virtual mIF and patient biomarkers reveal TIME protein-biomarker associations at pan-cancer level (B), cancer-type level (C-E), and cancer-subtype level (F,G). Circle size denotes significance strength. Circle color denotes the directionality in which the correlation occurs. Channel color denotes high, medium, and low confidence based on pearson correlations evaluated using test set. H, A case study showcasing the activation maps across different virtual mIF channels for a H&E slide in our virtual population, and virtual mIF of sample patches from this slide.By applying GigaTIME to Providence real-world data, we generated a virtual population of 14,256 patients with virtual mIF and key clinical attributes. After correcting for multiple hypothesis testing, we have identified 1,234 statistically significant associations between tumor immune cell states (CD138, CD20, CD4) and clinical biomarkers (tumor mutation burden, KRAS, KMT2D), from pan-cancer to cancer subtypes (Figure 3). Many of these findings are supported by existing literature. For example, MSI high and TMB high associated with increased activations of TIME-related channels such as CD138. Additionally, the virtual population also uncovered previously unknown associations, such as pan-cancer associations between immune activations and key tumor biomarkers, such as the tumor suppressor KMT2D and the oncogene KRAS).
Virtual population enables population-scale discovery of tumor immune signatures for patient stratification Figure 4. GigaTIME enables effective patient stratification across pathological stages and survival groups. A-C, Correlation analysis between virtual mIF and pathological stages at pan-cancer level (A), cancer-type level (B), and cancer-subtype level (C). Circle size denotes significance strength. Circle color denotes the directionality in which the correlation happens. Channel color denotes high, medium, and low confidence based on pearson correlations evaluated using test set. D-F, Survival analysis on lung cancer by using virtual CD3, virtual CD8, and virtual GigaTIME signature (all 21 GigaTIME protein channels) to stratify patients at pan-cancer level (D) and cancer-type level: lung (E), brain (F). G, Bar plot comparing pan-cancer patient stratification performance in terms of survival log rank p-values among virtual GigaTIME signature and individual virtual protein channels.The virtual population also uncovered GigaTIME signatures for effective patient stratification across staging and survival profiles (Figure 4), from pan-cancer to cancer subtypes. Prior studies have explored patient stratification based on individual immune proteins such as CD3 and CD8. We found that GigaTIME-simulated CD3 and CD8 are similarly effective. Moreover, the combined GigaTIME signature across all 21 protein channels attained even better patient stratification compared to individual channels.
Virtual population uncovers interesting spatial and combinatorial interactions Figure 5. GigaTIME uncovers interesting spatial and combinatorial virtual mIF patterns. A,B,C Bar plots comparing virtual mIF activation density with spatial metrics on identifying TIME protein-biomarker correlations. We investigated three spatial metrics based on entropy (A), signal-to-noise ratio (SNR) (B), and sharpness (C). D,E, Bar plots comparing single-channel and combinatorial-channel (using the OR logical operation) in biomarker associations for two GigaTIME virtual protein pairs: CD138/CD68 (D) and PD-L1/Caspase 3 (E), demonstrating substantially improved associations for the combination. F, Case studies visualizing the virtual mIF activation maps of individual channels (CD138, CD68; PD-L1, Caspase 3) and their combinations.The virtual population uncovered interesting non-linear interactions across the GigaTIME virtual protein channels, revealing associations with spatial features such as sharpness and entropy, as well as with key clinical biomarkers like APC and KMT2D (Figure 6). Such combinatorial studies were previously out of reach given the scarcity of mIF data.
Independent external validation on TCGA Figure 6. Independent validation on a virtual population from TCGA. A, Grid charts showing significantly correlated pan-cancer GigaTIME protein-biomarker pairs in Providence (left), TCGA (middle), and both (right). B, Grid charts showing significantly correlated GigaTIME protein-biomarker pairs for lung cancer in Providence and TCGA. C, Grid chart showing significantly correlated GigaTIME protein-biomarker pairs for LUAD in Providence. Channel color denotes high, medium, and low confidence based on pearson correlations evaluated using test set. D, Case studies with visualizations of H&E slides and the corresponding virtual mIF activations for the pair of a GigaTIME protein channel and a biomarker (mutated/non-mutated), where the patient with the given mutation demonstrates much higher activation scores for that GigaTIME protein channel.We conducted an independent external validation by applying GigaTIME to 10,200 patients in The Cancer Genome Atlas (TCGA) dataset and studied associations between GigaTIME-simulated virtual mIF and clinical biomarkers available in TCGA. We observed significant concordance across the virtual populations from Providence and TCGA, with a Spearman correlation of 0.88 for virtual protein activations across cancer subtypes. The two populations also uncovered a significant overlap of associations between GigaTIME-simulated protein activations and clinical biomarkers (Fisher’s exact test p < 2 × 10−9). On the other hand, the Providence virtual population yielded 33% more significant associations than TCGA, highlighting the value of large and diverse real-world data for clinical discovery.
GigaTIME is a promising step toward the moonshot of “virtual patient”By learning to translate across modalities, GigaTIME is a promising step toward “learning the language of patients” for the ultimate goal of developing a “virtual patient”, a high-fidelity digital twin that could one day accurately forecast disease progression and counterfactual treatment response. By converting routinely available cell morphology data into otherwise scarce high-resolution cell states signals, GigaTIME demonstrated the potential in harnessing multimodal AI to scale real-world evidence (RWE) generation.
Going forward, growth opportunities abound. GigaTIME can be extended to handle more spatial modalities and cell-state channels. It can be integrated into advanced multimodal frameworks such as LLaVA-Med to facilitate conversational image analysis by “talking to the data.” To facilitate research in tumor microenvironment modeling, we have made GigaTIME open-source (opens in new tab) on Foundry Labs (opens in new tab) and Hugging Face (opens in new tab).
GigaTIME is a joint work with Providence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. It reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health (opens in new tab), with other exciting progress such as GigaPath, BiomedCLIP, LLaVA-Rad (opens in new tab), BiomedJourney, BiomedParse, TrialScope, Curiosity.
Learn more at the Microsoft Signal blogPaper co-authors: Jeya Maria Jose Valanarasu, Hanwen Xu, Naoto Usuyama, Chanwoo Kim, Cliff Wong, Peniel Argaw, Racheli Ben Shimol, Angela Crabtree, Kevin Matlock, Alexandra Q. Bartlett, Jaspreet Bagga, Yu Gu, Sheng Zhang, Tristan Naumann, Bernard A. Fox, Bill Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Sheng Wang, Hoifung Poon
Opens in a new tabThe post GigaTIME: Scaling tumor microenvironment modeling using virtual population generated by multimodal AI appeared first on Microsoft Research.
Inilah Keunggulan Yang Ditawarkan Situs Sabung Ayam Online Resmi Di Indonesia
Situs judi IDN Slot online yang resmi dan terbaik adalah tempat untuk player yang ingin melakukan taruhan dengan cara online. di dalamnya kamu akan menemukan permainan sabung ayam yang sudah terkenal di Indonesia. game sabung ayam sendiri adalah permainan yang sangat disukai oleh para pecinta ayam aduan tidak hanya di Indonesia saja tapi juga di berbagai belahan dunia. Dikarenakan adanya larangan perjudian, sekarang seluruh pecinta ayam aduan melakuka taruhan dengan sistem online. karena itu kamu bisa mencoba game ini di agen judi resmi dan terpercaya untuk dapatkan keseruan tanpa batas di dalamnya.
Beragam Keunggulan Yang Ditawarkan Situs Sabung Ayam Online Resmi IndonesiaKebanyakan petaruh di Indonesia yang melakukan taruhan sabung ayam diwajibkan untuk memilih agen atau situs judi sabung ayam terbaik terlebih dahulu. Karena ketika pemilihan agen dapat dilakukan oleh player, tentu saja hal ini akan memudahkan jalannya dalam mendapatkan keuntungan dengan mudah. agen judi sabung ayam terbaik sendiri menawarkan beberapa keungguln yang membuat petaruh suka dan jatuh hati saat bermain di dalamnya. berikut ini sudah ada keunggulan yang akan kamu temukan di dalam situs judi sabung ayam resmi untuk para player di indonesia:
- Fitur Live Streaming
Untuk keunggulan yang akan kamu dapatkan pertama kali adalah live streaming. Jadi perlu diketahui, lewat fitur yang satu ini, kamu akan menemukan sebuah perlombaan secara langsung. Fitur live streaming memungkinkan para player untuk merasakan sensasi bermain yang sangat mirip seperti pada bandar darat langsung. Karena itu kebanyakan player akan lebih memilih bermain game sabung ayam bersama agen judi yang menyediakan ftur live streaming di dalamnya supaya taruhan lebih menyenangkan.
Para player yang ingin bermain dapat masuk ke dalam pertandingan lewat stus atau aplikasi. Jadi cobalah untuk temukan agen-agen yang memiliki fitur ini di dalamnya. karena ketika kamu ada di dalam sebuah agen judi sabung ayam dengan fitur seperti ini, itu artinya kamu sudah berhasil dapatkan agen terbaik. disini kamu bisa melakukan taruhan dengan aman dan nyaman serta mendapatkan hasil yang begitu menggiurkan.
- Hadir untuk semua kalangan
Kemudian, kamu juga akan menemukan banyak sekali player yang ikut bermain di dalam agen judi seperti ini. karena itu, game ini hadir untuk semua kalangan player yang membuatnya semakin populer. Game ini bisa diakses dengan mudah oleh player karena alat main yang digunakan hanyalah sebuah smartphone yang dihubungkan ke jaringan internet saja.
Jadi apabila kamu sudah menemukan jaringan internet di dalam smartphone milik kamu, kamu bisa akses sabung ayam online kapan saja dan dimana saja. kamu juga dapat menikmati permainan ini dengan penawaran tanpa batas yang membuat game ini sangat sayang bila dilewatkan begitu saja. jadi cobalah untuk melakukan pemilihan situs sabung ayam sampai menemukan agen seperti ini.
- Terjamin kEamanannya
Dan yang terakhir adalah mendapatkan game sabung ayam yang sudah terjamin keamanannya. Ini adalah salah satu keunggulan yang juga akan kamu dapatkan dari situs judi sabung ayam. Jadi apabila saat ini kamu mengikuti taruhan sabung ayam secara online, keamanan yang ada di dalam agen patut untuk kamu perhatikan dengan benar.
Pasalnya ketika kamu berada di dalam sebuah agen yang keamanannya tidak begitu terjamin, tentu saja kamu harus memperhatikan sistemnya dulu di dalam agen. Karena semua player yang bermain berhak mendapatkan keamanan pada saat berada di dalam agen. Keamanan dan kenyamanan adalah dua hal penting yang akan membantu player untuk bisa dapatkan keuntungan di setiap harinya. player yang bermain game taruhan online juga tidak perlu khawatir jika nanti tidak bisa mendapatkan keseruan pada game yang dimainkan.
Itulah beberapa keunggulan yang akan kamu dapatkan saat berada di dalam agen judi sabung ayam online resmi dan terpercaya. jadi apabila saat ini kamu tertarik dengan game ini, kamu harus temukan situs-situs dengan semua daftar keunggulan di atas untuk dapatkan keuntungan di setiap harinya.
Originally posted 2022-07-12 00:42:47. …
Strategi Main Sabung Ayam Online Yang Jarang Diketahui Oleh Player
Bermain game judi Joker123 apk online adalah salah satu aktivitas yang saat ini sedang banyak dilakukan oleh player. aktivitas ini disukai oleh player karena bisa mendatangkan penghasilan dalam jumlah yang besar. karena itu, apabila saat ini kamu suka dengan taruhan sabung ayam, pastikan kamu bertaruh dengan strategi. Jika kamu punya strategi untuk bermain game sabung ayam, kesempatan kamu dalam mendapatkan kemenangan akan jauh lebih besar. kamu juga bisa menikmati hasil yang menggiurkan lewat kemenangan yang sudah berhasil diraih.
Berikut Ini Beberapa Strategi Main Sabung Ayam Online Yang Jarang Diketahui Oleh PlayerBanyak petaruh mendambakan kemenangan dalam game sabung ayam yang dimainkan. Karena itu, jika kamu salah satunya, maka strategi dalam permainan harus kamu ketahui sejak awal. Jika kamu tahu strategi apa saja yang mesti dilakukan pada saat betting, hal ini akan membantu kamu dalam mendapatkan penghasilan yang besar. nah berikut ini sudah kami rangkum beberapa strategi untuk yang ingin bermain game sabung ayam dengan sistem online:
- Memilih Pertandingan yang Tepat
Dikarenakan ada banyak pertandingan sabung ayam yang akan ditemukan di agen judi terpercaya, maka kamu perlu mencari pertandingan yang memang sudah diketahui dengan baik. Banyanya pertandingan sabung ayam membantu para petaruh untuk memilih yang benar-benar memguntungkan. Jangan pernah berpikir jika semua pertandingan bisa kamu nikmati. jadi sebaiknya cari informasi yang banyak dan lengkap terkait pertandingan yang akan diikuti nanti. Jika sudah mengetahui pertandinganya, barulah kamu bisa dapatkan kemenangan dalam game dengan mudah.
- Amati Hasil Riwayat Pertandingan Terdahulu
Kemudian, strategi kedua untuk player yang ingin bermain game judi sabung ayam adalah mengamati hasil riwayat pertandingan dari kedua ayam yang diadu. Nantinya, kamu akan bertemu dengan ayam berwarna merah dan biru. Disini kamu harus pandai dalam memilih ayam yang dirasa bisa memenangkan pertarungan. Tapi untuk melakukan analisa, dibutuhkan informasi yang lengkap. Kamu bisa perhatikan hasil riwayat dari kedua ayam yang akan diadu.
Biasanya di agen judi sabung ayam online, player bisa menemukan hasil riwayat tersebut dengan mudah. informasi seperti ini tentu saja dibutuhkan oleh player. apalagi yang baru saja masuk ke dalam dunia taruhan adu ayam online itu sendiri. jadi bagi para pecinta ayam aduan, lakukan strategi yang kedua ini dan kamu bisa dapatkan kemenangan dengan mudah.
- Modal Harus Dikelola dengan Baik
Strategi main game sabung ayam yang ketiga adalah modal harus dikelola dengan baik. Jadi buat yang ingin bermain taruhan sabung ayam, kamu harus pastikan jika modal yang akan dikeluarkan sudah melalui perhitungan yang matang. Jangan pernah berpikir jika uang yang kamu punya saat ini bisa kamu jadikan chip. Kamu harus perhatikan dulu berapa jumlah chip yang dibuthkan supaya nanti memudahkan proses deposit yang akan kamu lakukan.
Kebanyakan petaruh pemula langsung bertaruh dengan modal yang banyak.padahal jika hal ini dilakukan akan membuat taruhan yang dilakukan player justru tidak bisa memberikan keuntungan ataupun penghasilan. Maka dari itu, kamu tetap harus membatasi penggunaan modal yang akan dikeluarkan di setiap harinya. karena ini adalah bagian dari strategi yang perlu dilakukan oleh player yang bertaruh. Jika sudah mengaturnya, kerugian besar pasti tidak akan pernah kamu rasakan.
- Bermain dI Situs Terbaik
Dan yang terakhir adalah bermain game sabung ayam di situs judi terbaik. ini merupakan strtegi bermain game judi sabung ayam ketiga yang mesti dilakukan player. jadi untuk yang ingin bermain game sabung ayam, coba pilih dan pilah situsnya dulu. Jika kamu sudah menemukan situs judi terbaik, kamu pasti akan mendapatkan tempat yang bisa berikan kenyamanan untuk playernya.
Itulah beberapa strategi main game judi sabung ayam online yang jarang diketahui oleh player. jadi untuk petaruh yang ingin bermain harus mengikuti strategi di atas untuk bisa dapatkan peluang menang yang besar. jika kamu bisa dapatkan kemenangan dalam permainan sabung ayam, silahkan tarik dananya untuk dapatkan untung menjanjikan. Selamat mencoba dan semoga bermanfaat.
Originally posted 2022-06-08 00:34:09. …
Apa Yang Harus Dilakukan Saat Main Poker Online Modal Kecil?
Memang sekarang ini banyak sekali game judi joker123 online yang beredar di internet atau dunia maya dengan begitu bebasnya. Meski game judi dimainkan via onine, tetap saja harus ada modal untuk bisa mengakses dan menikmati keseruan pada game tersebut. begitu pun dengan game judi poker online, semua yang bermain game poker pastinya harus mempelajari dan memahami bagaimana caranya agar modal yang dibawa bisa memberikan hasil yang luar biasa. Karena itu, coba simak beberapa cara di bawah ini untuk pemula yang ingin bermain game poker tapi membawa modal dalamjumlah sedikit.
Hal-Hal Yang Perlu Dilakukan Saat Main Poker Online Memakai Modal KecilPermainan poker tidak dapat dipungkiri adalah game judi online yang membutuhkan modal bermain di dalamnya. Modal yang diperlukan pada saat bermain game poker adalah uang asli. Karena itu, ketika kamu berhasil dapatkan kemenangan, maka kemenangan tersebut akan membantu kamu untuk dapatkan penghasilan dalam jumlah yang sangat besar. jadi sudah tidak perlu heran lagi mengapa saat ini banyak petaruh yang bermain game poker dengan modal kecil. Jika kamubisa melakukan taruhan dengan modal kecil, kamu pasti akan bertaruh dengan aman. berikut ada beberapa hal yang sebaiknya dilakukan saat main poker dengan modal kecil:
- MEnguasai Permainannya Dulu
Hal pertama yang mesti dilakukan oleh player pada saat bermain game poker memakai modal kecil adalah menguasai permainannya terlebih dahulu. Jadi disini kamu harus tahu jika penguasaan terhadap permainan judi poker sangat diperlukan oleh player. karena ketika kamu menguasai permainannya dengan baik, akan ada banyak hal positif yang bisa kamu dapatkan nanti.
Jika kamu termasuk salah seorang pemain baru atau pemula, mungkin kamu perlu waktu yang cukup banyak agar bisa mempelajari dan memahami aturan dalam game poker dengan baik. Jika kamu sudah melakukannya, barulah kamu boleh melakukan taruhan dengan uang asli dengan pemahaman yang kamu miliki. Karena kamu pasti bisa mengolah kartu yang didapatkan dengan benar jika penguasaan terhadap permainan sudah kamu dapatkan.
- Memakai Konsentrasi Tingkat Tinggi
Kemudian, kamu juga perlu memakai konsentrasi tingkat tinggi pada saat bermain game poker. Ini adalah hal kedua yang perlu dilakukan oleh player. jangan pernah berpikir jika segala kondsii bisa kamu pakai untuk bermain game judi poker online. pasalnya kamu hanya bisa memenangkan permainan poker jika berada dalam konsentrasi. Kamu harus berkonsentrasi penuh pada taruhan dan fokus dengan segala tahapan yang kamu lalui untuk dapatkan kemenangan dengan mudah.
Kebanyakan player di Indonesia yang bermain tanpa konsentrasi justru akan mengalami kerugian dalam jumlah yang sangat besar. karena itu, kamu harus pastikan jika waktu dan tempat yang dipergunakan untuk bermain sudah tepat. pasalnya hanya dengan cara itu sja, kamu pasti bisa dapatkan taruhan yang lebih gampang untuk dimenangkan.
- Memanfaatkan Trik Jitu
Trik dibutuhkan oleh player pada saat bermain game judi poker. Salah satu trik yang tidakboleh sampai kamu lewatkan adalah trik bluffing atau menggertak. Karena disini kamu harus tahu jika trik bluffing akan sangat membantu kamu untuk mengalahkan player laiin yang duduk di meja taruhan online. jadi trik ini harus kamu lakukan dengan penuh keberanian agar player lain percaya dan segera keluar.
Jika kamu memakai trik yang satu ini, pastikan kamu melakukannya di moment yang tepat. Tidak masalah meski saat ini kartu yang kamu miliki tidak begitu bagus. Jika kamu punya kartu yang tidak terlalu baik nilainya, kamu hanya perlu mengolahnya saja dan berani untuk bluffing. Karena tidak ada satupun player yang bisa mengetahui nilai kombinasi kartu yang kamu dapatkan saat ini. jadi coba untuk melakukan trik yang ketiga ini agar bisa memenangkan permainan denga mudah.
Itulah beberapa hal yang harus dilakukan oleh player bila bermain game judi poker online memakai modal dalamjumlah yang kecil. Jadi apabila saat ini kamu sedang tertarik untuk bermain taruhan poker, kamu boleh melakukan taruhan dengan sejumlah trik di atas. Selamat mencoba dan semoga bermanfaat.
Originally posted 2022-05-23 00:16:33. …
Begini Cara Mengikuti Taruhan Sabung Ayam Online Yang Aman
Beberapa cara sepertinya perlu kamu lakukan apabila ingin bermain game judi idnplay download online dengan aman dan nyaman. karena itu, apabila saat ini kamu mengikuti game sabung ayam, pastikan kamu melakukan taruhan dengan cara yang benar. Game ini sudah bisa diakses dan dinikmati dengan cara online. karena itu terdapat kemudahan pada saat mengakses permainannya. Kemudahan dalam mengakses game taruhan sabung ayam tentu saja dikarenakan akses ke dalam game yang hanya membutuhkan smartphone dan internet saja. jadi kamu bisa bermain dimanapun kamu mau dengan mudah.
Beberapa Cara Mengikuti Taruhan Sabung Ayam Online Dengan AmanBerbeda halnya dengan game sabung ayam yang dimainkan secara langsung, permainan sabung ayam yang kini diakses via online tentu saja jauh lebih aman. karena kamu bisa akses game ini dimana saja yang kamu mau. Hanya dengan smrtphone dan internet saja, akses ke dalam game sudah bisa dilakukan dimanapun kamu mau. Karena itu, rata-rata player lebih suka bermain game sabung ayam dengan sistem online. jadi apabila kamu tetarik, coba simak cara mengikuti taruhan sabung ayam berikut ini agar prosesnya dapat berjalan dengan aman dan nyaman:
- Mendaftar di Situs Judi Resmi
Pertama, kamu harus melakukan pendaftaran di situs judi yang resmi. Ini menjadi cara pertama yang harus kamu lakukan apabila ingin mengikuti taruhan sabung aym secara online. pendaftaran yang dilakukan di dalam agen judi resmi akan membantu kamu supaya bisa dapatkan akun member dengan segera. Data-data yang diberikan ke dalam agen harus data asli. Jangan pernah berpikir jika kamu bisa pergunakan data orang lain pada saat mendaftar di dalam agen sabung ayam.
Siapkan semua data diri yang akan diperlukan pada saat mendaftar. Karena itu, apabila saat ini kamu tertarik untuk bermain nanti, tidak ada salahnya untuk melakukan persiapan yang matang. Pasalnya jika kamu mempersiapkan semuanya dengan matang, hal ini akan membantu kamu supaya bisa menyelesaikan proses daftar dengan mudah. kamu juga bisa mendapatkan akun member tanpa harus dalam waktu yang lama.
- Melakukan Deposit yang Pertama
Kemudian, kamu harus melakukan yang namanya deposit untuk pertama kalinya. Deposit ke dalam stus judi sabung ayam online adalah langkah kedua yang mesti dilakukan oleh player. jadi untuk yang saat ini melakukan transaksi deposit ke dalam agen judi sabung ayam, maka kamu perlu meminta terlebih dahulu nomor rekening agen lewat cs yang bertugas. Tenang saja, cs akan membantu kamu supaya bisa mendapatkan nomor rekening terbaru milik situs sehingga tidak ada lagi kesalahan yang dilakukan player saat bertaruh.
Deposit ke dalam agen judi sabung ayam sudah semestinya dilakukan di waktu yang tepat. jadi untuk player yang ingin bermain game sabung ayam, jangan pernah bertransaksi jika kamu sendiri tidak tahu apakah bank dalam keadaan online atau tidak. Jadi saat deposit, kamu harus melakukannya ketika bank dalam keadaan online. Dengan begitu, transaksi akan berjalan dengan lancar dan kamu bisa mendapatkan chip untuk bermain taruhan di setiap harinya.
- Memulai Taruhan dengan Bet Kecil
Dan cara terakhir untuk yang ingin mengikuti taruhan sabung ayam adalah memulai taruhan dengan bet kecil. Jadi untuk yang saat ini ingin bermain game sabung ayam, kamu perlu memasang taruhan dengan bet kecil terlebih dahulu. Jangan buru-buru melakukan pemasngan taruhan dengan bet besar. karena kamu akan mengalami kerugian yang besar jika langsung mengikuti taruhan dengan bet besar.
Bermain game sabung ayam dapat dilakukan dengan bet besar dan juga bet kecil. Jika kamu bermain game sabung ayam dengan bet kecil, kemungkinan untuk kamu bisa mendapatkan kemenangan akan jauh lebih besar. berbeda dengan taruhan yang dilakukan dengan bet besar dimana kebanyakan petaruh akan lebih terfokus hanya pada kemenangan dan sisa uang yang dimiliki saja. sehingga mereka lupa dengan kekalahan dan kerugian yang kerap diberikan game ini untuk playernya.
Itulah beberapa cara mengikuti taruhan sabung ayam online yang aman untuk pemula. Jadi supaya kamu bisa bertaruh nanti, coba ikuti satu per satu semua cara main di atas untuk dapatkan untung yang besar.
Originally posted 2022-05-07 00:40:10. …
Simak Tipsnya Jika Ingin bermain di Agen Judi Poker Online
Dalam bermain game judi poker88 online, tentu kamu harus mengetahui terlebih dahulu sejumlah tips yang akan membantu kamu agar bisa menjalankan taruhan dengan baik. Tips bermain game judi poker sejatinya diperlukan oleh semua player terutama yang masih pemula. Karena ketika tips bermain game poker sudah diketahui oleh player, tentu hal ini akan membantu mempermudah proses taruhan yang akan dilakukan. maka dari itu, coba disimak dulu beberapa tips bermain game judi poker di bawah ini apabila ingin melakukan taruhan dengan mudah dan nyaman.
Beragam Tips Yang Diperlukan Jika Ingin Bermain Game Di Agen Poker OnlinePada saat bermain game judi poker, semua player tentu saja berharap jika mereka bisa mendapatkan hasil keuntungan dalam jumlah besar. tapi sayangnya, sebagai pemula, banyak hal yang sejatinya perlu kamu ketahui terlebih dahulu. Jika kamu tahu banyak hal tentang game yang dimainkan, tentu saja kemungkinan untuk kamu bisa dapatkan kemenangan akan semakin besar. karena itu, coba simak tips bermain game judi poker berikut ini agar kesempatan meraup untung besar akan semakin terbuka lebar:
- Membaca Info Tentang Aturan Main Poker
Pertama, coba baca informasi tentang aturan main game poker yang benar. Jadi untuk yang saat ini suka dengan game judi poker, kamu harus pastika jika informasi di dalam permainan poker sudah kamu dapatkan sejak awal. Banyak hal yang mesti diketahui oleh player salah satunya adalah kombinasi dalam game poker itu sendiri. jadi disini kamu harus mengetahui informasi tentang kombinasi yang ada di dalam game poker supaya bisa dapatkan susunan terbaik pada saat bermain taruhan.
Aturan main game poker lainnya yang perlu diketahui oleh player adalah stategi main yang akan dibutuhkan atau berguna pada saat bermain. jadi kamu harus tahu jika strategi dalam game poker juga dibutuhkan. Salah satu strategi yang sangat populer di dalam dunia betting adalah strategi bluffing. Jadi kamu bisa melakukan bluffing untuk menggertak player lain agar mau keluar dari taruhan yang dimainkan.
- Modal Tampil
Kemudian. Pada saat bermain game judi poker, kamu juga harus punya yang namanya modal tampil. Player yang ingin bermain game poker sudah sepatutnya melakukan deposit terlebih dahulu. Apabila sudah melakukan deposit, barulah uang yang dibawa ke dalam permainan disetorkan ke dalam rekening agen. Dengan uang tersebut, kamu bisa bermain taruhan poker di setiap harinya. kamu bisa mengikuti permainan poker tanpa harus menunggu waktu-waktu tertentu.
Dalam game poker, chip memang begitu dibutuhkan oleh semua player yang bertaruh. Maka dari itu, apabila saat ini kamu tengah tertarik untuk bermain judi poker online, jangan pernah beranggapan jika game poker ini bisa kamu akses atau mainkan tanpa chip atau modal di dalamnya. tanpa adanya modal, game apapun tidak akan bisa diakses termasuk game judi poker itu sendiri.
- Bermain Sabar
Dan yang ketiga adalah bermain game taruhan poker dengan penuh kesabaran. Ini menjadi tips selanjutnya yang tidak boleh dilupakan oleh player di Indonesia. Karena ketika kamu berharap untuk terjun ke dalam game taruhan poker, tentu saja kesabaran menjadi salah satu hal yang sangat dibuthkan disini. Kamu bisa dapatkan banyak kemenangan dan keuntungan bila lebih bersabar dalam menjalankan taruhan online.
Sudah banyak petaruh di Indonesia yang saat ini melakukan taruhan dengan sikap terburu-buru. Bukan hanya memberikan efek kerugian dalam jumlah yang besar saja, jika kamu buru-buru mengikuti kegiatan betting yang ada khawatirnya nanti kerugian dalam jumlah besar juga akan kamu alami nanti. Kesabaran adalah salah satu teknik bermain yang sangat penting untuk dilakukan oleh para player indonesia.
Itulah beberapa tips yang harus dilakukan oleh player apabila ingin bermain bersama agen judi poker online yang terbaik. jadi apabila saat ini kamu mengikuti semua tips bermain di atas, kemugkinan untuk kamu bisa dapatkan keuntungan akan semakin besar. bahkan kamu juga bisa menikmati kesuksesan lewat game ini di setiap harinya. selamat mencoba.
Originally posted 2022-04-10 00:30:54. …
Trik Membuat Akun Judi Poker Online Yang Harus Dipelajari Pemula
Pembuatan akun member di dalam agen judi idnplay poker online adalah salah satu informasi yang pastinya akan dibutuhkan oleh semua player pemula di indonesia. karena player pemula yang bermain game judi poker akan membutuhkan trik-trik supaya proses pembuatan akun member dapat berjalan mudah dan nyaman. trik membuat akun judi poker sudah sepatutnya dipelajari oleh pemula. Jadi jika kamu salah satu pemula yang saat ini tertarik dengan game poker, coba simak dulu beberapa trik membuat akun judi di bawah ini yang harus dipelajari oleh pemula.
Beragam Trik Untuk Player Yang Ingin Membuat Akun Judi Poker OnlineSemua yang sudah terjun ke dalam dunia taruhan pasti ingin mencoba game judi poker yang kini bisa diakses dengan sistem online. terdapat begitu banyak perbedaan yang dimiliki game poker offline dan online. karena itu, jika kamu belum pernah mencoba game ini dengan sistem online, tentu kamu perlu menyimak dulu uraian kali ini. pasalnya banyak sekali hal penting yang sepatutnya diketahui termasuk salah satunya adalah panduan membuat akun judi poker. Berikut ini diantaranya trik membuat akun judi poker yang perlu dipelajari oleh pemula:
- Main di Situs yang Direkomendasikan Orang-orang
Pertama, kamu harus mainkan game judi poker di situs yang sudah direkomendasikan banyak orang. Ini adalah cara main pertama yang perlu dilakukan oleh player. jika kamu bermain game judi poker, kamu tidak boleh salah dalam memilih situs judi. Situs yang dipilih harus situs yang terpercaya. adapun langkah memilih situs poker adalah melihat lisensi resmi dalam situs judi itu sendiri. jadi kamu harus mencari situs judi poker yang sudah mendapatkan sertifikat sebagai situs resmi terlebih dahulu.
Kemudian, kamu juga perlu melakukan taruhan di dalam situs yang sudah memiliki fasilitas dan layanan terlengkap di dalamnya. jadi untuk yang saat ini ingin bermain game judi poker, kamu harus perhatikan dulu apakah situs yang dipilih adlah situs yang sudah dilengkapi dengan pelayanan yang nyaman atau tidak. Karena situs judi terbaik pasti akan memberikan pelayanan terbaik untuk para player yang bermain.
- Siapkan Dana
Kemudian, kamu perlu menyiapkan dana untuk bisa deposit ke dalam situs judi poker online. Bagi yang ingin bermain game poker, kamu perlu memiliki dana dan rekening atas nama kamu sendri. Jika kamu belum membuat rekening bank, silahkan buat dengan memakai nama kamu sendiri. karena pihak agen tidak akan memproses transaksi yang nama akun banknya berbeda dengan nama pemilik akun judi yang dibuat.
- Mengisi Form Data
Langkah ketiga untuk player yang ingin membuat akun judi poker adalah mengisi form data. Jadi apabila kamu sudah menemukan situs judi dan menyiapkan dana yang cukup, ini adalah langkah ketiga yang perlu dilakukan. silahkan isi data-data yang benar. Adapun data yang sebaiknya diisi dengan data kamu sendiri adalah nama akun atau username, nomor rekening bank yang digunakan, jenis bank yang digunakan dan banyak lagi yang lain.
Jika kamu ingin mendapatkan kemudahan dalam melakukan pengisian data diri, usahakan untuk mempersiapkan data-data yang diperlukan. Persiapan data diri sebelum proses pendaftaran dilakukan adalah salah satu langkah yang mesti dilakukan oleh player. karena itu, kamu bisa isi form data dengan cara atau trik satu ini.
- Memulai Taruhan
Dan langkah terakhir yang perlu dilakukan adalah memulai taruhan. ini adalah salah satu langkah membuat akun judi yang terakhir kali mesti dilakukan oleh player. jika kamu sudah memulai taruhan online, itu artinya kamu bisa melakukan taruhan kapan saja. tapi disini kamu harus periksa dulu apakah taruhan yang kamu lakukan sudah kamu mengerti dengan baik atau tidak. Jika tidak, usahakan untuk mempelajari terlebih dahulu aturan main di dalamnya.
Itulah beberapa trik membuat akun judi poker online yang sudah sepatutnya dipelajari dan dipahami oleh semua pemula di Indonesia. jika kamu sudah mempelajarinya, kamu pasti bisa membuat akun member dengan segera. Bahkan waktu yang dibutuhkan nanti hanya beberapa menit saja jika semua langkah sudah benar atau sesuai.
Originally posted 2022-03-25 00:07:14. …


