Microsoft Research

Syndicate content
Updated: 1 hour 50 min ago

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

Wed, 03/04/2026 - 20:05
At a glance
  • Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
  • We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, inferring about changes in sequences of images, and much more. Beyond these general capabilities, it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens. In particular, our model presents an appealing value relative to popular open-weight models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning.

Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values. 

In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.

A focus on smaller and faster vision–language models

Many popular vision-language models (VLMs) have trended towards growing in parameter count and, in particular, the number of tokens they consume and generate. This leads to increase in training and inference-time cost and latency, and impedes their usability for downstream deployment, especially in resource‑constrained or interactive settings.

A growing countertrend towards smaller (opens in new tab) models aims to boost efficiency, enabled by careful model design and data curation – a goal pioneered by the Phi family of models (opens in new tab) and furthered by Phi-4-reasoning-vision-15B. We specifically build on learnings from the Phi-4 and Phi-4-Reasoning language models and show how a multimodal model can be trained to cover a wide range of vision and language tasks without relying on extremely large training datasets, architectures, or excessive inference‑time token generation. Our model is intended to be lightweight enough to run on modest hardware while remaining capable of structured reasoning when it is beneficial. Our model was trained with far less compute than many recent open-weight VLMs of similar size. We used just 200 billion tokens of multimodal data leveraging Phi-4-reasoning (trained with 16 billion tokens) based on a core model Phi-4 (400 billion unique tokens), compared to more than 1 trillion tokens used for training multimodal models like Qwen 2.5 VL (opens in new tab) and 3 VL (opens in new tab), Kimi-VL (opens in new tab), and Gemma3 (opens in new tab). We can therefore present a compelling option compared to existing models pushing the pareto-frontier of the tradeoff between accuracy and compute costs.

Figure 2: Phi-4-Reasoning-Vision can help with a wide range of everyday tasks. Lessons from training a multimodal model

Training a multimodal reasoning model raises numerous questions and requires many nuanced design choices around model architecture, dataset quality and composition, and the interaction between reasoning‑heavy and non-reasoning perception‑focused tasks.

Model architecture: Early- vs mid-fusion

Model architectures for VLMs differ primarily in how visual and textual information is fused. Mid-fusion models use a pretrained vision encoder to convert images into visual tokens that are projected into a pretrained LLM’s embedding space, enabling cross-modal reasoning while leveraging components already trained on trillions of tokens. Early-fusion models process image patches and text tokens in a single model transformer, yielding richer joint representations but at significantly higher compute, memory, and data cost. We adopted a mid-fusion architecture as it offers a practical trade-off for building a performant model with modest resources.

Model architecture: Vision encoder and image processing

We build on the SigLIP-2 (opens in new tab) vision encoder and the Phi-4-Reasoning backbone. In previous research, we found that multimodal language models sometimes struggled to solve tasks, not because of a lack of reasoning proficiency, but rather an inability to extract and select relevant perceptual information from the image. An example would be a high-resolution screenshot that is information-dense with relatively small interactive elements.

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3 (opens in new tab) uses pan-and-scan and NVILA (opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

MethodMax TokensMathVistaScreenSpotScreenSpot-ProV*BenchDynamic-S2309642.978.49.452.9Multi-crop309643.467.85.451.8Multi-crop with S2204843.479.110.657.1Dynamic resolution204845.281.59.251.3Dynamic resolution360044.979.717.556.0Table 1: Results with different resolution handling approaches. The top two configurations on each benchmark are in bold. Data: Quality and composition

As with its language backbone Phi-4-Reasoning, Phi-4-reasoning-vision-15B was trained with a deliberate focus on data quality. Our final dataset consists primarily of data from three sources: open-source datasets which were meticulously filtered and improved; high-quality domain-specific internal data; and high-quality data from targeted acquisitions. The overwhelming majority of our data lies in the first category: data which originated as open-source data, which were significantly filtered and improved, whether by removing low-quality datasets or records, programmatically fixing errors in data formatting, or using open-source images as seeds to synthetically generate higher-quality accompanying text.

The process of improving open-source data began by manually reviewing samples from each dataset. Typically, 5 to 10 minutes were sufficient to classify data as excellent-quality, good questions with wrong answers, low-quality questions or images, or high-quality with formatting errors. Excellent data was kept largely unchanged. For data with incorrect answers or poor-quality captions, we re-generated responses using GPT-4o and o4-mini, excluding datasets where error rates remained too high. Low-quality questions proved difficult to salvage, but when the images themselves were high quality, we repurposed them as seeds for new caption or visual question answering (VQA) data. Datasets with fundamentally flawed images were excluded entirely. We also fixed a surprisingly large number of formatting and logical errors across widely used open-source datasets.

We extracted additional value from existing datasets through reformatting, diversification, and using images as seeds for new data generation. We generated detailed image descriptions alongside original QA pairs for math and science data, had data perform “double-duty” by embedding instruction-following requirements directly into domain-specific QA, created “scrambled,” “caption-matching,” and “what’s changed?” records to improve multi-image reasoning and sequential navigation for CUA scenarios, and diversifying prompt styles to encourage robustness beyond perfectly structured questions.

To supplement the improved open-source data, we utilize high-quality internal datasets, several math-specific datasets which were acquired during training of the Phi-4 language model, and also some domain-specific curated data; for example, latex-OCR data generated by processing and rendering equations from arXiv documents.

before returning a bounding box coordinates for a UI grounding task, and the other uses a tag with step-by-step reasoning to answer a chart question about expatriate populations, concluding with "Dubai." " class="wp-image-1163336"/> Figure 3: Phi-4-reasoning-vision-15B training data composition and examples Data: Mathematics vs. computer-use data proportion

One of our goals was to train a model that performs well across general vision-language tasks, while excelling at mathematical and scientific reasoning and computer-use scenarios. How to structure datasets for generalizable reasoning remains an open question—particularly because the relationship between data scale and reasoning performance can lead to starkly different design decisions, such as training a single model on a large dataset versus multiple specialized models with targeted post-training.

Research on long-tailed classification robustness has suggested that balancing or removing data from overrepresented tasks or subgroups (opens in new tab) is an effective method for ensuring good performance. Nevertheless, these insights are not fully utilized or explored when it comes to training VLMs, which at times have favored scale over careful data balancing. To achieve our goals, we conducted a set of experiments to analyze a range of data ratios between our focus domains.

Using the same 5 billion parameter proxy model as for previous experiments, we trained while varying the amount of mathematics and science vs. computer-use data for each run. Each dataset included the same subset of 1 million general image-text pairs as a baseline. For mathematics and science data, we used a subsample of 150,000 records, optionally duplicating each one up to three times. Next, we included up to 450,000 computer-use records, and optionally an additional 400,000 from Phi-Ground.

We found that that multimodal mathematics and science performance were not harmed by additional computer-use data, and vice versa. Interestingly, we found that increasing mathematics data by 3x while keeping computer-use data constant improved math, science, and computer-use benchmarks.

GeneralMath and ScienceCUATotalMMMUMathVistaScreenSpot-V21M150K450K1.6M44.037.448.21M150K850K2.0M44.137.360.01M450K450K1.9M45.336.048.31M450K850K2.3M43.438.963.11M150K150K1.3M44.236.929.81M150K250K1.4M45.437.437.7Table 2: Varying the ratios of math and CUA data. Increasing math data by 3x while keeping computer-use data constant improves both math and computer-use benchmarks.  Data: Synthetic data for text-rich visual reasoning

Recent work (opens in new tab) suggests that targeted synthetic data can materially improve multimodal reasoning, particularly for text-rich visual domains such as charts, documents, diagrams, and rendered mathematics. Using images, questions, and answers that are programmatically generated and grounded in the visual structure enables precise control over visual content and supervision quality, resulting in data that avoids many annotation errors, ambiguities, and distributional biases common in scraped datasets. This enables cleaner alignment between visual perception and multi-step inference, which has been shown to translate into measurable gains on reasoning-heavy benchmarks.

Synthetic text-rich images expand coverage of long-tail visual formats that are underrepresented in real data but disproportionately impact reasoning accuracy, improving not only visual grounding but also downstream reasoning by ensuring that failures are less often caused by perceptual errors. We found that programmatically generated synthetic data is a useful augmentation to high-quality real datasets — not a replacement, but a scalable mechanism for strengthening both perception and reasoning that complements the training objectives in compact multimodal models such as Phi-4-reasoning-vision-15B.

Mixing non-reasoning and reasoning as a design objective

In language-only settings, reasoning traces have improved performance on many tasks, but they require additional compute which adds undesired latency. In multimodal settings, this tradeoff is less clear-cut, for tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful (opens in new tab), while mathematical and scientific problem-solving benefit from multi-step reasoning. Thus, the choice of when to reason or not can be quite nuanced.

Training approaches for multimodal reasoning models

Language-only reasoning models are typically created through supervised fine-tuning (SFT) or reinforcement learning (RL): SFT is simpler but requires large amounts of expensive reasoning trace data, while RL reduces data requirements at the cost of significantly increased training complexity and compute. Multimodal reasoning models follow a similar process, but the design space is more complex. With a mid-fusion architecture, the first decision is whether the base language model is itself a reasoning or non-reasoning model. This leads to several possible training pipelines:

  • Non-reasoning LLM → reasoning multimodal training: Reasoning and multimodal capabilities are trained together.
  • Non-reasoning LLM → non-reasoning multimodal → reasoning multimodal training: Multimodal capabilities are learned first, then reasoning is added.
  • Reasoning LLM → reasoning multimodal training: A reasoning base is used, but all multimodal data must include reasoning traces.
  • Our approach: Reasoning LLM → mixed non-reasoning / reasoning multimodal training. A reasoning-capable base is trained on a hybrid data mixture, learning when to reason and when to respond directly.

Approaches 1 and 2 offer flexibility in designing multimodal reasoning behavior from scratch using widely available non-reasoning LLM checkpoints but place a heavy burden on multimodal training. Approach 1 must teach visual understanding and reasoning simultaneously and requires a large amount of multimodal reasoning data, while Approach 2 can be trained with less reasoning data but risks catastrophic forgetting, as reasoning training may degrade previously learned visual capabilities. Both risk weaker reasoning than starting from a reasoning-capable base. Approach 3 inherits strong reasoning foundations, but like Approach 1, it requires reasoning traces for all training data and produces reasoning traces for all queries, even when not beneficial.

Our approach: A mixed reasoning and non-reasoning model

Phi-4-reasoning-vision-15B adopts the 4th approach listed previously, as it balances reasoning capability, inference efficiency, and data requirements. It inherits a strong reasoning foundation but uses a hybrid approach to combine the strengths of alternatives while mitigating their drawbacks. Our model defaults to direct inference for perception-focused domains where reasoning adds latency without improving accuracy, avoiding unnecessary verbosity and reducing inference costs, and it invokes longer reasoning paths for domains, such as math and science, that benefit from structured multi-step reasoning (opens in new tab).

Our model is trained with SFT, where reasoning samples include “…” sections with chain-of-thought reasoning before the final answer, covering domains like math and science. Non-reasoning samples are tagged to start with a “” token, signaling a direct response, and cover perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Reasoning data comprises approximately 20% of the total mix. Starting from a reasoning-capable backbone means this data grounds existing reasoning in visual contexts rather than teaching it to reason from scratch.

This approach is not without limitations. The balance between modes is a direct function of design choices we made, informed by recent literature (opens in new tab) and observed model behavior during training—though the boundary between modes can be imprecise as it is learned implicitly from the data distribution. Our model allows control through explicit prompting with “” or “” tokens when the user wants to override the default reasoning behavior. The 20/80 reasoning-to-non-reasoning data split may not be optimal for all domains or deployment contexts. Evaluating the ideal balance of data and the model’s ability to switch appropriately between modes remains an open problem.

We view this mixed approach not as a definitive solution, but as one practical and well-motivated point in the design space for balancing latency, accuracy, and flexibility in multimodal systems.

Applications Figure 4: Phi-4-Reasoning-Vision can interpret sequences of images 

Phi-4-reasoning-vision-15B is a high-performing model across many vision-language tasks. It sees and understands the world by looking at a photo, document, chart, or screen and making sense of it. In practice that covers an enormous range of applications — just a few examples include: describing images and answering questions about them, interpreting changes and trends in images sequences, and recognizing objects, landmarks, and transcribing text.

Highlights: Scientific and mathematical reasoning and supporting computer-using agents (CUA)

In addition to general vision and language tasks, Phi-4-reasoning-vision-15B was designed to excel at tasks that combine visual input with structured inference, such as solving math problems presented in visual form, such as handwritten or diagram-based questions, extracting and reasoning over quantitative information in documents and charts, and supporting multi-step reasoning in educational or scientific analysis contexts.

Figure 5: Phi-4-reasoning-vision-15B is great at math and science  Figure 6: Phi-4-reasoning-vision-15B can help with written math problems 

In addition, we trained Phi-4-reasoning-vision-15B to have skills that can enable agents to interact with graphical user interfaces by interpreting screen content and selecting actions. With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields. Due to its low inference-time needs it is great for interactive environments where low latency and compact model size are essential.

Figure 7: Phi-4-reasoning-vision-15B can help navigate computer UIs Evaluation

Phi-4-reasoning-vision-15B was evaluated for accuracy and timing using two complementary open-source frameworks to ensure both rigorous and standardized analysis: Eureka ML Insights (opens in new tab) and VLMEvalKit (opens in new tab).

BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force nothinkPhi-4-mm-instructKimi-VL-A3B-Instructgemma-3-12b-itQwen3-VL-8B-Instruct-4KQwen3-VL-8B-Instruct-32KQwen3-VL-32B-Instruct-4KQwen3-VL-32B-Instruct-32KAI2D_TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 ChartQA_TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 HallusionBench64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 MathVerse_MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 MathVision_MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 MathVista_MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 MMMU_VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 ScreenSpot_v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 Table 3: Accuracy comparisons relative to popular open-weight, non-thinking models  BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force thinkingKimi-VL-A3B-Thinkinggemma-3-12b-itQwen3-VL-8B-Thinking-4KQwen3-VL-8B-Thinking-40KQwen3-VL-32B-Thiking-4KQwen3-VL-32B-Thinking-40KAI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 ChartQA_TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 HallusionBench64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 MathVerse_MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 MathVision_MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 MathVista_MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 MMMU_VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 ScreenSpot_v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 Table 4: Accuracy comparisons relative to popular open-weight, thinking models 

Our model balances thinking and non-thinking performance – on average showing better accuracy in the default “mixed-reasoning” behavior than when forcing thinking vs. non-thinking. Only in a few cases does forcing a specific mode improve performance (MathVerse and MMU_val for thinking and ScreenSpot_v2 for non-thinking). Compared to recent popular, open-weight models, our model provides a desirable trade-off between accuracy and cost (as a function of inference time compute and output tokens), as discussed previously.

Note: All numbers here are the result of running benchmarks ourselves and may be lower than other previously shared numbers. Instead of quoting leaderboards, we performed our own benchmarking, so we could understand scaling performance as a function of output token counts for related models. We made our best effort to run fair evaluations and used recommended evaluation platforms with model-specific recommended settings and prompts provided for all third-party models. For Qwen models we use the recommended token counts and also ran evaluations matching our max output token count of 4096. For Phi-4-reasoning-vision-15B, we used our system prompt and chat template but did not do any custom user-prompting or parameter tuning, and we ran all evaluations with temperature=0.0, greedy decoding, and 4096 max output tokens. These numbers are provided for comparison and analysis rather than as leaderboard claims. For maximum transparency and fairness, we will release all our evaluation logs publicly. For more details on our evaluation methodology, please see our technical report (opens in new tab).

Safety

As with other Phi models, Phi-4-reasoning-vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. For further details, check out our technical report (opens in new tab).

Open release and community engagement

Phi-4-reasoning-vision-15B is available on Microsoft Foundry (opens in new tab) and HuggingFace (opens in new tab) with additional examples and details on GitHub (opens in new tab). For additional guidance on how to use our model properly and safely, please refer to our Model card (opens in new tab). For further details on the technical aspects of the model, training, and evaluation, see our technical report (opens in new tab).

In line with our goal of supporting future AI development in the community, Phi-4-reasoning-vision-15B is released under a permissive license with model weights, fine‑tuning code, and benchmark logs. We intend this release to complement existing work by providing concrete artifacts that help close gaps in understanding how compact multimodal reasoning models can be built and studied.

Looking forward

Smaller vision–language models with selective, task‑aware reasoning offer one promising direction for making multimodal systems more practical and accessible. We present our model and its learnings to inform ongoing research in multimodal modeling, computer‑using agents, and mathematical scientific reasoning. We hope these details are useful to researchers exploring similar tradeoffs and invite critical evaluation, replication, and extension by the community. If you’d like to join us and help shape the future of multimodal models, please apply for one of our open roles.

Acknowledgements

We thank Rachel Ward for her extensive work on data collection and curation. We thank the GenDatasets, PhiGround, SimCity, and Fara-7B efforts for invaluable training data. We thank Harkirat Behl, Mojan Javaheripi, and Suriya Gunasekar for providing us with Phi-4 checkpoints and guidance on training with Phi models. We additionally thank Sahaj Agarwal, Ahmed Awadallah, Qi Dai, Gustavo de Rosa, Rafah Hosn, Ece Kamar, Piero Kauffmann, Yash Lara, Chong Luo, Caio César Teodoro Mendes, Akshay Nambi, Craig Presti, Matthew Rosoff, Corby Rosset, Marco Rossi, Kashyap Patel, Adil Salim, Sidhartha Sen, Shital Shah, Pratyusha Sharma, Alexey Taymanov, Vibhav Vineet, John Weiss, Spencer Whitehead, the AI Frontiers Team and Leadership, and Microsoft Research Leadership, for their valuable help, insightful discussions, and continued support throughout this work.

Opens in a new tab

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Categories: Microsoft

CORPGEN advances AI agents for real work

Thu, 02/26/2026 - 19:06
At a glance
  • Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs).
  • Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%.
  • CORPGEN introduces digital employees, with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends.
  • Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve.

By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once.

In our paper, “CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap.

Introducing Multi-Horizon Task Environments

Replicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours.

To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off.

We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems.

CORPGEN’s architecture

CORPGEN introduces digital employees: LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday.

Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session.

CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step.

Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled.

Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture.

Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization. How digital employees collaborate

When multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication.

There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue.

When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so.

Evaluating CORPGEN

We evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out.

Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more.

Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%.

Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice.

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Explore the series Opens in a new tab Implications and looking forward

The results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation.

CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software.

Acknowledgments

This work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation.

Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.

Opens in a new tab
Categories: Microsoft

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions

Thu, 02/19/2026 - 18:00

Insights from Microsoft’s Media Integrity and Authentication: Status, Directions, and Futures report

It has become increasingly difficult to distinguish fact from fiction when viewing online images and videos. Resilient, trustworthy technologies can help people determine whether the content they are viewing was captured by a camera or microphone—or generated or modified by AI tools. 

We refer to technologies aimed at helping viewers verify the source and history—that is, the provenance—of digital content as media integrity and authentication (MIA) methods. This technique, driven by the Coalition for Content Provenance and Authenticity (opens in new tab) (C2PA), a standards body dedicated to scaling these capabilities, as well as complementary methods such as watermarks and fingerprinting, have become critically important with the rapid advance of AI systems capable of creating, realistic imagery, video, and audio at scale.

A convergence of forces

Our team recognized an inflection point in the evolution of online content integrity, driven by the convergence of four forces:

  • Growing saturation of synthetic media, driven by proliferation of high-fidelity content-generation tools and the explosion of AI generated or modified media online
  • Forthcoming legislation both nationally and internationally seeking to define what “verifiable” provenance should mean in practice
  • Mounting pressure on implementers to ensure authentication signals are clear and helpful, especially as signals increase when legislation goes into effect in 2026
  • Heightened awareness of potential adversarial attacks that attempt to exploit weaknesses in authenticity systems

The usefulness and trustworthiness of provenance signals, whether certifying content as synthetic or as an authentic capture of real-world scenes, will depend not only on advances in technology, but also on how the broader digital ecosystem adopts, implements, and governs these tools. Aligning around implementation choices that promote consistency and clarity is essential to ensure transparency signals strengthen, rather than erode, public confidence.

To address these challenges, we launched a comprehensive evaluation of the real-world limits, edge cases, and emerging “attack surfaces” for MIA methods. Today, we are publishing our findings in the Media Integrity & Authentication: Status, Directions & Futures report. The report distills lessons learned and outlines practical directions for strengthening media integrity in the years ahead.

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Explore the series Opens in a new tab Findings and directions forward

Our research recognizes that different media integrity and authenticity methods serve differing purposes and offer distinct levels of protection. After defining each method in detail, we focused on secure provenance (C2PA), imperceptible watermarking, and soft hash fingerprinting across images, audio, and video.

Grounded in our evaluation of these MIA methods across modalities, attack categories, and real-world workflows, several new findings emerged including two new concepts:

  • High-Confidence Provenance Authentication: a critical capability for verifying, under defined conditions, whether claims about the origin of and modifications made to an asset can be validated with high certainty.
  • Sociotechnical Provenance Attacks: attacks aimed at deception and capable of inverting signals, making authentic content appear synthetic, and synthetic content appear authentic.

Drawing on our findings, we identified four promising directions for further strengthening media authentication, along with suggestions to support more effective implementation strategies and future decisions. We’ve summarized the findings and directions below, with additional detail available in the report.

Promising directionsHigh-level findingsDelivering high-confidence provenance authentication– Implementation and display choices may affect the reliability of provenance indicators and how they are interpreted by the public.

– Using a C2PA provenance manifest for media created and signed in a high security environment enables high-confidence validation.

– High-confidence validation is also possible across a broader volume of images, audio, and video when an imperceptible watermark is linked to C2PA provenance manifest as an additional layer to recover the provenance information if removed.

– Fingerprinting is not an enabler for high-confidence validation and can involve significant costs when expected at scale. However, it can support manual forensics.Mitigating confusion from sociotechnical provenance attacks– MIA methods are susceptible to sociotechnical attacks on provenance that may mislead the public, resulting in confusion and misplaced trust about an asset’s provenance if there is an overreliance on low-quality signals.

– Layering and linking secure provenance and imperceptible watermarking methods to achieve high confidence validation also offers a promising option to both deter and mitigate the impact of attacks.

– Unintended consequences may result from the use of methods lacking authentication, such as the use of perceptible watermarks in the absence of secure provenance. Perceptible watermarks may cause confusion in cases of forgery or discourage people from consulting high-confidence provenance information via a validation tool, if such perceptible disclosures are taken at face value.

UX design that enables users to explore manifest details—such as where edits occurred or region of interest—has the potential to reduce confusion and support forensics and fact checking efforts.  Enabling more trusted provenance on edge devices– High-confidence results aren’t feasible when provenance is added by a conventional offline device (e.g., camera or recording device without connectivity).

Implementing a secure enclave within the hardware layer of offline devices is essential to make the provenance of captured images, audio, and video more trustworthy.Investing in ongoing research and policy development– All three methods offer organizations valuable tools for addressing operational challenges such as fraud prevention, risk management, and digital accountability. 

UX and display are promising directions for research. Important directions include in-stream tools that display provenance information where people are and distinguish between high- and lower-confidence provenance signals.

Stakeholders should conduct ongoing analysis and red teaming to identify and mitigate weaknesses through technical approaches, policies, and laws.    The journey continues

This report marks the beginning of a new chapter in our media provenance journey (opens in new tab), building on years of foundational work, from developing the very first prototype in 2019 to co-founding the C2PA in 2021 and helping catalyze an ecosystem that has since grown to more than 6,000 members and affiliates (opens in new tab) supporting C2PA Content Credentials. This research represents the next evolution of that long‑standing commitment.

We hope that by sharing our learnings will help others prepare for an important wave, especially as generative technologies accelerate and provenance signals multiply. This work is already underway across our products at Microsoft. Together, these directions highlight opportunities for the ecosystem to align, harden, and innovate, so authentication signals are not merely visible, but robust, meaningful, and resilient throughout the content lifecycle.

Opens in a new tab

The post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.

Categories: Microsoft

Project Silica’s advances in glass storage technology

Wed, 02/18/2026 - 18:11
At a glance
  • Microsoft Research publishes breakthrough in Nature on glass-based data storage that could preserve information for 10,000 years. 
  • New technique extends technology from expensive fused silica to ordinary borosilicate glass found in kitchen cookware. 
  • Innovations enable faster parallel writing, simplified readers (one camera instead of three), and easier manufacturing. 
  • Phase voxel method requires only a single laser pulse, significantly reducing complexity and cost.

Long-term preservation of digital information has long challenged archivists and datacenters, as magnetic tapes and hard drives degrade within decades. Existing archival storage solutions have limited media lifespans that make them less than ideal for preserving information for future generations.

Now, we are excited to report significant progress on Project Silica (opens in new tab), our effort to encode data in glass using femtosecond lasers, a technology that could preserve information for 10,000 years. Glass is a permanent data storage material that is resistant to water, heat, and dust.

In findings published in Nature (opens in new tab), we describe a breakthrough that extends the technology beyond expensive fused silica to ordinary borosilicate glass. A readily available and lower-cost medium, this is the same material found in kitchen cookware and oven doors. This advance addresses key barriers to commercialization: cost and availability of storage media. We have unlocked the science for parallel high-speed writing and developed a technique to permit accelerated aging tests on the written glass, suggesting that the data should remain intact for at least 10,000 years.

Storing data inside glass with femtosecond (opens in new tab) laser pulses is one of the few technologies on the horizon with the potential for durable, immutable, and long-lived storage. Although we have been leading innovation in this type of storage for years, prior to this research the technique only worked with pure fused silica glass, a type of glass that is relatively difficult to manufacture and available from only a few sources.

In the paper, we show how data can be stored in borosilicate glass. The new technique stores hundreds of layers of data in glass only 2mm thin, as with previous methods, but with important improvements. The reader for the glass now needs only one camera, not three or four, reducing cost and size. In addition, the writing devices require fewer parts, making them easier to manufacture and calibrate, and enabling them to encode data more quickly.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Key scientific discoveries

The Nature paper details several key new scientific discoveries:

Advances in birefringent voxel (opens in new tab) writing: For the previous type of data storage in fused silica glass using birefringent (i.e., polarization) voxels, we developed a technique to reduce the number of pulses used to form the voxel from many to only two, critically showing that the polarization of the first pulse is not important to the polarization of the voxel formed. We further developed this to enable pseudo-single-pulse writing, in which a single pulse can be split after its polarization is set to simultaneously form the first pulse for one voxel (where the polarization doesn’t matter) and the second pulse of another (where the set polarization is essential). We demonstrated how to use this pseudo-single-pulse writing to enable fast writing with beam scanning across the media.

Phase voxels, a new storage method: We invented a new type of data storage in glass called phase voxels, in which the phase change of the glass is modified instead of its polarization, showing that only a single pulse is necessary to make a phase voxel. We demonstrated that these phase voxels can also be formed in borosilicate glass and devised a technique to read the phase information from phase voxels encoded in this material. We showed that the much higher levels of three-dimensional inter-symbol interference in phase voxels can be mitigated with a machine learning classification model.

Parallel writing capabilities: By combining a mathematical model of pre-heating and post-heating within the glass with the invention of a multi-beam delivery system, we showed that many data voxels can be written in proximity in the glass at the same time, significantly increasing writing speed. We explained a method for using light emissions (a side effect of voxel formation) for both static calibration and dynamic control to fully support automatic writing operations.

Optimization and longevity testing: We developed a new way to optimize symbol encodings using machine learning and a better way to understand the tradeoff between error rates, error protection, and error recovery when evaluating new digital storage systems. We also created a new nondestructive optical method (opens in new tab) to identify the aging of data storage voxels within the glass, using this and standard accelerated aging techniques to support data lasting 10,000 years. We extended the industry standard Gray codes to apply to nonpower-of-two numbers of symbols.

Skip slideshow for: Previous slide Previous slide

A piece of Project Silica media written with data.

A research-grade Writer used to set the record for high speed data writing into glass.

A research-grade Reader for retrieving data from glass.

Close up of Writer showing high-speed multi-beam data encoding on laser pulses.

End of slideshow for: Demonstrating the technology

As a research initiative, Project Silica has demonstrated these advances through several proofs of concept, including storing Warner Bros.’ “Superman” movie on quartz glass (opens in new tab), partnering with Global Music Vault (opens in new tab) to preserve music under ice for 10,000 years (opens in new tab), and working with students on a “Golden Record 2.0” project (opens in new tab), a digitally curated archive of images, sounds, music, and spoken language, crowdsourced to represent and preserve humanity’s diversity for millennia.

Looking ahead

The research phase is now complete, and we are continuing to consider learnings from Project Silica as we explore the ongoing need for sustainable, long-term preservation of digital information. We have added this paper to our published works so that others can build on them.

Related work

Project Silica has made scientific advances across multiple areas beyond laser direct writing (LDW) in glass, including archival storage systems design, archival workload analysis, datacenter robotics, erasure coding, free-space optical components, and machine learning-based methods for symbol decoding in storage systems. Many of these innovations were described in our ACM Transactions on Storage publication (opens in new tab) in 2025.

Opens in a new tab

The post Project Silica’s advances in glass storage technology appeared first on Microsoft Research.

Categories: Microsoft

eXTReMe Tracker