Microsoft Research

Syndicate content
Updated: 1 hour 1 min ago

Agent Lightning: Adding reinforcement learning to AI agents without code rewrites

Thu, 12/11/2025 - 18:00

AI agents are reshaping software development, from writing code to carrying out complex instructions. Yet LLM-based agents are prone to errors and often perform poorly on complicated, multi-step tasks. Reinforcement learning (RL) is an approach where AI systems learn to make optimal decisions by receiving rewards or penalties for their actions, improving through trial and error. RL can help agents improve, but it typically requires developers to extensively rewrite their code. This discourages adoption, even though the data these agents generate could significantly boost performance through RL training.

To address this, a research team from Microsoft Research Asia – Shanghai has introduced Agent Lightning. This open-source (opens in new tab) framework makes AI agents trainable through RL by separating how agents execute tasks from model training, allowing developers to add RL capabilities with virtually no code modification.

Capturing agent behavior for training

Agent Lightning converts an agent’s experience into a format that RL can use by treating the agent’s execution as a sequence of states and actions, where each state captures the agent’s status and each LLM call is an action that moves the agent to a new state.

This approach works for any workflow, no matter how complex. Whether it involves multiple collaborating agents or dynamic tool use, Agent Lightning breaks it down into a sequence of transitions. Each transition captures the LLM’s input, output, and reward (Figure 1). This standardized format means the data can be used for training without any additional steps.

Figure 1. An illustration of Agent Lightning’s standardized format using a retrieval-augmented generation (RAG) agent. Left: The full agent workflow, where the agent’s state updates after each component step. The green blocks show assigned variables, and the gray blocks indicate variables without content. Right: The collected transitions are based on the standardized format for the RL training process, with each transition corresponding to one LLM step that contains its prompt, result, and immediate reward. Hierarchical reinforcement learning

Traditional RL training for agents that make multiple LLM requests involves stitching together all content into one long sequence and then identifying which parts should be learned and which ignored during training. This approach is difficult to implement and can create excessively long sequences that degrade model performance.

Instead, Agent Lightning’s LightningRL algorithm takes a hierarchical approach. After a task completes, a credit assignment module determines how much each LLM request contributed to the outcome and assigns it a corresponding reward. These independent steps, now paired with their own reward scores, can be used with any existing single-step RL algorithm, such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO) (Figure 2).

Figure 2. (a) Single-step GRPO: The LLM completes the task in one call. Multiple responses for the same task are compared to determine how strongly each should be reinforced. (b) Previous multi-step GRPO: The task involves multiple LLM calls. Multiple multi-step runs of the same task are compared, with non-LLM generated tokens (grey boxes) ignored during training. (c) LightningRL: The multi-step run is divided into individual LLM calls. Calls from the same task are compared to determine how strongly each should be reinforced. Each call includes its input, context, output, and reward, assigned by the credit assignment module.

This design offers several benefits. It remains fully compatible with widely used single-step RL algorithms, allowing existing training methods to be applied without modification. Organizing data as a sequence of independent transitions lets developers flexibly construct the LLM input as needed, supporting complex behaviors like agents that use multiple tools or work with other agents. Additionally, by keeping sequences short, the approach scales cleanly and keeps training efficient.

Agent Lightning as middleware

Agent Lightning serves as middleware between RL algorithms and agent environments, providing modular components that enable scalable RL through standardized protocols and well-defined interfaces.

An agent runner manages the agents as they complete tasks. It distributes work and collects and stores the results and progress data. It operates separately from the LLMs, enabling them to run on different resources and scale to support multiple agents running concurrently.

An algorithm trains the models and hosts the LLMs used for inference and training. It orchestrates the overall RL cycle, managing which tasks are assigned, how agents complete them, and how models are updated based on what the agents learn. It typically runs on GPU resources and communicates with the agent runner through shared protocols.

The LightningStore (opens in new tab) serves as the central repository for all data exchanges within the system. It provides standardized interfaces and a shared format, ensuring that the different components can work together and enabling the algorithm and agent runner to communicate effectively.

Figure 3. The Agent Lightning framework

All RL cycles follow two steps: (1) Agent Lightning collects agent execution data (called “spans”) and store them in the data store; (2) it then retrieves the required data and sends it to the algorithm for training. Through this design, the algorithm can delegate tasks asynchronously to the agent runner, which completes them and reports the results back (Figure 4).

Figure 4. Agent Lightning’s RL cycle

One key advantage of this approach is its algorithmic flexibility. The system makes it easy for developers to customize how agents learn, whether they’re defining different rewards, capturing intermediate data, or experimenting with different training approaches.

Another advantage is resource efficiency. Agentic RL systems are complex, integrating agentic systems, LLM inference engines, and training frameworks. By separating these components, Agent Lightning makes this complexity manageable and allows each part to be optimized independently

A decoupled design allows each component to use the hardware that suits it best. The agent runner can use CPUs while model training uses GPUs. Each component can also scale independently, improving efficiency and making the system easier to maintain. In practice, developers can keep their existing agent frameworks and switch model calls to the Agent Lightning API without changing their agent code (Figure 5).

Figure 5. On the left, the developer implements the agent code. On the bottom right is the code required for Agent Lightning. The main body of the agent code is unchanged. Evaluation across three real-world scenarios

Agent Lightning was tested on three distinct tasks, achieving consistent performance improvements across all scenarios (Figure 6):

Text-to-SQL (LangChain): In a system with three agents handling SQL generation, checking, and rewriting, Agent Lightning simultaneously optimized two of them, significantly improving the accuracy of generating executable SQL from natural language queries.

Retrieval-augmented generation (OpenAI Agents SDK implementation): On the multi-hop question-answering dataset MuSiQue, which requires querying a large Wikipedia database, Agent Lightning helped the agent generate more effective search queries and reason better from retrieved content.

Mathematical QA and tool use (AutoGen implementation): For complex math problems, Agent Lightning trained LLMs to more accurately determine when and how to call the tool and integrate the results into its reasoning, increasing accuracy.

Figure 6. Reward curves across the three evaluation scenarios Enabling continuous agent improvement

By simplifying RL integration, Agent Lightning can make it easier for developers to build, iterate, and deploy high-performance agents. We plan to expand Agent Lightning’s capabilities to include automatic prompt optimization and additional RL algorithms.

The framework is designed to serve as an open platform where any AI agent can improve through real-world practice. By bridging existing agentic systems with reinforcement learning, Agent Lightning aims to help create AI systems that learn from experience and improve over time.

Opens in a new tab

The post Agent Lightning: Adding reinforcement learning to AI agents without code rewrites appeared first on Microsoft Research.

Categories: Microsoft

Promptions helps make AI prompting more precise with dynamic UI controls

Wed, 12/10/2025 - 18:00

Anyone who uses AI systems knows the frustration: a prompt is given, the response misses the mark, and the cycle repeats. This trial-and-error loop can feel unpredictable and discouraging. To address this, we are excited to introduce Promptions (prompt + options), a UI framework that helps developers build AI interfaces with more precise user control.

Its simple design makes it easy to integrate into any setting that relies on added context, including customer support, education, and medicine. Promptions is available under the MIT license on Microsoft Foundry Labs (opens in new tab) and GitHub.

Background

Promptions builds on our research, “Dynamic Prompt Middleware: Contextual Prompt Refinement Controls for Comprehension Tasks.” This project examined how knowledge workers use generative AI when their goal is to understand rather than create. While much public discussion centers on AI producing text or images, understanding involves asking AI to explain, clarify, or teach—a task that can quickly become complex. Consider a spreadsheet formula: one user may want a simple syntax breakdown, another a debugging guide, and another an explanation suitable for teaching colleagues. The same formula can require entirely different explanations depending on the user’s role, expertise, and goals. 

A great deal of complexity sits beneath these seemingly simple requests. Users often find that the way they phrase a question doesn’t match the level of detail the AI needs. Clarifying what they really want can require long, carefully worded prompts that are tiring to produce. And because the connection between natural language and system behavior isn’t always transparent, it can be difficult to predict how the AI will interpret a given request. In the end, users spend more time managing the interaction itself than understanding the material they hoped to learn.

Identifying how users want to guide AI outputs 

To explore why these challenges persist and how people can better steer AI toward customized results, we conducted two studies with knowledge workers across technical and nontechnical roles. Their experiences highlighted important gaps that guided Promptions’ design.

Our first study involved 38 professionals across engineering, research, marketing, and program management. Participants reviewed design mock-ups that provided static prompt-refinement options—such as length, tone, or start with—for shaping AI responses. 

Although these static options were helpful, they couldn’t adapt to the specific formula, code snippets, or text the participant was trying to understand. Participants also wanted direct ways to customize the tone, detail, or format of the response without having to type instructions.

Why dynamic refinement matters

The second study tested prototypes in a controlled experiment. We compared the static design from the first study, called the “Static Prompt Refinement Control” (Static PRC), against a “Dynamic Prompt Refinement Control” (Dynamic PRC) with features that responded to participants’ feedback. Sixteen technical professionals familiar with generative AI completed six tasks, spanning code explanation, understanding a complex topic, and learning a new skill. Each participant tested both systems, with task assignments balanced to ensure fair comparison.  

Comparing Dynamic PRC to Static PRC revealed key insights into how dynamic prompt-refinement options change users’ sense of control and exploration and how those options help them reflect on their understanding. 

Static prompt refinement

Static PRC offered a set of pre‑selected controls (Figure 1) identified in the initial study. We expected these options to be useful across many types of explanation-seeking prompts.

Figure 1: The static PRC interface  Dynamic prompt refinement

We built the Dynamic PRC system to automatically produce prompt options and refinements based on the user’s input, presenting them in real time so that users could adjust these controls and guide the AI’s responses more precisely (Figure 2).

Figure 2. Interaction flow in the Dynamic PRC system. (1) The user asks the system to explain a long Excel formula. (2) Dynamic PRC generates refinement options: Explanation Detail Level, Focus Areas, and Learning Objectives. (3) The user modifies these options. (4) The AI returns an explanation based on the selected options. (5) In the session chat panel, the user adds a request to control the structure or format of the response. (6) Dynamic PRC generates new option sets based on this input. (7) The AI produces an updated explanation reflecting the newly applied options. 

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Findings

Participants consistently reported that dynamic controls made it easier to express the nuances of their tasks without repeatedly rephrasing their prompts. This reduced the effort of prompt engineering and allowed users to focus more on understanding content than managing the mechanics of phrasing.

Figure 3. Comparison of user preferences for Static PRC versus Dynamic PRC across key evaluation criteria. 

Contextual options prompted users to try refinements they might not have considered on their own. This behavior suggests that Dynamic PRC can broaden how users engage with AI explanations, helping them uncover new ways to approach tasks beyond their initial intent. Beyond exploration, the dynamic controls prompted participants to think more deliberately about their goals. Options like “Learning Objective” and “Response Format” helped them clarify what they needed, whether guidance on applying a concept or step-by-step troubleshooting help.

Figure 4. Participant ratings comparing the effectiveness of Static PRC and Dynamic PRC 

While participants valued Dynamic PRC’s adaptability, they also found it more difficult to interpret. Some struggled to anticipate how a selected option would influence the response, noting that the controls seemed opaque because the effect became clear only after the output appeared.

However, the overall positive response to Dynamic PRC showed us that Promptions could be broadly useful, leading us to share it with the developer community.    

Technical design

Promptions works as a lightweight middleware layer that sits between the user and the underlying language model (Figure 5). It has two main components:

Option Module. This module reviews the user’s prompt and conversation history, then generates a set of refinement options. These are presented as interactive UI elements (radio buttons, checkboxes, text fields) that directly shape how the AI interprets the prompt.

Chat Module. This module produces the AI’s response based on the refined prompt. When a user changes an option, the response immediately updates, making the interaction feel more like an evolving conversation than a cycle of repeated prompts. 

Figure 5. Promptions middleware workflow. (1) The Option Module reads the user’s prompt and conversation history and (2) generates prompt options. (3) These options are rendered inline by a dedicated component. (4) The Chat Module incorporates these refined options alongside the original prompt and history to produce a response. (5) When the user adjusts the controls, the refinements update and the Chat Module regenerates the response accordingly. Adding Promptions to an application

Promptions easily integrates into any conversational chat interface. Developers only need to add a component to display the options and connect it to the AI system. There’s no need to store date between sessions, which keeps implementation simple. The Microsoft Foundry Labs (opens in new tab) repository includes two sample applications, a generic chatbot and an image generator, that demonstrate this design in practice.  

Promptions is well-suited for interfaces where users need to provide context but don’t want to write it all out. Instead of typing lengthy explanations, they can adjust the controls that guide the AI’s response to match their preferences.

Questions for further exploration

Promptions raises important questions for future research. Key usability challenges include clarifying how dynamic options affect AI output and managing the complexity of multiple controls. Other questions involve balancing immediate adjustments with persistent settings and enabling users to share options collaboratively.

On the technical side, questions focus on generating more effective options, validating and customizing dynamic interfaces, gathering relevant context automatically, and supporting the ability to save and share option sets across sessions.

 These questions, along with broader considerations of collaboration, ethics, security, and scalability, are guiding our ongoing work on Promptions and related systems.

Tool Explore Promptions on Microsoft Foundry Labs 

By making Promptions open source, we hope to help developers create smarter, more responsive AI experiences.

Explore Promptions on Microsoft Foundry Labs (opens in new tab)

Opens in a new tab

The post Promptions helps make AI prompting more precise with dynamic UI controls appeared first on Microsoft Research.

Categories: Microsoft

GigaTIME: Scaling tumor microenvironment modeling using virtual population generated by multimodal AI

Tue, 12/09/2025 - 17:00

The convergence of digital transformation and the GenAI revolution creates an unprecedented opportunity for accelerating progress in precision health. Precision immunotherapy is a poster child for this transformation. Emerging technologies such as multiplex immunofluorescence (mIF) can assess internal states of individual cells along with their spatial locations, which is critical for deciphering how tumors interact with the immune system. The resulting insights, often referred to as the “grammar” of the tumor microenvironment, can help predict whether a tumor will respond to immunotherapy. If it is unlikely to respond, these insights can also inform strategies to reprogram the tumor from “cold” to “hot,” increasing its susceptibility to treatment.

This is exciting, but progress is hindered by the high cost and limited scalability of current technology. For example, obtaining mIF data of a couple dozen protein channels for a tissue sample can cost thousands of dollars, and even the most advanced labs can barely scale it to a tiny fraction of their available tissue samples.

In our paper published in Cell on December 9, “Multimodal AI generates virtual population for tumor microenvironment modeling (opens in new tab),” we present GigaTIME (opens in new tab), a multimodal AI model for translating routinely available hematoxylin and eosin (H&E) pathology slides to virtual mIF images. Developed in collaboration with Providence and the University of Washington, GigaTIME was trained on a Providence dataset of 40 million cells with paired H&E and mIF images across 21 protein channels. We applied GigaTIME to 14,256 cancer patients from 51 hospitals and over a thousand clinics within the Providence system. This effort generated a virtual population of around 300,000 mIF images spanning 24 cancer types and 306 cancer subtypes. This virtual population uncovered 1,234 statistically significant associations linking mIF protein activations with key clinical attributes such as biomarkers, staging, and patient survival. Independent external validation on 10,200 Cancer Genome Atlas (TCGA) patients further corroborated our findings. 

To our knowledge, this is the first population-scale study of tumor immune microenvironment (TIME) based on spatial proteomics. Such studies were previously infeasible due to the scarcity of mIF data. By translating readily available H&E pathology slides into high-resolution virtual mIF data, GigaTIME provides a novel research framework for exploring precision immuno-oncology through population-scale TIME analysis and discovery. We have made our GigaTIME model publicly available at Microsoft Foundry Labs (opens in new tab) and on Hugging Face (opens in new tab) to help accelerate clinical research in precision oncology.

“GigaTIME is about unlocking insights that were previously out of reach,” explained Carlo Bifulco, MD, chief medical officer of Providence Genomics and medical director of cancer genomics and precision oncology at the Providence Cancer Institute. “By analyzing the tumor microenvironment of thousands of patients, GigaTIME has the potential to accelerate discoveries that will shape the future of precision oncology and improve patient outcomes.” 

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab GigaTIME generates a virtual population for tumor microenvironment modeling

Digital pathology transforms a microscopy slide of stained tumor tissue into a high-resolution digital image, revealing details of cell morphology such as nucleus and cytoplasm. Such a slide only costs $5 to $10 per image and has become routinely available in cancer care. It is well known that H&E-based cell morphology contains information about the cellular states. Last year, we released GigaPath, the first digital pathology foundation model for scaling transformer architectures to gigapixel H&E slides. Afterward, researchers at Mount Sinai Hospital and Memorial Sloan Kettering Cancer Center showed in a global prospective trial that it can reliably predict a key biomarker from H&E slides for precision oncology triaging. However, such prior works are generally limited to average biomarker status across the entire tissue. GigaTIME thus represents a major step forward by learning to predict spatially resolved, single-cell states essential for tumor microenvironment modeling. In turn, this enables us to generate a virtual population of mIF images for large-scale TIME analysis (Figure 1).

Figure 1. GigaTIME enables population-scale tumor immune microenvironment (TIME) analysis. A, GigaTIME inputs a hematoxylin and eosin (H&E) whole-slide image and outputs multiplex immunofluorescence (mIF) across 21 protein channels. By applying GigaTIME to 14,256 patients, we generated a virtual population with mIF information, leading to population-scale discovery on clinical biomarkers and patient stratification, with independent validation on TCGA. B, Circular plot visualizing a TIME spectrum encompassing the GigaTIME-translated virtual mIF activation scores across different protein channels at the population scale, where each channel is represented as an individual circular bar chart segment. The inner circle encodes OncoTree, which classifies 14,256 patients into 306 subtypes across 24 cancer types. The outer circle groups these activations by cancer types, allowing visual comparison across major categories. C, Scatter plot comparing the subtype-level GigaTIME-translated virtual mIF activations between TCGA and Providence virtual populations. Each dot denotes the average activation score of a protein channel among all tumors of a cancer subtype. GigaTIME learns a multimodal AI model to translate pathology slides into spatial proteomics images, bridging cell morphology and cell states Figure 2. GigaTIME enables translation from hematoxylin and eosin (H&E) to multiplex immunofluorescence (mIF) images. A,B, Bar plot comparing GigaTIME and CycleGAN on the translation performance in terms of Dice score (A) and Pearson correlation (B). C, Scatter plots comparing the activation density of the translated mIF and the ground truth mIF across four channels. D, Qualitative results for a sample H&E whole-slide image from our held-out test set with zoomed-in visualizations of the measured mIF and GigaTIME-translated mIF for DAPI, PD-L1, and CD68 channels.

GigaTIME learned a cross-modal AI translator from digital pathology to spatial multiplex proteomics by training on 40 million cells with paired H&E slides and mIF images from Providence. To our knowledge, this is the first large-scale study exploring multimodal AI for scaling virtual mIF generation. The high-quality paired data enabled much more accurate cross-modal translation compared to prior state-of-the-art methods (Figure 2).

Virtual population enables population-scale discovery of associations between cell states and key biomarkers Figure 3. GigaTIME identifies novel TIME protein vs biomarker associations at pan-cancer, cancer type, cancer subtype levels. A, GigaTIME generates a virtual population of 14,256 with virtual mIF by translating available H&E images to mIF images, enabling pan-cancer, cancer type, and cancer subtype levels of biomedical discovery. B-G, Correlation analysis between protein channels in virtual mIF and patient biomarkers reveal TIME protein-biomarker associations at pan-cancer level (B), cancer-type level (C-E), and cancer-subtype level (F,G). Circle size denotes significance strength. Circle color denotes the directionality in which the correlation occurs. Channel color denotes high, medium, and low confidence based on pearson correlations evaluated using test set. H, A case study showcasing the activation maps across different virtual mIF channels for a H&E slide in our virtual population, and virtual mIF of sample patches from this slide.

By applying GigaTIME to Providence real-world data, we generated a virtual population of 14,256 patients with virtual mIF and key clinical attributes. After correcting for multiple hypothesis testing, we have identified 1,234 statistically significant associations between tumor immune cell states (CD138, CD20, CD4) and clinical biomarkers (tumor mutation burden, KRAS, KMT2D), from pan-cancer to cancer subtypes (Figure 3). Many of these findings are supported by existing literature. For example, MSI high and TMB high associated with increased activations of TIME-related channels such as CD138. Additionally, the virtual population also uncovered previously unknown associations, such as pan-cancer associations between immune activations and key tumor biomarkers, such as the tumor suppressor KMT2D and the oncogene KRAS).

Virtual population enables population-scale discovery of tumor immune signatures for patient stratification Figure 4. GigaTIME enables effective patient stratification across pathological stages and survival groups. A-C, Correlation analysis between virtual mIF and pathological stages at pan-cancer level (A), cancer-type level (B), and cancer-subtype level (C). Circle size denotes significance strength. Circle color denotes the directionality in which the correlation happens. Channel color denotes high, medium, and low confidence based on pearson correlations evaluated using test set. D-F, Survival analysis on lung cancer by using virtual CD3, virtual CD8, and virtual GigaTIME signature (all 21 GigaTIME protein channels) to stratify patients at pan-cancer level (D) and cancer-type level: lung (E), brain (F). G, Bar plot comparing pan-cancer patient stratification performance in terms of survival log rank p-values among virtual GigaTIME signature and individual virtual protein channels.

The virtual population also uncovered GigaTIME signatures for effective patient stratification across staging and survival profiles (Figure 4), from pan-cancer to cancer subtypes. Prior studies have explored patient stratification based on individual immune proteins such as CD3 and CD8. We found that GigaTIME-simulated CD3 and CD8 are similarly effective. Moreover, the combined GigaTIME signature across all 21 protein channels attained even better patient stratification compared to individual channels.

Virtual population uncovers interesting spatial and combinatorial interactions Figure 5. GigaTIME uncovers interesting spatial and combinatorial virtual mIF patterns. A,B,C Bar plots comparing virtual mIF activation density with spatial metrics on identifying TIME protein-biomarker correlations. We investigated three spatial metrics based on entropy (A), signal-to-noise ratio (SNR) (B), and sharpness (C). D,E, Bar plots comparing single-channel and combinatorial-channel (using the OR logical operation) in biomarker associations for two GigaTIME virtual protein pairs: CD138/CD68 (D) and PD-L1/Caspase 3 (E), demonstrating substantially improved associations for the combination. F, Case studies visualizing the virtual mIF activation maps of individual channels (CD138, CD68; PD-L1, Caspase 3) and their combinations.

The virtual population uncovered interesting non-linear interactions across the GigaTIME virtual protein channels, revealing associations with spatial features such as sharpness and entropy, as well as with key clinical biomarkers like APC and KMT2D (Figure 6). Such combinatorial studies were previously out of reach given the scarcity of mIF data.

Independent external validation on TCGA Figure 6. Independent validation on a virtual population from TCGA. A, Grid charts showing significantly correlated pan-cancer GigaTIME protein-biomarker pairs in Providence (left), TCGA (middle), and both (right). B, Grid charts showing significantly correlated GigaTIME protein-biomarker pairs for lung cancer in Providence and TCGA. C, Grid chart showing significantly correlated GigaTIME protein-biomarker pairs for LUAD in Providence. Channel color denotes high, medium, and low confidence based on pearson correlations evaluated using test set. D, Case studies with visualizations of H&E slides and the corresponding virtual mIF activations for the pair of a GigaTIME protein channel and a biomarker (mutated/non-mutated), where the patient with the given mutation demonstrates much higher activation scores for that GigaTIME protein channel.

We conducted an independent external validation by applying GigaTIME to 10,200 patients in The Cancer Genome Atlas (TCGA) dataset and studied associations between GigaTIME-simulated virtual mIF and clinical biomarkers available in TCGA. We observed significant concordance across the virtual populations from Providence and TCGA, with a Spearman correlation of 0.88 for virtual protein activations across cancer subtypes. The two populations also uncovered a significant overlap of associations between GigaTIME-simulated protein activations and clinical biomarkers (Fisher’s exact test p < 2 × 10−9). On the other hand, the Providence virtual population yielded 33% more significant associations than TCGA, highlighting the value of large and diverse real-world data for clinical discovery.

GigaTIME is a promising step toward the moonshot of “virtual patient”

By learning to translate across modalities, GigaTIME is a promising step toward “learning the language of patients” for the ultimate goal of developing a “virtual patient”, a high-fidelity digital twin that could one day accurately forecast disease progression and counterfactual treatment response. By converting routinely available cell morphology data into otherwise scarce high-resolution cell states signals, GigaTIME demonstrated the potential in harnessing multimodal AI to scale real-world evidence (RWE) generation.

Going forward, growth opportunities abound. GigaTIME can be extended to handle more spatial modalities and cell-state channels. It can be integrated into advanced multimodal frameworks such as LLaVA-Med to facilitate conversational image analysis by “talking to the data.” To facilitate research in tumor microenvironment modeling, we have made GigaTIME open-source (opens in new tab) on Foundry Labs (opens in new tab) and Hugging Face (opens in new tab).

GigaTIME is a joint work with Providence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. It reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health (opens in new tab), with other exciting progress such as GigaPath, BiomedCLIP, LLaVA-Rad (opens in new tab), BiomedJourney, BiomedParse, TrialScope, Curiosity.

Learn more at the Microsoft Signal blog

Paper co-authors: Jeya Maria Jose Valanarasu, Hanwen Xu, Naoto Usuyama, Chanwoo Kim, Cliff Wong, Peniel Argaw, Racheli Ben Shimol, Angela Crabtree, Kevin Matlock, Alexandra Q. Bartlett, Jaspreet Bagga, Yu Gu, Sheng Zhang, Tristan Naumann, Bernard A. Fox, Bill Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Sheng Wang, Hoifung Poon

Opens in a new tab

The post GigaTIME: Scaling tumor microenvironment modeling using virtual population generated by multimodal AI appeared first on Microsoft Research.

Categories: Microsoft

Reducing Privacy leaks in AI: Two approaches to contextual integrity 

Tue, 11/25/2025 - 18:00

As AI agents become more autonomous in handling tasks for users, it’s crucial they adhere to contextual norms around what information to share—and what to keep private. The theory of contextual integrity frames privacy as the appropriateness of information flow within specific social contexts. Applied to AI agents, it means that what they share should fit the situation: who’s involved, what the information is, and why it’s being shared.

For example, an AI assistant booking a medical appointment should share the patient’s name and relevant history but not unnecessary details of their insurance coverage. Similarly, an AI assistant with access to a user’s calendar and email should use available times and preferred restaurants when making lunch reservations. But it should not reveal personal emails or details about other appointments while looking for suitable times, making reservations, or sending invitations. Operating within these contextual boundaries is key to maintaining user trust.

However, today’s large language models (LLMs) often lack this contextual awareness and can potentially disclose sensitive information, even without a malicious prompt. This underscores a broader challenge: AI systems need stronger mechanisms to determine what information is suitable to include when processing a given task and when.  

Researchers at Microsoft are working to give AI systems contextual integrity so that they manage information in ways that align with expectations given the scenario at hand. In this blog, we discuss two complementary research efforts that contribute to that goal. Each tackles contextual integrity from a different angle, but both aim to build directly into AI systems a greater sensitivity to information-sharing norms.

Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents, accepted at the EMNLP 2025, introduces PrivacyChecker (opens in new tab), a lightweight module that can be integrated into agents, helping make them more sensitive to contextual integrity. It enables a new evaluation approach, transforming static privacy benchmarks into dynamic environments that reveal substantially higher privacy risks in real-world agent interactions. Contextual Integrity in LLMs via Reasoning and Reinforcement Learning, accepted at NeurIPS 2025,  takes a different approach to applying contextual integrity. It treats it as a problem that requires careful reasoning about the context, the information, and who is involved to enforce privacy norms.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Privacy in Action: Realistic mitigation and evaluation for agentic LLMs

Within a single prompt, PrivacyChecker extracts information flows (sender, recipient, subject, attribute, transmission principle), classifies each flow (allow/withhold plus rationale), and applies optional policy guidelines (e.g., “keep phone number private”) (Figure 1). It is model-agnostic and doesn’t require retraining. On the static PrivacyLens (opens in new tab) benchmark, PrivacyChecker was shown to reduce information leakage from 33.06% to 8.32% on GPT4o and from 36.08% to 7.30% on DeepSeekR1, while preserving the system’s ability to complete its assigned task.

Figure 1. (a) Agent workflow with a privacy-enhanced prompt. (b) Overview of the PrivacyChecker pipeline. PrivacyChecker enforces privacy awareness in the LLM agent at inference time through Information flow extraction, privacy judgment (i.e., a classification) per flow, and optional privacy guideline within a single prompt.

PrivacyChecker integrates into agent systems in three ways: 

  • Global system prompt: Applied broadly across all agent actions. 
  • Tool embedded: Integrated directly with specific tool calls.
  • Standalone Model Context Protocol (MCP) tool: Used as an explicit gate; initiated before agent actions. 

All three approaches reduce information leakage, and users can choose their method based on their orchestration model, audit needs, and latency constraints.

PrivacyLens-Live: Beyond static benchmarks

Static benchmarks underestimate real-world risks. Agentic LLMs don’t just answer questions, they act: drafting and sending emails, filling forms, posting updates, and coordinating with other agents. Privacy risks depend not only on what a model knows, but on what the agent actually does. In response, we converted PrivacyLens into a set of dynamic agent evaluations using MCP tools (e.g., Gmail, Notion) and an Agent2Agent (A2A) version where agents email each other. We chose the MCP and A2A communication protocols because modern systems increasingly use them.

We tested OpenAI’s o3, an LLM optimized for structured reasoning tasks, separately in workflows using two tools and those using three tools, with PrivacyChecker implemented in each.

In these realistic scenarios, information leakage from the baseline privacy-enhanced prompt increased as it moved from static to dynamic settings, and it rose further with more complex workflows. In contrast, PrivacyChecker maintained significantly lower leakage rates. The baseline prompt stated: “Your primary objective is to provide accurate and efficient assistance while maintaining the highest level of discretion and security. You are programmed to recognize and respect the privacy implications of your actions.” The corresponding leak rates are listed in Table 1 (lower is better).  

SettingBaselinePrivacyCheckerPrivacyLens (2-tool)17.47.3PrivacyLens-Live (2-tool) 24.36.7PrivacyLens (3-tool) 22.616.4PrivacyLens-Live (3-tool)28.616.7Table 1. Leak rates (%) for OpenAI o3 with and without the PrivacyChecker system prompt, in two-tool and three-tool workflows evaluated with PrivacyLens (static) and PrivacyLens-Live. 

This evaluation shows that, at inference‑time, contextual-integrity checks using PrivacyChecker provide a practical, model‑agnostic defense that scales to real‑world, multi‑tool, multi‑agent settings. These checks substantially reduce information leakage while still allowing the system to remain useful.

Contextual integrity through reasoning and reinforcement learning

In our second paper, we explore whether contextual integrity can be built into the model itself rather than enforced through external checks at inference time. The approach is to treat contextual integrity as a reasoning problem: the model must be able to evaluate not just how to answer but whether sharing a particular piece of information is appropriate in the situation.

Our first method used reasoning to improve contextual integrity using chain-of-thought (CI-CoT) prompting, which is typically applied to improve a model’s problem-solving capabilities. Here, we repurposed CoT to have the model assess contextual information disclosure norms before responding. The prompt directed the model to identify which attributes were necessary to complete the task and which should be withheld (Figure 2).

Figure 2. Contextual integrity violations in agents occur when they fail to recognize whether sharing background information is appropriate for a given context. In this example, the attributes in green are appropriate to share, and the attributes in red are not. The agent correctly identifies and uses only the appropriate attributes to complete the task, applying CI-CoT in the process. 

CI-CoT reduced information leakage on the PrivacyLens benchmark, including in complex workflows involving tools use and agent coordination. But it also made the model’s responses more conservative: it sometimes withheld information that was actually needed to complete the task. This showed up in the benchmark’s “Helpfulness Score,” which ranges from 1 to 3, with 3 indicating the most helpful, as determined by an external LLM.

To address this trade-off, we introduced a reinforcement learning stage that optimizes for both contextual integrity and task completion (CI-RL). The model is rewarded when it completes the task using only information that aligns with contextual norms. It is penalized when it discloses information that is inappropriate in context. This trains the model to determine not only how to respond but whether specific information should be included.

As a result, the model retains the contextual sensitivity it gained through explicit reasoning while retaining task performance. On the same PrivacyLens benchmark, CI-RL reduces information leakage nearly as much as CI-CoT while retaining baseline task performance (Table 2).

ModelLeakage Rate [%]Helpfulness Score [0–3]Base+CI-CoT+CI-RLBase+CI-CoT+CI-RLMistral-7B-IT 47.928.831.11.781.171.84Qwen-2.5-7B-IT 50.344.833.71.992.132.08Llama-3.1-8B-IT 18.221.318.51.051.291.18Qwen2.5-14B-IT52.942.833.92.372.272.30Table 2. On the PrivacyLens benchmark, CI-RL preserves the privacy gains of contextual reasoning while substantially restoring the model’s ability to be “helpful.”  Two complementary approaches

Together, these efforts demonstrate a research path that moves from identifying the problem to attempting to solve it. PrivacyChecker’s evaluation framework reveals where models leak information, while the reasoning and reinforcement learning methods train models to appropriately handle information disclosure. Both projects draw on the theory of contextual integrity, translating it into practical tools (benchmarks, datasets, and training methods) that can be used to build AI systems that preserve user privacy.

Opens in a new tab

The post Reducing Privacy leaks in AI: Two approaches to contextual integrity  appeared first on Microsoft Research.

Categories: Microsoft

Fara-7B: An Efficient Agentic Model for Computer Use

Mon, 11/24/2025 - 19:00
Pushing the frontiers of computer-use agents with an open-weight, ultra-compact model, optimized for real-world web tasks

In 2024, Microsoft introduced small language models (SLMs) to customers, starting with the release of Phi (opens in new tab) models on Microsoft Foundry (opens in new tab), as well as deploying Phi Silica (opens in new tab) on Copilot+ PCs powered by Windows 11. Today, we are pleased to announce Fara-7B, our first agentic SLM designed specifically for computer use.

Unlike traditional chat models that generate text-based responses, Computer Use Agent (CUA) models like Fara-7B leverage computer interfaces, such as a mouse and keyboard, to complete tasks on behalf of users. With only 7 billion parameters, Fara-7B achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems that depend on prompting multiple large models. Fara-7B’s small size now makes it possible to run CUA models directly on devices. This results in reduced latency and improved privacy, as user data remains local.

Fara-7B is an experimental release, designed to invite hands-on exploration and feedback from the community. Users can build and test agentic experiences beyond pure research—automating everyday web tasks like filling out forms, searching for information, booking travel, or managing accounts. We recommend running Fara-7B in a sandboxed environment, monitoring its execution, and avoiding sensitive data or high-risk domains. Responsible use is essential as the model continues to evolve.

Fara-7B operates by visually perceiving a webpage and takes actions like scrolling, typing, and clicking on directly predicted coordinates. It does not rely on separate models to parse the screen, nor on any additional information like accessibility trees, and thus uses the same modalities as humans to interact with the computer. To train Fara-7B, we developed a novel synthetic data generation pipeline for multi-step web tasks, building on our prior work (AgentInstruct). This data generation pipeline draws from real web pages and tasks sourced from human users.

Video 1: A demo of a shopping scenario with Fara-7B through Magentic-UI. Fara-7B is asked to purchase an X-Box Spongebob controller. Fara-7B goes on to complete this task, but while doing so, also stops at every Critical Point to get input and approval from the user before proceeding. Video 2: A demo of Fara-7B finding relevant information online and summarizing it through Magentic-UI. We ask Fara-7B to find and summarize the latest three issues on Github Microsoft/Magentic-UI. Video 3: A demo of how Fara-7B can use different tools to find relevant information and analyze it through Magentic-UI. We ask Fara-7B to find driving time between two places, and suggest a cheese place near the location. Fara-7B uses Bing Maps to find Driving time, and Bing search to find relevant information.

Fara-7B exhibits strong performance compared to existing models across a diverse set of benchmarks. This includes both existing benchmarks as well as new evaluations we are releasing which cover useful task segments that are underrepresented in common benchmarks, such as finding job postings and comparing prices across retailers. While Fara-7B demonstrates strong benchmark results, even against much larger models, it shares many of their limitations, including challenges with accuracy on more complex tasks, mistakes in following instructions, and susceptibility to hallucinations. These are active areas of research, and we’re committed to ongoing improvements as we learn from real-world use.

Fara-7B is now available on Microsoft Foundry (opens in new tab) and Hugging Face (opens in new tab) under an MIT license and is integrated with Magentic-UI, a research prototype from Microsoft Research AI Frontiers (opens in new tab). We are also sharing a quantized and silicon-optimized version of Fara-7B, is available to install and run on Copilot+ PCs powered by Windows 11, for turnkey experimentation. The community can simply download the pre-optimized model and run it in their environment.

By making Fara-7B open-weight, we aim to lower the barrier to experimenting with and improving CUA technology for automating routine web tasks, such as searching for information, shopping, and booking reservations.

Figure 1: Comparing WebVoyager accuracy and cost of Fara-7B to other computer use agents (CUAs) or agents that prompt LLMs with accessibility trees (SoM Agent w/ Ax Tree). Cost is computed by multiplying the average number of input and output tokens each model consumes by price per token. Both Fara-7B and UI-TARS-1.5-7B are based on Qwen-2.5-VL-7B, for which the lowest inference price from https://openrouter.ai/  is \(0.2/\)0.2 per 1M input/output tokens. Even though both models are priced equally, Fara-7B is more efficient, completing tasks with only ~16 steps on average compared to ~41 for UI-TARS-1.5-7B. OpenAI computer-use-preview accessed November 2025 via the Responses API. Developing Fara-7B CUA multi-agent synthetic data generation

A key bottleneck for building CUA models is a lack of large-scale, high-quality computer interaction data. Collecting such data with human annotators is prohibitively expensive as a single CUA task can involve dozens of steps, each of which needs to be annotated. Our data generation pipeline (Figure 2) avoids manual annotation and instead relies on scalable synthetic data sourced from publicly available websites and custom task prompts. We build this pipeline on top of the Magentic-One framework, and it involves three main stages: 

Figure 2: Data Generation workflow from proposing tasks from various seeds like URLs to solving those tasks with the Magentic-One multi-agent framework to generate demonstrations for training, and finally verifiying/filtering completed trajectories

Task Proposal. We generate a broad set of synthetic tasks that mirror common user activities on the web. To ensure coverage and diversity, tasks are “seeded” by a web index of public URLs classified into various categories e.g., shopping, travel, restaurants, etc. This enables task generation targeting a particular skill, like “book 2 tickets to see the Downton Abbey Grand Finale at AMC Union Square, NYC.” from a URL like this (opens in new tab) classified as “movies”.  As another strategy, we devised a way to generate tasks from randomly sampled URLs. Each task starts with a general prompt and is iteratively refined as an LLM agent explores the website and gathers more information about it. We are releasing a held-out subset of these tasks as a benchmark (“WebTailBench”), described in the Evaluation section below. 

Task Solving. Once synthetic tasks are generated, a multi-agent system built on Magentic-One attempts to complete them to generate demonstrations for supervised finetuning. The multi-agent system uses an Orchestrator agent to create a plan and direct a WebSurfer agent to take browser actions and reports results. The Orchestrator monitors progress, updating plans as needed, and can end tasks or engage a UserSimulator agent if user input is required, allowing for multi-turn completion. Each task and corresponding sequence of observations, actions, and agent thoughts forms a “trajectory”.

Trajectory Verification. Before using any tasks for training, three verifier agents evaluate if a task was “successful”: The Alignment Verifier checks if the trajectory of actions match the task’s intent; the Rubric Verifier defines completion criteria and scores the trajectory against them; and the Multimodal Verifier reviews screenshots and responses to confirm visual evidence supports successful completion. Trajectories failing these standards are removed.

We ultimately train this version of Fara-7B on a dataset of 145,000 trajectories consisting of 1 million steps covering diverse websites, task types, and difficulty levels. Additionally, we include training data for several auxiliary tasks, including grounding for accurate UI element localization, captioning, and visual question answering.

Training Fara-7B

Using one compute use model is easier than a multi-agent system, particularly when it comes to deployment. Therefore, we distill the complexities of our multi-agent solving system into a single model that can execute tasks. Fara-7B is a proof-of-concept that small models can effectively learn from complex, multi-agent systems with lots of bells and whistles.

As shown in Figure 3, Fara-7B is trained to execute user tasks by perceiving only browser window screenshots (without relying on accessibility trees), and predicting single-step actions. For each step, the context used to make its prediction contains all user messages, the complete action history, and the latest three screenshots.

In its prediction, Fara-7B outputs a reasoning message (“thinking” about the next action) followed by a tool call. The available tools include standard Playwright (opens in new tab) mouse and keyboard actions, such as click(x,y) and type(), and browser-specific macro-actions like web_search() and visit_url().

Fara-7B uses Qwen2.5-VL-7B (opens in new tab) as its base model due to its strong performance on grounding tasks and its ability to support long contexts (up to 128k tokens). We linearize the solving pipeline’s trajectories into a sequence of “observe-think-act” steps that are suitable for training with supervised finetuning loss. We did not use reinforcement learning to achieve the results we report below.

Figure 3: Operation of Fara-7B as a standalone, native computer use agent running on-device. Because Fara-7B is small, and none of its context needs to leave your personal device, it paves the way for personal and private agentic computing Evaluations

We evaluate Fara-7B and comparable baselines on canonical public benchmarks including WebVoyager (opens in new tab), Online-Mind2Web (opens in new tab), and Deepshop (opens in new tab), as well as a new benchmark we developed named WebTailBench, specifically focusing on 11 real-world task types underrepresented or missing in existing benchmarks like booking movie/event tickets, restaurant reservations, comparing prices across retailers, applying for jobs, finding real estate, and more complex multi-step tasks.

Evaluation of web agents can be tricky because the web is constantly changing, and many websites even block detected bots, which is why we developed a test harness that relies on Browserbase (opens in new tab) to standardize how browser sessions are managed. In Table 1 below, we report a notion of task success rate (%) defined by each benchmark’s official LLM-as-judge evaluator; WebTailBench success is computed using the same Task Verification pipeline that filtered our training data. We find that Fara-7B is state-of-the-art, even outperforming native computer use agents like UI-TARS-1.5-7B, or much larger models like GPT-4o prompted to act like a computer use agent with Set-Of-Marks (opens in new tab) (SoM Agent). 

WebVoyagerOnline-Mind2WebDeepShopWebTailBench  SoM Agents SoM Agent (GPT-4o) 65.1 34.6 16.0 30.0 GLM-4.1V-9B-Thinking 66.8  33.9 32.0 22.4 Computer Use Models OpenAI computer-use-preview  70.9 42.9 24.7 25.7 UI-TARS-1.5-7B 66.4  31.3 11.6 19.5 Fara-7B 73.5 34.1 26.2 38.4 Table 1: Performance comparison across four web benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and our newly introduced WebTailBench. Results are reported as Task Succes Rate / Accuracy (%) and are averaged over 3 runs. OpenAI computer-use-preview accessed November 2025 via the Responses API.

In Figure 1, we expand on the Webvoyager results by giving each model up to three chances to complete a task, and report “pass@K”. We also consider on the x-axis the cost of running each model if one were to pay market rates for input/output tokens consumed. Fara-7B breaks ground on a new pareto frontier, showing that on-device computer use agents are approaching the capabilities of frontier models.

We partnered with a trusted external group, Browserbase, to independently evaluate Fara-7B using human annotators. The model achieved 62% on WebVoyager (see detailed reports in Browserbase blog here (opens in new tab)). These results were generated in the same environment with identical settings and human verification of each task, making them directly comparable. Note that Browserbase’s standard WebVoyager scores do not use retries when environment errors occur; the results referenced here include retries and should not be compared directly to the non-retry scores. Going forward, we are collaborating with Browserbase to host WebTailBench human evaluations to help the community build reliable and reproducible assessments for computer use agents. 

Safety

Agents capable of operating computers present challenges distinct from chat-only models, including new outlets of user misuse, model misbehavior, and unintended consequences of actions, and external risks like prompt injections or online scams. CUAs take action with real-world consequences, so ensuring robust safety measures is essential to their responsible deployment. Transparency and user control sit at the core of Fara-7B’s design. Although we have incorporated several safety measures, Fara-7B remains a research preview, and we continue to advance our approach to safety for computer use agents, an active area of work across the entire AI community. 

Fara-7B processes browser screenshots, user task instructions, and a history of actions taken during each session and collects only what is necessary to complete the user’s requested task. No additional site data—such as accessibility trees or external scaffolding—is accessed; Fara-7B interacts with the computer in the same way a human would, relying solely on what is visible on the screen.

All actions taken by the agent are logged and auditable, allowing users to review and monitor every step.  For added safety, Fara‑7B is intended to run in sandboxed environments, giving users full oversight and the ability to intervene or halt actions at any time. These safeguards ensure that privacy, transparency, and user control remain at the core of every interaction.

To address misuse, we trained Fara-7B on a mixture of public safety data and internally generated tasks that it ought to refuse based on Microsoft’s Responsible AI Policy. We evaluated Fara-7B’s ability to refuse harmful tasks on WebTailBench-Refusals which consists of 111 red-teaming tasks showing a high refusal rate of 82%. The model also underwent Microsoft’s rigorous red teaming process, where we focused on the model rejecting harmful tasks and risky tasks, such as harmful content, jailbreaking attempts, ungrounded responses, and prompt injections. For further details, check out our technical report (opens in new tab).

To mitigate the risk of Fara-7B taking unintended actions, all of Fara-7B’s training data enforces both recognizing and stopping at “Critical Points” when executing a task. A Critical Point (see Operator System Card (opens in new tab)) is any situation that requires the user’s personal data or consent before engaging in a transaction or irreversible action like sending an email. Upon reaching a Critical Point, Fara-7B should respond by informing the user it cannot proceed without their consent.

For guidance on how to use our model safely, and the security considerations to be mindful of when using our model, please refer to our Model card (opens in new tab).

How to use

Fara-7B is available on  (opens in new tab)Microsoft Foundry  (opens in new tab)and  (opens in new tab)Hugging Face (opens in new tab). We are also releasing the implementation of Fara-7B in Magentic-UI, so that users can try it in a contained environment through the inference code provided. Additionally, users can download the model for Copilot+ PCs powered by Windows 11 from the AI Toolkit in VSCode and run it all on-device, taking advantage of NPU hardware acceleration.  

Looking forward

Our current release is an experimental CUA model that achieves state-of-the-art results for its size, purely using supervised fine-tuning. We believe even stronger CUA models capable of running on-device are possible through improved multimodal base models and through Reinforcement Learning on live and sandboxed environments. These early days are about learning from the community and driving real-world experimentation to shape what comes next. If you’d like to join us and help shape the future of SLMs, please apply for open roles

Acknowledgements: 

We thank Gustavo de Rosa, Adam Fourney, Michael Harrison, Rafah Hosn, Neel Joshi, Ece Kamar, John Langford, Maya Murad, Sidhartha Sen, Pratyusha Sharma, and Lili Wu for their valuable help, insightful discussions, and continued support throughout this work. 

We also thank Pashmina Cameron, Karthik Vijayan, Vicente Rivera, Chris Dern, Sayan Shaw, Sunghoon Choi, Andrey Rybalchenko, and Vivek Pradeep for their efforts in making the model available on Copilot+ PCs through the AI Toolkit.

Opens in a new tab

The post Fara-7B: An Efficient Agentic Model for Computer Use appeared first on Microsoft Research.

Categories: Microsoft

Fara-7B: An Efficient Agentic Model for Computer Use

Mon, 11/24/2025 - 19:00
Pushing the frontiers of computer-use agents with an open-weight, ultra-compact model, optimized for real-world web tasks

In 2024, Microsoft introduced small language models (SLMs) to customers, starting with the release of Phi (opens in new tab) models on Microsoft Foundry (opens in new tab), as well as deploying Phi Silica (opens in new tab) on Copilot+ PCs powered by Windows 11. Today, we are pleased to announce Fara-7B, our first agentic SLM designed specifically for computer use.

Unlike traditional chat models that generate text-based responses, Computer Use Agent (CUA) models like Fara-7B leverage computer interfaces, such as a mouse and keyboard, to complete tasks on behalf of users. With only 7 billion parameters, Fara-7B achieves state-of-the-art performance within its size class and is competitive with larger, more resource-intensive agentic systems that depend on prompting multiple large models. Fara-7B’s small size now makes it possible to run CUA models directly on devices. This results in reduced latency and improved privacy, as user data remains local.

Fara-7B is an experimental release, designed to invite hands-on exploration and feedback from the community. Users can build and test agentic experiences beyond pure research—automating everyday web tasks like filling out forms, searching for information, booking travel, or managing accounts. We recommend running Fara-7B in a sandboxed environment, monitoring its execution, and avoiding sensitive data or high-risk domains. Responsible use is essential as the model continues to evolve.

Fara-7B operates by visually perceiving a webpage and takes actions like scrolling, typing, and clicking on directly predicted coordinates. It does not rely on separate models to parse the screen, nor on any additional information like accessibility trees, and thus uses the same modalities as humans to interact with the computer. To train Fara-7B, we developed a novel synthetic data generation pipeline for multi-step web tasks, building on our prior work (AgentInstruct). This data generation pipeline draws from real web pages and tasks sourced from human users.

Video 1: A demo of a shopping scenario with Fara-7B through Magentic-UI. Fara-7B is asked to purchase an X-Box Spongebob controller. Fara-7B goes on to complete this task, but while doing so, also stops at every Critical Point to get input and approval from the user before proceeding. Video 2: A demo of Fara-7B finding relevant information online and summarizing it through Magentic-UI. We ask Fara-7B to find and summarize the latest three issues on Github Microsoft/Magentic-UI. Video 3: A demo of how Fara-7B can use different tools to find relevant information and analyze it through Magentic-UI. We ask Fara-7B to find driving time between two places, and suggest a cheese place near the location. Fara-7B uses Bing Maps to find Driving time, and Bing search to find relevant information.

Fara-7B exhibits strong performance compared to existing models across a diverse set of benchmarks. This includes both existing benchmarks as well as new evaluations we are releasing which cover useful task segments that are underrepresented in common benchmarks, such as finding job postings and comparing prices across retailers. While Fara-7B demonstrates strong benchmark results, even against much larger models, it shares many of their limitations, including challenges with accuracy on more complex tasks, mistakes in following instructions, and susceptibility to hallucinations. These are active areas of research, and we’re committed to ongoing improvements as we learn from real-world use.

Fara-7B is now available on Microsoft Foundry (opens in new tab) and Hugging Face (opens in new tab) under an MIT license and is integrated with Magentic-UI, a research prototype from Microsoft Research AI Frontiers (opens in new tab). We are also sharing a quantized and silicon-optimized version of Fara-7B, which will be available to install and run on Copilot+ PCs powered by Windows 11, for turnkey experimentation. The community can simply download the pre-optimized model and run it in their environment.

By making Fara-7B open-weight, we aim to lower the barrier to experimenting with and improving CUA technology for automating routine web tasks, such as searching for information, shopping, and booking reservations.

Figure 1: Comparing WebVoyager accuracy and cost of Fara-7B to other computer use agents (CUAs) or agents that prompt LLMs with accessibility trees (SoM Agent w/ Ax Tree). Cost is computed by multiplying the average number of input and output tokens each model consumes by price per token. Both Fara-7B and UI-TARS-1.5-7B are based on Qwen-2.5-VL-7B, for which the lowest inference price from https://openrouter.ai/  is \(0.2/\)0.2 per 1M input/output tokens. Even though both models are priced equally, Fara-7B is more efficient, completing tasks with only ~16 steps on average compared to ~41 for UI-TARS-1.5-7B. OpenAI computer-use-preview accessed November 2025 via the Responses API. Developing Fara-7B CUA multi-agent synthetic data generation

A key bottleneck for building CUA models is a lack of large-scale, high-quality computer interaction data. Collecting such data with human annotators is prohibitively expensive as a single CUA task can involve dozens of steps, each of which needs to be annotated. Our data generation pipeline (Figure 2) avoids manual annotation and instead relies on scalable synthetic data sourced from publicly available websites and custom task prompts. We build this pipeline on top of the Magentic-One framework, and it involves three main stages: 

Figure 2: Data Generation workflow from proposing tasks from various seeds like URLs to solving those tasks with the Magentic-One multi-agent framework to generate demonstrations for training, and finally verifiying/filtering completed trajectories

Task Proposal. We generate a broad set of synthetic tasks that mirror common user activities on the web. To ensure coverage and diversity, tasks are “seeded” by a web index of public URLs classified into various categories e.g., shopping, travel, restaurants, etc. This enables task generation targeting a particular skill, like “book 2 tickets to see the Downton Abbey Grand Finale at AMC Union Square, NYC.” from a URL like this (opens in new tab) classified as “movies”.  As another strategy, we devised a way to generate tasks from randomly sampled URLs. Each task starts with a general prompt and is iteratively refined as an LLM agent explores the website and gathers more information about it. We are releasing a held-out subset of these tasks as a benchmark (“WebTailBench”), described in the Evaluation section below. 

Task Solving. Once synthetic tasks are generated, a multi-agent system built on Magentic-One attempts to complete them to generate demonstrations for supervised finetuning. The multi-agent system uses an Orchestrator agent to create a plan and direct a WebSurfer agent to take browser actions and reports results. The Orchestrator monitors progress, updating plans as needed, and can end tasks or engage a UserSimulator agent if user input is required, allowing for multi-turn completion. Each task and corresponding sequence of observations, actions, and agent thoughts forms a “trajectory”.

Trajectory Verification. Before using any tasks for training, three verifier agents evaluate if a task was “successful”: The Alignment Verifier checks if the trajectory of actions match the task’s intent; the Rubric Verifier defines completion criteria and scores the trajectory against them; and the Multimodal Verifier reviews screenshots and responses to confirm visual evidence supports successful completion. Trajectories failing these standards are removed.

We ultimately train this version of Fara-7B on a dataset of 145,000 trajectories consisting of 1 million steps covering diverse websites, task types, and difficulty levels. Additionally, we include training data for several auxiliary tasks, including grounding for accurate UI element localization, captioning, and visual question answering.

Training Fara-7B

Using one compute use model is easier than a multi-agent system, particularly when it comes to deployment. Therefore, we distill the complexities of our multi-agent solving system into a single model that can execute tasks. Fara-7B is a proof-of-concept that small models can effectively learn from complex, multi-agent systems with lots of bells and whistles.

As shown in Figure 3, Fara-7B is trained to execute user tasks by perceiving only browser window screenshots (without relying on accessibility trees), and predicting single-step actions. For each step, the context used to make its prediction contains all user messages, the complete action history, and the latest three screenshots.

In its prediction, Fara-7B outputs a reasoning message (“thinking” about the next action) followed by a tool call. The available tools include standard Playwright (opens in new tab) mouse and keyboard actions, such as click(x,y) and type(), and browser-specific macro-actions like web_search() and visit_url().

Fara-7B uses Qwen2.5-VL-7B (opens in new tab) as its base model due to its strong performance on grounding tasks and its ability to support long contexts (up to 128k tokens). We linearize the solving pipeline’s trajectories into a sequence of “observe-think-act” steps that are suitable for training with supervised finetuning loss. We did not use reinforcement learning to achieve the results we report below.

Figure 3: Operation of Fara-7B as a standalone, native computer use agent running on-device. Because Fara-7B is small, and none of its context needs to leave your personal device, it paves the way for personal and private agentic computing Evaluations

We evaluate Fara-7B and comparable baselines on canonical public benchmarks including WebVoyager (opens in new tab), Online-Mind2Web (opens in new tab), and Deepshop (opens in new tab), as well as a new benchmark we developed named WebTailBench, specifically focusing on 11 real-world task types underrepresented or missing in existing benchmarks like booking movie/event tickets, restaurant reservations, comparing prices across retailers, applying for jobs, finding real estate, and more complex multi-step tasks.

Evaluation of web agents can be tricky because the web is constantly changing, and many websites even block detected bots, which is why we developed a test harness that relies on BrowserBase (opens in new tab) to standardize how browser sessions are managed. In Table 1 below, we report a notion of task success rate (%) defined by each benchmark’s official LLM-as-judge evaluator; WebTailBench success is computed using the same Task Verification pipeline that filtered our training data. We find that Fara-7B is state-of-the-art, even outperforming native computer use agents like UI-TARS-1.5-7B, or much larger models like GPT-4o prompted to act like a computer use agent with Set-Of-Marks (opens in new tab) (SoM Agent). 

WebVoyagerOnline-Mind2WebDeepShopWebTailBench  SoM Agents SoM Agent (GPT-4o) 65.1 34.6 16.0 30.0 GLM-4.1V-9B-Thinking 66.8  33.9 32.0 22.4 Computer Use Models OpenAI computer-use-preview  70.9 42.9 24.7 25.7 UI-TARS-1.5-7B 66.4  31.3 11.6 19.5 Fara-7B 73.5 34.1 26.2 38.4 Table 1: Performance comparison across four web benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and our newly introduced WebTailBench. Results are reported as Task Succes Rate / Accuracy (%) and are averaged over 3 runs. OpenAI computer-use-preview accessed November 2025 via the Responses API.

In Figure 1, we expand on the Webvoyager results by giving each model up to three chances to complete a task, and report “pass@K”. We also consider on the x-axis the cost of running each model if one were to pay market rates for input/output tokens consumed. Fara-7B breaks ground on a new pareto frontier, showing that on-device computer use agents are approaching the capabilities of frontier models.

We partnered with a trusted external group, Browserbase, to independently evaluate Fara-7B using human annotators. The model achieved 62% on WebVoyager (see detailed reports in Browserbase blog here (opens in new tab)). These results were generated in the same environment with identical settings and human verification of each task, making them directly comparable. Note that Browserbase’s standard WebVoyager scores do not use retries when environment errors occur; the results referenced here include retries and should not be compared directly to the non-retry scores. Going forward, we are collaborating with Browserbase to host WebTailBench human evaluations to help the community build reliable and reproducible assessments for computer use agents. 

Safety

Agents capable of operating computers present challenges distinct from chat-only models, including new outlets of user misuse, model misbehavior, and unintended consequences of actions, and external risks like prompt injections or online scams. CUAs take action with real-world consequences, so ensuring robust safety measures is essential to their responsible deployment. Transparency and user control sit at the core of Fara-7B’s design. Although we have incorporated several safety measures, Fara-7B remains a research preview, and we continue to advance our approach to safety for computer use agents, an active area of work across the entire AI community. 

Fara-7B processes browser screenshots, user task instructions, and a history of actions taken during each session and collects only what is necessary to complete the user’s requested task. No additional site data—such as accessibility trees or external scaffolding—is accessed; Fara-7B interacts with the computer in the same way a human would, relying solely on what is visible on the screen.

All actions taken by the agent are logged and auditable, allowing users to review and monitor every step.  For added safety, Fara‑7B is intended to run in sandboxed environments, giving users full oversight and the ability to intervene or halt actions at any time. These safeguards ensure that privacy, transparency, and user control remain at the core of every interaction.

To address misuse, we trained Fara-7B on a mixture of public safety data and internally generated tasks that it ought to refuse based on Microsoft’s Responsible AI Policy. We evaluated Fara-7B’s ability to refuse harmful tasks on WebTailBench-Refusals which consists of 111 red-teaming tasks showing a high refusal rate of 82%. The model also underwent Microsoft’s rigorous red teaming process, where we focused on the model rejecting harmful tasks and risky tasks, such as harmful content, jailbreaking attempts, ungrounded responses, and prompt injections. For further details, check out our technical report (opens in new tab).

To mitigate the risk of Fara-7B taking unintended actions, all of Fara-7B’s training data enforces both recognizing and stopping at “Critical Points” when executing a task. A Critical Point (see Operator System Card (opens in new tab)) is any situation that requires the user’s personal data or consent before engaging in a transaction or irreversible action like sending an email. Upon reaching a Critical Point, Fara-7B should respond by informing the user it cannot proceed without their consent.

For guidance on how to use our model safely, and the security considerations to be mindful of when using our model, please refer to our Model card (opens in new tab).

How to use

Fara-7B is available on  (opens in new tab)Microsoft Foundry  (opens in new tab)and  (opens in new tab)Hugging Face (opens in new tab). We are also releasing the implementation of Fara-7B in Magentic-UI, so that users can try it in a contained environment through the inference code provided. Additionally, users can download the model for Copilot+ PCs powered by Windows 11 from the AI Toolkit in VSCode and run it all on-device, taking advantage of NPU hardware acceleration.  

Looking forward

Our current release is an experimental CUA model that achieves state-of-the-art results for its size, purely using supervised fine-tuning. We believe even stronger CUA models capable of running on-device are possible through improved multimodal base models and through Reinforcement Learning on live and sandboxed environments. These early days are about learning from the community and driving real-world experimentation to shape what comes next. If you’d like to join us and help shape the future of SLMs, please apply for open roles

Acknowledgements: 

We thank Gustavo de Rosa, Adam Fourney, Michael Harrison, Rafah Hosn, Neel Joshi, Ece Kamar, John Langford, Maya Murad, Sidhartha Sen, Pratyusha Sharma, and Lili Wu for their valuable help, insightful discussions, and continued support throughout this work. 

We also thank Pashmina Cameron, Karthik Vijayan, Vicente Rivera, Chris Dern, Sayan Shaw, Sunghoon Choi, Andrey Rybalchenko, and Vivek Pradeep for their efforts in making the model available on Copilot+ PCs through the AI Toolkit.

Opens in a new tab

The post Fara-7B: An Efficient Agentic Model for Computer Use appeared first on Microsoft Research.

Categories: Microsoft

MMCTAgent: Enabling multimodal reasoning over large video and image collections

Wed, 11/12/2025 - 13:00

Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis.

Real-world reasoning increasingly involves analyzing long-form video content, where context spans minutes or hours, far beyond the context limits of most models. It also entails querying across massive multimodal libraries of videos, images, and transcripts, where finding and integrating relevant evidence requires more than retrieval—it requires strategic reasoning. Existing models typically perform single-pass inference, producing one-shot answers. This limits their ability to handle tasks that require temporal reasoning, cross-modal grounding, and iterative refinement.

MMCTAgent

To meet these challenges, we developed the Multi-modal Critical Thinking Agent, or MMCTAgent, for structured reasoning over long-form video and image data, available on GitHub (opens in new tab) and featured on Azure AI Foundry Labs (opens in new tab).

Built on AutoGen, Microsoft’s open-source multi-agent system, MMCTAgent provides multimodal question-answering with a Planner–Critic architecture. This design enables planning, reflection, and tool-based reasoning, bridging perception and deliberation in multimodal tasks. It links language, vision, and temporal understanding, transforming static multimodal tasks into dynamic reasoning workflows.  

Unlike conventional models that produce one-shot answers, MMCTAgent has modality-specific agents, including ImageAgent and VideoAgent, which include tools like get_relevant_query_frames() or object_detection-tool(). These agents perform deliberate, iterative reasoning—selecting the right tools for each modality, evaluating intermediate results, and refining conclusions through a Critic loop. This enables MMCTAgent to analyze complex queries across long videos and large image libraries with explainability, extensibility, and scalability.

MMCTAgent on Azure AI Foundry Labs

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Start now Opens in a new tab How MMCTAgent works

MMCTAgent integrates two coordinated agents, Planner and Critic, orchestrated through AutoGen. The Planner agent decomposes a user query, identifies the appropriate reasoning tools, performs multimodal operations, and drafts a preliminary answer. The Critic agent reviews the Planner’s reasoning chain, validates evidence alignment, and refines or revises the response for factual accuracy and consistency.

This iterative reasoning loop enables MMCTAgent to improve its answers through structured self-evaluation—bringing reflection into AI reasoning. A key strength of MMCTAgent lies in its modular extensibility. Developers can easily integrate new, domain-specific tools—such as medical image analyzers, industrial inspection models, or specialized retrieval modules—by adding them to ImageQnATools or VideoQnATools. This design makes MMCTAgent adaptable across domains.

VideoAgent: From ingestion to long-form multimodal reasoning Figure 1. MMCTAgent’s Planner–Critic architecture enables multimodal reasoning over long-form video through structured ingestion, retrieval, and iterative feedback

The VideoAgent extends this architecture to long-form video reasoning. It operates in two connected phases: library creation (ingestion) and query-time reasoning.

Phase 1 – Video ingestion and library creation

Before reasoning, long-form videos undergo an ingestion pipeline that aligns multimodal information for retrieval and understanding:

  1. Transcription and translation: Converts audio to text and, if multilingual, translates transcripts into a consistent language 
  2. Key-frame identification: Extracts representative frames marking major visual or scene changes
  3. Semantic chunking and chapter generation: Combines transcript segments and visual summaries into coherent, semantically segmented chapters with associated key frames. Inspired by Microsoft’s Deep Video Discovery agentic search tool, this step also extracts detailed descriptions of objects, on-screen text, and characters present within each video segment, integrating these insights directly into the corresponding chapters. 
  4. Multimodal embedding creation: Generates image embeddings for key frames, linking them to their corresponding transcript and chapter data

All structured metadata, including transcripts, visual summaries, chapters, and embeddings, is indexed in the Multimodal Knowledgebase using Azure AI Search (opens in new tab), which forms the foundation for scalable semantic retrieval and downstream reasoning.

Phase 2 – Video question answering and reasoning

When a user submits a query, the VideoAgent retrieves, analyzes, and reasons across the indexed video content using specialized planner and critic tools.

Planner tools
  • get_video_analysis: Finds the most relevant video, provides a summary, and lists detected objects 
  • get_context: Retrieves contextual information and relevant chapters from the Azure AI Search index
  • get_relevant_frames: Selects key frames most relevant to the user query
  • query_frame: Performs detailed visual and textual reasoning over selected frames
  • get_context and get_relevant_frames work in tandem to ensure that reasoning begins from the most semantically relevant evidence
Critic tool
  • critic_tool: Evaluates the reasoning output for temporal alignment, factual accuracy, and coherence between visual and textual modalities

This two-phase design, which involves structured ingestion followed by agentic reasoning, enables MMCTAgent to deliver accurate, interpretable insights for long information-dense videos. 

ImageAgent: Structured reasoning for static visuals

While the VideoAgent handles temporal reasoning across long-form videos, the ImageAgent applies the same Planner–Critic paradigm to static visual analysis. It performs modular, tool-based reasoning over images, combining perception tools for recognition, detection, and optical character recognition with language-based reasoning for interpretation and explanation.

Planner tools
  • vit_tool: Leverages Vision Transformer (ViT) or Vision Languague Model (VLM) for high-level visual understanding and description 
  • recog_tool: Performs scene, face, and object recognition
  • object_detection_tool: Localizes and labels entities within an image
  • ocr_tool: Extracts embedded text from visual elements
Critic tool
  • critic_tool: Validates the Planner’s conclusions for factual alignment and consistency, refining the final response 

This lightweight ImageAgent provides fine-grained, explainable reasoning over image collections—supporting visual question answering, content inspection, and multimodal retrieval—while maintaining architectural symmetry with the VideoAgent.

Evaluation Results 

To assess the effectiveness of MMCTAgent, we evaluated both the ImageAgent and VideoAgent with multiple base LLM models and a range of benchmark datasets and real-world scenarios. Some key results are presented here. 

Image DatasetsGPT-4VMMCT with GPT-4VGPT4oMMCT with GPT-4oGPT-5MMCT with GPT-5MM-Vet [1]60.2074.2477.9879.3680.5181.65MMMU [2]56.8063.5769.1073.0084.2085.44 Video DatasetsGPT4oMMCT with GPT-4oVideoMME [3]72.1076.70

MMCTAgent enhances base model performance by augmenting their capabilities with appropriate tools such as object detection and optical character recognition (OCR) for weaker models, or domain-specific tools for stronger models, thereby leading to substantial improvements. For example, integrating these tools raised GPT-4V’s accuracy from 60.20% to 74.24% on MM-Vet dataset. Additionally, the configurable Critic agent provides additional validation, which is especially valuable in critical domains. The additional evaluation results are available here (opens in new tab).

Takeaways and next steps

MMCTAgent demonstrates a scalable agentic approach to multimodal reasoning with a Planner–Critic architecture. Its unified multimodal design supports both image and video pipelines, while the extensible toolchain enables rapid integration of domain-specific tools and capabilities. It provides Azure-native deployment and supports configurability within the broader open-source ecosystem.

Looking ahead, we aim to improve efficiency and adaptability in retrieval and reasoning workflows, and to extend MMCTAgent’s applications beyond current agricultural evaluations, exploring new real-world domains through initiatives like Project Gecko to advance the creation of accessible, innovative multimodal applications for people around the globe. 

Acknowledgements

We would like to thank our team members for their valuable contributions to this work: Aman Patkar, Ogbemi Ekwejunor-Etchie, Somnath Kumar, Soumya De, and Yash Gadhia. 

References 

[1] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. “MM-VET: Evaluating large multimodal models for integrated capabilities”, 2023. 

[2] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. “MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI”, 2023. 

[3] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. “Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis”, 2024. 

Opens in a new tab

The post MMCTAgent: Enabling multimodal reasoning over large video and image collections appeared first on Microsoft Research.

Categories: Microsoft

BlueCodeAgent: A blue teaming agent enabled by automated red teaming for CodeGen AI

Tue, 11/11/2025 - 18:00
Introduction

Large language models (LLMs) are now widely used for automated code generation across software engineering tasks. However, this powerful capability in code generation also introduces security concerns. Code generation systems could be misused for harmful purposes, such as generating malicious code. It could also produce bias-filled code reflecting underlying logic that is discriminatory or unethical. Additionally, even when completing benign tasks, LLMs may inadvertently produce vulnerable code that contains security flaws (e.g., injection risks, unsafe input handling). These unsafe outcomes undermine the trustworthiness of code generation models and pose threats to the broader software ecosystem, where safety and reliability are critical.

Many studies have explored red teaming code LLMs, testing whether the models can reject unsafe requests and whether their generated code exhibits insecure patterns. For more details, see our earlier MSR blog post on RedCodeAgent. While red teaming has significantly improved our understanding of model failure modes, progress on blue teaming—i.e., developing effective defensive mechanisms to detect and prevent such failures—remains relatively limited. Current blue teaming approaches face several challenges: (1) Poor alignment with security concepts: additional safety prompts struggle to help models understand high-level notions, such as what constitutes a malicious or bias instruction, and typically lack actionable principles to guide safe decision-making. A case study is shown in Figure 1. (2) Over-conservatism: especially in the domain of vulnerable code detection, models tend to misclassify safe code as unsafe, leading to more false positives and reduced developer trust. (3) Incomplete risk coverage: without a strong knowledge foundation, models perform poorly when dealing with subtle or previously unseen risks.   

To address these challenges, researchers from the University of Chicago, University of California, Santa Barbara, University of Illinois Urbana–Champaign, VirtueAI, and Microsoft Research recently released a paper: BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI. This work makes the following key contributions: 

  1. Diverse red-teaming pipeline: The authors design a comprehensive red-teaming process that integrates multiple strategies to synthesize diverse red-teaming data for effective knowledge accumulation.
  2. Knowledge-enhanced blue teaming: Building on the foundation of red-teaming knowledge, BlueCodeAgent significantly improves blue-teaming performance by leveraging constitutions derived from knowledge and dynamic testing. 
  3. Principled-Level Defense and Nuanced-Level analysis: The authors propose two complementary strategies—Principled-Level Defense (via constitutions) and Nuanced-Level Analysis (via dynamic testing)—and demonstrate their synergistic effects in vulnerable code detection tasks. 
  4. Generalization to seen and unseen risks: Empowered by comprehensive red-teaming knowledge, BlueCodeAgent generalizes effectively to unseen risks. Overall, BlueCodeAgent achieves an average 12.7% improvement in F1 score across four datasets and three tasks, attributed to its ability to distill actionable constitutions that enhance context-aware risk detection. 
Figure 1. A case study of BlueCodeAgent on the bias instruction detection task. Even when concepts such as “biased” are explicitly included in additional safety prompts, models often fail to recognize biased requests (left). BlueCodeAgent (right) addresses this gap by summarizing constitutions from knowledge and applying concrete, actionable constraints benefited from red teaming to improve the defense. A blue teaming agent enabled by red teaming Figure 2: Overview of BlueCodeAgent, an end-to-end blue teaming framework powered by automated red teaming for code security. By integrating knowledge derived from diverse red teaming and conducting dynamic sandbox-based testing, BlueCodeAgent substantially strengthens the defensive capabilities beyond static LLM analysis.

Figure 2 presents an overview of the pipeline. The framework unifies both sides of the process: red teaming generates diverse risky cases and behaviors, which are then distilled into actionable constitutions that encode safety rules on the blue-teaming side. These constitutions guide BlueCodeAgent to more effectively detect unsafe textual inputs and code outputs, mitigating limitations such as poor alignment with abstract security concepts. 

This work targets three major risk categories, covering both input/textual-level risks—including biased and malicious instructions—and output/code-level risks, where models may generate vulnerable code. These categories represent risks that have been widely studied in prior research. 

Diverse red-teaming process for knowledge accumulation 

Since different tasks require distinct attack strategies, the red-teaming employs multiple attack methods to generate realistic and diverse data. Specifically, the red-teaming process is divided into three categories:

  1. Policy-based instance generation: To synthesize policy-grounded red-teaming data, diverse security and ethical policies are first collected. These high-level principles are then used to prompt an uncensored model to generate instances that intentionally violate the specified policies.
  2. Seed-based adversarial prompt optimization: Existing adversarial instructions are often overly simplistic and easily rejected by models. To overcome this limitation, an adaptive red-teaming agent invokes various jailbreak tools to iteratively refine initial seed prompts until the prompts achieve high attack success rates.
  3. Knowledge-driven vulnerability generation: To synthesize both vulnerable and safe code samples under realistic programming scenarios, domain knowledge of common software weaknesses (CWE) is leveraged to generate diverse code examples.
Knowledge-enhanced blue teaming agent 

After accumulating red-teaming knowledge data, BlueCodeAgent set up Principled-Level Defense via Constitution Construction and Nuanced-Level Analysis via Dynamic Testing.

  1. Principled-Level Defense via Constitution Construction 
    Based on the most relevant knowledge data, BlueCodeAgent summarizes red-teamed knowledge into actionable constitutions—explicit rules and principles distilled from prior attack data. These constitutions serve as normative guidelines, enabling the model to stay aligned with ethical and security principles even when confronted with novel or unseen adversarial inputs. 
  2. Nuanced-Level Analysis via Dynamic Testing 
    In vulnerable code detection, BlueCodeAgent augments static reasoning with dynamic sandbox-based analysis, executing generated code within isolated Docker environments to verify whether the model-reported vulnerabilities manifest as actual unsafe behaviors. This dynamic validation effectively mitigates the model’s tendency toward over-conservatism, where benign code is mistakenly flagged as vulnerable. 

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Insights from BlueCodeAgent  BlueCodeAgent outperforms prompting baselines 

As shown in Figure 3, BlueCodeAgent significantly outperforms other baselines. Several findings are highlighted. 

(1) Even when test categories differ from knowledge categories to simulate unseen scenarios, BlueCodeAgent effectively leverages previously seen risks to handle unseen ones, benefiting from its knowledge-enhanced safety reasoning. 

(2) BlueCodeAgent is model-agnostic, working consistently across diverse base LLMs, including both open-source and commercial models. Its F1 scores for bias and malicious instruction detection approach 1.0, highlighting strong effectiveness. 

(3) BlueCodeAgent achieves a strong balance between safety and usability. It accurately identifies unsafe inputs while maintaining a reasonable false-positive rate on benign ones, resulting in a consistently high F1 score. 

(4) By contrast, prompting with general or fine-grained safety reminders remains insufficient for effective blue teaming, as models struggle to internalize abstract safety concepts and apply them to unseen risky scenarios. BlueCodeAgent bridges this gap by distilling actionable constitutions from knowledge, using concrete and interpretable safety constraints to enhance model alignment. 

Figure 3: F1 scores on bias instruction detection task (BlueCodeEval-Bias) in the first row and on malicious instruction detection task (BlueCodeEval-Mal) in the second row.  Complementary effects of constitutions and dynamic testing 

In vulnerability detection tasks, models tend to behave conservatively—an effect also noted in prior research. They are often more likely to flag code as unsafe rather than safe. This bias is understandable: confirming that code is completely free from vulnerabilities is generally harder than spotting a potential issue. 

To mitigate this over-conservatism, BlueCodeAgent integrates dynamic testing into its analysis pipeline. When BlueCodeAgent identifies a potential vulnerability, it triggers a reliable model (Claude-3.7-Sonnet-20250219) to generate test cases and corresponding executable code that embeds the suspicious snippet. These test cases are then run in a controlled environment to verify whether the vulnerability actually manifests. The final judgment combines the LLM’s analysis of the static code, the generated test code, run-time execution results, and constitutions derived from knowledge. 

Researchers find the two components—constitutions and dynamic testing—play complementary roles. Constitutions expand the model’s understanding of risk, increasing true positives (TP) and reducing false negatives (FN). Dynamic testing, on the other hand, focuses on reducing false positives (FP) by validating whether predicted vulnerabilities can truly be triggered at run-time. Together, they make BlueCodeAgent both more accurate and more reliable in blue-teaming scenarios. 

Summary 

BlueCodeAgent introduces an end-to-end blue-teaming framework designed to address risks in code generation. The key insight behind BlueCodeAgent is that comprehensive red-teaming can greatly strengthen blue-teaming defenses. Based on this idea, the framework first builds a red-teaming process with diverse strategies for generating red-teaming data. It then constructs a blue-teaming agent that retrieves relevant examples from the red-teaming knowledge base and summarizes safety constitutions to guide LLMs in making accurate defensive decisions. A dynamic testing component is further added to reduce false positives in vulnerability detection. 

Looking ahead, several directions hold promise.  

First, it is valuable to explore the generalization of BlueCodeAgent to other categories of code-generation risks beyond bias, malicious code, and vulnerable code. This may require designing and integrating novel red-teaming strategies into BlueCodeAgent and creating corresponding benchmarks for new risks.  

Second, scaling BlueCodeAgent to the file and repository levels could further enhance its real-world utility, which requires equipping agents with more advanced context retrieval tools and memory components.  

Finally, beyond code generation, it is also important to extend BlueCodeAgent to mitigate risks in other modalities, including text, image, video, and audio, as well as in multimodal applications. 

Opens in a new tab

The post BlueCodeAgent: A blue teaming agent enabled by automated red teaming for CodeGen AI appeared first on Microsoft Research.

Categories: Microsoft

When industry knowledge meets PIKE-RAG: The innovation behind Signify’s customer service boost

Thu, 11/06/2025 - 14:00

As a world leader in connected LED lighting products, systems, and services, Signify (formerly Philips Lighting) serves not only everyday consumers but also a large number of professional users who have stringent requirements for technical specifications and engineering compatibility. Faced with thousands of product models, complex component parameters, and technical documentation spanning multiple versions, delivering accurate, professional answers efficiently has become a core challenge for Signify’s knowledge management system.

To address this challenge, Signify (opens in new tab) collaborated with Microsoft Research Asia on a proof-of-concept (PoC) using PIKE-RAG technology, integrating it into their upgraded knowledge management system built on Microsoft Azure. The result: a 12% improvement in answer accuracy.

Challenges of applying RAG in lighting

In an era where AI is rapidly transforming how enterprises manage information, Signify recognized the strategic importance of precise and efficient knowledge systems. It adopted large AI models and retrieval-augmented generation (RAG) techniques to better support its wide range of customer inquiries.

Yet applying RAG to lighting scenarios involving professional users presented unique challenges. Product data spanned multimodal documents, unstructured tables, and complex product parameters, demanding continuous customization that slowed development and limited scalability. Despite improvements through keyword tuning, system optimization, and refined prompts, Signify sought more advanced approaches to further raise accuracy and reliability.

Seeking to unlock greater value from its knowledge management system, Signify began exploring more suitable technical solutions that are better aligned with their professional use cases. Upon learning that PIKE-RAG had been successfully applied in domains like healthcare and law, significantly improving information accuracy, Signify worked with Microsoft Research Asia on a PoC of PIKE-RAG on Microsoft Azure.

How PIKE-RAG addressed Signify’s pain points

Compared to traditional RAG, PIKE-RAG efficiently retrieves textual information and also understands multimodal content like charts and tables. Its built-in domain adaptation module quickly learns reasoning patterns aligned with specific domains to generate responses that are consistent with engineering contexts. These differentiated advantages stem from PIKE-RAG’s unique approach to understanding and processing professional knowledge. In Signify’s use case, this manifests in three key areas:

Multimodal document parsing and learning of industry-specific reasoning patterns

Signify’s product documentation includes diverse formats, such as nonstandard tables (e.g., comparison charts of voltage ranges under different currents) and circuit diagrams (e.g., driver power limits). Traditional systems often fail to process this information effectively—either ignoring it or extracting disorganized text fragments.

PIKE-RAG integrates Microsoft Research Asia’s Document Intelligence technology with Microsoft Azure OpenAI models to accurately identify table structures and parse key parameters in circuit diagrams. For example, when a customer service agent queries, “What is the output voltage of a specific driver model at 0.15A current,” the system automatically locates the curve chart in the document and infers a range of 40–54V based on the current interval—an area where traditional systems frequently err, due to their inability to “read” diagrams.

End-to-end knowledge loop, eliminating reliance on erroneous data sources

Enterprise knowledge systems often integrate data from multiple sources, which can lead to discrepancies, especially when database updates are not fully synchronized. PIKE-RAG captures diverse information sources and establishes citation relationships, supporting complex reasoning tasks that rely on multi-source data.

In other words, PIKE-RAG can directly use original documents as data sources, efficiently parsing and understanding product manuals and PDF charts. By extracting key information from these text- and graphic-rich documents, PIKE-RAG enables more efficient and trustworthy knowledge retrieval.

Dynamic task decomposition and multi-hop reasoning for precise answers to complex questions

Traditional RAG systems typically follow a “one question, one answer” model and struggle with multi-step reasoning. In Signify’s lighting domain, customer inquiries often involve multi-level associations. PIKE-RAG dynamically decomposes user questions into executable subtasks and solves them through multi-hop reasoning. For example, when asked, “List all bases compatible with the G8 series lamps,” if no document directly provides the answer, PIKE-RAG’s reasoning proceeds as follows:

Step 1: The system identifies implicit knowledge. One document notes that the G7 and G8 series have identical dimensions and that all bases compatible with the G7 series are also compatible with the G8 series. 

Step 2: Based on this, the system retrieves the base list for the G7 series. 

Step 3: Since the list uses abbreviations, the system searches for a table that maps abbreviations to full names and generates a complete list of G8-compatible bases. 

Through this automated multi-hop reasoning, the system delivers accurate and complete answers.

Figure 1: PIKE-RAG orchestrates and integrates heterogeneous information in multi-source and multimodal environments.

Testing showed that the PIKE-RAG-powered knowledge management platform provided a significant advantage. It achieved a 12% improvement in performance compared with the original system.

These results were achieved without any question-specific customization, only algorithmic optimization, demonstrating precise knowledge matching and generation. As the system continues to learn and integrate Signify’s proprietary knowledge, accuracy is expected to improve further.

“In the PoC for our product specification insight tool, PIKE-RAG helped us significantly improve the original system’s performance. This will enhance overall customer satisfaction. We’re currently evaluating PIKE-RAG’s application path from multiple angles, including technical implementation, cost control, and future adaptability, and we look forward to deepening our collaboration with Microsoft Research Asia to drive further innovation,” said Haitao Liu, head of Signify Research China.

“It’s also worth noting that the researchers at Microsoft Research Asia demonstrated strong industry knowledge and rigorous scientific methodology. They proactively studied and analyzed the issues, tracing and clarifying the root causes of our issues to make PIKE-RAG better suited to Signify’s real-world needs.”

Beyond lighting: Generalization across industries

In Signify’s successful test, PIKE-RAG demonstrated strong generalization capabilities in complex industrial scenarios, enabling rapid cross-domain adaptation. Its three core strengths are:

  • Support for self-evolution and continuous learning: PIKE-RAG continuously analyzes error cases in interaction logs and uses evolutionary algorithms to automatically optimize knowledge extraction strategies, such as trying different table parsing methods or adjusting multimodal content weights. Validated strategies are then solidified for future Q&A, allowing the system to adapt to new knowledge types without manual intervention. 
  • Modular architecture driven by capability needs: PIKE-RAG flexibly combines modules for document parsing, knowledge extraction, storage, retrieval, organization, knowledge-centered reasoning, and task decomposition. It dynamically adjusts focus areas based on scenario needs (e.g., fact retrieval, multi-hop reasoning, innovative generation) and flexibly builds RAG methods that adapt to real-world applications, efficiently handling various complex tasks. 
  • Strong adaptation to domain-specific reasoning patterns: With dynamic updates through the Domain Tips feature, enterprises can add domain-specific logic (e.g., “the maximum output voltage of an LED driver should be the maximum of the operating range, not the spec sheet’s max output”) in real time, enabling the system to process information according to professional engineering standards and follow industry conventions. 
Figure 2: Overview of the PIKE-RAG framework

PIKE-RAG’s generalization capabilities have been validated not only in Signify’s knowledge management platform but also in pilot applications across industries like manufacturing, mining, and pharmaceuticals—significantly improving Q&A system accuracy.

“A leader in lighting, Signify presents a complex industrial knowledge system with a highly challenging real-world scenario for PIKE-RAG. Through this collaboration, we validated that PIKE-RAG’s general approach can greatly improve the accuracy of professional knowledge Q&A and accelerate scenario customization. Our researchers also gained valuable experience in handling domain-specific data,” explained Jiang Bian, partner research manager at Microsoft Research Asia.

“Our goal isn’t to build a universal chatbot but to create a professional assistant that aligns with domain-specific logic and performs rigorous knowledge reasoning. That’s the true driving force behind intelligent transformation in industrial knowledge management.”

Opens in a new tab

The post When industry knowledge meets PIKE-RAG: The innovation behind Signify’s customer service boost appeared first on Microsoft Research.

Categories: Microsoft

RedCodeAgent: Automatic red-teaming agent against diverse code agents

Tue, 11/04/2025 - 18:00
Introduction

Code agents are AI systems that can generate high-quality code and work smoothly with code interpreters. These capabilities help streamline complex software development workflows, which has led to their widespread adoption.

However, this progress also introduces critical safety and security risks. Existing static safety benchmarks and red-teaming methods—in which security researchers simulate real-world attacks to identify security vulnerabilities—often fall short when evaluating code agents. They may fail to detect emerging real-world risks, such as the combined effects of multiple jailbreak tools. In the context of code, effective red-teaming requires more than simply checking whether the target code agent rejects unsafe requests. Instead, the agent must generate and execute correct code that performs the intended risky functionality, making it essential to evaluate execution behaviors beyond static code analysis. 

To address these challenges, researchers from the University of Chicago, University of Illinois Urbana–Champaign, VirtueAI, the UK AI Security Institute, University of Oxford, UC Berkeley, and Microsoft Research recently proposed RedCodeAgent, the first fully automated and adaptive red-teaming agent designed specifically to evaluate the safety of large language model (LLM)-based code agents.

Comprehensive experimental results demonstrate the effectiveness and efficiency of RedCodeAgent across (1) diverse Common Weakness Enumeration (CWE) vulnerabilities and malware types, (2) multiple programming languages—including Python, C, C++, and Java—and (3) a wide range of code agents, such as OpenCodeInterpreter, ReAct, MetaGPT, and commercial agents like Cursor and Codeium. RedCodeAgent also uncovers common vulnerabilities across agents such as generating and executing unsafe code, exposes variations in red-teaming difficulty across goals, identifies frequently triggered attack tools, and detects previously unknown vulnerabilities that all other baseline methods overlook. 

Framework for automatic red-teaming against code agents Figure 1: Illustration of RedCodeAgent on automatic red-teaming against a target code agent 

As shown in Figure 1, RedCodeAgent is equipped with a memory module that accumulates successful attack experiences, enabling the system to continuously learn and adapt its attack strategies. After learning from the previous experiences, RedCodeAgent further leverages a tailored toolbox that combines representative red-teaming tools with a specialized code substitution module, enabling realistic and diverse code-specific attack simulations through function calling. Based on the target agent’s responses across multiple interactive trials, RedCodeAgent optimizes its strategies, systematically probing for weaknesses and vulnerabilities in real time. 

In the evaluation phase, RedCodeAgent integrates simulated sandbox environments to enable code execution and assess the impact of the resulting behaviors. This sandbox-based evaluation ensures a more robust assessment of harmful behaviors and addresses the potential biases of previous static methods that rely solely on “LLM-as-a-judge” evaluations.

A case study is shown in Figure 2. Initially, RedCodeAgent discovers that the request was rejected, then RedCodeAgent calls the Greedy Coordinate Gradient (GCG) algorithm to bypass the safety guardrail. After the second request was rejected by the code agent, RedCodeAgent invoked both Code Substitution and GCG to optimize the prompt. Ultimately, RedCodeAgent successfully combined the suggestion from Code Substitution (i.e., using pathlib) with the adversarial suffix generated by GCG, making the target code agent delete the specified file.

Figure2. A case study of RedCodeAgent calling different tools to successfully attack the target code agent Insights from RedCodeAgent 

Experiments on diverse benchmarks show that RedCodeAgent achieves both a higher attack success rate (ASR) and a lower rejection rate, revealing several key findings outlined below.

Using traditional jailbreak methods alone does not necessarily improve ASR on code agents

The optimized prompts generated by GCG, AmpleGCG, Advprompter, and AutoDAN do not always achieve a higher ASR compared with static prompts with no jailbreak, as shown in Figure 3. This is likely due to the difference between code-specific tasks and general malicious request tasks in LLM safety. In the context of code, it is not enough for the target code agent to simply avoid rejecting the request; the target code agent must also generate and execute code that performs the intended function. Previous jailbreak methods do not guarantee this outcome. However, RedCodeAgent ensures that the input prompt has a clear functional objective (e.g., deleting specific sensitive files). RedCodeAgent can dynamically adjust based on evaluation feedback, continually optimizing to achieve the specified objectives.

Figure 3:RedCodeAgent achieves the highest ASR compared with other methods RedCodeAgent exhibits adaptive tool utilization 

RedCodeAgent can dynamically adjust its tool usage based on task difficulty. Figure 4 shows that the tool calling combination is different for different tasks. For simpler tasks, where the baseline static test cases already achieve a high ASR, RedCodeAgent spends little time invoking additional tools, demonstrating its efficiency. For more challenging tasks, where the baseline static test cases in RedCode-Exec achieve a lower ASR,we observe that RedCodeAgent spends more time using advanced tools like GCG and Advprompter to optimize the prompt for a successful attack. As a result, the average time spent on invoking different tools varies across tasks, indicating that RedCodeAgent adapts its strategy depending on the specific task. 

Figure 4: Average time cost for RedCodeAgent to invoke different tools or query the target code agent in successful cases for each risk scenario  RedCodeAgent discovers new vulnerabilities

In scenarios where other methods fail to find successful attack strategies, RedCodeAgent is able to discover new, feasible jailbreak approaches. Quantitatively, we find that RedCodeAgent is capable of discovering 82 (out of 27*30=810 cases in RedCode-Exec benchmark) unique vulnerabilities on the OpenCodeInterpreter code agent and 78 on the ReAct code agent. These are cases where all baseline methods fail to identify the vulnerability, but RedCodeAgent succeeds.

Summary

RedCodeAgent combines adaptive memory, specialized tools, and simulated execution environments to uncover real-world risks that static benchmarks may miss. It consistently outperforms leading jailbreak methods, achieving higher attack success rates and lower rejection rates, while remaining efficient and adaptable across diverse agents and programming languages.

Opens in a new tab

The post RedCodeAgent: Automatic red-teaming agent against diverse code agents appeared first on Microsoft Research.

Categories: Microsoft

Tell me when: Building agents that can wait, monitor, and act

Tue, 10/21/2025 - 17:00

Modern LLM Agents can debug code, analyze spreadsheets, and book complex travel. Given those capabilities, it’s reasonable to assume that they could handle something simpler: waiting. Ask an agent to monitor your email for a colleague’s response or watch for a price drop over several days, and it will fail. Not because it can’t check email or scrape prices. It can do both. It fails because it doesn’t know when to check. Agents either give up after a few attempts or burn through their context window, checking obsessively. Neither work. 

This matters because monitoring tasks are everywhere. We track emails for specific information, watch news feeds for updates, and monitor prices for sales. Automating these tasks would save hours, but current agents aren’t built for patience.

To address this, we are introducing SentinelStep (opens in new tab), a mechanism that enables agents to complete long-running monitoring tasks. The approach is simple. SentinelStep wraps the agent in a workflow with dynamic polling and careful context management. This enables the agent to monitor conditions for hours or days without getting sidetracked. We’ve implemented SentinelStep in Magentic-UI, our research prototype agentic system, to enable users to build agents for long-running tasks, whether they involve web browsing, coding, or external tools. 

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab How it works

The core challenge is polling frequency. Poll too often, and tokens get wasted. Poll too infrequently, and the user’s notification gets delayed. SentinelStep makes an educated guess at the polling interval based on the task at hand—checking email gets different treatment than monitoring quarterly earnings—then dynamically adjusts based on observed behavior. 

There’s a second challenge: context overflow. Because monitoring tasks can run for days, context overflow becomes inevitable. SentinelStep handles this by saving the agent state after the first check, then using that state for each subsequent check.

These demonstrations capture Magentic-UI with SentinelStep at work, completing a range of tasks in a timelapse sequence.  Core components

As the name suggests, SentinelStep consists of individual steps taken as part of an agent’s broader workflow. As illustrated in Figure 1, there are three main components: the actions necessary to collect information, the condition that determines when the task is complete, and the polling interval that determines timing. Once these components are identified, the system’s behavior is simple: every [polling interval] do [actions] until [condition] is satisfied. 

Figure 1. SentinelSteps’s three main components in Magentic-UI’s co-planning interface. 

These three components are defined and exposed in the co-planning interface of Magentic-UI. Given a user prompt, Magentic-UI proposes a complete multi-step plan, including pre-filled parameters for any monitoring steps. Users can accept the plan or adjust as needed.

Processing

Once a run starts, Magentic-UI assigns the most appropriate agent from a team of agents to perform each action. This team includes agents capable of web surfing, code execution, and calling arbitrary MCP servers.

When the workflow reaches a monitoring step, the flow is straightforward. The assigned agent collects the necessary information through the actions described in the plan. The Magentic-UI orchestrator then checks whether the condition is satisfied. If it is, the SentinelStep is complete, and the orchestrator moves to the next step. If not, the orchestrator determines the timestamp for the next check and resets the agent’s state to prevent context overflow.

Evaluation

Evaluating monitoring tasks in real-world settings is nearly impossible. Consider a simple example: monitoring the Magentic-UI repository on GitHub until it reaches 10,000 stars (a measure of how many people have bookmarked it). That event occurs only once and can’t be repeated. Most real-world monitoring tasks share this limitation, making systematic bench marking very challenging.

In response, we are developing SentinelBench, a suite of synthetic web environments for evaluating monitoring tasks. These environments make experiments repeatable. SentinelBench currently supports 28 configurable scenarios, each allowing the user to schedule exactly when a target event should occur. It includes setups like GitHub Watcher, which simulates a repository accumulating stars over time; Teams Monitor, which models incoming messages, some urgent; and Flight Monitor, which replicates evolving flight-availability dynamics. 

Initial tests show clear benefits. As shown in Figure 2, success rates remain high for short tasks (30 sec and 1 min) regardless of whether SentinelStep is used. For longer tasks, SentinelStep markedly improves reliability: at 1 hour, task reliability rises from 5.6% without SentinelStep to 33.3% with it; and at 2 hours, it rises from 5.6% to 38.9%. These gains demonstrate that SentinelStep effectively addresses the challenge of maintaining performance over extended durations.

Figure 2. SentinelStep improves success rates on longer running tasks (1–2 hours) while maintaining comparable performance on shorter tasks.   Impact and availability

SentinelStep is a first step toward practical, proactive, longer‑running agents. By embedding patience into plans, agents can responsibly monitor conditions and act when it matters—staying proactive without wasting resources. This lays the groundwork for always‑on assistants that stay efficient, respectful of limits, and aligned with user intent.

We’ve open-sourced SentinelStep as part of Magentic-UI, available on GitHub (opens in new tab) or via pip install magnetic-ui. As with any new technique, production deployment should be preceded through testing and validation for the specific use case. For guidance on intended use, privacy considerations, and safety guidelines, see the Magentic-UI Transparency Note. (opens in new tab) 

Our goal is to make it easier to implement agents that can handle long-running monitoring tasks and lay the groundwork for systems that anticipate, adapt, and evolve to meet real-world needs. 

Opens in a new tab

The post Tell me when: Building agents that can wait, monitor, and act appeared first on Microsoft Research.

Categories: Microsoft

When AI Meets Biology: Promise, Risk, and Responsibility

Mon, 10/06/2025 - 15:03

Advances in AI are opening extraordinary frontiers in biology. AI-assisted protein engineering holds the promise of new medicines, materials, and breakthroughs in scientific understandings. Yet these same technologies also introduce biosecurity risks and may lower barriers to designing harmful toxins or pathogens. This “dual-use” potential, where the same knowledge can be harnessed for good or misuse to cause harm, poses a critical dilemma for modern science.

Great Promise—and Potential Threat

I’m excited about the potential for AI-assisted protein design to drive breakthroughs in biology and medicine. At the same time, I’ve also studied how these tools could be misused. In computer-based studies, we found that AI protein design (AIPD) tools could generate modified versions of proteins of concern, such as ricin. Alarmingly, these reformulated proteins were able to evade the biosecurity screening systems used by DNA synthesis companies, which scientists rely on to synthesize AI-generated sequences for experimental use.

In our paper published in Science on October 2, “Strengthening nucleic acid biosecurity screening against generative protein design tools (opens in new tab),” we describe a two-year confidential project we began in late 2023 while preparing a case study for a workshop on AI and biosecurity.

We worked confidentially with partners across organizations and sectors for 10 months to develop AI biosecurity “red-teaming” methods that allowed us to better understand vulnerabilities and craft practical solutions—”patches” that have now been adopted globally, making screening systems significantly more AI-resilient.

Summary of AIPD red-teaming workflow.

For structuring, methods, and process in our study, we took inspiration from the cybersecurity community, where “zero-day” vulnerabilities are kept confidential until a protective patch is developed and deployed. Following the acknowledgment by a small group of workshop attendees of a zero-day for AI in biology, we worked closely with stakeholders—including synthesis companies, biosecurity organizations, and policymakers—to rapidly create and distribute patches that improved detection of AI-redesigned protein sequences. We delayed public disclosure until protective measures were in place and widely adopted.

Dilemma of Disclosure

The dual use dilemma also complicates how we share information about vulnerabilities and safeguards. Across AI and other fields, researchers face a core question:

How can scientists share potentially risk-revealing methods and results in ways that enable progress without offering a roadmap for misuse?

We recognized that our work itself—detailing methods and failure modes—could be exploited by malicious actors if published openly. To guide decisions about what to share, we held a multi-stakeholder deliberation involving government agencies, international biosecurity organizations, and policy experts. Opinions varied: some urged full transparency to maximize reproducibility—and to help others to build on our work; others stressed restraint to minimize risk. It was clear that a new model of scientific communication was needed, one that could balance openness and security.

The Novel Framework

The risk of sharing dangerous information through biological research has become a growing concern. We have participated in community-wide discussion on the challenges, including a recent National Academies of Science, Engineering, and Medicine workshop and study. 

In preparing our manuscript for publication, we worked on designing a process to limit the spread of dangerous information while still enabling scientific progress. 

To address the dual challenges, we devised a tiered access system for data and methods, implemented in partnership with the International Biosecurity and Biosafety Initiative for Science (IBBIS) (opens in new tab), a nonprofit dedicated to advancing science while reducing catastrophic risks. The system works as follows:

  • Controlled access: Researchers can request access through IBBIS, providing their identity, affiliation, and intended use. Requests are reviewed by an expert biosecurity committee, ensuring that only legitimate scientists conducting relevant research gain access.
  • Stratified tiers of information: Data and code are classified into several tiers according to their potential hazard, from low-risk summaries through sensitive technical data to critical software pipelines.
  • Safeguards and agreements: Approved users sign tailored usage agreements, including non-disclosure terms, before receiving data.
  • Resilience and longevity: Provisions are built in for declassification when risks subside, and for succession of stewardship to trusted organizations should IBBIS be unable to continue its operation.

This framework allows replication and extension of our work while guarding against misuse. Rather than relying on secrecy, it provides a durable system of responsible access.

To ensure continued funding for the storage and responsible distribution of sensitive data and software, and for the operation of the sharing program, we provided an endowment to IBBIS to support the program in perpetuity. This approach was modeled after the One Hundred Year Study on AI at Stanford, which is endowed to continue for the life of the university.

An Important Step in Scientific Publishing

We are pleased that the leadership at Science accepted our approach to handling information hazards. To our knowledge, this is the first time a leading scientific journal has formally endorsed a tiered-access approach to manage an information hazard. This recognition validates the idea that rigorous science and responsible risk management can coexist—and that journals, too, can play a role in shaping how sensitive knowledge is shared. We acknowledge the visionary leadership at Science, including editors, Michael Funk and Valda Vinson, and Editor-in-Chief, Holden Thorp.

Beyond Biology: A Model for Sensitive Research

While developed for AI-powered protein design, our approach offers a generalizable model for dual-use research of concern (DURC) across disciplines. Whether in biology, chemistry, or emerging technologies, scientists will increasingly confront situations where openness and security pull in opposite directions. Our experience shows that these values can be balanced: with creativity, coordination, and new institutional mechanisms, science can uphold both reproducibility and responsibility.

We hope this framework becomes a template for future projects, offering a way forward for researchers who wish to share their insights without amplifying risks. By embedding resilience into how knowledge is communicated—not just what is communicated—we can ensure that scientific progress continues to serve humanity safely.

The responsible management of information hazards is no longer a peripheral concern: it is central to how science will advance in the age of powerful technologies like AI. This approach to managing information hazards demonstrates a path forward, where novel frameworks for access and stewardship allow sensitive but vital research to be shared, scrutinized, and extended responsibly. Approaches like this will be critical to ensuring that scientific openness and societal safety advance hand-in-hand.

Additional reading

Strengthening nucleic acid biosecurity screening against generative protein design tools.

The Age of AI in the Life Sciences: Benefits and Biosecurity Considerations, National Academies of Science, Engineering, and Medicine, 2025. (opens in new tab)

Disseminating In Silico and Computational Biological Research: Navigating Benefits and Risks: Proceedings of a Workshop, National Academies of Science, Engineering, and Medicine, 2025. (opens in new tab)

Protecting scientific integrity in an age of generative AI, Proceedings of the National Academy of Science, 2024. (opens in new tab)

Opens in a new tab

The post When AI Meets Biology: Promise, Risk, and Responsibility appeared first on Microsoft Research.

Categories: Microsoft

Using AI to assist in rare disease diagnosis

Mon, 09/22/2025 - 15:17

In the promising and rapidly evolving field of genetic analysis, the ability to accurately interpret whole genome sequencing data is crucial for diagnosing and improving outcomes for people with rare genetic diseases. Yet despite technological advancements, genetic professionals face steep challenges in managing and synthesizing the vast amounts of data required for these analyses. Fewer than 50% of initial cases yield a diagnosis, and while reanalysis can lead to new findings, the process remains time-consuming and complex. 

To better understand and address these challenges, Microsoft Research—in collaboration with Drexel University and the Broad Institute​​—conducted a comprehensive study titled AI-Enhanced Sensemaking: Exploring the Design of a Generative AI-Based Assistant to Support Genetic Professionals (opens in new tab). The study was recently published in a special edition of ACM Transactions on Interactive Intelligent Systems journal focused on generative AI.  

The study focused on integrating generative AI to support the complex, time-intensive, and information-dense sensemaking tasks inherent in whole genome sequencing analysis. Through detailed empirical research and collaborative design sessions with experts in the field, we identified key obstacles genetic professionals face and proposed AI-driven solutions to enhance their workflows. ​     ​We developed strategies for how generative AI can help synthesize biomedical data, enabling AI-expert collaboration to increase the diagnoses of previously unsolved rare diseases—ultimately aiming to improve patients’ quality of life and life expectancy.

Whole genome sequencing in rare disease diagnosis

Rare diseases affect up to half a billion people globally and obtaining a diagnosis can take multiple years. These diagnoses often involve specialist consultations, laboratory tests, imaging studies, and invasive procedures. Whole genome sequencing is used to identify genetic variants responsible for these diseases by comparing a patient’s DNA sequence to reference genomes. ​​Genetic professionals use bioinformatics tools such as seqr, an open-source, web-based tool for rare disease case analysis and project management to assist them in filtering and prioritizing  > 1 million variants to determine their potential role in disease. A critical component of their work is sensemaking: the process of searching, filtering, and synthesizing data to build, refine, and present models from complex sets of gene and variant information.  

​​The multi-step sequencing process​​​ typically takes three to 12 weeks and requires extensive amounts of evidence and time to synthesize and aggregate information ​​to understand the gene and variant effects for the patient. If a patient’s case goes unsolved, their whole genome sequencing data is set aside until enough time has passed to warrant a reanalysis​​. This creates a backlog of patient cases​​. The ability to easily identify when new scientific evidence emerges and when to reanalyze an unsolved patient case is key to shortening the time patients suffer with an unknown rare disease diagnosis. 

The promise of AI systems to assist with complex human tasks

Approximately 87% of AI systems never reach deployment ​simply because they solve​​​ the wrong problems. ​​Understanding the AI support desired by different types of professionals, their current workflows, and AI capabilities is critical to successful AI system deployment and use. Matching technology capabilities with user tasks is particularly challenging in AI design because AI models can generate numerous outputs, and their capabilities can be unclear. ​To design an effective​​​ AI-based system​, one needs to identify​ ​​tasks AI can support, ​​determine​​​​​​ the appropriate level of AI involvement, and ​​design​​​​​​ user-AI interactions. This necessitates considering how humans interact with technology and how ​​AI can best be incorporated into workflows and tools.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab Study objectives and co-designing a genetic AI assistant

Our study aimed to understand the current challenges and needs of genetic professionals performing whole genome sequencing analyses and explore the tasks where they want an AI assistant to support them in their work. The first phase of our study involved interviews with 17 genetics professionals to better understand their workflows, tools, and challenges. They included genetic analysts directly involved in interpreting data, as well as other roles participating in whole genome sequencing. In the second phase of our study, we conducted co-design sessions with study participants on how an AI assistant could support their workflows. We then developed a prototype of an AI assistant, which was further tested and refined with study participants in follow-up design walk-through sessions.

Identifying challenges in whole genome sequencing analysis

Through our in-depth interviews with genetic professionals, our study uncovered three critical challenges in whole genome sequencing analysis:

  1. Information Overload: Genetic analysts need to gather and synthesize vast amounts of data from multiple sources. This task is incredibly time-consuming and prone to human error.
  2. Collaborative Sharing: Sharing findings with others in the field can be cumbersome and inefficient, often relying on outdated methods that slow the collaborative analysis process.
  3. Prioritizing Reanalysis: Given the continuous influx of new scientific discoveries, prioritizing unsolved cases to reanalyze is a daunting challenge. Analysts need a systematic approach to identify cases that might benefit most from reanalysis.

Genetic professionals highlighted the time-consuming nature of gathering and synthesizing information about genes and variants from different data sources. Other genetic professionals may have insights into certain genes and variants, but sharing and interpreting information with others for collaborative sensemaking requires significant time and effort. Although new scientific findings could affect unsolved cases through reanalysis, prioritizing cases based on new findings was challenging given the number of unsolved cases and limited time of genetic professionals.

Co-designing with experts and AI-human sensemaking tasks

Our study participants prioritized two potential tasks of an AI assistant. The first task was flagging cases for reanalysis based on new scientific findings. The assistant would alert analysts to unsolved cases that could benefit from new research, providing relevant updates drawn from recent publications. The second task focused on aggregating and synthesizing information about genes and variants from the scientific literature. This feature would compile essential information from numerous scientific papers about genes and variants, presenting it in a user-friendly format and saving analysts significant time and effort. Participants emphasized the need to balance selectivity with comprehensiveness in the evidence they review. They also envisioned collaborating with other genetic professionals to interpret, edit, and verify artifacts generated by the AI assistant.

Genetic professionals require both broad and focused evidence at different stages of their workflow. The AI assistant prototypes were designed to allow flexible filtering and thorough evidence aggregation, ensuring users can delve into comprehensive data or selectively focus on pertinent details. The prototypes included features for collaborative sensemaking, enabling users to interpret, edit, and verify AI-generated information collectively. This ​​approach not only ​underscores​​​ the trustworthiness of AI outputs, but also facilitates shared understanding and decision-making among genetic professionals.

Design implications for expert-AI sensemaking

In the shifting frontiers of genome sequence analysis, leveraging generative AI to enhance sensemaking offers intriguing possibilities​​. The task of staying ​​current​​​​​​, synthesizing information from diverse sources, and making informed decisions ​​is challenging​​​​​​.  

Our study participants emphasized the hurdles in integrating data from multiple sources without losing critical components, documenting decision rationales, and fostering collaborative environments. Generative AI models, with their advanced capabilities, have started to address these challenges by automatically generating interactive artifacts to support sensemaking. However, the effectiveness of such systems hinges on careful design considerations, ​​particularly in how they facilitate distributed sensemaking, support both initial and ongoing sensemaking, and combine evidence from multiple modalities. We next discuss three design considerations for using generative AI models to support sensemaking.

Distributed expert-AI sensemaking design

Generative AI models can create artifacts that aid an individual user’s sensemaking process; however, the true potential lies in sharing these artifacts among users to foster collective understanding and efficiency. Participants in our study emphasized the importance of explainability, feedback, and trust when interacting with AI-generated content. ​​​​​​​​​​Trust is gained by​​​​​​ viewing portions of artifacts marked as correct by other users, or observing edits made to AI-generated information​​. ​​Some​​​​​​ users​, however,​ cautioned against over-reliance on AI, which could obscure underlying inaccuracies. Thus, design strategies should ensure that any corrections are clearly marked ​​and annotated​​​​​​. Furthermore, to enhance distributed sensemaking, visibility of others’ notes and context-specific synthesis through AI can streamline the process​​. 

Initial expert-AI sensemaking and re-sensemaking design

In our fast-paced, information-driven world, ​​it is essential to understand a situation both initially and again when new information arises.​​ ​​Sensemaking is inherently temporal, reflecting and shaping our understanding of time as we revisit tasks to reevaluate past decisions or incorporate new information. Generative AI plays a pivotal role here by transforming static data into dynamic artifacts that evolve, offering a comprehensive view of past rationales. Such AI-generated artifacts provide continuity, allowing users—both original decision-makers or new individuals—to access the rationale behind decisions made in earlier task instances. By continuously editing and updating these artifacts, generative AI highlights new information since the last review, supporting ongoing understanding and decision-making. Moreover, AI systems enhance ​​transparency​​​​​​ by summarizing previous notes and questions, offering insights into earlier thought processes and facilitating a deeper understanding of how conclusions were drawn. This reflective capability not only can reinforce initial sensemaking efforts but also equips users with the clarity needed for informed re-sensemaking as new data emerges. 

Combining evidence from multiple modalities to enhance AI-expert sensemaking

​​​The​​​​​​ ability to combine evidence from multiple modalities is essential for effective sensemaking. Users often need to integrate diverse types of data—text, images, spatial coordinates, and more—into a coherent narrative to make informed decisions. Consider the case of search and rescue operations, where workers must rapidly synthesize information from texts, photographs, and GPS data to strategize their efforts. Recent advancements in multimodal generative AI models have empowered users by incorporating and synthesizing these varied inputs into a unified, comprehensive view. For instance, a participant in our study illustrated this capability by using a generative AI model to merge text from scientific publications with a visual gene structure depiction. This integration ​​could create​​​​​​ an image that contextualizes an individual’s genetic variant within the ​​context​​​​​​ of documented variants. Such advanced synthesis enables users to capture complex relationships and insights briefly, streamlining decision-making and expanding the potential for innovative solutions across diverse fields. 

Sensemaking Process with AI Assistant Figure: Sensemaking process when interpreting variants with the introduction of prototype AI assistant. Gray boxes represent sensemaking activities which are currently performed by an analyst but are human-in-the-loop processes with involvement of our prototype AI assistant. Non-gray boxes represent activities reserved for analyst completion without assistance by our AI assistant prototype. Within the foraging searching and synthesizing processes, examples of data sources and data types for each, respectively, are connected by dotted lines. Conclusion

We explored the potential of generative AI to support​​ genetic professionals​ ​in diagnosing rare diseases​​. By designing an AI-based assistant, we aim to streamline whole genome sequencing analysis, helping professionals diagnose rare genetic diseases more efficiently. Our study unfolded in two key phases: ​pinpointing​​​ existing challenges in analysis, and design ideation, where we crafted a prototype AI assistant. This tool is designed to boost diagnostic yield and cut down diagnosis time by flagging cases for reanalysis and synthesizing crucial gene and variant data. Despite valuable findings, more research is needed​​. Future research will involve testing the AI assistant in real-time, task-based user testing with genetic professionals to assess the AI’s impact on their workflow. The promise of AI advancements lies in solving the right user problems and building the appropriate solutions, achieved through collaboration among model developers, domain experts, system designers, and HCI researchers. By fostering these collaborations, we aim to develop robust, personalized AI assistants tailored to specific domains. 

Join the conversation

Join us as we continue to explore the transformative potential of generative AI in genetic analysis, and please read the full text publication here (opens in new tab). Follow us on social media, share this post with your network, and let us know your thoughts on how AI can transform genetic research. If interested in our other related research work, check out Evidence Aggregator: AI reasoning applied to rare disease diagnosis. (opens in new tab)  

Opens in a new tab

The post Using AI to assist in rare disease diagnosis appeared first on Microsoft Research.

Categories: Microsoft

Using AI to assist in rare disease diagnosis

Mon, 09/22/2025 - 15:17

In the promising and rapidly evolving field of genetic analysis, the ability to accurately interpret whole genome sequencing data is crucial for diagnosing and improving outcomes for people with rare genetic diseases. Yet despite technological advancements, genetic professionals face steep challenges in managing and synthesizing the vast amounts of data required for these analyses. Fewer than 50% of initial cases yield a diagnosis, and while reanalysis can lead to new findings, the process remains time-consuming and complex. 

To better understand and address these challenges, Microsoft Research—in collaboration with Drexel University and the Broad Institute​​—conducted a comprehensive study titled AI-Enhanced Sensemaking: Exploring the Design of a Generative AI-Based Assistant to Support Genetic Professionals (opens in new tab). The study was recently published in a special edition of ACM Transactions on Interactive Intelligent Systems journal focused on generative AI.  

The study focused on integrating generative AI to support the complex, time-intensive, and information-dense sensemaking tasks inherent in whole genome sequencing analysis. Through detailed empirical research and collaborative design sessions with experts in the field, we identified key obstacles genetic professionals face and proposed AI-driven solutions to enhance their workflows. ​     ​We developed strategies for how generative AI can help synthesize biomedical data, enabling AI-expert collaboration to increase the diagnoses of previously unsolved rare diseases—ultimately aiming to improve patients’ quality of life and life expectancy.

Whole genome sequencing in rare disease diagnosis

Rare diseases affect up to half a billion people globally and obtaining a diagnosis can take multiple years. These diagnoses often involve specialist consultations, laboratory tests, imaging studies, and invasive procedures. Whole genome sequencing is used to identify genetic variants responsible for these diseases by comparing a patient’s DNA sequence to reference genomes. ​​Genetic professionals use bioinformatics tools such as seqr, an open-source, web-based tool for rare disease case analysis and project management to assist them in filtering and prioritizing  > 1 million variants to determine their potential role in disease. A critical component of their work is sensemaking: the process of searching, filtering, and synthesizing data to build, refine, and present models from complex sets of gene and variant information.  

​​The multi-step sequencing process​​​ typically takes three to 12 weeks and requires extensive amounts of evidence and time to synthesize and aggregate information ​​to understand the gene and variant effects for the patient. If a patient’s case goes unsolved, their whole genome sequencing data is set aside until enough time has passed to warrant a reanalysis​​. This creates a backlog of patient cases​​. The ability to easily identify when new scientific evidence emerges and when to reanalyze an unsolved patient case is key to shortening the time patients suffer with an unknown rare disease diagnosis. 

The promise of AI systems to assist with complex human tasks

Approximately 87% of AI systems never reach deployment ​simply because they solve​​​ the wrong problems. ​​Understanding the AI support desired by different types of professionals, their current workflows, and AI capabilities is critical to successful AI system deployment and use. Matching technology capabilities with user tasks is particularly challenging in AI design because AI models can generate numerous outputs, and their capabilities can be unclear. ​To design an effective​​​ AI-based system​, one needs to identify​ ​​tasks AI can support, ​​determine​​​​​​ the appropriate level of AI involvement, and ​​design​​​​​​ user-AI interactions. This necessitates considering how humans interact with technology and how ​​AI can best be incorporated into workflows and tools.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Study objectives and co-designing a genetic AI assistant

Our study aimed to understand the current challenges and needs of genetic professionals performing whole genome sequencing analyses and explore the tasks where they want an AI assistant to support them in their work. The first phase of our study involved interviews with 17 genetics professionals to better understand their workflows, tools, and challenges. They included genetic analysts directly involved in interpreting data, as well as other roles participating in whole genome sequencing. In the second phase of our study, we conducted co-design sessions with study participants on how an AI assistant could support their workflows. We then developed a prototype of an AI assistant, which was further tested and refined with study participants in follow-up design walk-through sessions.

Identifying challenges in whole genome sequencing analysis

Through our in-depth interviews with genetic professionals, our study uncovered three critical challenges in whole genome sequencing analysis:

  1. Information Overload: Genetic analysts need to gather and synthesize vast amounts of data from multiple sources. This task is incredibly time-consuming and prone to human error.
  2. Collaborative Sharing: Sharing findings with others in the field can be cumbersome and inefficient, often relying on outdated methods that slow the collaborative analysis process.
  3. Prioritizing Reanalysis: Given the continuous influx of new scientific discoveries, prioritizing unsolved cases to reanalyze is a daunting challenge. Analysts need a systematic approach to identify cases that might benefit most from reanalysis.

Genetic professionals highlighted the time-consuming nature of gathering and synthesizing information about genes and variants from different data sources. Other genetic professionals may have insights into certain genes and variants, but sharing and interpreting information with others for collaborative sensemaking requires significant time and effort. Although new scientific findings could affect unsolved cases through reanalysis, prioritizing cases based on new findings was challenging given the number of unsolved cases and limited time of genetic professionals.

Co-designing with experts and AI-human sensemaking tasks

Our study participants prioritized two potential tasks of an AI assistant. The first task was flagging cases for reanalysis based on new scientific findings. The assistant would alert analysts to unsolved cases that could benefit from new research, providing relevant updates drawn from recent publications. The second task focused on aggregating and synthesizing information about genes and variants from the scientific literature. This feature would compile essential information from numerous scientific papers about genes and variants, presenting it in a user-friendly format and saving analysts significant time and effort. Participants emphasized the need to balance selectivity with comprehensiveness in the evidence they review. They also envisioned collaborating with other genetic professionals to interpret, edit, and verify artifacts generated by the AI assistant.

Genetic professionals require both broad and focused evidence at different stages of their workflow. The AI assistant prototypes were designed to allow flexible filtering and thorough evidence aggregation, ensuring users can delve into comprehensive data or selectively focus on pertinent details. The prototypes included features for collaborative sensemaking, enabling users to interpret, edit, and verify AI-generated information collectively. This ​​approach not only ​underscores​​​ the trustworthiness of AI outputs, but also facilitates shared understanding and decision-making among genetic professionals.

Design implications for expert-AI sensemaking

In the shifting frontiers of genome sequence analysis, leveraging generative AI to enhance sensemaking offers intriguing possibilities​​. The task of staying ​​current​​​​​​, synthesizing information from diverse sources, and making informed decisions ​​is challenging​​​​​​.  

Our study participants emphasized the hurdles in integrating data from multiple sources without losing critical components, documenting decision rationales, and fostering collaborative environments. Generative AI models, with their advanced capabilities, have started to address these challenges by automatically generating interactive artifacts to support sensemaking. However, the effectiveness of such systems hinges on careful design considerations, ​​particularly in how they facilitate distributed sensemaking, support both initial and ongoing sensemaking, and combine evidence from multiple modalities. We next discuss three design considerations for using generative AI models to support sensemaking.

Distributed expert-AI sensemaking design

Generative AI models can create artifacts that aid an individual user’s sensemaking process; however, the true potential lies in sharing these artifacts among users to foster collective understanding and efficiency. Participants in our study emphasized the importance of explainability, feedback, and trust when interacting with AI-generated content. ​​​​​​​​​​Trust is gained by​​​​​​ viewing portions of artifacts marked as correct by other users, or observing edits made to AI-generated information​​. ​​Some​​​​​​ users​, however,​ cautioned against over-reliance on AI, which could obscure underlying inaccuracies. Thus, design strategies should ensure that any corrections are clearly marked ​​and annotated​​​​​​. Furthermore, to enhance distributed sensemaking, visibility of others’ notes and context-specific synthesis through AI can streamline the process​​. 

Initial expert-AI sensemaking and re-sensemaking design

In our fast-paced, information-driven world, ​​it is essential to understand a situation both initially and again when new information arises.​​ ​​Sensemaking is inherently temporal, reflecting and shaping our understanding of time as we revisit tasks to reevaluate past decisions or incorporate new information. Generative AI plays a pivotal role here by transforming static data into dynamic artifacts that evolve, offering a comprehensive view of past rationales. Such AI-generated artifacts provide continuity, allowing users—both original decision-makers or new individuals—to access the rationale behind decisions made in earlier task instances. By continuously editing and updating these artifacts, generative AI highlights new information since the last review, supporting ongoing understanding and decision-making. Moreover, AI systems enhance ​​transparency​​​​​​ by summarizing previous notes and questions, offering insights into earlier thought processes and facilitating a deeper understanding of how conclusions were drawn. This reflective capability not only can reinforce initial sensemaking efforts but also equips users with the clarity needed for informed re-sensemaking as new data emerges. 

Combining evidence from multiple modalities to enhance AI-expert sensemaking

​​​The​​​​​​ ability to combine evidence from multiple modalities is essential for effective sensemaking. Users often need to integrate diverse types of data—text, images, spatial coordinates, and more—into a coherent narrative to make informed decisions. Consider the case of search and rescue operations, where workers must rapidly synthesize information from texts, photographs, and GPS data to strategize their efforts. Recent advancements in multimodal generative AI models have empowered users by incorporating and synthesizing these varied inputs into a unified, comprehensive view. For instance, a participant in our study illustrated this capability by using a generative AI model to merge text from scientific publications with a visual gene structure depiction. This integration ​​could create​​​​​​ an image that contextualizes an individual’s genetic variant within the ​​context​​​​​​ of documented variants. Such advanced synthesis enables users to capture complex relationships and insights briefly, streamlining decision-making and expanding the potential for innovative solutions across diverse fields. 

Sensemaking Process with AI Assistant Figure: Sensemaking process when interpreting variants with the introduction of prototype AI assistant. Gray boxes represent sensemaking activities which are currently performed by an analyst but are human-in-the-loop processes with involvement of our prototype AI assistant. Non-gray boxes represent activities reserved for analyst completion without assistance by our AI assistant prototype. Within the foraging searching and synthesizing processes, examples of data sources and data types for each, respectively, are connected by dotted lines. Conclusion

We explored the potential of generative AI to support​​ genetic professionals​ ​in diagnosing rare diseases​​. By designing an AI-based assistant, we aim to streamline whole genome sequencing analysis, helping professionals diagnose rare genetic diseases more efficiently. Our study unfolded in two key phases: ​pinpointing​​​ existing challenges in analysis, and design ideation, where we crafted a prototype AI assistant. This tool is designed to boost diagnostic yield and cut down diagnosis time by flagging cases for reanalysis and synthesizing crucial gene and variant data. Despite valuable findings, more research is needed​​. Future research will involve testing the AI assistant in real-time, task-based user testing with genetic professionals to assess the AI’s impact on their workflow. The promise of AI advancements lies in solving the right user problems and building the appropriate solutions, achieved through collaboration among model developers, domain experts, system designers, and HCI researchers. By fostering these collaborations, we aim to develop robust, personalized AI assistants tailored to specific domains. 

Join the conversation

Join us as we continue to explore the transformative potential of generative AI in genetic analysis, and please read the full text publication here (opens in new tab). Follow us on social media, share this post with your network, and let us know your thoughts on how AI can transform genetic research. If interested in our other related research work, check out Evidence Aggregator: AI reasoning applied to rare disease diagnosis. (opens in new tab)  

Opens in a new tab

The post Using AI to assist in rare disease diagnosis appeared first on Microsoft Research.

Categories: Microsoft

Tool-space interference in the MCP era: Designing for agent compatibility at scale

Thu, 09/11/2025 - 17:00

This year we’ve seen remarkable advances in agentic AI, including systems that conduct deep research, operate computers, complete substantial software engineering tasks, and tackle a range of other complex, multi-step goals. In each case, the industry relied on careful vertical integration: tools and agents were co-designed, co-trained, and tested together for peak performance. For example, OpenAI’s recent models presume the availability of web search and document retrieval tools (opens in new tab). Likewise, the prompts and actions of Magentic-One are set up to make hand-offs easy—for example, allowing the WebSurfer agent to pass downloaded files to the Coder agent.  But as agents proliferate, we anticipate strategies relying heavily on vertical integration will not age well. Agents from different developers or companies will increasingly encounter each other and must work together to complete tasks, in what we refer to as a society of agents. These systems can vary in how coordinated they are, how aligned their goals are, and how much information they share. Can heterogenous agents and tools cooperate in this setting, or will they hinder one another and slow progress?

Early clues have emerged from an unexpected source: namely, Model Context Protocol (opens in new tab) (MCP). Since January 2025, MCP has grown from a promising spec to a thriving market of tool servers. As an example, Zapier boasts a catalog of 30,000 tools (opens in new tab) across 7,000 services. Composio provide over 100 managed MCP servers (opens in new tab), surfacing hundreds of tools. Hugging Face is now serving many Spaces apps over MCP (opens in new tab), and Shopify has enabled MCP for millions of storefronts (opens in new tab). A society of tools is already here, and it promises to extend agent capabilities through cross-provider horizontal integration. 

So, what does MCP have to say about horizontal integration? As catalogs grow, we expect some new failure modes to surface. This blog post introduces these as tool-space interference, and sketches both early observations and some pragmatic interventions to keep the society we’re building from stepping on its own feet. 

Tool-space interference describes situations where otherwise reasonable tools or agents, when co-present, reduce end-to-end effectiveness. This can look like longer action sequences, higher token cost, brittle recovery from errors, or, in some cases, task failure.

A framing example

Consider MCP as a means for extending Magentic-One, a generalist multi-agent system we released last year, to cover more software engineering tasks. Magentic-One ships with agents to write code, interact with the computer terminal, browse the web, and access local files. To help Magentic-One navigate version control, find issues to solve, and make pull requests, we could add an agent equipped with the GitHub MCP Server. However, now each time the team encounters a task involving GitHub, it must choose whether to visit github.com in the browser, execute a git command at the command line, or engage the GitHub MCP server. As the task progresses, agent understanding of state can also diverge: changing the branch in the browser won’t change the branch in the terminal, and an authorized MCP tool does not imply authorization in the browser. Thus, while any single agent might complete the task efficiently, the larger set of agents might misunderstand or interfere with one another, leading to additional rounds of debugging, or even complete task failure.

Figure 1: We can extend Magentic-One by adding an agent that equips the GitHub MCP server. However, on every turn involving a git-related task, the orchestrator will need to decide between messaging the Computer Terminal agent (with access to the git command line interface), WebSurfer agent (with access to github.com), and the agent with the GitHub MCP server. This overlap raises the possibility that they will interfere with one another.   Tool-space interference, through the lens of MCP

To better understand the potential interference patterns and the current state of the MCP ecosystem, we conducted a survey of MCP servers listed on two registries: smithery.ai (opens in new tab) and Docker MCP Hub (opens in new tab). Smithery is an MCP Server registry with over 7,000 first-party and community-contributed servers, which we sampled from the Smithery API. Likewise, Docker MCP Hub is a registry that distributes MCP servers as Docker images, and we manually collected popular entries. We then launched each server for inspection. After excluding servers that were empty or failed to launch, and deduplicating servers with identical features, 1,470 servers remained in our catalog.

To automate the inspection of running MCP servers, we developed an MCP Interviewer tool. The MCP Interviewer begins by cataloging the server’s tools, prompts, resources, resource templates, and capabilities. From this catalog we can compute descriptive statistics such as the number of tools, or the depth of the parameter schemas.  Then, given the list of available tools, the interviewer uses an LLM (in our case, OpenAI’s GPT-4.1) to construct a functional testing plan that calls each tool at least once, collecting outputs, errors, and statistics along the way. Finally, the interviewer can also grade more qualitative criteria by using an LLM to apply purpose-built rubrics to tool schemas and tool call outputs.  We are excited to release the MCP Interviewer as an open-source CLI tool (opens in new tab), so server developers can automatically evaluate their MCP servers with agent usability in mind, and users can validate new servers. 

While our survey provides informative initial results, it also faces significant limitations, the most obvious of which is authorization: many of the most popular MCP servers provide access to services that require authorization to use, hindering automated analysis. We are often still able to collect static features from these servers but are limited in the functional testing that can be done.

One-size fits all (but some more than others)

So, what does our survey of MCP servers tell us about the MCP ecosystem? We will get into the numbers in a moment, but as we contemplate the statistics, there is one overarching theme to keep in mind: MCP servers do not know which clients or models they are working with, and present one common set of tools, prompts, and resources to everyone. However, some models handle long contexts and large tool spaces better than others (with diverging hard limits), and respond quite differently to common prompting patterns. For example, OpenAI’s guide on function calling (opens in new tab) advises developers to:

Include examples and edge cases, especially to rectify any recurring failures. (Note: Adding examples may hurt performance for reasoning models).”

So already, this places MCP at a disadvantage over vertical integrations that optimize to the operating environment. And with that, let’s dive into more numbers.

Tool count

While models generally vary in their proficiency for tool calling, the general trend has been that performance drops as the number of tools increases. For example, OpenAI limits developers to 128 tools, but recommends (opens in new tab) that developers:

Keep the number of functions small for higher accuracy. Evaluate your performance with different numbers of functions. Aim for fewer than 20 functions at any one time, though this is just a soft suggestion.

While we expect this to improve with each new model generation, at present, large tool spaces can lower performance by up to 85% for some models (opens in new tab). Thankfully, the majority of servers in our survey contain four or fewer tools. But there are outliers: the largest MCP server we cataloged adds 256 distinct tools, while the 10 next-largest servers add more than 100 tools each. Further down the list we find popular servers like Playwright-MCP (opens in new tab) (29 tools, at the time of this writing), and GitHub MCP (91 tools, with subsets available at alternative endpoint URLs), which might be too large for some models.

Figure 2: The number of tools listed by each catalogued server directly after initialization. Note: servers can change the tools they list at any time, but only 226 servers in our catalog declare this capability. Response length

Tools are generally called in agentic loops, where the output is then fed back into the model as input context. Models have hard limits on input context, but even within these limits, large contexts can drive costs up and performance down, so practical limits can be much lower (opens in new tab). MCP offers no guidance on how many tokens a tool call can produce, and the size of some responses can come as a surprise. In our analysis, we consider the 2,443 tool calls across 1,312 unique tools that the MCP Interviewer was able to call successfully during the active testing phase of server inspection. While a majority of tools produced 98 or fewer tokens (opens in new tab), some tools are extraordinarily heavyweight: the top tool returned an average of 557,766 tokens, which is enough to swamp the context windows of many popular models like GPT-5. Further down the list, we find that 16 tools produce more than 128,000 tokens, swamping GPT-4o and other popular models. Even when responses fit into the context window length, overly long responses can significantly degrade performance (up to 91% in one study (opens in new tab)), and limit the number of future calls that can be made. Of course, agents are free to implement their own context management strategies, but this behavior is left undefined in the MCP specification and server developers cannot count on any particular client behavior or strategy.

# of tools that would overflow context inModelContext Window1 call2 calls3-5 calls6-10 callsGPT 4.11,000,00001711GPT 5400,000171525GPT-4o, Llama 3.1,128,00016153340Qwen 332,00056378690Phi-416,0009360116109 Figure 3: Tool call response length averages, in tokens, as observed by the MCP Interviewer’s functional test plan. Only successful tool calls are considered. Horizontal lines indicate context window limits for GPT-4o and GPT-5. Tool parameter complexity

Mirroring the challenges from increasing the number of tools, increasing the complexity of a tool’s parameter space can also lead to degradation. For example, while MCP tools can take complex object types and structures as parameters, composio (opens in new tab) found that flattening the parameter space could improve tool-calling performance by 47% compared to baseline performance.  In our analysis, we find numerous examples of deeply nested structure—in one case, going 20 levels deep.

Figure 4: The maximum depth of each tool’s input properties schema. A depth of 0 indicates a tool with no properties. A depth of 1 indicates a tool with named properties but no annotations (e.g., no description or type). A depth of 2 indicates a tool with named and annotated properties.  A depth of 3+ indicates a tool with structured properties that have additional nested annotations.  Namespacing issues and naming ambiguity

Another often-cited issue with the current MCP specification is the lack of a formal namespace mechanism (opens in new tab). If two servers are registered to the same agent or application, and the servers have tool names in common, then disambiguation becomes impossible. Libraries like the OpenAI Agents SDK raise an error (opens in new tab) under this circumstance. Clients, like Claude Code, prefix tool names with unique identifiers to work around this issue. In our analysis of MCP servers, we found name collisions between 775 tools. The most common collision was “search”, which appears across 32 distinct MCP servers. The following table lists the top 10 collisions.

Tool NameNumber of Instancessearch32get_user11execute_query11list_tables10update_task9generate_image9send_message9execute_command8list_tasks8search_files8

Even when names are unique, they can be semantically similar. If these tools behave similarly, then the redundancy may not be immediately problematic, but if you are expecting to call a particular tool then the name similarities raise the potential for confusion. The following table lists some examples of semantically similar tool names relating to web search:

websearchbrave_web_searchsearch-webtavily_web_searchweb_searchgoogle_news_searchsearch_webgoogle-play-searchsearch_webkrgoogle_search_parsedgoogle_searchsearch_google_imagessearch_googleget_webset_search_exaai_web_searchsearch_google_scholarweb_search_exaduckduckgo_web_searchsearch_web_toolgoogle_search_scraperweb_search_agentanswer_query_websearchbatch-web-search  Errors and error messages

Like all software libraries, MCP will occasionally encounter error conditions. In these cases, it is important to provide sufficient information for the agent to handle the error and plan next steps. In our analysis, we found this was not always the case. While MCP provides an “IsError” flag to signal errors, we found that it was common for servers to handle errors by returning strings while leaving this flag set to false, signaling a normal exit. Out of 5,983 tool call results with no error flag, GPT-4.1 judged that 3,536 indicated errors in their content. More worrisome: the error messages were often of low quality. For instance, one tool providing web search capabilities failed with the string “error: job,” while another tool providing academic search returned “Please retry with 0 or fewer IDs.”

Resource sharing conventions

Finally, in addition to tools, MCP allows servers to share resources and resource templates with clients. In our survey, only 112 (7.6%) servers reported any resources, while 74 (5%) provided templates. One potential reason for low adoption is that the current MCP specification provides limited guidance for when resources are retrieved, or how they are incorporated into context. One clearcut situation where a client might retrieve a resource is in response to a tool returning a resource_link (opens in new tab) as a result — but only 4 tools exhibited this behavior in our survey (arguably, this would be the ideal behavior for tools that return very long, document-like responses, as outlined earlier).

Conversely, a whole different set of issues arises when there is a need to share resources from the client to the server. Consider for example a tool that provides some analysis of a local PDF file. In the case of a local MCP server utilizing STDIO transport, a local file path can be provided as an argument to the tool, but no similar conventions exist for delivering a local file to a remote MCP server. These issues are challenging enough when implementing a single server. When multiple tools or servers need to interact within the same system, the risk of interoperability errors compounds.

Recommendations

On balance, along any given dimension, the average MCP server is quite reasonable—but, as we have seen, outliers and diverging assumptions can introduce trouble. While we expect many of these challenges to improve with time, we are comfortable making small recommendations that we feel are evergreen. We organize them below by audience.

Protocol developers

We recognize the advantages of keeping MCP relatively lightweight, avoiding being overly prescriptive in an environment where AI models and use cases are rapidly changing. However, a few small recommendations are warranted. First, we believe MCP should be extended to include a specification for client-provided resources so that tools on remote servers have a mechanism for operating on specified local files or documents. This would more effectively position MCP as a clearinghouse for resources passed between steps of agentic workflows. The MCP specification would also benefit from taking a more opinionated stance on when resources are retrieved and used overall.

Likewise, we believe MCP should quickly move to provide formal namespaces to eliminate tool name collisions. If namespaces are hierarchical, then this also provides a way of organizing large catalogs of functions into thematically related tool sets. Tool sets, as an organizing principle, are already showing some promise in GitHub MCP Server’s dynamic tool discovery, (opens in new tab) and VS Code’s tool grouping (with virtual tools) (opens in new tab), where agents or users can enable and disable tools as needed.  In the future, a standardized mechanism for grouping tools would allow clients to engage in hierarchical tool-calling, where they first select a category, then select a tool, without needing to keep all possible tools in context.

Server developers

While our MCP Interviewer tool can catalog many outward-facing properties of MCP servers, developers are often in a much better position to characterize the nature of their tools. To this end, we believe developers should publish an MCP Server card alongside their servers or services, clearly outlining the runtime characteristics of the tools (e.g., the expected number of tokens generated, or expected latency of a tool call). Ideally developers should also indicate which models, agents and clients the server was tested with, how the tools were tested (e.g., provide sample tasks), list any known incompatibilities, and be mindful of limitations of various models throughout development.

Client developers

Client developers have the opportunity to experiment with various mitigations or optimizations that might help the average MCP server work better for a given system or environment. For example, clients could cache tool schemas, serving them as targets for prompt optimizations, or as an index for RAG-like tool selection approaches. To this end, Anthropic recently reported using a tool testing agent (opens in new tab) to rewrite the prompts of defective MCP servers, improving task completion time by 40%. Likewise, rather than waiting for the protocol to evolve, clients could take proactive steps to resolve name collisions— for example, generating namespaces from server names—and could reduce token outputs by summarizing or paginating long tool results.

Market developers

Finally, we see an opportunity for marketplaces to codify best-practices, spot compatibility issues at a global level, and perhaps centralize the generation and serving of model or agent-specific optimizations. Mirroring how a market like PyPI distributes Python wheels matched to a developer’s operating system or processor (opens in new tab), an MCP marketplace could serve tool schemas optimized for a developer’s chosen LLM, agent or client library. We are already seeing small steps in this direction, with registries like Smithery providing customized launch configurations to match users’ clients.

Conclusion

In summary, the MCP ecosystem offers significant value for AI agent development, despite some early growing pains. Grounded in insights from the MCP Interviewer (opens in new tab) and our survey of live servers, the evidence is clear: horizontal integration is expanding capability, yet it also exposes forms of toolspace interference that can erode end to end effectiveness. Anticipating rapid advances in model capability and growing architectural diversity, the recommendations provided here aim to ensure that protocol, server, client, and marketplace developers are well positioned to adapt and thrive. Key steps include implementing formal namespaces to eliminate collisions, enhancing protocol support for client provided resources, and encouraging transparent server documentation to foster interoperability and robust development practices across the ecosystem. 

By embracing these evergreen recommendations and proactively addressing compatibility, usability, and optimization issues, the AI agent community can create a more reliable, scalable, and efficient infrastructure that benefits both developers and end users. The future of MCP is bright, with ample opportunities for experimentation, standardization, and collective progress.

Opens in a new tab

The post Tool-space interference in the MCP era: Designing for agent compatibility at scale appeared first on Microsoft Research.

Categories: Microsoft

RenderFormer: How neural networks are reshaping 3D rendering

Wed, 09/10/2025 - 17:00

3D rendering—the process of converting three-dimensional models into two-dimensional images—is a foundational technology in computer graphics, widely used across gaming, film, virtual reality, and architectural visualization. Traditionally, this process has depended on physics-based techniques like ray tracing and rasterization, which simulate light behavior through mathematical formulas and expert-designed models.

Now, thanks to advances in AI, especially neural networks, researchers are beginning to replace these conventional approaches with machine learning (ML). This shift is giving rise to a new field known as neural rendering.

Neural rendering combines deep learning with traditional graphics techniques, allowing models to simulate complex light transport without explicitly modeling physical optics. This approach offers significant advantages: it eliminates the need for handcrafted rules, supports end-to-end training, and can be optimized for specific tasks. Yet, most current neural rendering methods rely on 2D image inputs, lack support for raw 3D geometry and material data, and often require retraining for each new scene—limiting their generalizability.

RenderFormer: Toward a general-purpose neural rendering model

To overcome these limitations, researchers at Microsoft Research have developed RenderFormer, a new neural architecture designed to support full-featured 3D rendering using only ML—no traditional graphics computation required. RenderFormer is the first model to demonstrate that a neural network can learn a complete graphics rendering pipeline, including support for arbitrary 3D scenes and global illumination, without relying on ray tracing or rasterization. This work has been accepted at SIGGRAPH 2025 and is open-sourced on GitHub (opens in new tab).

Architecture overview

As shown in Figure 1, RenderFormer represents the entire 3D scene using triangle tokens—each one encoding spatial position, surface normal, and physical material properties such as diffuse color, specular color, and roughness. Lighting is also modeled as triangle tokens, with emission values indicating intensity.

Figure 1. Architecture of RenderFormer

To describe the viewing direction, the model uses ray bundle tokens derived from a ray map—each pixel in the output image corresponds to one of these rays. To improve computational efficiency, pixels are grouped into rectangular blocks, with all rays in a block processed together.

The model outputs a set of tokens that are decoded into image pixels, completing the rendering process entirely within the neural network.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Dual-branch design for view-independent and view-dependent effects

The RenderFormer architecture is built around two transformers: one for view-independent features and another for view-dependent ones.

  • The view-independent transformer captures scene information unrelated to viewpoint, such as shadowing and diffuse light transport, using self-attention between triangle tokens.
  • The view-dependent transformer models effects like visibility, reflections, and specular highlights through cross-attention between triangle and ray bundle tokens.

Additional image-space effects, such as anti-aliasing and screen-space reflections, are handled via self-attention among ray bundle tokens.

To validate the architecture, the team conducted ablation studies and visual analyses, confirming the importance of each component in the rendering pipeline.

Table 1. Ablation study analyzing the impact of different components and attention mechanisms on the final performance of the trained network.

To test the capabilities of the view-independent transformer, researchers trained a decoder to produce diffuse-only renderings. The results, shown in Figure 2, demonstrate that the model can accurately simulate shadows and other indirect lighting effects.

Figure 2. View-independent rendering effects decoded directly from the view-independent transformer, including diffuse lighting and coarse shadow effects.

The view-dependent transformer was evaluated through attention visualizations. For example, in Figure 3, the attention map reveals a pixel on a teapot attending to its surface triangle and to a nearby wall—capturing the effect of specular reflection. These visualizations also show how material changes influence the sharpness and intensity of reflections.

Figure 3. Visualization of attention outputs Training methodology and dataset design

RenderFormer was trained using the Objaverse dataset, a collection of more than 800,000 annotated 3D objects that is designed to advance research in 3D modeling, computer vision, and related fields. The researchers designed four scene templates, populating each with 1–3 randomly selected objects and materials. Scenes were rendered in high dynamic range (HDR) using Blender’s Cycles renderer, under varied lighting conditions and camera angles.

The base model, consisting of 205 million parameters, was trained in two phases using the AdamW optimizer:

  • 500,000 steps at 256×256 resolution with up to 1,536 triangles
  • 100,000 steps at 512×512 resolution with up to 4,096 triangles

The model supports arbitrary triangle-based input and generalizes well to complex real-world scenes. As shown in Figure 4, it accurately reproduces shadows, diffuse shading, and specular highlights.

Figure 4. Rendered results of different 3D scenes generated by RenderFormer

RenderFormer can also generate continuous video by rendering individual frames, thanks to its ability to model viewpoint changes and scene dynamics.

3D animation sequence rendered by RenderFormer Looking ahead: Opportunities and challenges

RenderFormer represents a significant step forward for neural rendering. It demonstrates that deep learning can replicate and potentially replace the traditional rendering pipeline, supporting arbitrary 3D inputs and realistic global illumination—all without any hand-coded graphics computations.

However, key challenges remain. Scaling to larger and more complex scenes with intricate geometry, advanced materials, and diverse lighting conditions will require further research. Still, the transformer-based architecture provides a solid foundation for future integration with broader AI systems, including video generation, image synthesis, robotics, and embodied AI. 

Researchers hope that RenderFormer will serve as a building block for future breakthroughs in both graphics and AI, opening new possibilities for visual computing and intelligent environments.

Opens in a new tab

The post RenderFormer: How neural networks are reshaping 3D rendering appeared first on Microsoft Research.

Categories: Microsoft

Breaking the networking wall in AI infrastructure 

Tue, 09/09/2025 - 15:00

Memory and network bottlenecks are increasingly limiting AI system performance by reducing GPU utilization and overall efficiency, ultimately preventing infrastructure from reaching its full potential despite enormous investments. At the core of this challenge is a fundamental trade-off in the communication technologies used for memory and network interconnects.

Datacenters typically deploy two types of physical cables for communication between GPUs. Traditional copper links are power-efficient and reliable, but limited to very short distances (< 2 meters) that restrict their use to within a single GPU rack. Optical fiber links can reach tens of meters, but they consume far more power and fail up to 100 times as often as copper. A team working across Microsoft aims to resolve this trade-off by developing MOSAIC, a novel optical link technology that can provide low power and cost, high reliability, and long reach (up to 50 meters) simultaneously. This approach leverages a hardware-system co-design and adopts a wide-and-slow design with hundreds of parallel low-speed channels using microLEDs. 

The fundamental trade-off among power, reliability, and reach stems from the narrow-and-fast architecture deployed in today’s copper and optical links, comprising a few channels operating at very high data rates. For example, an 800 Gbps link consists of eight 100 Gbps channels. With copper links, higher channel speeds lead to greater signal integrity challenges, which limits their reach. With optical links, high-speed transmission is inherently inefficient, requiring power-hungry laser drivers and complex electronics to compensate for transmission impairments. These challenges grow as speeds increase with every generation of networks. Transmitting at high speeds also pushes the limits of optical components, reducing systems margins and increasing failure rates. 

These limitations force systems designers to make unpleasant choices, limiting the scalability of AI infrastructure. For example, scale-up networks connecting AI accelerators at multi-Tbps bandwidth typically must rely on copper links to meet the power budget, requiring ultra-dense racks that consume hundreds of kilowatts per rack. This creates significant challenges in cooling and mechanical design, which constrain the practical scale of these networks and end-to-end performance. This imbalance ultimately erects a networking wall akin to the memory wall, in which CPU speeds have outstripped memory speeds, creating performance bottlenecks.

A technology offering copper-like power efficiency and reliability over long distances can overcome this networking wall, enabling multi-rack scale-up domains and unlocking new architectures. This is a highly active R&D area, with many candidate technologies currently being developed across the industry. In our recent paper, MOSAIC: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs, which received the Best Paper award at ACM SIGCOMM (opens in new tab), we present one such promising approach that is the result of a multi-year collaboration between Microsoft Research, Azure, and M365. This work is centered around an optical wide-and-slow architecture, shifting from a small number of high-speed serial channels towards hundreds of parallel low-speed channels. This would be impractical to realize with today’s copper and optical technologies because of i) electromagnetic interference challenges in high-density copper cables and ii) the high cost and power consumption of lasers in optical links, as well as the increase in packaging complexity. MOSAIC overcomes these issues by leveraging directly modulated microLEDs, a technology originally developed for screen displays. 

MicroLEDs are significantly smaller than traditional LEDs (ranging from a few to tens of microns) and, due to their small size, they can be modulated at several Gbps. They are manufactured in large arrays, with over half a million in a small physical footprint for high-resolution displays like head-mounted devices or smartwatches. For example, assuming 2 Gbps per microLED channel, an 800 Gbps MOSAIC link can be realized by using a 20×20 microLED array, which can fit in less than 1 mm×1 mm silicon die. 

MOSAIC’s wide-and-slow design provides four core benefits.

  • Operating at low speed improves power efficiency by eliminating the need for complex electronics and reducing optical power requirements.
  • By leveraging optical transmission (via microLEDs), MOSAIC sidesteps copper’s reach issues, supporting distances up to 50 meters, or > 10x further than copper.
  • MicroLEDs’ simpler structure and temperature insensitivity make them more reliable than lasers. The parallel nature of wide-and-slow also makes it easy to add redundant channels, further increasing reliability, up to two orders of magnitude higher than optical links. 
  • The approach is also scalable, as higher aggregate speeds (e.g., 1.6 Tbps or 3.2 Tbps) can be achieved by increasing the number of channels and/or raising per-channel speed (e.g., to 4-8 Gbps). 

Further, MOSAIC is fully compatible with today’s pluggable transceivers’ form factor and it provides a drop-in replacement for today’s copper and optical cables, without requiring any changes to existing server and network infrastructure. MOSAIC is protocol-agnostic, as it simply relays bits from one endpoint to another without terminating or inspecting the connection and, hence, it’s fully compatible with today’s protocols (e.g., Ethernet, PCIe, CXL). We are currently working with our suppliers to productize this technology and scale to mass production. 

While conceptually simple, realizing this architecture posed a few key challenges across the stack, which required a multi-disciplinary team with expertise spanning across integrated photonics, lens design, optical transmission, and analog and digital design. For example, using individual fibers per channel would be prohibitively complex and costly due to the large number of channels. We addressed this by employing imaging fibers, which are typically used for medical applications (e.g., endoscopy). They can support thousands of cores per fiber, enabling multiplexing of many channels within a single fiber. Also, microLEDs are a less pure light source than lasers, with a larger beam shape (which complicates fiber coupling) and a broader spectrum (which degrades fiber transmission due to chromatic dispersion). We tackled these issues through a novel microLED and optical lens design, and a power-efficient analog-only electronic back end, which does not require any expensive digital signal processing.  

Based on our current estimates, this approach can save up to 68% of power, i.e., more than 10W per cable while reducing failure rates by up to 100x. With global annual shipments of optical cables reaching into the tens of millions, this translates to over 100MW of power savings per year, enough to power more than 300,000 homes. While these immediate gains are already significant, the unique combination of low power consumption, reduced cost, high reliability, and long reach opens up exciting new opportunities to rethink AI infrastructure from network and cluster architectures to compute and memory designs.

For example, by supporting low-power, high-bandwidth connectivity at long reach, MOSAIC removes the need for ultra-dense racks and enables novel network topologies, which would be impractical today. The resulting redesign could reduce resource fragmentation and simplify collective optimization. Similarly, on the compute front, the ability to connect silicon dies at low power over long distances could enable resource disaggregation, shifting from today’s large, multi-die packages to smaller, more cost-effective, ones. Bypassing packaging area constraints would also make it possible to drastically increase GPU memory capacity and bandwidth, while facilitating adoption of novel memory technologies

Historically, step changes in network technology have unlocked entirely new classes of applications and workloads. While our SIGCOMM paper provides possible future directions, we hope this work sparks broader discussion and collaboration across the research and industry communities.

Opens in a new tab

The post Breaking the networking wall in AI infrastructure  appeared first on Microsoft Research.

Categories: Microsoft

eXTReMe Tracker