Microsoft

Tool-space interference in the MCP era: Designing for agent compatibility at scale

Microsoft Research - Thu, 09/11/2025 - 18:00

This year we’ve seen remarkable advances in agentic AI, including systems that conduct deep research, operate computers, complete substantial software engineering tasks, and tackle a range of other complex, multi-step goals. In each case, the industry relied on careful vertical integration: tools and agents were co-designed, co-trained, and tested together for peak performance. For example, OpenAI’s recent models presume the availability of web search and document retrieval tools (opens in new tab). Likewise, the prompts and actions of Magentic-One are set up to make hand-offs easy—for example, allowing the WebSurfer agent to pass downloaded files to the Coder agent.  But as agents proliferate, we anticipate strategies relying heavily on vertical integration will not age well. Agents from different developers or companies will increasingly encounter each other and must work together to complete tasks, in what we refer to as a society of agents. These systems can vary in how coordinated they are, how aligned their goals are, and how much information they share. Can heterogenous agents and tools cooperate in this setting, or will they hinder one another and slow progress?

Early clues have emerged from an unexpected source: namely, Model Context Protocol (opens in new tab) (MCP). Since January 2025, MCP has grown from a promising spec to a thriving market of tool servers. As an example, Zapier boasts a catalog of 30,000 tools (opens in new tab) across 7,000 services. Composio provide over 100 managed MCP servers (opens in new tab), surfacing hundreds of tools. Hugging Face is now serving many Spaces apps over MCP (opens in new tab), and Shopify has enabled MCP for millions of storefronts (opens in new tab). A society of tools is already here, and it promises to extend agent capabilities through cross-provider horizontal integration. 

So, what does MCP have to say about horizontal integration? As catalogs grow, we expect some new failure modes to surface. This blog post introduces these as tool-space interference, and sketches both early observations and some pragmatic interventions to keep the society we’re building from stepping on its own feet. 

Tool-space interference describes situations where otherwise reasonable tools or agents, when co-present, reduce end-to-end effectiveness. This can look like longer action sequences, higher token cost, brittle recovery from errors, or, in some cases, task failure.

A framing example

Consider MCP as a means for extending Magentic-One, a generalist multi-agent system we released last year, to cover more software engineering tasks. Magentic-One ships with agents to write code, interact with the computer terminal, browse the web, and access local files. To help Magentic-One navigate version control, find issues to solve, and make pull requests, we could add an agent equipped with the GitHub MCP Server. However, now each time the team encounters a task involving GitHub, it must choose whether to visit github.com in the browser, execute a git command at the command line, or engage the GitHub MCP server. As the task progresses, agent understanding of state can also diverge: changing the branch in the browser won’t change the branch in the terminal, and an authorized MCP tool does not imply authorization in the browser. Thus, while any single agent might complete the task efficiently, the larger set of agents might misunderstand or interfere with one another, leading to additional rounds of debugging, or even complete task failure.

Figure 1: We can extend Magentic-One by adding an agent that equips the GitHub MCP server. However, on every turn involving a git-related task, the orchestrator will need to decide between messaging the Computer Terminal agent (with access to the git command line interface), WebSurfer agent (with access to github.com), and the agent with the GitHub MCP server. This overlap raises the possibility that they will interfere with one another.   Tool-space interference, through the lens of MCP

To better understand the potential interference patterns and the current state of the MCP ecosystem, we conducted a survey of MCP servers listed on two registries: smithery.ai (opens in new tab) and Docker MCP Hub (opens in new tab). Smithery is an MCP Server registry with over 7,000 first-party and community-contributed servers, which we sampled from the Smithery API. Likewise, Docker MCP Hub is a registry that distributes MCP servers as Docker images, and we manually collected popular entries. We then launched each server for inspection. After excluding servers that were empty or failed to launch, and deduplicating servers with identical features, 1,470 servers remained in our catalog.

To automate the inspection of running MCP servers, we developed an MCP Interviewer tool. The MCP Interviewer begins by cataloging the server’s tools, prompts, resources, resource templates, and capabilities. From this catalog we can compute descriptive statistics such as the number of tools, or the depth of the parameter schemas.  Then, given the list of available tools, the interviewer uses an LLM (in our case, OpenAI’s GPT-4.1) to construct a functional testing plan that calls each tool at least once, collecting outputs, errors, and statistics along the way. Finally, the interviewer can also grade more qualitative criteria by using an LLM to apply purpose-built rubrics to tool schemas and tool call outputs.  We are excited to release the MCP Interviewer as an open-source CLI tool (opens in new tab), so server developers can automatically evaluate their MCP servers with agent usability in mind, and users can validate new servers. 

While our survey provides informative initial results, it also faces significant limitations, the most obvious of which is authorization: many of the most popular MCP servers provide access to services that require authorization to use, hindering automated analysis. We are often still able to collect static features from these servers but are limited in the functional testing that can be done.

One-size fits all (but some more than others)

So, what does our survey of MCP servers tell us about the MCP ecosystem? We will get into the numbers in a moment, but as we contemplate the statistics, there is one overarching theme to keep in mind: MCP servers do not know which clients or models they are working with, and present one common set of tools, prompts, and resources to everyone. However, some models handle long contexts and large tool spaces better than others (with diverging hard limits), and respond quite differently to common prompting patterns. For example, OpenAI’s guide on function calling (opens in new tab) advises developers to:

Include examples and edge cases, especially to rectify any recurring failures. (Note: Adding examples may hurt performance for reasoning models).”

So already, this places MCP at a disadvantage over vertical integrations that optimize to the operating environment. And with that, let’s dive into more numbers.

Tool count

While models generally vary in their proficiency for tool calling, the general trend has been that performance drops as the number of tools increases. For example, OpenAI limits developers to 128 tools, but recommends (opens in new tab) that developers:

Keep the number of functions small for higher accuracy. Evaluate your performance with different numbers of functions. Aim for fewer than 20 functions at any one time, though this is just a soft suggestion.

While we expect this to improve with each new model generation, at present, large tool spaces can lower performance by up to 85% for some models (opens in new tab). Thankfully, the majority of servers in our survey contain four or fewer tools. But there are outliers: the largest MCP server we cataloged adds 256 distinct tools, while the 10 next-largest servers add more than 100 tools each. Further down the list we find popular servers like Playwright-MCP (opens in new tab) (29 tools, at the time of this writing), and GitHub MCP (91 tools, with subsets available at alternative endpoint URLs), which might be too large for some models.

Figure 2: The number of tools listed by each catalogued server directly after initialization. Note: servers can change the tools they list at any time, but only 226 servers in our catalog declare this capability. Response length

Tools are generally called in agentic loops, where the output is then fed back into the model as input context. Models have hard limits on input context, but even within these limits, large contexts can drive costs up and performance down, so practical limits can be much lower (opens in new tab). MCP offers no guidance on how many tokens a tool call can produce, and the size of some responses can come as a surprise. In our analysis, we consider the 2,443 tool calls across 1,312 unique tools that the MCP Interviewer was able to call successfully during the active testing phase of server inspection. While a majority of tools produced 98 or fewer tokens (opens in new tab), some tools are extraordinarily heavyweight: the top tool returned an average of 557,766 tokens, which is enough to swamp the context windows of many popular models like GPT-5. Further down the list, we find that 16 tools produce more than 128,000 tokens, swamping GPT-4o and other popular models. Even when responses fit into the context window length, overly long responses can significantly degrade performance (up to 91% in one study (opens in new tab)), and limit the number of future calls that can be made. Of course, agents are free to implement their own context management strategies, but this behavior is left undefined in the MCP specification and server developers cannot count on any particular client behavior or strategy.

# of tools that would overflow context inModelContext Window1 call2 calls3-5 calls6-10 callsGPT 4.11,000,00001711GPT 5400,000171525GPT-4o, Llama 3.1,128,00016153340Qwen 332,00056378690Phi-416,0009360116109 Figure 3: Tool call response length averages, in tokens, as observed by the MCP Interviewer’s functional test plan. Only successful tool calls are considered. Horizontal lines indicate context window limits for GPT-4o and GPT-5. Tool parameter complexity

Mirroring the challenges from increasing the number of tools, increasing the complexity of a tool’s parameter space can also lead to degradation. For example, while MCP tools can take complex object types and structures as parameters, composio (opens in new tab) found that flattening the parameter space could improve tool-calling performance by 47% compared to baseline performance.  In our analysis, we find numerous examples of deeply nested structure—in one case, going 20 levels deep.

Figure 4: The maximum depth of each tool’s input properties schema. A depth of 0 indicates a tool with no properties. A depth of 1 indicates a tool with named properties but no annotations (e.g., no description or type). A depth of 2 indicates a tool with named and annotated properties.  A depth of 3+ indicates a tool with structured properties that have additional nested annotations.  Namespacing issues and naming ambiguity

Another often-cited issue with the current MCP specification is the lack of a formal namespace mechanism (opens in new tab). If two servers are registered to the same agent or application, and the servers have tool names in common, then disambiguation becomes impossible. Libraries like the OpenAI Agents SDK raise an error (opens in new tab) under this circumstance. Clients, like Claude Code, prefix tool names with unique identifiers to work around this issue. In our analysis of MCP servers, we found name collisions between 775 tools. The most common collision was “search”, which appears across 32 distinct MCP servers. The following table lists the top 10 collisions.

Tool NameNumber of Instancessearch32get_user11execute_query11list_tables10update_task9generate_image9send_message9execute_command8list_tasks8search_files8

Even when names are unique, they can be semantically similar. If these tools behave similarly, then the redundancy may not be immediately problematic, but if you are expecting to call a particular tool then the name similarities raise the potential for confusion. The following table lists some examples of semantically similar tool names relating to web search:

websearchbrave_web_searchsearch-webtavily_web_searchweb_searchgoogle_news_searchsearch_webgoogle-play-searchsearch_webkrgoogle_search_parsedgoogle_searchsearch_google_imagessearch_googleget_webset_search_exaai_web_searchsearch_google_scholarweb_search_exaduckduckgo_web_searchsearch_web_toolgoogle_search_scraperweb_search_agentanswer_query_websearchbatch-web-search  Errors and error messages

Like all software libraries, MCP will occasionally encounter error conditions. In these cases, it is important to provide sufficient information for the agent to handle the error and plan next steps. In our analysis, we found this was not always the case. While MCP provides an “IsError” flag to signal errors, we found that it was common for servers to handle errors by returning strings while leaving this flag set to false, signaling a normal exit. Out of 5,983 tool call results with no error flag, GPT-4.1 judged that 3,536 indicated errors in their content. More worrisome: the error messages were often of low quality. For instance, one tool providing web search capabilities failed with the string “error: job,” while another tool providing academic search returned “Please retry with 0 or fewer IDs.”

Resource sharing conventions

Finally, in addition to tools, MCP allows servers to share resources and resource templates with clients. In our survey, only 112 (7.6%) servers reported any resources, while 74 (5%) provided templates. One potential reason for low adoption is that the current MCP specification provides limited guidance for when resources are retrieved, or how they are incorporated into context. One clearcut situation where a client might retrieve a resource is in response to a tool returning a resource_link (opens in new tab) as a result — but only 4 tools exhibited this behavior in our survey (arguably, this would be the ideal behavior for tools that return very long, document-like responses, as outlined earlier).

Conversely, a whole different set of issues arises when there is a need to share resources from the client to the server. Consider for example a tool that provides some analysis of a local PDF file. In the case of a local MCP server utilizing STDIO transport, a local file path can be provided as an argument to the tool, but no similar conventions exist for delivering a local file to a remote MCP server. These issues are challenging enough when implementing a single server. When multiple tools or servers need to interact within the same system, the risk of interoperability errors compounds.

Recommendations

On balance, along any given dimension, the average MCP server is quite reasonable—but, as we have seen, outliers and diverging assumptions can introduce trouble. While we expect many of these challenges to improve with time, we are comfortable making small recommendations that we feel are evergreen. We organize them below by audience.

Protocol developers

We recognize the advantages of keeping MCP relatively lightweight, avoiding being overly prescriptive in an environment where AI models and use cases are rapidly changing. However, a few small recommendations are warranted. First, we believe MCP should be extended to include a specification for client-provided resources so that tools on remote servers have a mechanism for operating on specified local files or documents. This would more effectively position MCP as a clearinghouse for resources passed between steps of agentic workflows. The MCP specification would also benefit from taking a more opinionated stance on when resources are retrieved and used overall.

Likewise, we believe MCP should quickly move to provide formal namespaces to eliminate tool name collisions. If namespaces are hierarchical, then this also provides a way of organizing large catalogs of functions into thematically related tool sets. Tool sets, as an organizing principle, are already showing some promise in GitHub MCP Server’s dynamic tool discovery, (opens in new tab) and VS Code’s tool grouping (with virtual tools) (opens in new tab), where agents or users can enable and disable tools as needed.  In the future, a standardized mechanism for grouping tools would allow clients to engage in hierarchical tool-calling, where they first select a category, then select a tool, without needing to keep all possible tools in context.

Server developers

While our MCP Interviewer tool can catalog many outward-facing properties of MCP servers, developers are often in a much better position to characterize the nature of their tools. To this end, we believe developers should publish an MCP Server card alongside their servers or services, clearly outlining the runtime characteristics of the tools (e.g., the expected number of tokens generated, or expected latency of a tool call). Ideally developers should also indicate which models, agents and clients the server was tested with, how the tools were tested (e.g., provide sample tasks), list any known incompatibilities, and be mindful of limitations of various models throughout development.

Client developers

Client developers have the opportunity to experiment with various mitigations or optimizations that might help the average MCP server work better for a given system or environment. For example, clients could cache tool schemas, serving them as targets for prompt optimizations, or as an index for RAG-like tool selection approaches. To this end, Anthropic recently reported using a tool testing agent (opens in new tab) to rewrite the prompts of defective MCP servers, improving task completion time by 40%. Likewise, rather than waiting for the protocol to evolve, clients could take proactive steps to resolve name collisions— for example, generating namespaces from server names—and could reduce token outputs by summarizing or paginating long tool results.

Market developers

Finally, we see an opportunity for marketplaces to codify best-practices, spot compatibility issues at a global level, and perhaps centralize the generation and serving of model or agent-specific optimizations. Mirroring how a market like PyPI distributes Python wheels matched to a developer’s operating system or processor (opens in new tab), an MCP marketplace could serve tool schemas optimized for a developer’s chosen LLM, agent or client library. We are already seeing small steps in this direction, with registries like Smithery providing customized launch configurations to match users’ clients.

Conclusion

In summary, the MCP ecosystem offers significant value for AI agent development, despite some early growing pains. Grounded in insights from the MCP Interviewer (opens in new tab) and our survey of live servers, the evidence is clear: horizontal integration is expanding capability, yet it also exposes forms of toolspace interference that can erode end to end effectiveness. Anticipating rapid advances in model capability and growing architectural diversity, the recommendations provided here aim to ensure that protocol, server, client, and marketplace developers are well positioned to adapt and thrive. Key steps include implementing formal namespaces to eliminate collisions, enhancing protocol support for client provided resources, and encouraging transparent server documentation to foster interoperability and robust development practices across the ecosystem. 

By embracing these evergreen recommendations and proactively addressing compatibility, usability, and optimization issues, the AI agent community can create a more reliable, scalable, and efficient infrastructure that benefits both developers and end users. The future of MCP is bright, with ample opportunities for experimentation, standardization, and collective progress.

Opens in a new tab

The post Tool-space interference in the MCP era: Designing for agent compatibility at scale appeared first on Microsoft Research.

Categories: Microsoft

RenderFormer: How neural networks are reshaping 3D rendering

Microsoft Research - Wed, 09/10/2025 - 18:00

3D rendering—the process of converting three-dimensional models into two-dimensional images—is a foundational technology in computer graphics, widely used across gaming, film, virtual reality, and architectural visualization. Traditionally, this process has depended on physics-based techniques like ray tracing and rasterization, which simulate light behavior through mathematical formulas and expert-designed models.

Now, thanks to advances in AI, especially neural networks, researchers are beginning to replace these conventional approaches with machine learning (ML). This shift is giving rise to a new field known as neural rendering.

Neural rendering combines deep learning with traditional graphics techniques, allowing models to simulate complex light transport without explicitly modeling physical optics. This approach offers significant advantages: it eliminates the need for handcrafted rules, supports end-to-end training, and can be optimized for specific tasks. Yet, most current neural rendering methods rely on 2D image inputs, lack support for raw 3D geometry and material data, and often require retraining for each new scene—limiting their generalizability.

RenderFormer: Toward a general-purpose neural rendering model

To overcome these limitations, researchers at Microsoft Research have developed RenderFormer, a new neural architecture designed to support full-featured 3D rendering using only ML—no traditional graphics computation required. RenderFormer is the first model to demonstrate that a neural network can learn a complete graphics rendering pipeline, including support for arbitrary 3D scenes and global illumination, without relying on ray tracing or rasterization. This work has been accepted at SIGGRAPH 2025 and is open-sourced on GitHub (opens in new tab).

Architecture overview

As shown in Figure 1, RenderFormer represents the entire 3D scene using triangle tokens—each one encoding spatial position, surface normal, and physical material properties such as diffuse color, specular color, and roughness. Lighting is also modeled as triangle tokens, with emission values indicating intensity.

Figure 1. Architecture of RenderFormer

To describe the viewing direction, the model uses ray bundle tokens derived from a ray map—each pixel in the output image corresponds to one of these rays. To improve computational efficiency, pixels are grouped into rectangular blocks, with all rays in a block processed together.

The model outputs a set of tokens that are decoded into image pixels, completing the rendering process entirely within the neural network.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab Dual-branch design for view-independent and view-dependent effects

The RenderFormer architecture is built around two transformers: one for view-independent features and another for view-dependent ones.

  • The view-independent transformer captures scene information unrelated to viewpoint, such as shadowing and diffuse light transport, using self-attention between triangle tokens.
  • The view-dependent transformer models effects like visibility, reflections, and specular highlights through cross-attention between triangle and ray bundle tokens.

Additional image-space effects, such as anti-aliasing and screen-space reflections, are handled via self-attention among ray bundle tokens.

To validate the architecture, the team conducted ablation studies and visual analyses, confirming the importance of each component in the rendering pipeline.

Table 1. Ablation study analyzing the impact of different components and attention mechanisms on the final performance of the trained network.

To test the capabilities of the view-independent transformer, researchers trained a decoder to produce diffuse-only renderings. The results, shown in Figure 2, demonstrate that the model can accurately simulate shadows and other indirect lighting effects.

Figure 2. View-independent rendering effects decoded directly from the view-independent transformer, including diffuse lighting and coarse shadow effects.

The view-dependent transformer was evaluated through attention visualizations. For example, in Figure 3, the attention map reveals a pixel on a teapot attending to its surface triangle and to a nearby wall—capturing the effect of specular reflection. These visualizations also show how material changes influence the sharpness and intensity of reflections.

Figure 3. Visualization of attention outputs Training methodology and dataset design

RenderFormer was trained using the Objaverse dataset, a collection of more than 800,000 annotated 3D objects that is designed to advance research in 3D modeling, computer vision, and related fields. The researchers designed four scene templates, populating each with 1–3 randomly selected objects and materials. Scenes were rendered in high dynamic range (HDR) using Blender’s Cycles renderer, under varied lighting conditions and camera angles.

The base model, consisting of 205 million parameters, was trained in two phases using the AdamW optimizer:

  • 500,000 steps at 256×256 resolution with up to 1,536 triangles
  • 100,000 steps at 512×512 resolution with up to 4,096 triangles

The model supports arbitrary triangle-based input and generalizes well to complex real-world scenes. As shown in Figure 4, it accurately reproduces shadows, diffuse shading, and specular highlights.

Figure 4. Rendered results of different 3D scenes generated by RenderFormer

RenderFormer can also generate continuous video by rendering individual frames, thanks to its ability to model viewpoint changes and scene dynamics.

3D animation sequence rendered by RenderFormer Looking ahead: Opportunities and challenges

RenderFormer represents a significant step forward for neural rendering. It demonstrates that deep learning can replicate and potentially replace the traditional rendering pipeline, supporting arbitrary 3D inputs and realistic global illumination—all without any hand-coded graphics computations.

However, key challenges remain. Scaling to larger and more complex scenes with intricate geometry, advanced materials, and diverse lighting conditions will require further research. Still, the transformer-based architecture provides a solid foundation for future integration with broader AI systems, including video generation, image synthesis, robotics, and embodied AI. 

Researchers hope that RenderFormer will serve as a building block for future breakthroughs in both graphics and AI, opening new possibilities for visual computing and intelligent environments.

Opens in a new tab

The post RenderFormer: How neural networks are reshaping 3D rendering appeared first on Microsoft Research.

Categories: Microsoft

Breaking the networking wall in AI infrastructure 

Microsoft Research - Tue, 09/09/2025 - 16:00

Memory and network bottlenecks are increasingly limiting AI system performance by reducing GPU utilization and overall efficiency, ultimately preventing infrastructure from reaching its full potential despite enormous investments. At the core of this challenge is a fundamental trade-off in the communication technologies used for memory and network interconnects.

Datacenters typically deploy two types of physical cables for communication between GPUs. Traditional copper links are power-efficient and reliable, but limited to very short distances (< 2 meters) that restrict their use to within a single GPU rack. Optical fiber links can reach tens of meters, but they consume far more power and fail up to 100 times as often as copper. A team working across Microsoft aims to resolve this trade-off by developing MOSAIC, a novel optical link technology that can provide low power and cost, high reliability, and long reach (up to 50 meters) simultaneously. This approach leverages a hardware-system co-design and adopts a wide-and-slow design with hundreds of parallel low-speed channels using microLEDs. 

The fundamental trade-off among power, reliability, and reach stems from the narrow-and-fast architecture deployed in today’s copper and optical links, comprising a few channels operating at very high data rates. For example, an 800 Gbps link consists of eight 100 Gbps channels. With copper links, higher channel speeds lead to greater signal integrity challenges, which limits their reach. With optical links, high-speed transmission is inherently inefficient, requiring power-hungry laser drivers and complex electronics to compensate for transmission impairments. These challenges grow as speeds increase with every generation of networks. Transmitting at high speeds also pushes the limits of optical components, reducing systems margins and increasing failure rates. 

These limitations force systems designers to make unpleasant choices, limiting the scalability of AI infrastructure. For example, scale-up networks connecting AI accelerators at multi-Tbps bandwidth typically must rely on copper links to meet the power budget, requiring ultra-dense racks that consume hundreds of kilowatts per rack. This creates significant challenges in cooling and mechanical design, which constrain the practical scale of these networks and end-to-end performance. This imbalance ultimately erects a networking wall akin to the memory wall, in which CPU speeds have outstripped memory speeds, creating performance bottlenecks.

A technology offering copper-like power efficiency and reliability over long distances can overcome this networking wall, enabling multi-rack scale-up domains and unlocking new architectures. This is a highly active R&D area, with many candidate technologies currently being developed across the industry. In our recent paper, MOSAIC: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs, which received the Best Paper award at ACM SIGCOMM (opens in new tab), we present one such promising approach that is the result of a multi-year collaboration between Microsoft Research, Azure, and M365. This work is centered around an optical wide-and-slow architecture, shifting from a small number of high-speed serial channels towards hundreds of parallel low-speed channels. This would be impractical to realize with today’s copper and optical technologies because of i) electromagnetic interference challenges in high-density copper cables and ii) the high cost and power consumption of lasers in optical links, as well as the increase in packaging complexity. MOSAIC overcomes these issues by leveraging directly modulated microLEDs, a technology originally developed for screen displays. 

MicroLEDs are significantly smaller than traditional LEDs (ranging from a few to tens of microns) and, due to their small size, they can be modulated at several Gbps. They are manufactured in large arrays, with over half a million in a small physical footprint for high-resolution displays like head-mounted devices or smartwatches. For example, assuming 2 Gbps per microLED channel, an 800 Gbps MOSAIC link can be realized by using a 20×20 microLED array, which can fit in less than 1 mm×1 mm silicon die. 

MOSAIC’s wide-and-slow design provides four core benefits.

  • Operating at low speed improves power efficiency by eliminating the need for complex electronics and reducing optical power requirements.
  • By leveraging optical transmission (via microLEDs), MOSAIC sidesteps copper’s reach issues, supporting distances up to 50 meters, or > 10x further than copper.
  • MicroLEDs’ simpler structure and temperature insensitivity make them more reliable than lasers. The parallel nature of wide-and-slow also makes it easy to add redundant channels, further increasing reliability, up to two orders of magnitude higher than optical links. 
  • The approach is also scalable, as higher aggregate speeds (e.g., 1.6 Tbps or 3.2 Tbps) can be achieved by increasing the number of channels and/or raising per-channel speed (e.g., to 4-8 Gbps). 

Further, MOSAIC is fully compatible with today’s pluggable transceivers’ form factor and it provides a drop-in replacement for today’s copper and optical cables, without requiring any changes to existing server and network infrastructure. MOSAIC is protocol-agnostic, as it simply relays bits from one endpoint to another without terminating or inspecting the connection and, hence, it’s fully compatible with today’s protocols (e.g., Ethernet, PCIe, CXL). We are currently working with our suppliers to productize this technology and scale to mass production. 

While conceptually simple, realizing this architecture posed a few key challenges across the stack, which required a multi-disciplinary team with expertise spanning across integrated photonics, lens design, optical transmission, and analog and digital design. For example, using individual fibers per channel would be prohibitively complex and costly due to the large number of channels. We addressed this by employing imaging fibers, which are typically used for medical applications (e.g., endoscopy). They can support thousands of cores per fiber, enabling multiplexing of many channels within a single fiber. Also, microLEDs are a less pure light source than lasers, with a larger beam shape (which complicates fiber coupling) and a broader spectrum (which degrades fiber transmission due to chromatic dispersion). We tackled these issues through a novel microLED and optical lens design, and a power-efficient analog-only electronic back end, which does not require any expensive digital signal processing.  

Based on our current estimates, this approach can save up to 68% of power, i.e., more than 10W per cable while reducing failure rates by up to 100x. With global annual shipments of optical cables reaching into the tens of millions, this translates to over 100MW of power savings per year, enough to power more than 300,000 homes. While these immediate gains are already significant, the unique combination of low power consumption, reduced cost, high reliability, and long reach opens up exciting new opportunities to rethink AI infrastructure from network and cluster architectures to compute and memory designs.

For example, by supporting low-power, high-bandwidth connectivity at long reach, MOSAIC removes the need for ultra-dense racks and enables novel network topologies, which would be impractical today. The resulting redesign could reduce resource fragmentation and simplify collective optimization. Similarly, on the compute front, the ability to connect silicon dies at low power over long distances could enable resource disaggregation, shifting from today’s large, multi-die packages to smaller, more cost-effective, ones. Bypassing packaging area constraints would also make it possible to drastically increase GPU memory capacity and bandwidth, while facilitating adoption of novel memory technologies

Historically, step changes in network technology have unlocked entirely new classes of applications and workloads. While our SIGCOMM paper provides possible future directions, we hope this work sparks broader discussion and collaboration across the research and industry communities.

Opens in a new tab

The post Breaking the networking wall in AI infrastructure  appeared first on Microsoft Research.

Categories: Microsoft

Crescent library brings privacy to digital identity systems

Microsoft Research - Tue, 08/26/2025 - 18:00

Digital identities, the electronic credentials embedded in phone wallets, workplace logins, and other apps, are becoming ubiquitous. While they offer unprecedented convenience, they also create new privacy risks, particularly around tracking and surveillance. 

One of these risks is linkability, the ability to associate one or more uses of a credential to a specific person. Currently, when people use their mobile driver’s license or log into various apps, hidden identifiers can link these separate activities together, building detailed profiles of user behavior.  

To address this, we have released Crescent (opens in new tab), a cryptographic library that adds unlinkability to widely used identity formats, protecting privacy. These include JSON Web Tokens (the authentication standard behind many app logins) and mobile driver’s licenses. Crescent also works without requiring the organizations that issue these credentials to update their systems.  

The protection goes beyond existing privacy features. Some digital identity systems already offer selective disclosure, allowing users to share only specific pieces of information in each interaction.  

But even with selective disclosure, credentials can still be linked through serial numbers, cryptographic signatures, or embedded identifiers. Crescent’s unlinkability feature is designed to prevent anything in the credential, beyond what a user explicitly chooses to reveal, from being used to connect their separate digital interactions.

Figure 1: Unlinkability between a credential issuance and presentation Two paths to unlinkability 

To understand how Crescent works, it helps to examine the two main approaches researchers have developed for adding unlinkability to identity systems: 

  1. Specialized cryptographic signature schemes. These schemes can provide unlinkability but require extensive changes to existing infrastructure. New algorithms must be standardized, implemented, and integrated into software and hardware platforms. For example, the BBS (opens in new tab) signature scheme is currently being standardized by the Internet Engineering Task Force (IETF), but even after completion, adoption may be slow.   
  1. Zero-knowledge proofs with existing credentials. This approach, used by Crescent (opens in new tab), allows users to prove specific facts about their credentials without revealing the underlying data that could enable tracking. For example, someone could prove they hold a valid driver’s license and live in a particular ZIP code without exposing any other personal information or identifiers that could link this interaction to future ones. 

Zero-knowledge proofs have become more practical since they were first developed 40 years ago but they are not as efficient as the cryptographic algorithms used in today’s credentials. Crescent addresses this computational challenge through preprocessing, performing the most complex calculations once in advance so that later proof generation is quick and efficient for mobile devices. 

Beyond unlinkability, Crescent supports selective disclosure, allowing users to prove specific facts without revealing unnecessary details. For example, it can confirm that a credential is valid and unexpired without disclosing the exact expiration date, which might otherwise serve as a unique identifier. These privacy protections work even when credentials are stored in a phone’s secure hardware, which keeps them tied to the device and prevents unauthorized access.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Behind the cryptographic curtain 

At its core, Crescent uses a sophisticated form of cryptographic proof called a zero-knowledge SNARK (Zero-Knowledge Succinct Noninteractive Argument of Knowledge). This method allows one party to prove possession of information or credentials without revealing the underlying data itself. 

Crescent specifically uses the Groth16 proof system, one of the first practical implementations of this technology. What makes Groth16 particularly useful is that its proofs are small in size, quick to verify, and can be shared in a single step without back-and-forth communication between the user and verifier. 

The system works by first establishing shared cryptographic parameters based on a credential template. Multiple organizations issuing similar credentials, such as different state motor vehicle departments issuing mobile driver’s licenses, can use the same parameters as long as they follow compatible data formats and security standards. 

The mathematical rules that define what each proof will verify are written using specialized programming tools that convert them into a Rank-1 Constraint System (R1CS), a mathematical framework that describes exactly what needs to be proven about a credential. 

To make the system fast enough for real-world use, Crescent splits the proof generation into two distinct stages: 

  1. Prepare stage. This step runs once and generates cryptographic values that can be stored on the user’s device for repeated use. 
  1. Show stage. When a user needs to present their credential, this quicker step takes the stored values and randomizes them to prevent any connection to previous presentations. It also creates a compact cryptographic summary that reveals only the specific information needed for that particular interaction. 

Figures 2 and 3 illustrate this credential-proving workflow and the division between the prepare and show steps.

Figure 2: Crescent’s credential-proving workflow includes a compilation of a circuit to R1CS, followed by the prepare and show steps. The output zero-knowledge proof is sent to the verifier. Figure 3: The Crescent presentation steps show the division between prepare and show steps. A sample application 

To demonstrate how Crescent works, we created a sample application covering two real-world scenarios: verifying employment and proving age for online access. The application includes sample code for setting up fictional issuers and verifiers as Rust servers, along with a browser-extension wallet for the user. The step numbers correspond to the steps in Figure 4. 

Setup 
  1. A Crescent service pre-generates the zero-knowledge parameters for creating and verifying proofs from JSON Web Tokens and mobile driver’s licenses. 
  1. The user obtains a mobile driver’s license from their Department of Motor Vehicles. 
  1. The user obtains a proof-of-employment JSON Web Token from their employer, Contoso. 
  1. These credentials and their private keys are stored in the Crescent wallet. 
Scenarios 
  1. Employment verification: The user presents their JSON Web Token to Fabrikam, an online health clinic, to prove they are employed at Contoso and eligible for workplace benefits. Fabrikam learns that the user works at Contoso but not the user’s identity, while Contoso remains unaware of the interaction. 
  1. Age verification: The user presents their mobile driver’s license to a social network, proving they are over 18. The proof confirms eligibility without revealing their age or identity. 

Across both scenarios, Crescent ensures that credential presentations remain unlinkable, preventing any party from connecting them to the user. 

For simplicity, the sample defines its own issuance and presentation protocol, but it could be integrated into higher-level identity frameworks such as OpenID/OAuth, Verifiable Credentials, or the mobile driver’s license ecosystem.

Figure 4. The sample architecture, from credential issuance to presentation.

To learn more about the project, visit the Crescent project GitHub (opens in new tab) page, or check out our recent presentations given at the Real-Word Crypto 2025 (opens in new tab) and North Sec 2025 (opens in new tab) conferences. 

Opens in a new tab

The post Crescent library brings privacy to digital identity systems appeared first on Microsoft Research.

Categories: Microsoft

Applicability vs. job displacement: further notes on our recent research on AI and occupations

Microsoft Research - Thu, 08/21/2025 - 19:00

Recently, we released a paper (Working with AI: Measuring the Occupational Implications of Generative AI) that studied what occupations might find AI chatbots useful, and to what degree. The paper sparked significant discussion, which is no surprise since people care deeply about the future of AI and jobs–that’s part of why we think it’s important to study these topics.

Unfortunately, not all the discussion was accurate in its portrayal of the study’s scope or conclusions. Specifically, our study does not draw any conclusions about jobs being eliminated; in the paper, we explicitly cautioned against using our findings to make that conclusion. 

Given the importance of this topic, we want to clarify any misunderstandings and provide a more digestible summary of the paper, our methodology, and its limitations. 

What did our research find?

We set out to better understand how people are using AI, highlighting where AI might be useful in different occupations. To do this, we analyzed how people currently use generative AI—specifically Microsoft Bing Copilot (now Microsoft Copilot)—to assist with tasks. We then compared these sets of tasks against the O*NET database (opens in new tab), a widely used occupational classification system, to understand potential applicability to various occupations.

We found that AI is most useful for tasks related to knowledge work and communication, particularly tasks such as writing, gathering information, and learning.

Those in occupations with these tasks may benefit by considering how AI can be used as a tool to help improve their workflows. On the flip side, it’s not surprising that physical tasks like performing surgeries or moving objects had less direct AI chatbot applicability.

So, to summarize, our paper is about identifying the occupations where AI may be most useful, by assisting or performing subtasks.  Our data do not indicate, nor did we suggest, that certain jobs will be replaced by AI.

Methodological limitations are acknowledged—and important

The paper is transparent about the limitations of our approach.  

We analyzed anonymized Bing Copilot conversations to see what activities users are seeking AI assistance with and what activities AI can perform when mapped to the O*NET database. While O*NET provides a structured list of activities associated with various occupations, it does not capture the full spectrum of skills, context, and nuance required in the real world.  A job is far more than the collection of tasks that make it up.

For example, a task might involve “writing reports,” but O*NET won’t reflect the interpersonal judgment, domain expertise, or ethical considerations that go into doing that well. The paper acknowledges this gap and warns against over-interpreting the AI applicability scores as measures of AI’s ability to perform an occupation.

Additionally, the dataset is based on user queries from Bing Copilot (from January – September 2024), which may be influenced by factors like awareness, access, or comfort with AI tools.  Different people use different LLMs for different purposes and it also is very difficult (or nearly impossible) to determine what conversations are performed in a work context or for leisure. 

Finally, we only evaluated AI chatbot usage, so this study does not evaluate the impact or applicability of other forms of AI.

Where do we go from here?

Given the intense interest in how AI will shape our collective future, it’s important we continue to study and better understand its societal and economic impact. As with all research on this topic, the findings are nuanced, and it’s important to pay attention to this nuance. 

The public interest in our research is based, in large part, on the topic of AI and job displacement. However, our current methodology for this study is unlikely to lead to firm conclusions about this.  AI may prove to be a useful tool for many occupations, and we believe the right balance lies in finding how to use the technology in a way that leverages its abilities while complementing human strengths and accounting for people’s preferences.    

For more information from Microsoft on the future of work and AI skilling, check out Microsoft’s Annual Work Trend Index (opens in new tab) and Microsoft Elevate (opens in new tab)

Opens in a new tab

The post Applicability vs. job displacement: further notes on our recent research on AI and occupations appeared first on Microsoft Research.

Categories: Microsoft

MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation

Microsoft Research - Wed, 08/20/2025 - 18:00

A new research framework helps AI agents explore three-dimensional spaces they can’t directly detect. Called MindJourney, the approach addresses a key limitation in vision-language models (VLMs), which give AI agents their ability to interpret and describe visual scenes.  

While VLMs are strong at identifying objects in static images, they struggle to interpret the interactive 3D world behind 2D images. This gap shows up in spatial questions like “If I sit on the couch that is on my right and face the chairs, will the kitchen be to my right or left?”—tasks that require an agent to interpret its position and movement through space. 

People overcome this challenge by mentally exploring a space, imagining moving through it and combining those mental snapshots to work out where objects are. MindJourney applies the same process to AI agents, letting them roam a virtual space before answering spatial questions. 

How MindJourney navigates 3D space

To perform this type of spatial navigation, MindJourney uses a world model—in this case, a video generation system trained on a large collection of videos captured from a single moving viewpoint, showing actions such as going forward and turning left or right, much like a 3D cinematographer. From this, it learns to predict how a new scene would appear from different perspectives.

At inference time, the model can generate photo-realistic images of a scene based on possible movements from the agent’s current position. It generates multiple possible views of a scene while the VLM acts as a filter, selecting the constructed perspectives that are most likely to answer the user’s question.

These are kept and expanded in the next iteration, while less promising paths are discarded. This process, shown in Figure 1, avoids the need to generate and evaluate thousands of possible movement sequences by focusing only on the most informative perspectives.

Figure 1. Given a spatial reasoning query, MindJourney searches through the imagined 3D space using a world model and improves the VLM’s spatial interpretation through generated observations when encountering new challenges. 

 

To make its search through a simulated space both effective and efficient, MindJourney uses a spatial beam search—an algorithm that prioritizes the most promising paths. It works within a fixed number of steps, each representing a movement. By balancing breadth with depth, spatial beam search enables MindJourney to gather strong supporting evidence. This process is illustrated in Figure 2.

Figure 2. The MindJourney workflow starts with a spatial beam search for a set number of steps before answering the query. The world model interactively generates new observations, while a VLM interprets the generated images, guiding the search throughout the process.

By iterating through simulation, evaluation, and integration, MindJourney can reason about spatial relationships far beyond what any single 2D image can convey, all without the need for additional training. On the Spatial Aptitude Training (SAT) benchmark, it improved the accuracy of VLMs by 8% over their baseline performance.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab Building smarter agents  

MindJourney showed strong performance on multiple 3D spatial-reasoning benchmarks, and even advanced VLMs improved when paired with its imagination loop. This suggests that the spatial patterns that world models learn from raw images, combined with the symbolic capabilities of VLMs, create a more complete spatial capability for agents. Together, they enable agents to infer what lies beyond the visible frame and interpret the physical world more accurately. 

It also demonstrates that pretrained VLMs and trainable world models can work together in 3D without retraining either one—pointing toward general-purpose agents capable of interpreting and acting in real-world environments. This opens the way to possible applications in autonomous robotics, smart home technologies, and accessibility tools for people with visual impairments. 

By converting systems that simply describe static images into active agents that continually evaluate where to look next, MindJourney connects computer vision with planning. Because exploration occurs entirely within the model’s latent space—its internal representation of the scene—robots would be able to test multiple viewpoints before determining their next move, potentially reducing wear, energy use, and collision risk. 

Looking ahead, we plan to extend the framework to use world models that not only predict new viewpoints but also forecast how the scene might change over time. We envision MindJourney working alongside VLMs that interpret those predictions and use them to plan what to do next. This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments.

Opens in a new tab

The post MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation appeared first on Microsoft Research.

Categories: Microsoft

Dion: the distributed orthonormal update revolution is here

Microsoft Research - Tue, 08/12/2025 - 22:09

Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.

There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP (opens in new tab) and TP (opens in new tab) parallelization becomes desirable. Going back to the inspiration for Muon, (opens in new tab) the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.  

What’s an orthonormal update? Figure1. Illustration of matrix parameters

At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.

What is Dion?

While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).

Figure 2. Pseudocode of the centralized version of Dion

This is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.

Download Dion optimizer 

Dion implements orthonormalization using amortized power iteration (opens in new tab)Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it

Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab How does it work?

Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.

Figure 3. Wall-clock time speedup of Dion for 3B model training

Why would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.

This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).

Figure 4. Scaling of Dion across different batch sizes

Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).   

In our experiments, these benefits extend to various post-training regimes as well.

We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.

Figure 5. Low-rank Dion across different model sizes

Projecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.    

Using hardware timings of the individual update steps suggests a story that looks this:

Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.

We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.

Dion optimizer Acknowledgements

We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.

Opens in a new tab

The post Dion: the distributed orthonormal update revolution is here appeared first on Microsoft Research.

Categories: Microsoft

Self-adaptive reasoning for science

Microsoft Research - Wed, 08/06/2025 - 18:00
Unlocking self-adaptive cognitive behavior that is more controllable and explainable than reasoning models in challenging scientific domains

Long-running LLM agents equipped with strong reasoning, planning, and execution skills have the potential to transform scientific discovery with high-impact advancements, such as developing new materials or pharmaceuticals. As these agents become more autonomous, ensuring effective human oversight and clear accountability becomes increasingly important, presenting challenges that must be addressed to unlock their full transformative power. Today’s approaches to long-term reasoning are established during the post-training phase, prior to end-user deployment and typically by the model provider. As a result, the expected actions of these agents are pre-baked by the model developer, offering little to no control from the end user.

At Microsoft, we are pioneering a vision for a continually steerable virtual scientist. In line with this vision, we created the ability to have a non-reasoning model develop thought patterns that allow for control and customizability by scientists. Our approach, a cognitive loop via in-situ optimization (CLIO), does not rely on reinforcement learning post-training to develop reasoning patterns yet still yields equivalent performance as demonstrated through our evaluation on Humanity’s Last Exam (HLE). Notably, we increased OpenAI GPT-4.1’s base model accuracy on text-only biology and medicine from 8.55% to 22.37%, an absolute increase of 13.82% (161.64% relative), surpassing o3 (high). This demonstrates that an optimization-based, self-adaptive AI system developed without further post-training can rival post-trained models in domains where adaptability, explainability, and control matter most.

Figure 1. Head-to-head comparison of OpenAI’s GPT-4.1 with CLIO, o3, and GPT-4.1 with no tools on HLE biology and medicine questions In-situ optimization with internal self-reflection to enable self-adaptive reasoning

Model development has advanced from using reinforcement learning human feedback (RLHF) for answer alignment to external grading in reinforcement learning (RLVR). Recent approaches show promise in the utilization of intrinsic rewards for training reasoning models (RLIR). Traditionally, these reasoning processes are learned during the post-training process before any user interaction. While today’s reasoning models require additional data in the training phase and limit user control during the reasoning generation process, CLIO’s approach enables users to steer reasoning from scratch without additional data. Rather, CLIO generates its own necessary data by creating reflection loops at runtime. These reflection loops are utilized for a wide array of activities that CLIO self-defines, encompassing idea exploration, memory management, and behavior control. Most interesting is CLIO’s ability to leverage prior inferences to adjust future behaviors, handling uncertainties and raising flags for correction when necessary. Through this open architecture approach to reasoning, we alleviate the necessity for further model post-training to achieve desired reasoning behavior. Performing novel scientific discoveries often has no prior established patterns for reasoning, much less a large enough corpus of high-quality data to train on. 

CLIO reasons by continuously reflecting on progress, generating hypotheses, and evaluating multiple discovery strategies. For the HLE test, CLIO was specifically steered to follow the scientific method as a guiding framework. Our research shows that equipping language models with self-adapting reasoning enhances their problem-solving ability. It provides a net benefit in quality for science questions, as well as providing exposure and control to the end user.

Figure 2. CLIO can raise key areas of uncertainty within its self-formulated reasoning process, balancing multiple different viewpoints using graph structures. Control over uncertainty: Building trust in AI 

Orchestrated reasoning systems like CLIO are valuable for scientific discovery, as they provide features beyond accuracy alone. Capabilities such as explaining the outcomes of internal reasoning are standard in the scientific field and are present in current reasoning model approaches. However, elements like displaying complete work, including final outcomes, internal thought processes, and uncertainty thresholds to support reproducibility or correction, as well as indicating uncertainty, are not yet universally implemented. Current models and systems do not have this same innate humility.  Rather, we are left with models that produce confident results, whether correct or incorrect. When correct, it is valuable. When incorrect, it is dangerous to the scientific process. Hence, understanding a model or system’s uncertainty is a crucial aspect that we have developed natively into CLIO.

On the other end of the spectrum, orchestrated reasoning systems tend to oversaturate the user by raising too many flags. We enable prompt-free control knobs within CLIO to set thresholds for raising uncertainty flags. This allows CLIO to flag uncertainty for itself and the end user at the proper point in time. This also enables scientists to revisit CLIO’s reasoning path with critiques, edit beliefs during the reasoning process, and re-execute them from the desired point in time. Ultimately, this builds a foundational level of trust with scientists to use them in a scientifically defensible and rigorous way. 

How does CLIO perform? 

We evaluate CLIO against text-based biology and medicine questions from HLE. For this domain, we demonstrate a 61.98% relative increase or an 8.56% net increase in accuracy over OpenAI’s o3 and substantially outperform base completion models like OpenAI’s GPT-4.1, while enabling the requisite explainability and control. This technique applies to all models, showing similar increases in OpenAI’s GPT-4o model, which we observe performs poorly on HLE-level questions. On average, GPT-4.1 is not considered competent for HLE scale questions (<9%), and GPT-4o is natively at less than 2%. By utilizing CLIO, we bring these to near state-of-the-art performance against top reasoning models. CLIO’s recursive nature enables the system to think broader and more deeply, ensuring coverage of the question when answered. In GPT-4.1, we see an increase of 5.92% in accuracy for overall performance using just the cognitive loop recursion. To think more deeply, we allow CLIO to ensemble different evolutions and intelligently choose from the best approach using GraphRAG. This extension of the cognition pattern provides a further 7.90% over a non-ensembled approach.  

Figure 3. The impact of thinking effort on CLIO’s effectiveness.

Furthermore, CLIO’s design offers different knobs of control, for example, how much time to think and which technique to utilize for a given problem. In Figure 3, we demonstrate these knobs of control and their increase on GPT-4.1 and GPT-4o’s performance. In this case, we analyze performance for a subset of biomedical questions, those focused on immunology. CLIO increases GPT-4o’s base performance to be at par with the best reasoning models for immunology questions. We observe a 13.60% improvement over the base model, GPT-4o. This result shows CLIO to be model agnostic, similar to Microsoft AI Diagnostic Orchestrator’s (MAI-DxO) (opens in new tab)‘s approach and corresponding performance boost. 

Implications for science and trustworthy discovery

The future of scientific discovery demands more than reasoning over knowledge and raw computational power alone. Here, we demonstrate how CLIO not only increases model performance but establishes new layers of control for scientists. In our upcoming work, we will demonstrate how CLIO increases tool utility for highly valuable scientific questions in the drug discovery space which requires precise tools designed for the language of science. While our experiments focus on scientific discovery, we believe CLIO can apply in a domain-agnostic fashion. Experts tackling problems in domains such as financial analysis, engineering, and legal services could potentially benefit from AI systems with a transparent, steerable reasoning approach. Ultimately, we envision CLIO as an enduring control-layer in hybrid AI stacks that combine traditional completion and reasoning models, with external memory systems, and advanced tool calling. These continuous checks and balances that CLIO enables will continue to remain valuable even as components within the AI stacks evolve. This combination of intelligent and steerable scientific decision making and tool optimization is the basis of the recently announced Microsoft Discovery platform (opens in new tab).

At Microsoft, we’re committed to advancing AI research that earns the trust of scientists, empowering them to discover new frontiers of knowledge. Our work is a testament to what’s possible when we blend innovation with trustworthiness and a human-centered vision for the future of AI-assisted scientific discovery. We invite the research and scientific community to join us in shaping that future.

Further information:

To learn more details about our approach, please read our pre-print paper published alongside this blog. We are in the process of submitting this work for external peer review and encourage partners to explore the utilization of CLIO in Microsoft Discovery. To learn more about Microsoft’s research on this or contact our team, please reach out to discoverylabs@microsoft.com

Acknowledgements

We are grateful for Jason Zander and Nadia Karim’s support. We extend our thanks to colleagues both inside and outside Microsoft Discovery and Quantum for sharing their insights and feedback, including Allen Stewart, Yasser Asmi, David Marvin, Harsha Nori, Scott Lundberg, and Phil Waymouth. 

Opens in a new tab

The post Self-adaptive reasoning for science appeared first on Microsoft Research.

Categories: Microsoft

Project Ire autonomously identifies malware at scale

Microsoft Research - Tue, 08/05/2025 - 18:00

Today, we are excited to introduce an autonomous AI agent that can analyze and classify software without assistance, a step forward in cybersecurity and malware detection. The prototype, Project Ire, automates what is considered the gold standard in malware classification: fully reverse engineering a software file without any clues about its origin or purpose. It uses decompilers and other tools, reviews their output, and determines whether the software is malicious or benign.

Project Ire emerged from a collaboration between Microsoft Research, Microsoft Defender Research, and Microsoft Discovery & Quantum, bringing together security expertise, operational knowledge, data from global malware telemetry, and AI research. It is built on the same collaborative and agentic foundation behind GraphRAG and Microsoft Discovery (opens in new tab). The system uses advanced language models and a suite of callable reverse engineering and binary analysis tools to drive investigation and adjudication.

As of this writing, Project Ire has achieved a precision (opens in new tab) of 0.98 and a recall (opens in new tab) of 0.83 using public datasets of Windows drivers. It was the first reverse engineer at Microsoft, human or machine, to author a conviction case—a detection strong enough to justify automatic blocking—for a specific advanced persistent threat (APT) malware sample, which has since been identified and blocked by Microsoft Defender. 

Malware classification at a global scale

Microsoft’s Defender platform scans more than one billion monthly (opens in new tab) active devices through the company’s Defender suite of products, which routinely require manual review of software by experts.

This kind of work is challenging. Analysts often face error and alert fatigue, and there’s no easy way to compare and standardize how different people review and classify threats over time. For both of these reasons, today’s overloaded experts are vulnerable to burnout, a well-documented issue in the field.

Unlike other AI applications in security, malware classification lacks a computable validator (opens in new tab). The AI must make judgment calls without definitive validation beyond expert review. Many behaviors found in software, like reverse engineering protections, don’t clearly indicate whether a sample is malicious or benign. 

This ambiguity requires analysts to investigate each sample incrementally, building enough evidence to determine whether it’s malicious or benign despite opposition from adaptive, active adversaries. This has long made it difficult to automate and scale what is inherently a complex and expensive process.

Technical foundation

Project Ire attempts to address these challenges by acting as an autonomous system that uses specialized tools to reverse engineer software. The system’s architecture allows for reasoning at multiple levels, from low-level binary analysis to control flow reconstruction and high-level interpretation of code behavior.

Its tool-use API enables the system to update its understanding of a file using a wide range of reverse engineering tools, including Microsoft memory analysis sandboxes based on Project Freta (opens in new tab), custom and open-source tools, documentation search, and multiple decompilers.  

Reaching a verdict 

The evaluation process begins with a triage, where automated reverse engineering tools identify the file type, its structure, and potential areas of interest. From there, the system reconstructs the software’s control flow graph using frameworks such as angr (opens in new tab) and Ghidra (opens in new tab), building a graph that forms the backbone of Project Ire’s memory model and guides the rest of the analysis.  

Through iterative function analysis, the LLM calls specialized tools through an API to identify and summarize key functions. Each result feeds into a “chain of evidence,” a detailed, auditable trail that shows how the system reached its conclusion. This traceable evidence log supports secondary review by security teams and helps refine the system in cases of misclassification.  

To verify its findings, Project Ire can invoke a validator tool that cross-checks claims in the report against the chain of evidence. This tool draws on expert statements from malware reverse engineers on the Project Ire team. Drawing on this evidence and its internal model, the system creates a final report and classifies the sample as malicious or benign.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

Listen now Opens in a new tab Preliminary testing shows promise 

Two early evaluations tested Project Ire’s effectiveness as an autonomous malware classifier. In the first, we assessed Project Ire on a dataset of publicly accessible Windows drivers, some known to be malicious, others benign. Malicious samples came from the Living off the Land Drivers (opens in new tab) database, which includes a collection of Windows drivers used by attackers to bypass security controls, while known benign drivers were sourced from Windows Update. 

This classifier performed well, correctly identifying 90% of all files and flagging only 2% of benign files as threats. It achieved a precision of 0.98 and a recall of 0.83. This low false-positive rate suggests clear potential for deployment in security operations, alongside expert reverse engineering reviews. 

For each file it analyzes, Project Ire generates a report that includes an evidence section, summaries of all examined code functions, and other technical artifacts.  

Figures 1 and 2 present reports for two successful malware classification cases generated during testing. The first involves a kernel-level rootkit, Trojan:Win64/Rootkit.EH!MTB (opens in new tab). The system identified several key features, including jump-hooking, process termination, and web-based command and control. It then correctly flagged the sample as malicious.

Figure 1 Analysis body { font-family: Arial, sans-serif; background: #f8f8f8; } .code-block { background: #23272e; color: #e6e6e6; font-family: 'Fira Mono', 'Consolas', 'Monaco', monospace; font-size: 1em; padding: 24px 28px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.10); border: 1px solid #444; margin: 40px auto; max-width: 900px; line-height: 1.7; word-break: break-word; } .code-block p { margin: 0 0 18px 0; text-indent: 0; }

The binary contains a function named ‘MonitorAndTerminateExplorerThread_16f64’ that runs an infinite loop waiting on synchronization objects and terminates system threads upon certain conditions. It queries system or process information, iterates over processes comparing their names case-insensitively to ‘Explorer.exe’, and manipulates registry values related to ‘Explorer.exe’. This function appears to monitor and potentially terminate or manipulate the ‘Explorer.exe’ process, a critical Windows shell process. Such behavior is suspicious and consistent with malware that aims to disrupt or control system processes.

Another function, ‘HttpGetRequestAndResponse_174a4’, performs HTTP GET requests by parsing URLs, resolving hostnames, opening sockets, sending requests, and reading responses. This network communication capability could be leveraged for command and control or data exfiltration, common in malware.

The binary also includes a function ‘PatchProcessEntryPointWithHook_12b5c’ that patches the entry point of a process by writing a hook or trampoline that redirects execution to a specified address. This technique is commonly used for process injection or hooking, allowing malware to alter process behavior or inject malicious code.

Other functions related to sending IOCTL requests to device drivers were identified, but their maliciousness could not be conclusively determined without additional context.

Overall, the binary exhibits multiple indicators of malicious behavior, including process manipulation, network communication, and code injection techniques, suggesting it is likely malware designed to interfere with system processes and communicate with remote servers.

Figure 1. Project Ire report, sample with SHA256: 86047bb1969d1db455493955fd450d18c62a3f36294d0a6c3732c88dfbcc4f62 (opens in new tab)

The second sample, HackTool:Win64/KillAV!MTB (opens in new tab), was designed to disable antivirus software. Project Ire correctly identified the code that locates and disables antivirus programs, providing evidence that the file was malicious.  

In one section of the code, however, the system misidentified a function as anti-debugging behavior. To maintain accuracy, the system used the validator tool to flag the claim as unsupported. The issue was later resolved by updating decompiler rules, but this example illustrates how Project Ire navigates uncertainty during analysis. Figure 2 shows the corresponding report. 

Figure 2 Analysis body { font-family: Arial, sans-serif; background: #f8f8f8; } .code-block { background: #23272e; color: #e6e6e6; font-family: 'Fira Mono', 'Consolas', 'Monaco', monospace; font-size: 1em; padding: 24px 28px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.10); border: 1px solid #444; margin: 40px auto; max-width: 900px; line-height: 1.7; word-break: break-word; } .code-block p { margin: 0 0 18px 0; text-indent: 0; }

The binary contains several functions indicative of malicious intent. The function register_and_log_known_processes_140001000 logs and registers process names associated with antivirus and security software, such as ‘avp.exe’, ‘avpui.exe’, and ‘360Tray.exe’. It calls another function, TerminateProcessesByNameSubstring_1400010f4, which enumerates system processes and terminates those whose names contain specified substrings. This behavior is typical of malware attempting to disable or evade security software by killing their processes.

Another function, check_and_handle_special_state_14000502c, performs checks on a global variable and triggers software interrupts if certain conditions are not met. While the exact purpose of these interrupts (int 0x29 and int 0x3) is unclear, they could represent an anti-debug or anti-analysis mechanism to detect or interfere with debugging or tampering attempts. However, this assumption could not be fully validated against expert statements.

Other functions include initialization routines and simple logging wrappers, but the core malicious behavior centers on process termination targeting security software. This indicates the binary is designed to compromise system security by disabling protective processes, a hallmark of malware such as trojans or rootkits.

Figure 2. Project Ire report, sample with SHA256: b6cb163089f665c05d607a465f1b6272cdd5c949772ab9ce7227120cf61f971a (opens in new tab) Real-world evaluation with Microsoft Defender 

The more demanding test involved nearly 4,000 “hard-target” files not classified by automated systems and slated for manual review by expert reverse engineers.

In this real-world scenario, Project Ire operated fully autonomously on files created after the language models’ training cutoff, files that no other automated tools at Microsoft could classify at the time.

The system achieved a high precision score of 0.89, meaning nearly 9 out of 10 files flagged malicious were correctly identified as malicious. Recall was 0.26, indicating that under these challenging conditions, the system detected roughly a quarter of all actual malware.

The system correctly identified many of the malicious files, with few false alarms, just a 4% false positive rate. While overall performance was moderate, this combination of accuracy and a low error rate suggests real potential for future deployment.

Looking ahead 

Based on these early successes, the Project Ire prototype will be leveraged inside Microsoft’s Defender organization as Binary Analyzer for threat detection and software classification.

Our goal is to scale the system’s speed and accuracy so that it can correctly classify files from any source, even on first encounter. Ultimately, our vision is to detect novel malware directly in memory, at scale.

Acknowledgements 

Project Ire acknowledges the following additional developers that contributed to the results in this publication: Dayenne de Souza, Raghav Pande, Ryan Terry, Shauharda Khadka, and Bob Fleck for their independent review of the system.

The system incorporates multiple tools, including the angr framework developed by Emotion Labs (opens in new tab). Microsoft has collaborated extensively with Emotion Labs, a pioneer in cyber autonomy, throughout the development of Project Ire, and thanks them for the innovations and insights that contributed to the successes reported here. 

Opens in a new tab

The post Project Ire autonomously identifies malware at scale appeared first on Microsoft Research.

Categories: Microsoft

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

Microsoft Research - Tue, 08/05/2025 - 18:00
Watch VeriTrail Explainer

Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”1

Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).  

In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components: 

  1. Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source. 
  2. Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.

Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.2

VeriTrail’s hallucination detection process

A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process.  

An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.3

Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by “⭤ “), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a “community report” that summarizes the entities and relationships (Stage 4). To answer a user’s question, an LM generates “map-level answers” based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).

VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1. 

Case study 1: A “Fully Supported” claim Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Fully Supported” claim.

Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated: 

  • In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.
    • An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.
    • Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”
  • Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4. 
  • In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.
Case study 2: A “Not Fully Supported” claim Figure 3: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Not Fully Supported” claim, where the maximum number of consecutive “Not Fully Supported” verdicts was set to 2.

Figure 3 provides an example of a claim where VeriTrail identified hallucination:

  • In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.
  • Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.
  • In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.
Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab Providing traceability

In addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability: 

  1. Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.
  2. Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur. 

The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries. 

Key design features

VeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include: 

  • During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.
  • After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.
  • VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.  
  • The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict. 
Evaluating VeriTrail’s performance

We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization4 and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).

We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore (opens in new tab) and INFUSE (opens in new tab)); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.5

Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.  

For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.

1 The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).

2 VeriTrail is currently used for research purposes only and is not available commercially.

3 We focus on GraphRAG’s global search method.

4 In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).

5 The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.

Opens in a new tab

The post VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows appeared first on Microsoft Research.

Categories: Microsoft

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

Microsoft Research - Thu, 07/24/2025 - 03:30

AI has made remarkable progress in recent years, but turning experimental models into tools that work in the real world is still a major challenge. Bridging this gap between innovation and application has shaped the career of Xinxing Xu, principal researcher at Microsoft Research Asia – Singapore (opens in new tab), and underpins the mission of the lab’s newly established presence in the region.

Xinxing Xu, Principal Researcher, Microsoft Research Asia – Singapore

“Innovative algorithms can only demonstrate their true value when tested with real-world data and in actual scenarios, where they can be continuously optimized through iteration,” he says.

Xu’s commitment to balancing algorithmic innovation with practical application has shaped his entire career. During his PhD studies at Nanyang Technological University, Singapore, Xu focused on emerging technologies like multiple kernel learning methods and multimodal machine learning. Today he’s applying these techniques to real-world use cases like image recognition and video classification.

After completing his doctorate, he joined the Institute of High Performance Computing at Singapore’s Agency for Science, Technology and Research (A*STAR), where he worked on interdisciplinary projects ranging from medical image recognition to AI systems for detecting defects on facade of buildings. These experiences broadened his perspective and deepened his passion for translating AI into real-world impact.

In 2024, Xu joined Microsoft Research Asia where he began a new chapter focused on bridging between academic research and real-world AI applications.

“Microsoft Research Asia is committed to integrating scientific exploration with real-world applications, which creates a unique research environment,” Xu says. “It brings together top talent and resources, and Microsoft’s engineering and product ecosystem strongly supports turning research into impactful technology. The lab’s open and inclusive culture encourages innovation with broader societal impact. It reflects the approach to research I’ve always hoped to contribute to.”

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab Bringing cross-domain expertise to AI’s real-world frontiers

As a key hub in Microsoft Research’s network across Asia, the Singapore lab is guided by a three-part mission: to drive industry-transforming AI deployment, pursue fundamental breakthroughs in the field, and promote responsible, socially beneficial applications of the technology.

To reach these goals, Xu and his colleagues are working closely with local collaborators, combining cross-disciplinary expertise to tackle complex, real-world challenges.

To deliver on that mission, Xinxing Xu and his colleagues are working closely with local collaborators, drawing on cross-disciplinary expertise to solve real-world problems. One key focus is healthcare, where Xu leads a collaboration with Singapore’s SingHealth to explore how AI can support precision medicine. By combining SingHealth’s clinical data with advanced AI models, the team aims to deliver more personalized analyses and sharper diagnostic tools—laying the groundwork for improved patient outcomes. 

Beyond healthcare, the team is also targeting key sectors like finance and logistics. By developing domain-specific foundation models and AI agents, they aim to support smarter decision-making and accelerate digital transformation across industries. “Singapore has a strong foundation in these sectors,” Xu notes, “making it an ideal environment for technology validation and iteration.”

The team is also partnering with leading academic institutions, including the National University of Singapore (NUS) and Nanyang Technological University, Singapore (NTU Singapore), to advance the field of spatial intelligence. Their goal is to develop embodied intelligence systems capable of carrying out complex tasks in smart environments.

As AI becomes more deeply embedded in everyday life, researchers at the Singapore lab are also increasingly focused on what they call “societal AI”—building AI systems that are culturally relevant and trustworthy within Southeast Asia’s unique cultural and social contexts. In collaboration with global colleagues, they’re helping to advance a more culturally grounded and responsible approach to AI research in the region.

Microsoft Research Asia – Singapore: Expanding global reach, connecting regional innovation 

Realizing AI’s full potential requires more than technical breakthroughs. It also depends on collaboration—across industries, academia, and policy. Only through this intersection of forces can AI move beyond the lab to deliver meaningful societal value. 

Singapore’s strengths in science, engineering, and digital governance make it an ideal setting for this kind of work. Its collaborative culture, robust infrastructure, international talent pool, and strong policy support for science and technology make it fertile ground for interdisciplinary research. 

This is why Microsoft Research Asia continues to collaborate closely with Singapore’s top universities, research institutions, and industry partners. These partnerships support joint research, talent development, and technical exchange. Building on this foundation, Microsoft Research Asia – Singapore will further deepen its collaboration with NUS, NTU Singapore, and Singapore Management University (SMU) to advance both fundamental and applied research, while equipping the next generation of researchers with real-world experience. In addition, Microsoft Research Asia is fostering academic exchange and strengthening the research ecosystem through summer schools and joint workshops with NUS, NTU Singapore, and SMU. 

The launch of the Singapore lab further marks an important step in expanding the company’s global research footprint, serving as a bridge between regional innovation and Microsoft’s global ecosystem. Through its integrated lab network, Microsoft Research fosters the sharing of technologies, methods, and real-world insights, creating a virtuous cycle of innovation.

“We aim to build a research hub in Singapore that is globally connected and deeply rooted in the local ecosystem,” Xu says. “Many breakthroughs come from interdisciplinary and cross-regional collaboration. By breaking boundaries—across disciplines, industries, and geographies—we can drive research that has lasting impact.”

As AI becomes more deeply woven into industry and everyday life, Xu believes that meaningful research must be closely connected to regional development and social well-being. “Microsoft Research Asia – Singapore is a future-facing lab,” he says. “While we push technological frontiers, we’re equally committed to the responsibility of technology—ensuring AI can help address society’s most pressing challenges.”

In a world shaped by global challenges, Xu sees collaboration and innovation as essential to real progress. With Singapore as a launchpad, he and his team are working to extend AI’s impact and value across Southeast Asia and beyond.

Xingxing Xu (center) with colleagues at Microsoft Research Asia – Singapore  Three essential strengths for the next generation of AI researchers

AI’s progress depends not only on technical breakthroughs but also on the growth and dedication of talent. At Microsoft Research Asia, there is a strong belief that bringing research into the real world requires more than technical coordination—it depends on unlocking the full creativity and potential of researchers.

In Singapore—a regional innovation hub that connects Southeast Asia—Xu and his colleagues are working to push AI beyond the lab and into fields like healthcare, finance, and manufacturing. For young researchers hoping to shape the future of AI, this is a uniquely powerful stage.

To help guide the next generation, Xu shares three pieces of advice:

  • Build a strong foundation – “Core knowledge in machine learning, linear algebra, and probability and statistics is the bedrock of AI research,” Xu says. “A solid theoretical base is essential to remain competitive in a rapidly evolving field. Even today’s hottest trends in generative AI rely on longstanding principles of optimization and model architecture design.” While code generation tools are on the rise, Xu emphasizes that mathematical fundamentals remain essential for understanding and innovating in AI.
  • Understand real-world applications – Technical skills alone aren’t enough. Xu encourages young researchers to deeply engage with the problems they’re trying to solve. Only by tightly integrating technology with its context can researchers create truly valuable solutions.

    “In healthcare, for example, researchers may need to follow doctors in clinics to gain a true understanding of clinical workflows. That context helps identify the best entry points for AI deployment. Framing research problems around real-world needs is often more impactful than just tuning model parameters,” Xu says.
  • Develop interdisciplinary thinking – Cross-disciplinary collaboration is becoming essential to AI innovation. Xu advises young researchers to learn how to work with experts from other fields to explore new directions together. “These kinds of interactions often spark fresh, creative ideas,” he says.

    Maintaining curiosity is just as important. “Being open to new technologies and fields is what enables researchers to continually break new ground and produce original results.”

Xu extends an open invitation to aspiring researchers from all backgrounds to join Microsoft Research Asia – Singapore. “We offer a unique platform that blends cutting-edge research with real-world impact,” he says. “It’s a place where you can work on the frontiers of AI—and see how your work can help transform industries and improve lives.”

To learn more about current openings at the Singapore lab, please visit our careers page (opens in new tab)

Opens in a new tab

The post Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore appeared first on Microsoft Research.

Categories: Microsoft

Technical approach for classifying human-AI interactions at scale

Microsoft Research - Wed, 07/23/2025 - 18:00

As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale—efficiently, reliably, and in near real-time—is no longer a nice-to-have. It’s essential. The Semantic Telemetry project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.

But building a pipeline that can handle this volume isn’t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.

In this latest post in our series on Semantic Telemetry, we’ll walk through the engineering behind that system—how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we’ll share what it takes to build a real-time LLM classification pipeline.

For additional project background: Semantic Telemetry: Understanding how users interact with AI systems and Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project.

Blog Semantic Telemetry: Understanding how users interact with AI systems  Blog Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project  System architecture highlights

The Semantic Telemetry pipeline (opens in new tab) is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:

  • Hybrid compute engine
    The pipeline combines the distributed power of PySpark with the speed and simplicity of Polars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments—without code changes.
  • LLM-centric transformation layer
    At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:
    • Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.
    • Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.
    • Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. “create” versus “created”) and relabeling invalid classifications.
Figure 1. Architecture diagram

The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths—making it easy to adapt to new use cases or environments.

Engineering challenges & solutions

Building a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges—from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.

LLM endpoint latency & variability

Challenge: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.

Solution: We implemented a combination of:

  • Multiple Azure OpenAI endpoints in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.
  • Saving output in intervals to write data asynchronously in case of network errors.
  • Utilizing models with higher tokens per minute (TPM) such as OpenAI’s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)
  • Timeouts and retries with exponential backoff.
Evolving LLM models & prompt alignment

Challenge: Each new LLM release—such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)—brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.

Solution: We developed a model evaluation workflow to test prompt alignment across LLM versions:

  • Small-sample testing: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.
  • Distribution analysis: If the new model’s output aligned closely, we scaled up testing. If not, we iteratively tuned the prompts and re-ran comparisons.
  • Interpretation flexibility: We also recognized that a shift in distribution isn’t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.

To support this process, we used tools like Sammo (opens in new tab), which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.

Dynamic concurrency scaling for LLM calls

Challenge: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models’ speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.

Solution: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:

  • External task awareness: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.
  • Success/failure rate monitoring: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.
  • Latency-based feedback loop: Instead of waiting for rate-limit errors, measure the response time of LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Optimization experiments

To further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.

Batch endpoints (Azure/OpenAI)

Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.

Conversation batching in prompts during pipeline runtime

Batching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.

Combining classifiers in a single prompt

Combining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.

Classification using text embeddings

An alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation—instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.

For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.

The benefits of this approach include significantly increased throughput and cost savings—since it’s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.

Prompt compression

Compressing prompts by eliminating unnecessary tokens or by using a tool such as LLMLingua (opens in new tab) to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.

Text truncation

Truncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint’s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it’s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.

Conclusion

Building a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they’re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we’ve shared represent a snapshot of what’s working today. But more importantly, they offer a foundation for what’s possible tomorrow.

Opens in a new tab

The post Technical approach for classifying human-AI interactions at scale appeared first on Microsoft Research.

Categories: Microsoft

CollabLLM: Teaching LLMs to collaborate with users

Microsoft Research - Tue, 07/15/2025 - 20:00

Large language models (LLMs) can solve complex puzzles in seconds, yet they sometimes struggle over simple conversations. When these AI tools make assumptions, overlook key details, or neglect to ask clarifying questions, the result can erode trust and derail real-world interactions, where nuance is everything.

A key reason these models behave this way lies in how they’re trained and evaluated. Most benchmarks use isolated, single-turn prompts with clear instructions. Training methods tend to optimize for the model’s next response, not its contribution to a successful, multi-turn exchange. But real-world interaction is dynamic and collaborative. It relies on context, clarification, and shared understanding.

User-centric approach to training 

To address this, we’re exploring ways to train LLMs with users in mind. Our approach places models in simulated environments that reflect the back-and-forth nature of real conversations. Through reinforcement learning, these models improve through trial and error, for example, learning when to ask questions and how to adapt tone and communication style to different situations. This user-centric approach helps bridge the gap between how LLMs are typically trained and how people actually use them.  

This is the concept behind CollabLLM, recipient of an ICML Outstanding Paper Award (opens in new tab). This training framework helps LLMs improve through simulated multi-turn interactions, as illustrated in Figure 1. The core insight behind CollabLLM is simple: in a constructive collaboration, the value of a response isn’t just in its immediate usefulness, but in how it contributes to the overall success of the conversation. A clarifying question might seem like a delay but often leads to better outcomes. A quick answer might appear useful but can create confusion or derail the interaction.

Figure 1. Diagram comparing two training approaches for LLMs. (a) The standard method lacks user-agent collaboration and uses single-turn rewards, leading to an inefficient conversation. (b) In contrast, CollabLLM simulates multi-turn user-agent interactions during training, enabling it to learn effective collaboration strategies and produce more efficient dialogues.

CollabLLM puts this collaborative approach into practice with a simulation-based training loop, illustrated in Figure 2. At any point in a conversation, the model generates multiple possible next turns by engaging in a dialogue with a simulated user.

Figure 2: Simulation-based training process used in CollabLLM

The system uses a sampling method to extend conversations turn by turn, choosing likely responses for each participant (the AI agent or the simulated user), while adding some randomness to vary the conversational paths. The goal is to expose the model to a wide variety of conversational scenarios, helping it learn more effective collaboration strategies.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab

To each simulated conversation, we applied multiturn-aware reward (MR) functions, which assess how the model’s response at the given turn influences the entire trajectory of the conversation. We sampled multiple conversational follow-ups from the model, such as statements, suggestions, questions, and used MR to assign a reward to each based on how well the conversation performed in later turns. We based these scores on automated metrics that reflect key factors like goal completion, conversational efficiency, and user engagement.

To score the sampled conversations, we used task-specific metrics and metrics from an LLM-as-a-judge framework, which supports efficient and scalable evaluation. For metrics like engagement, a judge model rates each sampled conversation on a scale from 0 to 1.

The MR of each model response was computed by averaging the scores from the sampled conversations, originating from the model response. Based on the score, the model updates its parameters using established reinforcement learning algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).

We tested CollabLLM through a combination of automated and human evaluations, detailed in the paper. One highlight is a user study involving 201 participants in a document co-creation task, shown in Figure 3. We compared CollabLLM to a baseline trained with single-turn rewards and to a second, more proactive baseline prompted to ask clarifying questions and take other proactive steps. CollabLLM outperformed both, producing higher-quality documents, better interaction ratings, and faster task completion times.

Figure 3: Results of the user study in a document co-creation task comparing CollabLLM to a baseline trained with single-turn rewards. Designing for real-world collaboration

Much of today’s AI research focuses on fully automated tasks, models working without input from or interaction with users. But many real-world applications depend on people in the loop: as users, collaborators, or decision-makers. Designing AI systems that treat user input not as a constraint, but as essential, leads to systems that are more accurate, more helpful, and ultimately more trustworthy.

This work is driven by a core belief: the future of AI depends not just on intelligence, but on the ability to collaborate effectively. And that means confronting the communication breakdowns in today’s systems.

We see CollabLLM as a step in that direction, training models to engage in meaningful multi-turn interactions, ask clarifying questions, and adapt to context. In doing so, we can build systems designed to work with people—not around them.

Opens in a new tab

The post CollabLLM: Teaching LLMs to collaborate with users appeared first on Microsoft Research.

Categories: Microsoft

PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays

Microsoft Research - Thu, 06/26/2025 - 18:08

In our ever-evolving journey to enhance healthcare through technology, we’re announcing a unique new benchmark for grounded radiology report generation—PadChest-GR (opens in new tab). The world’s first multimodal, bilingual sentence-level radiology report dataset, developed by the University of Alicante with Microsoft Research, University Hospital Sant Joan d’Alacant and MedBravo, is set to redefine how AI and radiologists interpret radiological images. Our work demonstrates how collaboration between humans and AI can create powerful feedback loops—where new datasets drive better AI models, and those models, in turn, inspire richer datasets. We’re excited to share this progress in NEJM AI, highlighting both the clinical relevance and research excellence of this initiative. 

A new frontier in radiology report generation 

It is estimated that over half of people visiting hospitals have radiology scans that must be interpreted by a clinical professional. Traditional radiology reports often condense multiple findings into unstructured narratives. In contrast, grounded radiology reporting demands that each finding be described and localized individually.

This can mitigate the risk of AI fabrications and enable new interactive capabilities that enhance clinical and patient interpretability. PadChest-GR is the first bilingual dataset to address this need with 4,555 chest X-ray studies complete with Spanish and English sentence-level descriptions and precise spatial (bounding box) annotations for both positive and negative findings. It is the first public benchmark that enables us to evaluate generation of fully grounded radiology reports in chest X-rays. 

Figure 1. Example of a grounded report from PadChest-GR. The original free-text report in Spanish was ”Motivo de consulta: Preoperatorio. Rx PA tórax: Impresión diagnóstica: Ateromatosis aórtica calcificada. Engrosamiento pleural biapical. Atelectasia laminar basal izquierda. Elongación aórtica. Sin otros hallazgos radiológicos significativos.”

This benchmark isn’t standing alone—it plays a critical role in powering our state-of-the-art multimodal report generation model, MAIRA-2. Leveraging the detailed annotations of PadChest-GR, MAIRA-2 represents our commitment to building more interpretable and clinically useful AI systems. You can explore our work on MAIRA-2 on our project web page, including recent user research conducted with clinicians in healthcare settings.

PadChest-GR is a testament to the power of collaboration. Aurelia Bustos at MedBravo and Antonio Pertusa at the University of Alicante published the original PadChest dataset (opens in new tab) in 2020, with the help of Jose María Salinas from Hospital San Juan de Alicante and María de la Iglesia Vayá from the Center of Excellence in Biomedical Imaging at the Ministry of Health in Valencia, Spain. We started to look at PadChest and were deeply impressed by the scale, depth, and diversity of the data.

As we worked more closely with the dataset, we realized the opportunity to develop this for grounded radiology reporting research and worked with the team at the University of Alicante to determine how to approach this together. Our complementary expertise was a nice fit. At Microsoft Research, our mission is to push the boundaries of medical AI through innovative, data-driven solutions. The University of Alicante, with its deep clinical expertise, provided critical insights that greatly enriched the dataset’s relevance and utility. The result of this collaboration is the PadChest-GR dataset.

A significant enabler of our annotation process was Centaur Labs. The team of senior and junior radiologists from the University Hospital Sant Joan d’Alacant, coordinated by Joaquin Galant, used this HIPAA-compliant labeling platform to perform rigorous study-level quality control and bounding box annotations. The annotation protocol implemented ensured that each annotation was accurate and consistent, forming the backbone of a dataset designed for the next generation of grounded radiology report generation models. 

Accelerating PadChest-GR dataset annotation with AI 

Our approach integrates advanced large language models with comprehensive manual annotation: 

Data Selection & Processing: Leveraging Microsoft Azure OpenAI Service (opens in new tab) with GPT-4, we extracted sentences describing individual positive and negative findings from raw radiology reports, translated them from Spanish to English, and linked each sentence to the existing expert labels from PadChest. This was done for a selected subset of the full PadChest dataset, carefully curated to reflect a realistic distribution of clinically relevant findings. 

Manual Quality Control & Annotation: The processed studies underwent meticulous quality checks on the Centaur Labs platform by radiologist from Hospital San Juan de Alicante. Each positive finding was then annotated with bounding boxes to capture critical spatial information. 

Standardization & Integration: All annotations were harmonized into coherent grounded reports, preserving the structure and context of the original findings while enhancing interpretability. 

Figure 2. Overview of the data curation pipeline. Impact and future directions 

PadChest-GR not only sets a new benchmark for grounded radiology reporting, but also serves as the foundation for our MAIRA-2 model, which already showcases the potential of highly interpretable AI in clinical settings. While we developed PadChest-GR to help train and validate our own models, we believe the research community will greatly benefit from this dataset for many years to come. We look forward to seeing the broader research community build on this—improving grounded reporting AI models and using PadChest-GR as a standard for evaluation. We believe that by fostering open collaboration and sharing our resources, we can accelerate progress in medical imaging AI and ultimately improve patient care together with the community.

The collaboration between Microsoft Research and the University of Alicante highlights the transformative power of working together across disciplines. With our publication in NEJM-AI and the integral role of PadChest-GR in the development of MAIRA-2 (opens in new tab) and RadFact (opens in new tab), we are excited about the future of AI-empowered radiology. We invite researchers and industry experts to explore PadChest-GR and MAIRA-2, contribute innovative ideas, and join us in advancing the field of grounded radiology reporting. 

Papers already using PadChest-GR:

For further details or to download PadChest-GR, please visit the BIMCV PadChest-GR Project (opens in new tab)

Models in the Azure Foundry that can do Grounded Reporting: 

Acknowledgement Opens in a new tab

The post PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays appeared first on Microsoft Research.

Categories: Microsoft

Learning from other domains to advance AI evaluation and testing

Microsoft Research - Mon, 06/23/2025 - 18:35

As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?  

Recent research and reports from Microsoft (opens in new tab), the UK AI Security Institute (opens in new tab), The New York Times (opens in new tab), and MIT Technology Review (opens in new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report (opens in new tab) (2025) and the Singapore Consensus (opens in new tab) (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust. 

Today, we’re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, pharmaceuticals, and medical devices to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.

We’re also sharing written case studies from experts, along with top-level lessons we’re applying to AI. At the close of the podcast series, we’ll offer Microsoft’s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation. 

Lessons from eight case studies 

Our research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft’s Office of Responsible AI (opens in new tab) gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI (opens in new tab), in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past. 

While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.  

Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the “DNA” of the historical moment in which they’re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later. 

Strict, pre-deployment testing regimes—such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals—offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.  

In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment—such as cybersecurity and bank stress testing—rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.  

Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation. 

These variations in approaches across domains—stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors—also inform takeaways for AI.

Applying risk evaluation and governance lessons to AI 

While no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.  

Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.  

These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including: 

  1. Rigor in defining what is being examined and why it matters. This requires detailed specification of what is being measured and understanding how the deployment context may affect outcomes.
  2. Standardization of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency. 
  3. Interpretability of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results—while remaining aware of their limitations. 
Toward stronger foundations for AI testing 

Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability—and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.  

Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.

Acknowledgements 

We would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.  

Case studies 

Civil aviation: Testing in Aircraft Design and Manufacturing, by Paul Alp 

Cybersecurity: Cybersecurity Standards and Testing—Lessons for AI Safety and Security, by Stewart Baker 

Financial services (bank stress testing): The Evolving Use of Bank Stress Tests, by Kathryn Judge 

Genome editing: Governance of Genome Editing in Human Therapeutics and Agricultural Applications, by Alta Charo and Andy Greenfield 

Medical devices: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance, by Mateo Aboy and Timo Minssen 

Nanoscience: The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation, by Jennifer Dionne 

Nuclear energy: Testing in the Nuclear Industry, by Pablo Cantero and Gerónimo Poletto Antonacci 

Pharmaceuticals: The History and Evolution of Testing in Pharmaceutical Regulation, by Daniel Benamouzig and Daniel Carpenter

Opens in a new tab

The post Learning from other domains to advance AI evaluation and testing appeared first on Microsoft Research.

Categories: Microsoft

Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning

Microsoft Research - Wed, 06/18/2025 - 12:01

We are excited to share our first big milestone in solving a grand challenge that has hampered the predictive power of computational chemistry, biochemistry, and materials science for decades. By using a scalable deep-learning approach and generating an unprecedented quantity of diverse, highly accurate data, we have achieved a breakthrough in the accuracy of density functional theory (DFT), the workhorse method that thousands of scientists use every year to simulate matter at the atomistic level. Within the region of chemical space represented in our large training dataset, our model reaches the accuracy required to reliably predict experimental outcomes, as assessed on the well-known benchmark dataset W4-17 (opens in new tab). This removes a fundamental barrier to shifting the balance of molecule and material design from being driven by laboratory experiments to being driven by computational simulations. The implications for accelerating scientific discovery are far reaching, spanning applications from drugs to batteries and green fertilizers.

What is DFT?

Molecules and materials are made of atoms, which are held together by their electrons. These electrons act as a glue, determining the stability and properties of the chemical structure. Accurately computing the strength and properties of the electron glue is essential for predicting whether a chemical reaction will proceed, whether a candidate drug molecule will bind to its target protein, whether a material is suitable for carbon capture, or if a flow battery can be optimized for renewable energy storage. Unfortunately, a brute-force approach amounts to solving the many-electron Schrödinger equation, which requires computation that scales exponentially with the number of electrons. Considering that an atom has dozens of electrons, and that molecules and materials have large numbers of atoms, we could easily end up waiting the age of the universe to complete our computation unless we restrict our attention to small systems with only a few atoms.

DFT, introduced by Walter Kohn and collaborators in 1964-1965, was a true scientific breakthrough, earning Kohn the Nobel Prize in Chemistry in 1998. DFT provides an extraordinary reduction in the computational cost of calculating the electron glue in an exact manner, from exponential to cubic, making it possible to perform calculations of practical value within seconds to hours.

DFT Timeline What is the grand challenge in DFT? 

But there is a catch: the exact reformulation has a small but crucial term—the exchange-correlation (XC) functional—which Kohn proved is universal (i.e., the same for all molecules and materials), but for which no explicit expression is known. For 60 years, people have designed practical approximations for the XC functional. The magazine Science dubbed the gold rush to design better XC models the “pursuit of the Divine Functional (opens in new tab)”. With time, these approximations have grown into a zoo of hundreds of different XC functionals from which users must choose, often using experimental data as a guide. Owing to the uniquely favorable computational cost of DFT, existing functionals have enabled scientists to gain extremely useful insight into a huge variety of chemical problems. However, the limited accuracy and scope of current XC functionals mean that DFT is still mostly used to interpret experimental results rather than predict them.

Why is it important to increase the accuracy of DFT? 

We can contrast the present state of computational chemistry with the state of aircraft engineering and design. Thanks to predictive simulations, aeronautical engineers no longer need to build and test thousands of prototypes to identify one viable design. However, this is exactly what we currently must do in molecular and materials sciences. We send thousands of potential candidates to the lab, because the accuracy of the computational methods is not sufficient to predict the experiments. To make a significant shift in the balance from laboratory to in silico experiments, we need to remove the fundamental bottleneck of the insufficient accuracy of present XC functionals. This amounts to bringing the error of DFT calculations with respect to experiments within chemical accuracy, which is around 1 kcal/mol for most chemical processes. Present approximations typically have errors that are 3 to 30 times larger.

How can AI make a difference? 

AI can transform how we model molecules and materials with DFT by learning the XC functional directly from highly accurate data. The goal is to learn how the XC functional captures the complex relationship between its input, the electron density, and its output, the XC energy. You can think of the density like a glue, with regions of space where there is a lot of it and other regions with less of it. Traditionally, researchers have built XC functional approximations using the concept of the so-called Jacob’s ladder: a hierarchy of increasingly complex, hand-designed descriptors of the electron density. Including density descriptors from higher rungs of this ladder aims to improve accuracy, but it comes at the price of increased computational cost. Even the few attempts that use machine learning have stayed within this traditional paradigm, thereby taking an approach that is akin to what people were doing in computer vision and speech recognition before the deep-learning era. Progress toward better accuracy has stagnated for at least two decades with this approach. 

Our project is driven by the intuition that a true deep learning approach—where relevant representations of the electron density are learned directly from data in a computationally scalable way—has the potential to revolutionize the accuracy of DFT, much like deep learning has transformed other fields. A significant challenge with going down this path, however, is that feature or representation learning is very data-hungry, and there is very little data around—too little to test this hypothesis reliably.

What have we done in this milestone?

The first step was generating data—a lot of it. This posed a major challenge, since the data must come from accurate solutions of the many-electron Schrödinger equation, which is precisely the prohibitively expensive problem that DFT is designed to replace. Fortunately, decades of progress in the scientific community have led to smarter, more efficient variants of brute-force methods, making it possible to compute reference data for small molecules at experimental accuracy. While these high-accuracy methods, also referred to as wavefunction methods, are far too costly for routine use in applications, we made a deliberate investment in them for this project. The reason? The upfront cost of generating high-quality training data is offset by the long-term benefit of enabling vast numbers of industrially relevant applications with cost effective DFT using the trained XC functional. Crucially, we rely on the ability of DFT—and our learned XC functional—to generalize from high-accuracy data for small systems to larger, more complex molecules. 

There are many different high-accuracy wavefunction methods, each tailored to different regions of chemical space. However, their use at scale is not well established, as they require extensive expertise—small methodological choices can significantly affect accuracy at the level that we target. We therefore joined forces with Prof. Amir Karton (opens in new tab) from the University of New England, Australia, a world-leading expert who developed widely recognized benchmark datasets for a fundamental thermochemical property: atomization energy—the energy required to break all bonds in a molecule and separate it into individual atoms. To create a training dataset of atomization energies at unprecedented scale, our team at Microsoft built a scalable pipeline to produce highly diverse molecular structures. Using these structures and substantial Azure compute resources via Microsoft’s Accelerating Foundation Models Research program (opens in new tab), Prof. Karton applied a high-accuracy wavefunction method to compute the corresponding energy labels. The result is a dataset (opens in new tab) two orders of magnitude larger than previous efforts. We are releasing a large part of this dataset (opens in new tab) to the scientific community.

Data generation was only half of the challenge. We also needed to design a dedicated deep-learning architecture for the XC functional—one that is both computationally scalable and capable of learning meaningful representations from electron densities to accurately predict the XC energy. Our team of machine learning specialists, assisted by DFT experts, introduced a series of innovations that solve these and other challenges inherent to this complex learning problem. The result is Skala, an XC functional that generalizes to unseen molecules, reaching the accuracy needed to predict experiments. This demonstrates for the first time that deep learning can truly disrupt DFT: reaching experimental accuracy does not require the computationally expensive hand-designed features of Jacob’s ladder. Instead, we can retain the original computational complexity of DFT while allowing the XC functional to learn how to extract meaningful features and predict accurate energies.

We compare the accuracy of Skala against the best existing functionals of varying computational cost. The prediction errors are evaluated on two well-known public benchmark datasets: the W4-17 dataset for atomization energies (y axis, mean absolute error) and the GMTKN55 dataset for general main-group chemistry (x axis, weighted total mean absolute deviation, or WTMAD-2 for short). Skala achieves near “chemical accuracy” (1 kcal/mol) on atomization energies. This is the accuracy required for predictive modeling of laboratory experiments, which, to date, no existing functional has reached. Skala works especially well on the “single reference” subset of this dataset, reaching a groundbreaking 0.85 kcal/mol. On the GMTKN55 dataset, Skala shows competitive accuracy to the best-performing hybrid functionals, at a lower cost.

Skala is a new density functional for the exchange-correlation energy that employs meta-GGA ingredients plus D3 dispersion and machine-learned nonlocal features of the electron density. Some exact constraints were imposed, and some others “emerge” from the fitting to about 150,000 accurate energy differences for sp molecules and atoms. Skala achieves high, hybrid-like accuracy on a large and diverse data set of properties of main group molecules, which has no overlap with its training set. The computational cost of Skala is higher than that of the r2SCAN meta-GGA for small molecules, but about the same for systems with 1,000 or more occupied orbitals. Its cost seems to be only 10% of the cost of standard hybrids and 1% of the cost of local hybrids. Developed by a Microsoft team of density functional theorists and deep-learning experts, Skala could be the first machine-learned density functional to compete with existing functionals for wide use in computational chemistry, and a sign of things to come in that and related fields. Skala learned from big data and was taught by insightful human scientists.”

— John P. Perdew, Professor of Physics, School of Science and Engineering, Tulane University

This first milestone was achieved for a challenging property in a specific region of chemical space—atomization energies of main group molecules—for which we generated our initial large batch of high-accuracy training data. Building on this foundation, we have started to expand our training dataset to cover a broader range of general chemistry, using our scalable in-house data generation pipeline. With the first small batch of training data beyond atomization energies, we have already extended the accuracy of our model, making it competitive with the best existing XC functionals across a wider spectrum of main group chemistry. This motivates us to continue growing our high-accuracy data generation campaign, engaging with external experts such as Prof. Amir Karton, who noted, “After years of benchmarking DFT methods against experimental accuracy, this is the first time I’ve witnessed such an unprecedented leap in the accuracy–cost trade-off. It is genuinely exciting to see how the creation of our new dataset has enabled these groundbreaking results — opening up a path for transformative advances across chemical, biochemical, and materials research.”

Advancing computational chemistry together

We are excited to work closely with the global computational chemistry community to accelerate progress for all and look forward to openly releasing our first XC functional in the near future. 

Density Functional Theory (DFT) and related technologies are a core Digital Chemistry technology supporting advancements in Merck’s diverse Life Science, Healthcare and Electronics businesses. However, the limitations of traditional DFT methods, which have persisted for the last 50 years, have hindered its full potential. Microsoft Research’s innovative approach to integrating deep learning represents a substantial leap, enhancing its accuracy, robustness, and scalability. We are looking forward to exploring how this can advance Digital Chemistry workflows and unlock new possibilities for the future, aligning with our commitment to developing advanced algorithms and technologies that propel scientific innovation at Merck.”

— Jan Gerit Brandenburg – Director for Digital Chemistry at Merck 

We are entering a golden age for predictive and realistic simulations: very accurate electronic-structure calculations provide vast amounts of consistent data that can be used to train novel machine-learning architectures, delivering the holy grail of precision and computational efficiency.”

— Professor Nicola Marzari, Chair of Theory and Simulation of Materials, EPFL and PSI

We believe that our new functional can help unlock new opportunities for businesses and are eager to work together on real-world applications. Today, we are delighted to launch the DFT Research Early Access Program (DFT REAP) and welcome Flagship Pioneering as the first participant. This program is for companies and research labs to collaborate with us to accelerate innovation across many industries. To find out more about how to join this program please visit: https://aka.ms/DFT-REAP (opens in new tab) 

“Microsoft’s effort to enhance the predictive power of computational chemistry reflects a bold but thoughtful step toward a simulation-first future. At Flagship, we believe that openly shared, foundational advances in science – like this leap forward in DFT accuracy – can serve as powerful enablers of innovation. These next-generation tools promise to accelerate discovery across a wide range of sectors, from therapeutics to materials science, by helping researchers navigate chemical and biological space with far greater precision and speed.”

Junaid Bajwa, M.D., Senior Partner at Flagship Pioneering and Science Partner at Pioneering Intelligence

By making our work available to the scientific community, we hope to enable widespread testing and gather valuable feedback that will guide future improvements. For the first time, deep learning offers a clear and computationally scalable path to building an accurate, efficient, and broadly applicable model of the universal XC functional—one that could transform the computational design of molecules and materials.

Skala Paper Dataset Paper Dataset Acknowledgement

This work is the product of a highly collaborative and interdisciplinary effort led by Microsoft Research AI for Science, in partnership with colleagues from Microsoft Research Accelerator, Microsoft Quantum and the University of New England. The full author list includes Giulia Luise, Chin-Wei Huang, Thijs Vogels, Derk P. Kooi, Sebastian Ehlert, Stephanie Lanius, Klaas J. H. Giesbertz, Amir Karton, Deniz Gunceler, Megan Stanley, Wessel P. Bruinsma, Victor Garcia Satorras, Marwin Segler, Kenji Takeda, Lin Huang, Xinran Wei, José Garrido Torres, Albert Katbashev, Rodrigo Chavez Zavaleta, Bálint Máté, Sékou-Oumar Kaba, Roberto Sordillo, Yingrong Chen, David B. Williams-Young, Christopher M. Bishop, Jan Hermann, Rianne van den Berg and Paola Gori Giorgi

Opens in a new tab

The post Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning appeared first on Microsoft Research.

Categories: Microsoft

New methods boost reasoning in small and large language models

Microsoft Research - Tue, 06/17/2025 - 18:00

Artificial intelligence is advancing across a wide range of fields, with one of the most important developments being its growing capacity for reasoning. This capability could help AI becomes a reliable partner in critical domains like scientific research and healthcare.

To support this progress, we’ve identified three primary strategies to strengthen reasoning capabilities in both small and large language models: improve architectural design to boost performance in smaller models; incorporate mathematical reasoning techniques to increase reliability; and build stronger generalization capabilities to enable reasoning across a variety of fields.

Smarter reasoning in smaller models

While language models trained on broad world knowledge hold great potential, they lack the ability to learn continuously and refine their understanding. This limitation becomes especially pronounced in smaller models, where limited capacity makes strong reasoning even harder.

The problem stems from how current language models operate. They rely on fast, pattern recognition-based responses that break down in complex scenarios. In contrast, people use deliberate, step-by-step reasoning, test different approaches, and evaluate outcomes. To address this gap, we’re building methods to enable stronger reasoning in smaller systems.

rStar-Math is a method that uses Monte Carlo Tree Search (MCTS) to simulate deeper, more methodical reasoning in smaller models. It uses a three-step, self-improving cycle: 

  • Problem decomposition breaks down complex mathematical problems into manageable steps, creating a thorough and accurate course of reasoning.
  • Process preference model (PPM) trains small models to predict reward labels for each step, improving process-level supervision.
  • Iterative refinement applies a four-round, self-improvement cycle in which updated strategy models and PPMs guide MCTS to improve performance. 

When tested on four small language models ranging from 1.5 billion to 7 billion parameters, rStar-Math achieved an average accuracy of 53% on the American Invitational Mathematics Examination (AIME)—performance that places it among the top 20% of high school competitors in the US.

Figure 1. The rStar-Math framework

Logic-RL is a reinforcement learning framework that strengthens logical reasoning through a practical system prompt and a structured reward function. By training models on logic puzzles, Logic-RL grants rewards only when both the reasoning process and the final answer meet strict formatting requirements. This prevents shortcuts and promotes analytical rigor.

Language models trained with Logic-RL demonstrate strong performance beyond logic puzzles, generalizing effectively to mathematical competition problems. On the AIME and AMC (American Mathematics Competitions) datasets, 7-billion-parameter models improved accuracy by 125% and 38%, respectively, compared with baseline models.

Building reliable mathematical reasoning 

Mathematics poses a unique challenge for language models, which often struggle to meet its precision and rigor using natural language. To address this, we’re creating formal and symbolic methods to enable language models to adopt structured mathematical tools. The goal is to convert language model outputs into code based on the fundamental rules of arithmetic, like 1 + 1 = 2, allowing us to systematically verify accuracy. 

LIPS (LLM-based Inequality Prover with Symbolic Reasoning) is a system that combines LLMs’ pattern recognition capabilities with symbolic reasoning. LIPS draws on the strategies participants in math competitions use in order to distinguish between tasks best suited to symbolic solvers (e.g., scaling) and those better handled by language models (e.g., rewriting). On 161 Olympiad-level problems, LIPS achieved state-of-the-art results without additional training data.

Figure 2. An overview of LIPS

However, translating natural-language math problems into precise, machine-readable formats is a challenge. Our goal is to bridge the gap between the one-pass success rate, where the top-ranked generated result is correct, and the k-pass success rate, where at least one of the top k generated results is correct.

We developed a new framework using two evaluation methods. Symbolic equivalence checks whether outputs are logically identical, while semantic consistency uses embedding similarity to detect subtle differences missed by symbolic checks.

When we evaluated this approach on the MATH and miniF2F datasets, which include problems from various math competitions, it improved accuracy by up to 1.35 times over baseline methods.

Figure 3. An overview of the auto-formalization framework

To address the shortage of high-quality training data, we developed a neuro-symbolic framework that automatically generates diverse, well-structured math problems. Symbolic solvers create the problems, while language models translate them into natural language. This approach not only broadens training resources but also supports more effective instruction and evaluation of mathematical reasoning in language models.

Figure 4. An overview of the neuro-symbolic data generation framework Boosting generalization across domains 

A key indicator of advanced AI is its ability to generalize—the ability to transfer reasoning skills across different domains. We found that training language models on math data significantly improved performance in coding, science, and other areas, revealing unexpected cross-domain benefits. 

This discovery motivated us to develop Chain-of-Reasoning (CoR), an approach that unifies reasoning across natural language, code, and symbolic forms. CoR lets models blend these formats using natural language to frame context, code for precise calculations, and symbolic representations for abstraction. By adjusting prompts, CoR adapts both reasoning depth and paradigm diversity to match specific problem requirements. 

Tests of CoR across five math datasets showed its ability to tackle both computational and proof-based problems, demonstrating strong general mathematical problem-solving skills.

Figure 5. CoR’s reasoning process under different types of methods

Current language models often rely on domain-specific solutions, limiting their flexibility across different types of problems. To move beyond this constraint, we developed Critical Plan Step Learning (CPL), an approach focused on high-level abstract planning that teaches models to identify key knowledge, break down problems, and make strategic decisions. 

The technique draws on how people solve problems, by breaking them down, identifying key information, and recalling relevant knowledge—strategies we want language models to learn. 

CPL combines two key components: plan-based MCTS, which searches multi-step solution paths and constructs planning trees, and step-APO, which learns preferences for strong intermediate steps while filtering out weak ones. This combination enhances reasoning and improves generalization across tasks, moving AI systems closer to the flexible thinking that characterizes human intelligence.

Figure 6. Overview of the CPL framework Looking ahead: Next steps in AI reasoning

From building reliable math solvers to unifying reasoning approaches, researchers are redefining how language models approach complex tasks. Their work sets the stage for more capable and versatile AI systems—applicable to education, science, healthcare, and beyond. Despite these advances, hallucinations and imprecise logic continue to pose risks in critical fields like medicine and scientific research, where accuracy is essential.

These challenges are driving the team’s exploration of additional tools and frameworks to improve language model reasoning. This includes AutoVerus for automated proof generation in Rust code, SAFE for addressing data scarcity in Rust formal verification, and Alchemy, which uses symbolic mutation to improve neural theorem proving.

Together, these technologies represent important progress toward building trustworthy, high-performing reasoning models and signal a broader shift toward addressing some of AI’s current limitations.

Opens in a new tab

The post New methods boost reasoning in small and large language models appeared first on Microsoft Research.

Categories: Microsoft

Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library 

Microsoft Research - Tue, 06/10/2025 - 18:00

Outdated coding practices and memory-unsafe languages like C are putting software, including cryptographic libraries, at risk. Fortunately, memory-safe languages like Rust, along with formal verification tools, are now mature enough to be used at scale, helping prevent issues like crashes, data corruption, flawed implementation, and side-channel attacks.

To address these vulnerabilities and improve memory safety, we’re rewriting SymCrypt (opens in new tab)—Microsoft’s open-source cryptographic library—in Rust. We’re also incorporating formal verification methods. SymCrypt is used in Windows, Azure Linux, Xbox, and other platforms.

Currently, SymCrypt is primarily written in cross-platform C, with limited use of hardware-specific optimizations through intrinsics (compiler-provided low-level functions) and assembly language (direct processor instructions). It provides a wide range of algorithms, including AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA. 

Formal verification will confirm that implementations behave as intended and don’t deviate from algorithm specifications, critical for preventing attacks. We’ll also analyze compiled code to detect side-channel leaks caused by timing or hardware-level behavior.

Proving Rust program properties with Aeneas

Program verification is the process of proving that a piece of code will always satisfy a given property, no matter the input. Rust’s type system profoundly improves the prospects for program verification by providing strong ownership guarantees, by construction, using a discipline known as “aliasing xor mutability”.

For example, reasoning about C code often requires proving that two non-const pointers are live and non-overlapping, a property that can depend on external client code. In contrast, Rust’s type system guarantees this property for any two mutably borrowed references.

As a result, new tools have emerged specifically for verifying Rust code. We chose Aeneas (opens in new tab) because it helps provide a clean separation between code and proofs.

Developed by Microsoft Azure Research in partnership with Inria, the French National Institute for Research in Digital Science and Technology, Aeneas connects to proof assistants like Lean (opens in new tab), allowing us to draw on a large body of mathematical proofs—especially valuable given the mathematical nature of cryptographic algorithms—and benefit from Lean’s active user community.

Compiling Rust to C supports backward compatibility  

We recognize that switching to Rust isn’t feasible for all use cases, so we’ll continue to support, extend, and certify C-based APIs as long as users need them. Users won’t see any changes, as Rust runs underneath the existing C APIs.

Some users compile our C code directly and may rely on specific toolchains or compiler features that complicate the adoption of Rust code. To address this, we will use Eurydice (opens in new tab), a Rust-to-C compiler developed by Microsoft Azure Research, to replace handwritten C code with C generated from formally verified Rust. Eurydice (opens in new tab) compiles directly from Rust’s MIR intermediate language, and the resulting C code will be checked into the SymCrypt repository alongside the original Rust source code.

As more users adopt Rust, we’ll continue supporting this compilation path for those who build SymCrypt from source code but aren’t ready to use the Rust compiler. In the long term, we hope to transition users to either use precompiled SymCrypt binaries (via C or Rust APIs), or compile from source code in Rust, at which point the Rust-to-C compilation path will no longer be needed.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Timing analysis with Revizor 

Even software that has been verified for functional correctness can remain vulnerable to low-level security threats, such as side channels caused by timing leaks or speculative execution. These threats operate at the hardware level and can leak private information, such as memory load addresses, branch targets, or division operands, even when the source code is provably correct. 

To address this, we’re extending Revizor (opens in new tab), a tool developed by Microsoft Azure Research, to more effectively analyze SymCrypt binaries. Revizor models microarchitectural leakage and uses fuzzing techniques to systematically uncover instructions that may expose private information through known hardware-level effects.  

Earlier cryptographic libraries relied on constant-time programming to avoid operations on secret data. However, recent research has shown that this alone is insufficient with today’s CPUs, where every new optimization may open a new side channel. 

By analyzing binary code for specific compilers and platforms, our extended Revizor tool enables deeper scrutiny of vulnerabilities that aren’t visible in the source code.

Verified Rust implementations begin with ML-KEM

This long-term effort is in alignment with the Microsoft Secure Future Initiative and brings together experts across Microsoft, building on decades of Microsoft Research investment in program verification and security tooling.

A preliminary version of ML-KEM in Rust is now available on the preview feature/verifiedcrypto (opens in new tab) branch of the SymCrypt repository. We encourage users to try the Rust build and share feedback (opens in new tab). Looking ahead, we plan to support direct use of the same cryptographic library in Rust without requiring C bindings. 

Over the coming months, we plan to rewrite, verify, and ship several algorithms in Rust as part of SymCrypt. As our investment in Rust deepens, we expect to gain new insights into how to best leverage the language for high-assurance cryptographic implementations with low-level optimizations. 

As performance is key to scalability and sustainability, we’re holding new implementations to a high bar using our benchmarking tools to match or exceed existing systems.

Looking forward 

This is a pivotal moment for high-assurance software. Microsoft’s investment in Rust and formal verification presents a rare opportunity to advance one of our key libraries. We’re excited to scale this work and ultimately deliver an industrial-grade, Rust-based, FIPS-certified cryptographic library.

Opens in a new tab

The post Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library  appeared first on Microsoft Research.

Categories: Microsoft

BenchmarkQED: Automated benchmarking of RAG systems

Microsoft Research - Thu, 06/05/2025 - 18:00

One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. 

To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.  

BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. 

In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.

In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. 

Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.  

AutoQ: Automated query synthesis

This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.

AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom).

Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. 

AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.

Figure 2. Synthesis process and example query for each of the four AutoQ query classes.  Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry Opens in a new tab AutoE: Automated evaluation framework 

Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:

  • Comprehensiveness: Does the answer address all relevant aspects of the question? 
  • Diversity: Does it present varied perspectives or insights? 
  • Empowerment: Does it help the reader understand and make informed judgments? 
  • Relevance: Does it address what the question is specifically asking?  

The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.

An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.

These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.

Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.

The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.

Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab)RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.

LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context.

Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.

Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.

Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.

Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition.  AutoD: Automated data sampling and summarization

Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.

The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.

AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.

Supporting the community with open data and tools 

Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.  

We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab)

Opens in a new tab

The post BenchmarkQED: Automated benchmarking of RAG systems appeared first on Microsoft Research.

Categories: Microsoft

FrodoKEM: A conservative quantum-safe cryptographic algorithm

Microsoft Research - Tue, 05/27/2025 - 18:00

In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers.

The quantum threat to cryptography

For decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization and (elliptic curve) discrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of quantum computing poses a significant threat to these cryptographic schemes.

Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence.

At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1 (opens in new tab), the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems.

To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography (PQC)—new cryptographic algorithms that can resist attacks from both classical and quantum computers.

The NIST Post-Quantum Cryptography Standardization effort

In 2017, the U.S. National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography Standardization project (opens in new tab) to evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms (KEMs)—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes.

This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM (opens in new tab). Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS+, now called SLH-DSA; and Falcon, now called FN-DSA.

While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM.

International standardization of post-quantum cryptography

Beyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardization (ISO) is leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI (opens in new tab), the Netherlands’ NLNCSA and AIVD (opens in new tab), and France’s ANSSI (opens in new tab)—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes.

As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2 (opens in new tab).

What is FrodoKEM?

FrodoKEM is a key encapsulation mechanism (KEM) based on the Learning with Errors (LWE) problem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem.

Why unstructured lattices?

Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency.

It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances.

Lattices and the Learning with Errors (LWE) problem

Lattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problem (SVP) and the Learning with Errors (LWE) problem, forms the basis of lattice-based schemes.

The Learning with Errors (LWE) problem

The LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem.

As can be seen in Figure 1, for the setup of the problem we need a dimension \(n\) that defines the size of matrices, a modulus \(q\) that defines the value range of the matrix coefficients, and a certain error distribution \(\chi\) from which we sample \(\textit{“small”}\) matrices. We sample two matrices from \(\chi\), a small matrix \(\text{s}\) and an error matrix \(\text{e}\) (for simplicity in the explanation, we assume that both have only one column); sample an \(n \times n\) matrix \(\text{A}\) uniformly at random; and compute \(\text{b} = \text{A} \times \text{s} + \text{e}\). In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix \(\text{s}\) (small relative to the modulus \(q\)). Finally, given \(\text{A}\) and \(\text{b}\), the search LWE problem consists in finding \(\text{s}\). This problem is believed to be hard for suitably chosen parameters (e.g., for dimension \(n\) sufficiently large) and is used at the core of FrodoKEM.

In comparison, the LWE variant used in ML-KEM—called Module-LWE (M-LWE)—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix \(\text{A}\) can be represented by just a single row of coefficients.

FIGURE 1: Visualization of the (search) LWE problem.

LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time.

How FrodoKEM Works

FrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key:

  1. Key generation (KeyGen), computed by the recipient
    • Generates a public key and a secret key.
    • The public key is sent to the sender, while the private key remains secret.
  2. Encapsulation (Encapsulate), computed by the sender
    • Generates a random session key.
    • Encrypts the session key using the recipient’s public key to produce a ciphertext.
    • Produces a shared key using the session key and the ciphertext.
    • The ciphertext is sent to the recipient.
  3. Decapsulation (Decapsulate), computed by the recipient
    • Decrypts the ciphertext using their secret key to recover the original session key.
    • Reproduces the shared key using the decrypted session key and the ciphertext.

The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties.

Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “\(\text{b} = \text{A} \times \text{s} + \text{e}\)”, which are directly applied within the KEM paradigm.

FIGURE 2: Simplified overview of FrodoKEM. Performance: Strong security has a cost

Not relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter set (variant called “FrodoKEM-640-AES”) and the respective parameter set of ML-KEM (variant called “ML-KEM-512”). These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively.

For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions.

TABLE 1: Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets. A holistic design made with security in mind

FrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks.

Conclusion

After years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world.

Further Reading

For those interested in learning more about FrodoKEM, post-quantum cryptography, and lattice-based cryptography, the following resources provide valuable insights:

Opens in a new tab

The post FrodoKEM: A conservative quantum-safe cryptographic algorithm appeared first on Microsoft Research.

Categories: Microsoft
Syndicate content

eXTReMe Tracker