Microsoft Research

Syndicate content
Updated: 1 hour 33 min ago

When AI Meets Biology: Promise, Risk, and Responsibility

Mon, 10/06/2025 - 16:03

Advances in AI are opening extraordinary frontiers in biology. AI-assisted protein engineering holds the promise of new medicines, materials, and breakthroughs in scientific understandings. Yet these same technologies also introduce biosecurity risks and may lower barriers to designing harmful toxins or pathogens. This “dual-use” potential, where the same knowledge can be harnessed for good or misuse to cause harm, poses a critical dilemma for modern science.

Great Promise—and Potential Threat

I’m excited about the potential for AI-assisted protein design to drive breakthroughs in biology and medicine. At the same time, I’ve also studied how these tools could be misused. In computer-based studies, we found that AI protein design (AIPD) tools could generate modified versions of proteins of concern, such as ricin. Alarmingly, these reformulated proteins were able to evade the biosecurity screening systems used by DNA synthesis companies, which scientists rely on to synthesize AI-generated sequences for experimental use.

In our paper published in Science on October 2, “Strengthening nucleic acid biosecurity screening against generative protein design tools (opens in new tab),” we describe a two-year confidential project we began in late 2023 while preparing a case study for a workshop on AI and biosecurity.

We worked confidentially with partners across organizations and sectors for 10 months to develop AI biosecurity “red-teaming” methods that allowed us to better understand vulnerabilities and craft practical solutions—”patches” that have now been adopted globally, making screening systems significantly more AI-resilient.

Summary of AIPD red-teaming workflow.

For structuring, methods, and process in our study, we took inspiration from the cybersecurity community, where “zero-day” vulnerabilities are kept confidential until a protective patch is developed and deployed. Following the acknowledgment by a small group of workshop attendees of a zero-day for AI in biology, we worked closely with stakeholders—including synthesis companies, biosecurity organizations, and policymakers—to rapidly create and distribute patches that improved detection of AI-redesigned protein sequences. We delayed public disclosure until protective measures were in place and widely adopted.

Dilemma of Disclosure

The dual use dilemma also complicates how we share information about vulnerabilities and safeguards. Across AI and other fields, researchers face a core question:

How can scientists share potentially risk-revealing methods and results in ways that enable progress without offering a roadmap for misuse?

We recognized that our work itself—detailing methods and failure modes—could be exploited by malicious actors if published openly. To guide decisions about what to share, we held a multi-stakeholder deliberation involving government agencies, international biosecurity organizations, and policy experts. Opinions varied: some urged full transparency to maximize reproducibility—and to help others to build on our work; others stressed restraint to minimize risk. It was clear that a new model of scientific communication was needed, one that could balance openness and security.

The Novel Framework

The risk of sharing dangerous information through biological research has become a growing concern. We have participated in community-wide discussion on the challenges, including a recent National Academies of Science, Engineering, and Medicine workshop and study. 

In preparing our manuscript for publication, we worked on designing a process to limit the spread of dangerous information while still enabling scientific progress. 

To address the dual challenges, we devised a tiered access system for data and methods, implemented in partnership with the International Biosecurity and Biosafety Initiative for Science (IBBIS) (opens in new tab), a nonprofit dedicated to advancing science while reducing catastrophic risks. The system works as follows:

  • Controlled access: Researchers can request access through IBBIS, providing their identity, affiliation, and intended use. Requests are reviewed by an expert biosecurity committee, ensuring that only legitimate scientists conducting relevant research gain access.
  • Stratified tiers of information: Data and code are classified into several tiers according to their potential hazard, from low-risk summaries through sensitive technical data to critical software pipelines.
  • Safeguards and agreements: Approved users sign tailored usage agreements, including non-disclosure terms, before receiving data.
  • Resilience and longevity: Provisions are built in for declassification when risks subside, and for succession of stewardship to trusted organizations should IBBIS be unable to continue its operation.

This framework allows replication and extension of our work while guarding against misuse. Rather than relying on secrecy, it provides a durable system of responsible access.

To ensure continued funding for the storage and responsible distribution of sensitive data and software, and for the operation of the sharing program, we provided an endowment to IBBIS to support the program in perpetuity. This approach was modeled after the One Hundred Year Study on AI at Stanford, which is endowed to continue for the life of the university.

An Important Step in Scientific Publishing

We are pleased that the leadership at Science accepted our approach to handling information hazards. To our knowledge, this is the first time a leading scientific journal has formally endorsed a tiered-access approach to manage an information hazard. This recognition validates the idea that rigorous science and responsible risk management can coexist—and that journals, too, can play a role in shaping how sensitive knowledge is shared. We acknowledge the visionary leadership at Science, including editors, Michael Funk and Valda Vinson, and Editor-in-Chief, Holden Thorp.

Beyond Biology: A Model for Sensitive Research

While developed for AI-powered protein design, our approach offers a generalizable model for dual-use research of concern (DURC) across disciplines. Whether in biology, chemistry, or emerging technologies, scientists will increasingly confront situations where openness and security pull in opposite directions. Our experience shows that these values can be balanced: with creativity, coordination, and new institutional mechanisms, science can uphold both reproducibility and responsibility.

We hope this framework becomes a template for future projects, offering a way forward for researchers who wish to share their insights without amplifying risks. By embedding resilience into how knowledge is communicated—not just what is communicated—we can ensure that scientific progress continues to serve humanity safely.

The responsible management of information hazards is no longer a peripheral concern: it is central to how science will advance in the age of powerful technologies like AI. This approach to managing information hazards demonstrates a path forward, where novel frameworks for access and stewardship allow sensitive but vital research to be shared, scrutinized, and extended responsibly. Approaches like this will be critical to ensuring that scientific openness and societal safety advance hand-in-hand.

Additional reading

Strengthening nucleic acid biosecurity screening against generative protein design tools.

The Age of AI in the Life Sciences: Benefits and Biosecurity Considerations, National Academies of Science, Engineering, and Medicine, 2025. (opens in new tab)

Disseminating In Silico and Computational Biological Research: Navigating Benefits and Risks: Proceedings of a Workshop, National Academies of Science, Engineering, and Medicine, 2025. (opens in new tab)

Protecting scientific integrity in an age of generative AI, Proceedings of the National Academy of Science, 2024. (opens in new tab)

Opens in a new tab

The post When AI Meets Biology: Promise, Risk, and Responsibility appeared first on Microsoft Research.

Categories: Microsoft

Using AI to assist in rare disease diagnosis

Mon, 09/22/2025 - 16:17

In the promising and rapidly evolving field of genetic analysis, the ability to accurately interpret whole genome sequencing data is crucial for diagnosing and improving outcomes for people with rare genetic diseases. Yet despite technological advancements, genetic professionals face steep challenges in managing and synthesizing the vast amounts of data required for these analyses. Fewer than 50% of initial cases yield a diagnosis, and while reanalysis can lead to new findings, the process remains time-consuming and complex. 

To better understand and address these challenges, Microsoft Research—in collaboration with Drexel University and the Broad Institute​​—conducted a comprehensive study titled AI-Enhanced Sensemaking: Exploring the Design of a Generative AI-Based Assistant to Support Genetic Professionals (opens in new tab). The study was recently published in a special edition of ACM Transactions on Interactive Intelligent Systems journal focused on generative AI.  

The study focused on integrating generative AI to support the complex, time-intensive, and information-dense sensemaking tasks inherent in whole genome sequencing analysis. Through detailed empirical research and collaborative design sessions with experts in the field, we identified key obstacles genetic professionals face and proposed AI-driven solutions to enhance their workflows. ​     ​We developed strategies for how generative AI can help synthesize biomedical data, enabling AI-expert collaboration to increase the diagnoses of previously unsolved rare diseases—ultimately aiming to improve patients’ quality of life and life expectancy.

Whole genome sequencing in rare disease diagnosis

Rare diseases affect up to half a billion people globally and obtaining a diagnosis can take multiple years. These diagnoses often involve specialist consultations, laboratory tests, imaging studies, and invasive procedures. Whole genome sequencing is used to identify genetic variants responsible for these diseases by comparing a patient’s DNA sequence to reference genomes. ​​Genetic professionals use bioinformatics tools such as seqr, an open-source, web-based tool for rare disease case analysis and project management to assist them in filtering and prioritizing  > 1 million variants to determine their potential role in disease. A critical component of their work is sensemaking: the process of searching, filtering, and synthesizing data to build, refine, and present models from complex sets of gene and variant information.  

​​The multi-step sequencing process​​​ typically takes three to 12 weeks and requires extensive amounts of evidence and time to synthesize and aggregate information ​​to understand the gene and variant effects for the patient. If a patient’s case goes unsolved, their whole genome sequencing data is set aside until enough time has passed to warrant a reanalysis​​. This creates a backlog of patient cases​​. The ability to easily identify when new scientific evidence emerges and when to reanalyze an unsolved patient case is key to shortening the time patients suffer with an unknown rare disease diagnosis. 

The promise of AI systems to assist with complex human tasks

Approximately 87% of AI systems never reach deployment ​simply because they solve​​​ the wrong problems. ​​Understanding the AI support desired by different types of professionals, their current workflows, and AI capabilities is critical to successful AI system deployment and use. Matching technology capabilities with user tasks is particularly challenging in AI design because AI models can generate numerous outputs, and their capabilities can be unclear. ​To design an effective​​​ AI-based system​, one needs to identify​ ​​tasks AI can support, ​​determine​​​​​​ the appropriate level of AI involvement, and ​​design​​​​​​ user-AI interactions. This necessitates considering how humans interact with technology and how ​​AI can best be incorporated into workflows and tools.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Study objectives and co-designing a genetic AI assistant

Our study aimed to understand the current challenges and needs of genetic professionals performing whole genome sequencing analyses and explore the tasks where they want an AI assistant to support them in their work. The first phase of our study involved interviews with 17 genetics professionals to better understand their workflows, tools, and challenges. They included genetic analysts directly involved in interpreting data, as well as other roles participating in whole genome sequencing. In the second phase of our study, we conducted co-design sessions with study participants on how an AI assistant could support their workflows. We then developed a prototype of an AI assistant, which was further tested and refined with study participants in follow-up design walk-through sessions.

Identifying challenges in whole genome sequencing analysis

Through our in-depth interviews with genetic professionals, our study uncovered three critical challenges in whole genome sequencing analysis:

  1. Information Overload: Genetic analysts need to gather and synthesize vast amounts of data from multiple sources. This task is incredibly time-consuming and prone to human error.
  2. Collaborative Sharing: Sharing findings with others in the field can be cumbersome and inefficient, often relying on outdated methods that slow the collaborative analysis process.
  3. Prioritizing Reanalysis: Given the continuous influx of new scientific discoveries, prioritizing unsolved cases to reanalyze is a daunting challenge. Analysts need a systematic approach to identify cases that might benefit most from reanalysis.

Genetic professionals highlighted the time-consuming nature of gathering and synthesizing information about genes and variants from different data sources. Other genetic professionals may have insights into certain genes and variants, but sharing and interpreting information with others for collaborative sensemaking requires significant time and effort. Although new scientific findings could affect unsolved cases through reanalysis, prioritizing cases based on new findings was challenging given the number of unsolved cases and limited time of genetic professionals.

Co-designing with experts and AI-human sensemaking tasks

Our study participants prioritized two potential tasks of an AI assistant. The first task was flagging cases for reanalysis based on new scientific findings. The assistant would alert analysts to unsolved cases that could benefit from new research, providing relevant updates drawn from recent publications. The second task focused on aggregating and synthesizing information about genes and variants from the scientific literature. This feature would compile essential information from numerous scientific papers about genes and variants, presenting it in a user-friendly format and saving analysts significant time and effort. Participants emphasized the need to balance selectivity with comprehensiveness in the evidence they review. They also envisioned collaborating with other genetic professionals to interpret, edit, and verify artifacts generated by the AI assistant.

Genetic professionals require both broad and focused evidence at different stages of their workflow. The AI assistant prototypes were designed to allow flexible filtering and thorough evidence aggregation, ensuring users can delve into comprehensive data or selectively focus on pertinent details. The prototypes included features for collaborative sensemaking, enabling users to interpret, edit, and verify AI-generated information collectively. This ​​approach not only ​underscores​​​ the trustworthiness of AI outputs, but also facilitates shared understanding and decision-making among genetic professionals.

Design implications for expert-AI sensemaking

In the shifting frontiers of genome sequence analysis, leveraging generative AI to enhance sensemaking offers intriguing possibilities​​. The task of staying ​​current​​​​​​, synthesizing information from diverse sources, and making informed decisions ​​is challenging​​​​​​.  

Our study participants emphasized the hurdles in integrating data from multiple sources without losing critical components, documenting decision rationales, and fostering collaborative environments. Generative AI models, with their advanced capabilities, have started to address these challenges by automatically generating interactive artifacts to support sensemaking. However, the effectiveness of such systems hinges on careful design considerations, ​​particularly in how they facilitate distributed sensemaking, support both initial and ongoing sensemaking, and combine evidence from multiple modalities. We next discuss three design considerations for using generative AI models to support sensemaking.

Distributed expert-AI sensemaking design

Generative AI models can create artifacts that aid an individual user’s sensemaking process; however, the true potential lies in sharing these artifacts among users to foster collective understanding and efficiency. Participants in our study emphasized the importance of explainability, feedback, and trust when interacting with AI-generated content. ​​​​​​​​​​Trust is gained by​​​​​​ viewing portions of artifacts marked as correct by other users, or observing edits made to AI-generated information​​. ​​Some​​​​​​ users​, however,​ cautioned against over-reliance on AI, which could obscure underlying inaccuracies. Thus, design strategies should ensure that any corrections are clearly marked ​​and annotated​​​​​​. Furthermore, to enhance distributed sensemaking, visibility of others’ notes and context-specific synthesis through AI can streamline the process​​. 

Initial expert-AI sensemaking and re-sensemaking design

In our fast-paced, information-driven world, ​​it is essential to understand a situation both initially and again when new information arises.​​ ​​Sensemaking is inherently temporal, reflecting and shaping our understanding of time as we revisit tasks to reevaluate past decisions or incorporate new information. Generative AI plays a pivotal role here by transforming static data into dynamic artifacts that evolve, offering a comprehensive view of past rationales. Such AI-generated artifacts provide continuity, allowing users—both original decision-makers or new individuals—to access the rationale behind decisions made in earlier task instances. By continuously editing and updating these artifacts, generative AI highlights new information since the last review, supporting ongoing understanding and decision-making. Moreover, AI systems enhance ​​transparency​​​​​​ by summarizing previous notes and questions, offering insights into earlier thought processes and facilitating a deeper understanding of how conclusions were drawn. This reflective capability not only can reinforce initial sensemaking efforts but also equips users with the clarity needed for informed re-sensemaking as new data emerges. 

Combining evidence from multiple modalities to enhance AI-expert sensemaking

​​​The​​​​​​ ability to combine evidence from multiple modalities is essential for effective sensemaking. Users often need to integrate diverse types of data—text, images, spatial coordinates, and more—into a coherent narrative to make informed decisions. Consider the case of search and rescue operations, where workers must rapidly synthesize information from texts, photographs, and GPS data to strategize their efforts. Recent advancements in multimodal generative AI models have empowered users by incorporating and synthesizing these varied inputs into a unified, comprehensive view. For instance, a participant in our study illustrated this capability by using a generative AI model to merge text from scientific publications with a visual gene structure depiction. This integration ​​could create​​​​​​ an image that contextualizes an individual’s genetic variant within the ​​context​​​​​​ of documented variants. Such advanced synthesis enables users to capture complex relationships and insights briefly, streamlining decision-making and expanding the potential for innovative solutions across diverse fields. 

Sensemaking Process with AI Assistant Figure: Sensemaking process when interpreting variants with the introduction of prototype AI assistant. Gray boxes represent sensemaking activities which are currently performed by an analyst but are human-in-the-loop processes with involvement of our prototype AI assistant. Non-gray boxes represent activities reserved for analyst completion without assistance by our AI assistant prototype. Within the foraging searching and synthesizing processes, examples of data sources and data types for each, respectively, are connected by dotted lines. Conclusion

We explored the potential of generative AI to support​​ genetic professionals​ ​in diagnosing rare diseases​​. By designing an AI-based assistant, we aim to streamline whole genome sequencing analysis, helping professionals diagnose rare genetic diseases more efficiently. Our study unfolded in two key phases: ​pinpointing​​​ existing challenges in analysis, and design ideation, where we crafted a prototype AI assistant. This tool is designed to boost diagnostic yield and cut down diagnosis time by flagging cases for reanalysis and synthesizing crucial gene and variant data. Despite valuable findings, more research is needed​​. Future research will involve testing the AI assistant in real-time, task-based user testing with genetic professionals to assess the AI’s impact on their workflow. The promise of AI advancements lies in solving the right user problems and building the appropriate solutions, achieved through collaboration among model developers, domain experts, system designers, and HCI researchers. By fostering these collaborations, we aim to develop robust, personalized AI assistants tailored to specific domains. 

Join the conversation

Join us as we continue to explore the transformative potential of generative AI in genetic analysis, and please read the full text publication here (opens in new tab). Follow us on social media, share this post with your network, and let us know your thoughts on how AI can transform genetic research. If interested in our other related research work, check out Evidence Aggregator: AI reasoning applied to rare disease diagnosis. (opens in new tab)  

Opens in a new tab

The post Using AI to assist in rare disease diagnosis appeared first on Microsoft Research.

Categories: Microsoft

Tool-space interference in the MCP era: Designing for agent compatibility at scale

Thu, 09/11/2025 - 18:00

This year we’ve seen remarkable advances in agentic AI, including systems that conduct deep research, operate computers, complete substantial software engineering tasks, and tackle a range of other complex, multi-step goals. In each case, the industry relied on careful vertical integration: tools and agents were co-designed, co-trained, and tested together for peak performance. For example, OpenAI’s recent models presume the availability of web search and document retrieval tools (opens in new tab). Likewise, the prompts and actions of Magentic-One are set up to make hand-offs easy—for example, allowing the WebSurfer agent to pass downloaded files to the Coder agent.  But as agents proliferate, we anticipate strategies relying heavily on vertical integration will not age well. Agents from different developers or companies will increasingly encounter each other and must work together to complete tasks, in what we refer to as a society of agents. These systems can vary in how coordinated they are, how aligned their goals are, and how much information they share. Can heterogenous agents and tools cooperate in this setting, or will they hinder one another and slow progress?

Early clues have emerged from an unexpected source: namely, Model Context Protocol (opens in new tab) (MCP). Since January 2025, MCP has grown from a promising spec to a thriving market of tool servers. As an example, Zapier boasts a catalog of 30,000 tools (opens in new tab) across 7,000 services. Composio provide over 100 managed MCP servers (opens in new tab), surfacing hundreds of tools. Hugging Face is now serving many Spaces apps over MCP (opens in new tab), and Shopify has enabled MCP for millions of storefronts (opens in new tab). A society of tools is already here, and it promises to extend agent capabilities through cross-provider horizontal integration. 

So, what does MCP have to say about horizontal integration? As catalogs grow, we expect some new failure modes to surface. This blog post introduces these as tool-space interference, and sketches both early observations and some pragmatic interventions to keep the society we’re building from stepping on its own feet. 

Tool-space interference describes situations where otherwise reasonable tools or agents, when co-present, reduce end-to-end effectiveness. This can look like longer action sequences, higher token cost, brittle recovery from errors, or, in some cases, task failure.

A framing example

Consider MCP as a means for extending Magentic-One, a generalist multi-agent system we released last year, to cover more software engineering tasks. Magentic-One ships with agents to write code, interact with the computer terminal, browse the web, and access local files. To help Magentic-One navigate version control, find issues to solve, and make pull requests, we could add an agent equipped with the GitHub MCP Server. However, now each time the team encounters a task involving GitHub, it must choose whether to visit github.com in the browser, execute a git command at the command line, or engage the GitHub MCP server. As the task progresses, agent understanding of state can also diverge: changing the branch in the browser won’t change the branch in the terminal, and an authorized MCP tool does not imply authorization in the browser. Thus, while any single agent might complete the task efficiently, the larger set of agents might misunderstand or interfere with one another, leading to additional rounds of debugging, or even complete task failure.

Figure 1: We can extend Magentic-One by adding an agent that equips the GitHub MCP server. However, on every turn involving a git-related task, the orchestrator will need to decide between messaging the Computer Terminal agent (with access to the git command line interface), WebSurfer agent (with access to github.com), and the agent with the GitHub MCP server. This overlap raises the possibility that they will interfere with one another.   Tool-space interference, through the lens of MCP

To better understand the potential interference patterns and the current state of the MCP ecosystem, we conducted a survey of MCP servers listed on two registries: smithery.ai (opens in new tab) and Docker MCP Hub (opens in new tab). Smithery is an MCP Server registry with over 7,000 first-party and community-contributed servers, which we sampled from the Smithery API. Likewise, Docker MCP Hub is a registry that distributes MCP servers as Docker images, and we manually collected popular entries. We then launched each server for inspection. After excluding servers that were empty or failed to launch, and deduplicating servers with identical features, 1,470 servers remained in our catalog.

To automate the inspection of running MCP servers, we developed an MCP Interviewer tool. The MCP Interviewer begins by cataloging the server’s tools, prompts, resources, resource templates, and capabilities. From this catalog we can compute descriptive statistics such as the number of tools, or the depth of the parameter schemas.  Then, given the list of available tools, the interviewer uses an LLM (in our case, OpenAI’s GPT-4.1) to construct a functional testing plan that calls each tool at least once, collecting outputs, errors, and statistics along the way. Finally, the interviewer can also grade more qualitative criteria by using an LLM to apply purpose-built rubrics to tool schemas and tool call outputs.  We are excited to release the MCP Interviewer as an open-source CLI tool (opens in new tab), so server developers can automatically evaluate their MCP servers with agent usability in mind, and users can validate new servers. 

While our survey provides informative initial results, it also faces significant limitations, the most obvious of which is authorization: many of the most popular MCP servers provide access to services that require authorization to use, hindering automated analysis. We are often still able to collect static features from these servers but are limited in the functional testing that can be done.

One-size fits all (but some more than others)

So, what does our survey of MCP servers tell us about the MCP ecosystem? We will get into the numbers in a moment, but as we contemplate the statistics, there is one overarching theme to keep in mind: MCP servers do not know which clients or models they are working with, and present one common set of tools, prompts, and resources to everyone. However, some models handle long contexts and large tool spaces better than others (with diverging hard limits), and respond quite differently to common prompting patterns. For example, OpenAI’s guide on function calling (opens in new tab) advises developers to:

Include examples and edge cases, especially to rectify any recurring failures. (Note: Adding examples may hurt performance for reasoning models).”

So already, this places MCP at a disadvantage over vertical integrations that optimize to the operating environment. And with that, let’s dive into more numbers.

Tool count

While models generally vary in their proficiency for tool calling, the general trend has been that performance drops as the number of tools increases. For example, OpenAI limits developers to 128 tools, but recommends (opens in new tab) that developers:

Keep the number of functions small for higher accuracy. Evaluate your performance with different numbers of functions. Aim for fewer than 20 functions at any one time, though this is just a soft suggestion.

While we expect this to improve with each new model generation, at present, large tool spaces can lower performance by up to 85% for some models (opens in new tab). Thankfully, the majority of servers in our survey contain four or fewer tools. But there are outliers: the largest MCP server we cataloged adds 256 distinct tools, while the 10 next-largest servers add more than 100 tools each. Further down the list we find popular servers like Playwright-MCP (opens in new tab) (29 tools, at the time of this writing), and GitHub MCP (91 tools, with subsets available at alternative endpoint URLs), which might be too large for some models.

Figure 2: The number of tools listed by each catalogued server directly after initialization. Note: servers can change the tools they list at any time, but only 226 servers in our catalog declare this capability. Response length

Tools are generally called in agentic loops, where the output is then fed back into the model as input context. Models have hard limits on input context, but even within these limits, large contexts can drive costs up and performance down, so practical limits can be much lower (opens in new tab). MCP offers no guidance on how many tokens a tool call can produce, and the size of some responses can come as a surprise. In our analysis, we consider the 2,443 tool calls across 1,312 unique tools that the MCP Interviewer was able to call successfully during the active testing phase of server inspection. While a majority of tools produced 98 or fewer tokens (opens in new tab), some tools are extraordinarily heavyweight: the top tool returned an average of 557,766 tokens, which is enough to swamp the context windows of many popular models like GPT-5. Further down the list, we find that 16 tools produce more than 128,000 tokens, swamping GPT-4o and other popular models. Even when responses fit into the context window length, overly long responses can significantly degrade performance (up to 91% in one study (opens in new tab)), and limit the number of future calls that can be made. Of course, agents are free to implement their own context management strategies, but this behavior is left undefined in the MCP specification and server developers cannot count on any particular client behavior or strategy.

# of tools that would overflow context inModelContext Window1 call2 calls3-5 calls6-10 callsGPT 4.11,000,00001711GPT 5400,000171525GPT-4o, Llama 3.1,128,00016153340Qwen 332,00056378690Phi-416,0009360116109 Figure 3: Tool call response length averages, in tokens, as observed by the MCP Interviewer’s functional test plan. Only successful tool calls are considered. Horizontal lines indicate context window limits for GPT-4o and GPT-5. Tool parameter complexity

Mirroring the challenges from increasing the number of tools, increasing the complexity of a tool’s parameter space can also lead to degradation. For example, while MCP tools can take complex object types and structures as parameters, composio (opens in new tab) found that flattening the parameter space could improve tool-calling performance by 47% compared to baseline performance.  In our analysis, we find numerous examples of deeply nested structure—in one case, going 20 levels deep.

Figure 4: The maximum depth of each tool’s input properties schema. A depth of 0 indicates a tool with no properties. A depth of 1 indicates a tool with named properties but no annotations (e.g., no description or type). A depth of 2 indicates a tool with named and annotated properties.  A depth of 3+ indicates a tool with structured properties that have additional nested annotations.  Namespacing issues and naming ambiguity

Another often-cited issue with the current MCP specification is the lack of a formal namespace mechanism (opens in new tab). If two servers are registered to the same agent or application, and the servers have tool names in common, then disambiguation becomes impossible. Libraries like the OpenAI Agents SDK raise an error (opens in new tab) under this circumstance. Clients, like Claude Code, prefix tool names with unique identifiers to work around this issue. In our analysis of MCP servers, we found name collisions between 775 tools. The most common collision was “search”, which appears across 32 distinct MCP servers. The following table lists the top 10 collisions.

Tool NameNumber of Instancessearch32get_user11execute_query11list_tables10update_task9generate_image9send_message9execute_command8list_tasks8search_files8

Even when names are unique, they can be semantically similar. If these tools behave similarly, then the redundancy may not be immediately problematic, but if you are expecting to call a particular tool then the name similarities raise the potential for confusion. The following table lists some examples of semantically similar tool names relating to web search:

websearchbrave_web_searchsearch-webtavily_web_searchweb_searchgoogle_news_searchsearch_webgoogle-play-searchsearch_webkrgoogle_search_parsedgoogle_searchsearch_google_imagessearch_googleget_webset_search_exaai_web_searchsearch_google_scholarweb_search_exaduckduckgo_web_searchsearch_web_toolgoogle_search_scraperweb_search_agentanswer_query_websearchbatch-web-search  Errors and error messages

Like all software libraries, MCP will occasionally encounter error conditions. In these cases, it is important to provide sufficient information for the agent to handle the error and plan next steps. In our analysis, we found this was not always the case. While MCP provides an “IsError” flag to signal errors, we found that it was common for servers to handle errors by returning strings while leaving this flag set to false, signaling a normal exit. Out of 5,983 tool call results with no error flag, GPT-4.1 judged that 3,536 indicated errors in their content. More worrisome: the error messages were often of low quality. For instance, one tool providing web search capabilities failed with the string “error: job,” while another tool providing academic search returned “Please retry with 0 or fewer IDs.”

Resource sharing conventions

Finally, in addition to tools, MCP allows servers to share resources and resource templates with clients. In our survey, only 112 (7.6%) servers reported any resources, while 74 (5%) provided templates. One potential reason for low adoption is that the current MCP specification provides limited guidance for when resources are retrieved, or how they are incorporated into context. One clearcut situation where a client might retrieve a resource is in response to a tool returning a resource_link (opens in new tab) as a result — but only 4 tools exhibited this behavior in our survey (arguably, this would be the ideal behavior for tools that return very long, document-like responses, as outlined earlier).

Conversely, a whole different set of issues arises when there is a need to share resources from the client to the server. Consider for example a tool that provides some analysis of a local PDF file. In the case of a local MCP server utilizing STDIO transport, a local file path can be provided as an argument to the tool, but no similar conventions exist for delivering a local file to a remote MCP server. These issues are challenging enough when implementing a single server. When multiple tools or servers need to interact within the same system, the risk of interoperability errors compounds.

Recommendations

On balance, along any given dimension, the average MCP server is quite reasonable—but, as we have seen, outliers and diverging assumptions can introduce trouble. While we expect many of these challenges to improve with time, we are comfortable making small recommendations that we feel are evergreen. We organize them below by audience.

Protocol developers

We recognize the advantages of keeping MCP relatively lightweight, avoiding being overly prescriptive in an environment where AI models and use cases are rapidly changing. However, a few small recommendations are warranted. First, we believe MCP should be extended to include a specification for client-provided resources so that tools on remote servers have a mechanism for operating on specified local files or documents. This would more effectively position MCP as a clearinghouse for resources passed between steps of agentic workflows. The MCP specification would also benefit from taking a more opinionated stance on when resources are retrieved and used overall.

Likewise, we believe MCP should quickly move to provide formal namespaces to eliminate tool name collisions. If namespaces are hierarchical, then this also provides a way of organizing large catalogs of functions into thematically related tool sets. Tool sets, as an organizing principle, are already showing some promise in GitHub MCP Server’s dynamic tool discovery, (opens in new tab) and VS Code’s tool grouping (with virtual tools) (opens in new tab), where agents or users can enable and disable tools as needed.  In the future, a standardized mechanism for grouping tools would allow clients to engage in hierarchical tool-calling, where they first select a category, then select a tool, without needing to keep all possible tools in context.

Server developers

While our MCP Interviewer tool can catalog many outward-facing properties of MCP servers, developers are often in a much better position to characterize the nature of their tools. To this end, we believe developers should publish an MCP Server card alongside their servers or services, clearly outlining the runtime characteristics of the tools (e.g., the expected number of tokens generated, or expected latency of a tool call). Ideally developers should also indicate which models, agents and clients the server was tested with, how the tools were tested (e.g., provide sample tasks), list any known incompatibilities, and be mindful of limitations of various models throughout development.

Client developers

Client developers have the opportunity to experiment with various mitigations or optimizations that might help the average MCP server work better for a given system or environment. For example, clients could cache tool schemas, serving them as targets for prompt optimizations, or as an index for RAG-like tool selection approaches. To this end, Anthropic recently reported using a tool testing agent (opens in new tab) to rewrite the prompts of defective MCP servers, improving task completion time by 40%. Likewise, rather than waiting for the protocol to evolve, clients could take proactive steps to resolve name collisions— for example, generating namespaces from server names—and could reduce token outputs by summarizing or paginating long tool results.

Market developers

Finally, we see an opportunity for marketplaces to codify best-practices, spot compatibility issues at a global level, and perhaps centralize the generation and serving of model or agent-specific optimizations. Mirroring how a market like PyPI distributes Python wheels matched to a developer’s operating system or processor (opens in new tab), an MCP marketplace could serve tool schemas optimized for a developer’s chosen LLM, agent or client library. We are already seeing small steps in this direction, with registries like Smithery providing customized launch configurations to match users’ clients.

Conclusion

In summary, the MCP ecosystem offers significant value for AI agent development, despite some early growing pains. Grounded in insights from the MCP Interviewer (opens in new tab) and our survey of live servers, the evidence is clear: horizontal integration is expanding capability, yet it also exposes forms of toolspace interference that can erode end to end effectiveness. Anticipating rapid advances in model capability and growing architectural diversity, the recommendations provided here aim to ensure that protocol, server, client, and marketplace developers are well positioned to adapt and thrive. Key steps include implementing formal namespaces to eliminate collisions, enhancing protocol support for client provided resources, and encouraging transparent server documentation to foster interoperability and robust development practices across the ecosystem. 

By embracing these evergreen recommendations and proactively addressing compatibility, usability, and optimization issues, the AI agent community can create a more reliable, scalable, and efficient infrastructure that benefits both developers and end users. The future of MCP is bright, with ample opportunities for experimentation, standardization, and collective progress.

Opens in a new tab

The post Tool-space interference in the MCP era: Designing for agent compatibility at scale appeared first on Microsoft Research.

Categories: Microsoft

RenderFormer: How neural networks are reshaping 3D rendering

Wed, 09/10/2025 - 18:00

3D rendering—the process of converting three-dimensional models into two-dimensional images—is a foundational technology in computer graphics, widely used across gaming, film, virtual reality, and architectural visualization. Traditionally, this process has depended on physics-based techniques like ray tracing and rasterization, which simulate light behavior through mathematical formulas and expert-designed models.

Now, thanks to advances in AI, especially neural networks, researchers are beginning to replace these conventional approaches with machine learning (ML). This shift is giving rise to a new field known as neural rendering.

Neural rendering combines deep learning with traditional graphics techniques, allowing models to simulate complex light transport without explicitly modeling physical optics. This approach offers significant advantages: it eliminates the need for handcrafted rules, supports end-to-end training, and can be optimized for specific tasks. Yet, most current neural rendering methods rely on 2D image inputs, lack support for raw 3D geometry and material data, and often require retraining for each new scene—limiting their generalizability.

RenderFormer: Toward a general-purpose neural rendering model

To overcome these limitations, researchers at Microsoft Research have developed RenderFormer, a new neural architecture designed to support full-featured 3D rendering using only ML—no traditional graphics computation required. RenderFormer is the first model to demonstrate that a neural network can learn a complete graphics rendering pipeline, including support for arbitrary 3D scenes and global illumination, without relying on ray tracing or rasterization. This work has been accepted at SIGGRAPH 2025 and is open-sourced on GitHub (opens in new tab).

Architecture overview

As shown in Figure 1, RenderFormer represents the entire 3D scene using triangle tokens—each one encoding spatial position, surface normal, and physical material properties such as diffuse color, specular color, and roughness. Lighting is also modeled as triangle tokens, with emission values indicating intensity.

Figure 1. Architecture of RenderFormer

To describe the viewing direction, the model uses ray bundle tokens derived from a ray map—each pixel in the output image corresponds to one of these rays. To improve computational efficiency, pixels are grouped into rectangular blocks, with all rays in a block processed together.

The model outputs a set of tokens that are decoded into image pixels, completing the rendering process entirely within the neural network.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab Dual-branch design for view-independent and view-dependent effects

The RenderFormer architecture is built around two transformers: one for view-independent features and another for view-dependent ones.

  • The view-independent transformer captures scene information unrelated to viewpoint, such as shadowing and diffuse light transport, using self-attention between triangle tokens.
  • The view-dependent transformer models effects like visibility, reflections, and specular highlights through cross-attention between triangle and ray bundle tokens.

Additional image-space effects, such as anti-aliasing and screen-space reflections, are handled via self-attention among ray bundle tokens.

To validate the architecture, the team conducted ablation studies and visual analyses, confirming the importance of each component in the rendering pipeline.

Table 1. Ablation study analyzing the impact of different components and attention mechanisms on the final performance of the trained network.

To test the capabilities of the view-independent transformer, researchers trained a decoder to produce diffuse-only renderings. The results, shown in Figure 2, demonstrate that the model can accurately simulate shadows and other indirect lighting effects.

Figure 2. View-independent rendering effects decoded directly from the view-independent transformer, including diffuse lighting and coarse shadow effects.

The view-dependent transformer was evaluated through attention visualizations. For example, in Figure 3, the attention map reveals a pixel on a teapot attending to its surface triangle and to a nearby wall—capturing the effect of specular reflection. These visualizations also show how material changes influence the sharpness and intensity of reflections.

Figure 3. Visualization of attention outputs Training methodology and dataset design

RenderFormer was trained using the Objaverse dataset, a collection of more than 800,000 annotated 3D objects that is designed to advance research in 3D modeling, computer vision, and related fields. The researchers designed four scene templates, populating each with 1–3 randomly selected objects and materials. Scenes were rendered in high dynamic range (HDR) using Blender’s Cycles renderer, under varied lighting conditions and camera angles.

The base model, consisting of 205 million parameters, was trained in two phases using the AdamW optimizer:

  • 500,000 steps at 256×256 resolution with up to 1,536 triangles
  • 100,000 steps at 512×512 resolution with up to 4,096 triangles

The model supports arbitrary triangle-based input and generalizes well to complex real-world scenes. As shown in Figure 4, it accurately reproduces shadows, diffuse shading, and specular highlights.

Figure 4. Rendered results of different 3D scenes generated by RenderFormer

RenderFormer can also generate continuous video by rendering individual frames, thanks to its ability to model viewpoint changes and scene dynamics.

3D animation sequence rendered by RenderFormer Looking ahead: Opportunities and challenges

RenderFormer represents a significant step forward for neural rendering. It demonstrates that deep learning can replicate and potentially replace the traditional rendering pipeline, supporting arbitrary 3D inputs and realistic global illumination—all without any hand-coded graphics computations.

However, key challenges remain. Scaling to larger and more complex scenes with intricate geometry, advanced materials, and diverse lighting conditions will require further research. Still, the transformer-based architecture provides a solid foundation for future integration with broader AI systems, including video generation, image synthesis, robotics, and embodied AI. 

Researchers hope that RenderFormer will serve as a building block for future breakthroughs in both graphics and AI, opening new possibilities for visual computing and intelligent environments.

Opens in a new tab

The post RenderFormer: How neural networks are reshaping 3D rendering appeared first on Microsoft Research.

Categories: Microsoft

Breaking the networking wall in AI infrastructure 

Tue, 09/09/2025 - 16:00

Memory and network bottlenecks are increasingly limiting AI system performance by reducing GPU utilization and overall efficiency, ultimately preventing infrastructure from reaching its full potential despite enormous investments. At the core of this challenge is a fundamental trade-off in the communication technologies used for memory and network interconnects.

Datacenters typically deploy two types of physical cables for communication between GPUs. Traditional copper links are power-efficient and reliable, but limited to very short distances (< 2 meters) that restrict their use to within a single GPU rack. Optical fiber links can reach tens of meters, but they consume far more power and fail up to 100 times as often as copper. A team working across Microsoft aims to resolve this trade-off by developing MOSAIC, a novel optical link technology that can provide low power and cost, high reliability, and long reach (up to 50 meters) simultaneously. This approach leverages a hardware-system co-design and adopts a wide-and-slow design with hundreds of parallel low-speed channels using microLEDs. 

The fundamental trade-off among power, reliability, and reach stems from the narrow-and-fast architecture deployed in today’s copper and optical links, comprising a few channels operating at very high data rates. For example, an 800 Gbps link consists of eight 100 Gbps channels. With copper links, higher channel speeds lead to greater signal integrity challenges, which limits their reach. With optical links, high-speed transmission is inherently inefficient, requiring power-hungry laser drivers and complex electronics to compensate for transmission impairments. These challenges grow as speeds increase with every generation of networks. Transmitting at high speeds also pushes the limits of optical components, reducing systems margins and increasing failure rates. 

These limitations force systems designers to make unpleasant choices, limiting the scalability of AI infrastructure. For example, scale-up networks connecting AI accelerators at multi-Tbps bandwidth typically must rely on copper links to meet the power budget, requiring ultra-dense racks that consume hundreds of kilowatts per rack. This creates significant challenges in cooling and mechanical design, which constrain the practical scale of these networks and end-to-end performance. This imbalance ultimately erects a networking wall akin to the memory wall, in which CPU speeds have outstripped memory speeds, creating performance bottlenecks.

A technology offering copper-like power efficiency and reliability over long distances can overcome this networking wall, enabling multi-rack scale-up domains and unlocking new architectures. This is a highly active R&D area, with many candidate technologies currently being developed across the industry. In our recent paper, MOSAIC: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs, which received the Best Paper award at ACM SIGCOMM (opens in new tab), we present one such promising approach that is the result of a multi-year collaboration between Microsoft Research, Azure, and M365. This work is centered around an optical wide-and-slow architecture, shifting from a small number of high-speed serial channels towards hundreds of parallel low-speed channels. This would be impractical to realize with today’s copper and optical technologies because of i) electromagnetic interference challenges in high-density copper cables and ii) the high cost and power consumption of lasers in optical links, as well as the increase in packaging complexity. MOSAIC overcomes these issues by leveraging directly modulated microLEDs, a technology originally developed for screen displays. 

MicroLEDs are significantly smaller than traditional LEDs (ranging from a few to tens of microns) and, due to their small size, they can be modulated at several Gbps. They are manufactured in large arrays, with over half a million in a small physical footprint for high-resolution displays like head-mounted devices or smartwatches. For example, assuming 2 Gbps per microLED channel, an 800 Gbps MOSAIC link can be realized by using a 20×20 microLED array, which can fit in less than 1 mm×1 mm silicon die. 

MOSAIC’s wide-and-slow design provides four core benefits.

  • Operating at low speed improves power efficiency by eliminating the need for complex electronics and reducing optical power requirements.
  • By leveraging optical transmission (via microLEDs), MOSAIC sidesteps copper’s reach issues, supporting distances up to 50 meters, or > 10x further than copper.
  • MicroLEDs’ simpler structure and temperature insensitivity make them more reliable than lasers. The parallel nature of wide-and-slow also makes it easy to add redundant channels, further increasing reliability, up to two orders of magnitude higher than optical links. 
  • The approach is also scalable, as higher aggregate speeds (e.g., 1.6 Tbps or 3.2 Tbps) can be achieved by increasing the number of channels and/or raising per-channel speed (e.g., to 4-8 Gbps). 

Further, MOSAIC is fully compatible with today’s pluggable transceivers’ form factor and it provides a drop-in replacement for today’s copper and optical cables, without requiring any changes to existing server and network infrastructure. MOSAIC is protocol-agnostic, as it simply relays bits from one endpoint to another without terminating or inspecting the connection and, hence, it’s fully compatible with today’s protocols (e.g., Ethernet, PCIe, CXL). We are currently working with our suppliers to productize this technology and scale to mass production. 

While conceptually simple, realizing this architecture posed a few key challenges across the stack, which required a multi-disciplinary team with expertise spanning across integrated photonics, lens design, optical transmission, and analog and digital design. For example, using individual fibers per channel would be prohibitively complex and costly due to the large number of channels. We addressed this by employing imaging fibers, which are typically used for medical applications (e.g., endoscopy). They can support thousands of cores per fiber, enabling multiplexing of many channels within a single fiber. Also, microLEDs are a less pure light source than lasers, with a larger beam shape (which complicates fiber coupling) and a broader spectrum (which degrades fiber transmission due to chromatic dispersion). We tackled these issues through a novel microLED and optical lens design, and a power-efficient analog-only electronic back end, which does not require any expensive digital signal processing.  

Based on our current estimates, this approach can save up to 68% of power, i.e., more than 10W per cable while reducing failure rates by up to 100x. With global annual shipments of optical cables reaching into the tens of millions, this translates to over 100MW of power savings per year, enough to power more than 300,000 homes. While these immediate gains are already significant, the unique combination of low power consumption, reduced cost, high reliability, and long reach opens up exciting new opportunities to rethink AI infrastructure from network and cluster architectures to compute and memory designs.

For example, by supporting low-power, high-bandwidth connectivity at long reach, MOSAIC removes the need for ultra-dense racks and enables novel network topologies, which would be impractical today. The resulting redesign could reduce resource fragmentation and simplify collective optimization. Similarly, on the compute front, the ability to connect silicon dies at low power over long distances could enable resource disaggregation, shifting from today’s large, multi-die packages to smaller, more cost-effective, ones. Bypassing packaging area constraints would also make it possible to drastically increase GPU memory capacity and bandwidth, while facilitating adoption of novel memory technologies

Historically, step changes in network technology have unlocked entirely new classes of applications and workloads. While our SIGCOMM paper provides possible future directions, we hope this work sparks broader discussion and collaboration across the research and industry communities.

Opens in a new tab

The post Breaking the networking wall in AI infrastructure  appeared first on Microsoft Research.

Categories: Microsoft

Crescent library brings privacy to digital identity systems

Tue, 08/26/2025 - 18:00

Digital identities, the electronic credentials embedded in phone wallets, workplace logins, and other apps, are becoming ubiquitous. While they offer unprecedented convenience, they also create new privacy risks, particularly around tracking and surveillance. 

One of these risks is linkability, the ability to associate one or more uses of a credential to a specific person. Currently, when people use their mobile driver’s license or log into various apps, hidden identifiers can link these separate activities together, building detailed profiles of user behavior.  

To address this, we have released Crescent (opens in new tab), a cryptographic library that adds unlinkability to widely used identity formats, protecting privacy. These include JSON Web Tokens (the authentication standard behind many app logins) and mobile driver’s licenses. Crescent also works without requiring the organizations that issue these credentials to update their systems.  

The protection goes beyond existing privacy features. Some digital identity systems already offer selective disclosure, allowing users to share only specific pieces of information in each interaction.  

But even with selective disclosure, credentials can still be linked through serial numbers, cryptographic signatures, or embedded identifiers. Crescent’s unlinkability feature is designed to prevent anything in the credential, beyond what a user explicitly chooses to reveal, from being used to connect their separate digital interactions.

Figure 1: Unlinkability between a credential issuance and presentation Two paths to unlinkability 

To understand how Crescent works, it helps to examine the two main approaches researchers have developed for adding unlinkability to identity systems: 

  1. Specialized cryptographic signature schemes. These schemes can provide unlinkability but require extensive changes to existing infrastructure. New algorithms must be standardized, implemented, and integrated into software and hardware platforms. For example, the BBS (opens in new tab) signature scheme is currently being standardized by the Internet Engineering Task Force (IETF), but even after completion, adoption may be slow.   
  1. Zero-knowledge proofs with existing credentials. This approach, used by Crescent (opens in new tab), allows users to prove specific facts about their credentials without revealing the underlying data that could enable tracking. For example, someone could prove they hold a valid driver’s license and live in a particular ZIP code without exposing any other personal information or identifiers that could link this interaction to future ones. 

Zero-knowledge proofs have become more practical since they were first developed 40 years ago but they are not as efficient as the cryptographic algorithms used in today’s credentials. Crescent addresses this computational challenge through preprocessing, performing the most complex calculations once in advance so that later proof generation is quick and efficient for mobile devices. 

Beyond unlinkability, Crescent supports selective disclosure, allowing users to prove specific facts without revealing unnecessary details. For example, it can confirm that a credential is valid and unexpired without disclosing the exact expiration date, which might otherwise serve as a unique identifier. These privacy protections work even when credentials are stored in a phone’s secure hardware, which keeps them tied to the device and prevents unauthorized access.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Behind the cryptographic curtain 

At its core, Crescent uses a sophisticated form of cryptographic proof called a zero-knowledge SNARK (Zero-Knowledge Succinct Noninteractive Argument of Knowledge). This method allows one party to prove possession of information or credentials without revealing the underlying data itself. 

Crescent specifically uses the Groth16 proof system, one of the first practical implementations of this technology. What makes Groth16 particularly useful is that its proofs are small in size, quick to verify, and can be shared in a single step without back-and-forth communication between the user and verifier. 

The system works by first establishing shared cryptographic parameters based on a credential template. Multiple organizations issuing similar credentials, such as different state motor vehicle departments issuing mobile driver’s licenses, can use the same parameters as long as they follow compatible data formats and security standards. 

The mathematical rules that define what each proof will verify are written using specialized programming tools that convert them into a Rank-1 Constraint System (R1CS), a mathematical framework that describes exactly what needs to be proven about a credential. 

To make the system fast enough for real-world use, Crescent splits the proof generation into two distinct stages: 

  1. Prepare stage. This step runs once and generates cryptographic values that can be stored on the user’s device for repeated use. 
  1. Show stage. When a user needs to present their credential, this quicker step takes the stored values and randomizes them to prevent any connection to previous presentations. It also creates a compact cryptographic summary that reveals only the specific information needed for that particular interaction. 

Figures 2 and 3 illustrate this credential-proving workflow and the division between the prepare and show steps.

Figure 2: Crescent’s credential-proving workflow includes a compilation of a circuit to R1CS, followed by the prepare and show steps. The output zero-knowledge proof is sent to the verifier. Figure 3: The Crescent presentation steps show the division between prepare and show steps. A sample application 

To demonstrate how Crescent works, we created a sample application covering two real-world scenarios: verifying employment and proving age for online access. The application includes sample code for setting up fictional issuers and verifiers as Rust servers, along with a browser-extension wallet for the user. The step numbers correspond to the steps in Figure 4. 

Setup 
  1. A Crescent service pre-generates the zero-knowledge parameters for creating and verifying proofs from JSON Web Tokens and mobile driver’s licenses. 
  1. The user obtains a mobile driver’s license from their Department of Motor Vehicles. 
  1. The user obtains a proof-of-employment JSON Web Token from their employer, Contoso. 
  1. These credentials and their private keys are stored in the Crescent wallet. 
Scenarios 
  1. Employment verification: The user presents their JSON Web Token to Fabrikam, an online health clinic, to prove they are employed at Contoso and eligible for workplace benefits. Fabrikam learns that the user works at Contoso but not the user’s identity, while Contoso remains unaware of the interaction. 
  1. Age verification: The user presents their mobile driver’s license to a social network, proving they are over 18. The proof confirms eligibility without revealing their age or identity. 

Across both scenarios, Crescent ensures that credential presentations remain unlinkable, preventing any party from connecting them to the user. 

For simplicity, the sample defines its own issuance and presentation protocol, but it could be integrated into higher-level identity frameworks such as OpenID/OAuth, Verifiable Credentials, or the mobile driver’s license ecosystem.

Figure 4. The sample architecture, from credential issuance to presentation.

To learn more about the project, visit the Crescent project GitHub (opens in new tab) page, or check out our recent presentations given at the Real-Word Crypto 2025 (opens in new tab) and North Sec 2025 (opens in new tab) conferences. 

Opens in a new tab

The post Crescent library brings privacy to digital identity systems appeared first on Microsoft Research.

Categories: Microsoft

Applicability vs. job displacement: further notes on our recent research on AI and occupations

Thu, 08/21/2025 - 19:00

Recently, we released a paper (Working with AI: Measuring the Occupational Implications of Generative AI) that studied what occupations might find AI chatbots useful, and to what degree. The paper sparked significant discussion, which is no surprise since people care deeply about the future of AI and jobs–that’s part of why we think it’s important to study these topics.

Unfortunately, not all the discussion was accurate in its portrayal of the study’s scope or conclusions. Specifically, our study does not draw any conclusions about jobs being eliminated; in the paper, we explicitly cautioned against using our findings to make that conclusion. 

Given the importance of this topic, we want to clarify any misunderstandings and provide a more digestible summary of the paper, our methodology, and its limitations. 

What did our research find?

We set out to better understand how people are using AI, highlighting where AI might be useful in different occupations. To do this, we analyzed how people currently use generative AI—specifically Microsoft Bing Copilot (now Microsoft Copilot)—to assist with tasks. We then compared these sets of tasks against the O*NET database (opens in new tab), a widely used occupational classification system, to understand potential applicability to various occupations.

We found that AI is most useful for tasks related to knowledge work and communication, particularly tasks such as writing, gathering information, and learning.

Those in occupations with these tasks may benefit by considering how AI can be used as a tool to help improve their workflows. On the flip side, it’s not surprising that physical tasks like performing surgeries or moving objects had less direct AI chatbot applicability.

So, to summarize, our paper is about identifying the occupations where AI may be most useful, by assisting or performing subtasks.  Our data do not indicate, nor did we suggest, that certain jobs will be replaced by AI.

Methodological limitations are acknowledged—and important

The paper is transparent about the limitations of our approach.  

We analyzed anonymized Bing Copilot conversations to see what activities users are seeking AI assistance with and what activities AI can perform when mapped to the O*NET database. While O*NET provides a structured list of activities associated with various occupations, it does not capture the full spectrum of skills, context, and nuance required in the real world.  A job is far more than the collection of tasks that make it up.

For example, a task might involve “writing reports,” but O*NET won’t reflect the interpersonal judgment, domain expertise, or ethical considerations that go into doing that well. The paper acknowledges this gap and warns against over-interpreting the AI applicability scores as measures of AI’s ability to perform an occupation.

Additionally, the dataset is based on user queries from Bing Copilot (from January – September 2024), which may be influenced by factors like awareness, access, or comfort with AI tools.  Different people use different LLMs for different purposes and it also is very difficult (or nearly impossible) to determine what conversations are performed in a work context or for leisure. 

Finally, we only evaluated AI chatbot usage, so this study does not evaluate the impact or applicability of other forms of AI.

Where do we go from here?

Given the intense interest in how AI will shape our collective future, it’s important we continue to study and better understand its societal and economic impact. As with all research on this topic, the findings are nuanced, and it’s important to pay attention to this nuance. 

The public interest in our research is based, in large part, on the topic of AI and job displacement. However, our current methodology for this study is unlikely to lead to firm conclusions about this.  AI may prove to be a useful tool for many occupations, and we believe the right balance lies in finding how to use the technology in a way that leverages its abilities while complementing human strengths and accounting for people’s preferences.    

For more information from Microsoft on the future of work and AI skilling, check out Microsoft’s Annual Work Trend Index (opens in new tab) and Microsoft Elevate (opens in new tab)

Opens in a new tab

The post Applicability vs. job displacement: further notes on our recent research on AI and occupations appeared first on Microsoft Research.

Categories: Microsoft

MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation

Wed, 08/20/2025 - 18:00

A new research framework helps AI agents explore three-dimensional spaces they can’t directly detect. Called MindJourney, the approach addresses a key limitation in vision-language models (VLMs), which give AI agents their ability to interpret and describe visual scenes.  

While VLMs are strong at identifying objects in static images, they struggle to interpret the interactive 3D world behind 2D images. This gap shows up in spatial questions like “If I sit on the couch that is on my right and face the chairs, will the kitchen be to my right or left?”—tasks that require an agent to interpret its position and movement through space. 

People overcome this challenge by mentally exploring a space, imagining moving through it and combining those mental snapshots to work out where objects are. MindJourney applies the same process to AI agents, letting them roam a virtual space before answering spatial questions. 

How MindJourney navigates 3D space

To perform this type of spatial navigation, MindJourney uses a world model—in this case, a video generation system trained on a large collection of videos captured from a single moving viewpoint, showing actions such as going forward and turning left or right, much like a 3D cinematographer. From this, it learns to predict how a new scene would appear from different perspectives.

At inference time, the model can generate photo-realistic images of a scene based on possible movements from the agent’s current position. It generates multiple possible views of a scene while the VLM acts as a filter, selecting the constructed perspectives that are most likely to answer the user’s question.

These are kept and expanded in the next iteration, while less promising paths are discarded. This process, shown in Figure 1, avoids the need to generate and evaluate thousands of possible movement sequences by focusing only on the most informative perspectives.

Figure 1. Given a spatial reasoning query, MindJourney searches through the imagined 3D space using a world model and improves the VLM’s spatial interpretation through generated observations when encountering new challenges. 

 

To make its search through a simulated space both effective and efficient, MindJourney uses a spatial beam search—an algorithm that prioritizes the most promising paths. It works within a fixed number of steps, each representing a movement. By balancing breadth with depth, spatial beam search enables MindJourney to gather strong supporting evidence. This process is illustrated in Figure 2.

Figure 2. The MindJourney workflow starts with a spatial beam search for a set number of steps before answering the query. The world model interactively generates new observations, while a VLM interprets the generated images, guiding the search throughout the process.

By iterating through simulation, evaluation, and integration, MindJourney can reason about spatial relationships far beyond what any single 2D image can convey, all without the need for additional training. On the Spatial Aptitude Training (SAT) benchmark, it improved the accuracy of VLMs by 8% over their baseline performance.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Building smarter agents  

MindJourney showed strong performance on multiple 3D spatial-reasoning benchmarks, and even advanced VLMs improved when paired with its imagination loop. This suggests that the spatial patterns that world models learn from raw images, combined with the symbolic capabilities of VLMs, create a more complete spatial capability for agents. Together, they enable agents to infer what lies beyond the visible frame and interpret the physical world more accurately. 

It also demonstrates that pretrained VLMs and trainable world models can work together in 3D without retraining either one—pointing toward general-purpose agents capable of interpreting and acting in real-world environments. This opens the way to possible applications in autonomous robotics, smart home technologies, and accessibility tools for people with visual impairments. 

By converting systems that simply describe static images into active agents that continually evaluate where to look next, MindJourney connects computer vision with planning. Because exploration occurs entirely within the model’s latent space—its internal representation of the scene—robots would be able to test multiple viewpoints before determining their next move, potentially reducing wear, energy use, and collision risk. 

Looking ahead, we plan to extend the framework to use world models that not only predict new viewpoints but also forecast how the scene might change over time. We envision MindJourney working alongside VLMs that interpret those predictions and use them to plan what to do next. This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments.

Opens in a new tab

The post MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation appeared first on Microsoft Research.

Categories: Microsoft

Dion: the distributed orthonormal update revolution is here

Tue, 08/12/2025 - 22:09

Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.

There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP (opens in new tab) and TP (opens in new tab) parallelization becomes desirable. Going back to the inspiration for Muon, (opens in new tab) the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.  

What’s an orthonormal update? Figure1. Illustration of matrix parameters

At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.

What is Dion?

While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).

Figure 2. Pseudocode of the centralized version of Dion

This is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.

Tool Dion optimizer 

Dion implements orthonormalization using amortized power iteration (opens in new tab)Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it

Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab How does it work?

Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.

Figure 3. Wall-clock time speedup of Dion for 3B model training

Why would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.

This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).

Figure 4. Scaling of Dion across different batch sizes

Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).   

In our experiments, these benefits extend to various post-training regimes as well.

We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.

Figure 5. Low-rank Dion across different model sizes

Projecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.    

Using hardware timings of the individual update steps suggests a story that looks this:

Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.

We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.

Dion optimizer Acknowledgements

We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.

Opens in a new tab

The post Dion: the distributed orthonormal update revolution is here appeared first on Microsoft Research.

Categories: Microsoft

Self-adaptive reasoning for science

Wed, 08/06/2025 - 18:00
Unlocking self-adaptive cognitive behavior that is more controllable and explainable than reasoning models in challenging scientific domains

Long-running LLM agents equipped with strong reasoning, planning, and execution skills have the potential to transform scientific discovery with high-impact advancements, such as developing new materials or pharmaceuticals. As these agents become more autonomous, ensuring effective human oversight and clear accountability becomes increasingly important, presenting challenges that must be addressed to unlock their full transformative power. Today’s approaches to long-term reasoning are established during the post-training phase, prior to end-user deployment and typically by the model provider. As a result, the expected actions of these agents are pre-baked by the model developer, offering little to no control from the end user.

At Microsoft, we are pioneering a vision for a continually steerable virtual scientist. In line with this vision, we created the ability to have a non-reasoning model develop thought patterns that allow for control and customizability by scientists. Our approach, a cognitive loop via in-situ optimization (CLIO), does not rely on reinforcement learning post-training to develop reasoning patterns yet still yields equivalent performance as demonstrated through our evaluation on Humanity’s Last Exam (HLE). Notably, we increased OpenAI GPT-4.1’s base model accuracy on text-only biology and medicine from 8.55% to 22.37%, an absolute increase of 13.82% (161.64% relative), surpassing o3 (high). This demonstrates that an optimization-based, self-adaptive AI system developed without further post-training can rival post-trained models in domains where adaptability, explainability, and control matter most.

Figure 1. Head-to-head comparison of OpenAI’s GPT-4.1 with CLIO, o3, and GPT-4.1 with no tools on HLE biology and medicine questions In-situ optimization with internal self-reflection to enable self-adaptive reasoning

Model development has advanced from using reinforcement learning human feedback (RLHF) for answer alignment to external grading in reinforcement learning (RLVR). Recent approaches show promise in the utilization of intrinsic rewards for training reasoning models (RLIR). Traditionally, these reasoning processes are learned during the post-training process before any user interaction. While today’s reasoning models require additional data in the training phase and limit user control during the reasoning generation process, CLIO’s approach enables users to steer reasoning from scratch without additional data. Rather, CLIO generates its own necessary data by creating reflection loops at runtime. These reflection loops are utilized for a wide array of activities that CLIO self-defines, encompassing idea exploration, memory management, and behavior control. Most interesting is CLIO’s ability to leverage prior inferences to adjust future behaviors, handling uncertainties and raising flags for correction when necessary. Through this open architecture approach to reasoning, we alleviate the necessity for further model post-training to achieve desired reasoning behavior. Performing novel scientific discoveries often has no prior established patterns for reasoning, much less a large enough corpus of high-quality data to train on. 

CLIO reasons by continuously reflecting on progress, generating hypotheses, and evaluating multiple discovery strategies. For the HLE test, CLIO was specifically steered to follow the scientific method as a guiding framework. Our research shows that equipping language models with self-adapting reasoning enhances their problem-solving ability. It provides a net benefit in quality for science questions, as well as providing exposure and control to the end user.

Figure 2. CLIO can raise key areas of uncertainty within its self-formulated reasoning process, balancing multiple different viewpoints using graph structures. Control over uncertainty: Building trust in AI 

Orchestrated reasoning systems like CLIO are valuable for scientific discovery, as they provide features beyond accuracy alone. Capabilities such as explaining the outcomes of internal reasoning are standard in the scientific field and are present in current reasoning model approaches. However, elements like displaying complete work, including final outcomes, internal thought processes, and uncertainty thresholds to support reproducibility or correction, as well as indicating uncertainty, are not yet universally implemented. Current models and systems do not have this same innate humility.  Rather, we are left with models that produce confident results, whether correct or incorrect. When correct, it is valuable. When incorrect, it is dangerous to the scientific process. Hence, understanding a model or system’s uncertainty is a crucial aspect that we have developed natively into CLIO.

On the other end of the spectrum, orchestrated reasoning systems tend to oversaturate the user by raising too many flags. We enable prompt-free control knobs within CLIO to set thresholds for raising uncertainty flags. This allows CLIO to flag uncertainty for itself and the end user at the proper point in time. This also enables scientists to revisit CLIO’s reasoning path with critiques, edit beliefs during the reasoning process, and re-execute them from the desired point in time. Ultimately, this builds a foundational level of trust with scientists to use them in a scientifically defensible and rigorous way. 

How does CLIO perform? 

We evaluate CLIO against text-based biology and medicine questions from HLE. For this domain, we demonstrate a 61.98% relative increase or an 8.56% net increase in accuracy over OpenAI’s o3 and substantially outperform base completion models like OpenAI’s GPT-4.1, while enabling the requisite explainability and control. This technique applies to all models, showing similar increases in OpenAI’s GPT-4o model, which we observe performs poorly on HLE-level questions. On average, GPT-4.1 is not considered competent for HLE scale questions (<9%), and GPT-4o is natively at less than 2%. By utilizing CLIO, we bring these to near state-of-the-art performance against top reasoning models. CLIO’s recursive nature enables the system to think broader and more deeply, ensuring coverage of the question when answered. In GPT-4.1, we see an increase of 5.92% in accuracy for overall performance using just the cognitive loop recursion. To think more deeply, we allow CLIO to ensemble different evolutions and intelligently choose from the best approach using GraphRAG. This extension of the cognition pattern provides a further 7.90% over a non-ensembled approach.  

Figure 3. The impact of thinking effort on CLIO’s effectiveness.

Furthermore, CLIO’s design offers different knobs of control, for example, how much time to think and which technique to utilize for a given problem. In Figure 3, we demonstrate these knobs of control and their increase on GPT-4.1 and GPT-4o’s performance. In this case, we analyze performance for a subset of biomedical questions, those focused on immunology. CLIO increases GPT-4o’s base performance to be at par with the best reasoning models for immunology questions. We observe a 13.60% improvement over the base model, GPT-4o. This result shows CLIO to be model agnostic, similar to Microsoft AI Diagnostic Orchestrator’s (MAI-DxO) (opens in new tab)‘s approach and corresponding performance boost. 

Implications for science and trustworthy discovery

The future of scientific discovery demands more than reasoning over knowledge and raw computational power alone. Here, we demonstrate how CLIO not only increases model performance but establishes new layers of control for scientists. In our upcoming work, we will demonstrate how CLIO increases tool utility for highly valuable scientific questions in the drug discovery space which requires precise tools designed for the language of science. While our experiments focus on scientific discovery, we believe CLIO can apply in a domain-agnostic fashion. Experts tackling problems in domains such as financial analysis, engineering, and legal services could potentially benefit from AI systems with a transparent, steerable reasoning approach. Ultimately, we envision CLIO as an enduring control-layer in hybrid AI stacks that combine traditional completion and reasoning models, with external memory systems, and advanced tool calling. These continuous checks and balances that CLIO enables will continue to remain valuable even as components within the AI stacks evolve. This combination of intelligent and steerable scientific decision making and tool optimization is the basis of the recently announced Microsoft Discovery platform (opens in new tab).

At Microsoft, we’re committed to advancing AI research that earns the trust of scientists, empowering them to discover new frontiers of knowledge. Our work is a testament to what’s possible when we blend innovation with trustworthiness and a human-centered vision for the future of AI-assisted scientific discovery. We invite the research and scientific community to join us in shaping that future.

Further information:

To learn more details about our approach, please read our pre-print paper published alongside this blog. We are in the process of submitting this work for external peer review and encourage partners to explore the utilization of CLIO in Microsoft Discovery. To learn more about Microsoft’s research on this or contact our team, please reach out to discoverylabs@microsoft.com

Acknowledgements

We are grateful for Jason Zander and Nadia Karim’s support. We extend our thanks to colleagues both inside and outside Microsoft Discovery and Quantum for sharing their insights and feedback, including Allen Stewart, Yasser Asmi, David Marvin, Harsha Nori, Scott Lundberg, and Phil Waymouth. 

Opens in a new tab

The post Self-adaptive reasoning for science appeared first on Microsoft Research.

Categories: Microsoft

Project Ire autonomously identifies malware at scale

Tue, 08/05/2025 - 18:00

Today, we are excited to introduce an autonomous AI agent that can analyze and classify software without assistance, a step forward in cybersecurity and malware detection. The prototype, Project Ire, automates what is considered the gold standard in malware classification: fully reverse engineering a software file without any clues about its origin or purpose. It uses decompilers and other tools, reviews their output, and determines whether the software is malicious or benign.

Project Ire emerged from a collaboration between Microsoft Research, Microsoft Defender Research, and Microsoft Discovery & Quantum, bringing together security expertise, operational knowledge, data from global malware telemetry, and AI research. It is built on the same collaborative and agentic foundation behind GraphRAG and Microsoft Discovery (opens in new tab). The system uses advanced language models and a suite of callable reverse engineering and binary analysis tools to drive investigation and adjudication.

As of this writing, Project Ire has achieved a precision (opens in new tab) of 0.98 and a recall (opens in new tab) of 0.83 using public datasets of Windows drivers. It was the first reverse engineer at Microsoft, human or machine, to author a conviction case—a detection strong enough to justify automatic blocking—for a specific advanced persistent threat (APT) malware sample, which has since been identified and blocked by Microsoft Defender. 

Malware classification at a global scale

Microsoft’s Defender platform scans more than one billion monthly (opens in new tab) active devices through the company’s Defender suite of products, which routinely require manual review of software by experts.

This kind of work is challenging. Analysts often face error and alert fatigue, and there’s no easy way to compare and standardize how different people review and classify threats over time. For both of these reasons, today’s overloaded experts are vulnerable to burnout, a well-documented issue in the field.

Unlike other AI applications in security, malware classification lacks a computable validator (opens in new tab). The AI must make judgment calls without definitive validation beyond expert review. Many behaviors found in software, like reverse engineering protections, don’t clearly indicate whether a sample is malicious or benign. 

This ambiguity requires analysts to investigate each sample incrementally, building enough evidence to determine whether it’s malicious or benign despite opposition from adaptive, active adversaries. This has long made it difficult to automate and scale what is inherently a complex and expensive process.

Technical foundation

Project Ire attempts to address these challenges by acting as an autonomous system that uses specialized tools to reverse engineer software. The system’s architecture allows for reasoning at multiple levels, from low-level binary analysis to control flow reconstruction and high-level interpretation of code behavior.

Its tool-use API enables the system to update its understanding of a file using a wide range of reverse engineering tools, including Microsoft memory analysis sandboxes based on Project Freta (opens in new tab), custom and open-source tools, documentation search, and multiple decompilers.  

Reaching a verdict 

The evaluation process begins with a triage, where automated reverse engineering tools identify the file type, its structure, and potential areas of interest. From there, the system reconstructs the software’s control flow graph using frameworks such as angr (opens in new tab) and Ghidra (opens in new tab), building a graph that forms the backbone of Project Ire’s memory model and guides the rest of the analysis.  

Through iterative function analysis, the LLM calls specialized tools through an API to identify and summarize key functions. Each result feeds into a “chain of evidence,” a detailed, auditable trail that shows how the system reached its conclusion. This traceable evidence log supports secondary review by security teams and helps refine the system in cases of misclassification.  

To verify its findings, Project Ire can invoke a validator tool that cross-checks claims in the report against the chain of evidence. This tool draws on expert statements from malware reverse engineers on the Project Ire team. Drawing on this evidence and its internal model, the system creates a final report and classifies the sample as malicious or benign.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab Preliminary testing shows promise 

Two early evaluations tested Project Ire’s effectiveness as an autonomous malware classifier. In the first, we assessed Project Ire on a dataset of publicly accessible Windows drivers, some known to be malicious, others benign. Malicious samples came from the Living off the Land Drivers (opens in new tab) database, which includes a collection of Windows drivers used by attackers to bypass security controls, while known benign drivers were sourced from Windows Update. 

This classifier performed well, correctly identifying 90% of all files and flagging only 2% of benign files as threats. It achieved a precision of 0.98 and a recall of 0.83. This low false-positive rate suggests clear potential for deployment in security operations, alongside expert reverse engineering reviews. 

For each file it analyzes, Project Ire generates a report that includes an evidence section, summaries of all examined code functions, and other technical artifacts.  

Figures 1 and 2 present reports for two successful malware classification cases generated during testing. The first involves a kernel-level rootkit, Trojan:Win64/Rootkit.EH!MTB (opens in new tab). The system identified several key features, including jump-hooking, process termination, and web-based command and control. It then correctly flagged the sample as malicious.

Figure 1 Analysis body { font-family: Arial, sans-serif; background: #f8f8f8; } .code-block { background: #23272e; color: #e6e6e6; font-family: 'Fira Mono', 'Consolas', 'Monaco', monospace; font-size: 1em; padding: 24px 28px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.10); border: 1px solid #444; margin: 40px auto; max-width: 900px; line-height: 1.7; word-break: break-word; } .code-block p { margin: 0 0 18px 0; text-indent: 0; }

The binary contains a function named ‘MonitorAndTerminateExplorerThread_16f64’ that runs an infinite loop waiting on synchronization objects and terminates system threads upon certain conditions. It queries system or process information, iterates over processes comparing their names case-insensitively to ‘Explorer.exe’, and manipulates registry values related to ‘Explorer.exe’. This function appears to monitor and potentially terminate or manipulate the ‘Explorer.exe’ process, a critical Windows shell process. Such behavior is suspicious and consistent with malware that aims to disrupt or control system processes.

Another function, ‘HttpGetRequestAndResponse_174a4’, performs HTTP GET requests by parsing URLs, resolving hostnames, opening sockets, sending requests, and reading responses. This network communication capability could be leveraged for command and control or data exfiltration, common in malware.

The binary also includes a function ‘PatchProcessEntryPointWithHook_12b5c’ that patches the entry point of a process by writing a hook or trampoline that redirects execution to a specified address. This technique is commonly used for process injection or hooking, allowing malware to alter process behavior or inject malicious code.

Other functions related to sending IOCTL requests to device drivers were identified, but their maliciousness could not be conclusively determined without additional context.

Overall, the binary exhibits multiple indicators of malicious behavior, including process manipulation, network communication, and code injection techniques, suggesting it is likely malware designed to interfere with system processes and communicate with remote servers.

Figure 1. Project Ire report, sample with SHA256: 86047bb1969d1db455493955fd450d18c62a3f36294d0a6c3732c88dfbcc4f62 (opens in new tab)

The second sample, HackTool:Win64/KillAV!MTB (opens in new tab), was designed to disable antivirus software. Project Ire correctly identified the code that locates and disables antivirus programs, providing evidence that the file was malicious.  

In one section of the code, however, the system misidentified a function as anti-debugging behavior. To maintain accuracy, the system used the validator tool to flag the claim as unsupported. The issue was later resolved by updating decompiler rules, but this example illustrates how Project Ire navigates uncertainty during analysis. Figure 2 shows the corresponding report. 

Figure 2 Analysis body { font-family: Arial, sans-serif; background: #f8f8f8; } .code-block { background: #23272e; color: #e6e6e6; font-family: 'Fira Mono', 'Consolas', 'Monaco', monospace; font-size: 1em; padding: 24px 28px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.10); border: 1px solid #444; margin: 40px auto; max-width: 900px; line-height: 1.7; word-break: break-word; } .code-block p { margin: 0 0 18px 0; text-indent: 0; }

The binary contains several functions indicative of malicious intent. The function register_and_log_known_processes_140001000 logs and registers process names associated with antivirus and security software, such as ‘avp.exe’, ‘avpui.exe’, and ‘360Tray.exe’. It calls another function, TerminateProcessesByNameSubstring_1400010f4, which enumerates system processes and terminates those whose names contain specified substrings. This behavior is typical of malware attempting to disable or evade security software by killing their processes.

Another function, check_and_handle_special_state_14000502c, performs checks on a global variable and triggers software interrupts if certain conditions are not met. While the exact purpose of these interrupts (int 0x29 and int 0x3) is unclear, they could represent an anti-debug or anti-analysis mechanism to detect or interfere with debugging or tampering attempts. However, this assumption could not be fully validated against expert statements.

Other functions include initialization routines and simple logging wrappers, but the core malicious behavior centers on process termination targeting security software. This indicates the binary is designed to compromise system security by disabling protective processes, a hallmark of malware such as trojans or rootkits.

Figure 2. Project Ire report, sample with SHA256: b6cb163089f665c05d607a465f1b6272cdd5c949772ab9ce7227120cf61f971a (opens in new tab) Real-world evaluation with Microsoft Defender 

The more demanding test involved nearly 4,000 “hard-target” files not classified by automated systems and slated for manual review by expert reverse engineers.

In this real-world scenario, Project Ire operated fully autonomously on files created after the language models’ training cutoff, files that no other automated tools at Microsoft could classify at the time.

The system achieved a high precision score of 0.89, meaning nearly 9 out of 10 files flagged malicious were correctly identified as malicious. Recall was 0.26, indicating that under these challenging conditions, the system detected roughly a quarter of all actual malware.

The system correctly identified many of the malicious files, with few false alarms, just a 4% false positive rate. While overall performance was moderate, this combination of accuracy and a low error rate suggests real potential for future deployment.

Looking ahead 

Based on these early successes, the Project Ire prototype will be leveraged inside Microsoft’s Defender organization as Binary Analyzer for threat detection and software classification.

Our goal is to scale the system’s speed and accuracy so that it can correctly classify files from any source, even on first encounter. Ultimately, our vision is to detect novel malware directly in memory, at scale.

Acknowledgements 

Project Ire acknowledges the following additional developers that contributed to the results in this publication: Dayenne de Souza, Raghav Pande, Ryan Terry, Shauharda Khadka, and Bob Fleck for their independent review of the system.

The system incorporates multiple tools, including the angr framework developed by Emotion Labs (opens in new tab). Microsoft has collaborated extensively with Emotion Labs, a pioneer in cyber autonomy, throughout the development of Project Ire, and thanks them for the innovations and insights that contributed to the successes reported here. 

Opens in a new tab

The post Project Ire autonomously identifies malware at scale appeared first on Microsoft Research.

Categories: Microsoft

VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows

Tue, 08/05/2025 - 18:00
Watch VeriTrail Explainer

Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”1

Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).  

In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components: 

  1. Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source. 
  2. Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.

Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.2

VeriTrail’s hallucination detection process

A key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process.  

An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.3

Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by “⭤ “), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a “community report” that summarizes the entities and relationships (Stage 4). To answer a user’s question, an LM generates “map-level answers” based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).

VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1. 

Case study 1: A “Fully Supported” claim Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Fully Supported” claim.

Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated: 

  • In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.
    • An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.
    • Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”
  • Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4. 
  • In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.
Case study 2: A “Not Fully Supported” claim Figure 3: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Not Fully Supported” claim, where the maximum number of consecutive “Not Fully Supported” verdicts was set to 2.

Figure 3 provides an example of a claim where VeriTrail identified hallucination:

  • In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.
  • Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.
  • In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Providing traceability

In addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability: 

  1. Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.
  2. Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur. 

The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries. 

Key design features

VeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include: 

  • During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.
  • After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.
  • VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.  
  • The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict. 
Evaluating VeriTrail’s performance

We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization4 and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).

We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore (opens in new tab) and INFUSE (opens in new tab)); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.5

Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.  

For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.

1 The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).

2 VeriTrail is currently used for research purposes only and is not available commercially.

3 We focus on GraphRAG’s global search method.

4 In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).

5 The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.

Opens in a new tab

The post VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows appeared first on Microsoft Research.

Categories: Microsoft

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

Thu, 07/24/2025 - 03:30

AI has made remarkable progress in recent years, but turning experimental models into tools that work in the real world is still a major challenge. Bridging this gap between innovation and application has shaped the career of Xinxing Xu, principal researcher at Microsoft Research Asia – Singapore (opens in new tab), and underpins the mission of the lab’s newly established presence in the region.

Xinxing Xu, Principal Researcher, Microsoft Research Asia – Singapore

“Innovative algorithms can only demonstrate their true value when tested with real-world data and in actual scenarios, where they can be continuously optimized through iteration,” he says.

Xu’s commitment to balancing algorithmic innovation with practical application has shaped his entire career. During his PhD studies at Nanyang Technological University, Singapore, Xu focused on emerging technologies like multiple kernel learning methods and multimodal machine learning. Today he’s applying these techniques to real-world use cases like image recognition and video classification.

After completing his doctorate, he joined the Institute of High Performance Computing at Singapore’s Agency for Science, Technology and Research (A*STAR), where he worked on interdisciplinary projects ranging from medical image recognition to AI systems for detecting defects on facade of buildings. These experiences broadened his perspective and deepened his passion for translating AI into real-world impact.

In 2024, Xu joined Microsoft Research Asia where he began a new chapter focused on bridging between academic research and real-world AI applications.

“Microsoft Research Asia is committed to integrating scientific exploration with real-world applications, which creates a unique research environment,” Xu says. “It brings together top talent and resources, and Microsoft’s engineering and product ecosystem strongly supports turning research into impactful technology. The lab’s open and inclusive culture encourages innovation with broader societal impact. It reflects the approach to research I’ve always hoped to contribute to.”

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab Bringing cross-domain expertise to AI’s real-world frontiers

As a key hub in Microsoft Research’s network across Asia, the Singapore lab is guided by a three-part mission: to drive industry-transforming AI deployment, pursue fundamental breakthroughs in the field, and promote responsible, socially beneficial applications of the technology.

To reach these goals, Xu and his colleagues are working closely with local collaborators, combining cross-disciplinary expertise to tackle complex, real-world challenges.

To deliver on that mission, Xinxing Xu and his colleagues are working closely with local collaborators, drawing on cross-disciplinary expertise to solve real-world problems. One key focus is healthcare, where Xu leads a collaboration with Singapore’s SingHealth to explore how AI can support precision medicine. By combining SingHealth’s clinical data with advanced AI models, the team aims to deliver more personalized analyses and sharper diagnostic tools—laying the groundwork for improved patient outcomes. 

Beyond healthcare, the team is also targeting key sectors like finance and logistics. By developing domain-specific foundation models and AI agents, they aim to support smarter decision-making and accelerate digital transformation across industries. “Singapore has a strong foundation in these sectors,” Xu notes, “making it an ideal environment for technology validation and iteration.”

The team is also partnering with leading academic institutions, including the National University of Singapore (NUS) and Nanyang Technological University, Singapore (NTU Singapore), to advance the field of spatial intelligence. Their goal is to develop embodied intelligence systems capable of carrying out complex tasks in smart environments.

As AI becomes more deeply embedded in everyday life, researchers at the Singapore lab are also increasingly focused on what they call “societal AI”—building AI systems that are culturally relevant and trustworthy within Southeast Asia’s unique cultural and social contexts. In collaboration with global colleagues, they’re helping to advance a more culturally grounded and responsible approach to AI research in the region.

Microsoft Research Asia – Singapore: Expanding global reach, connecting regional innovation 

Realizing AI’s full potential requires more than technical breakthroughs. It also depends on collaboration—across industries, academia, and policy. Only through this intersection of forces can AI move beyond the lab to deliver meaningful societal value. 

Singapore’s strengths in science, engineering, and digital governance make it an ideal setting for this kind of work. Its collaborative culture, robust infrastructure, international talent pool, and strong policy support for science and technology make it fertile ground for interdisciplinary research. 

This is why Microsoft Research Asia continues to collaborate closely with Singapore’s top universities, research institutions, and industry partners. These partnerships support joint research, talent development, and technical exchange. Building on this foundation, Microsoft Research Asia – Singapore will further deepen its collaboration with NUS, NTU Singapore, and Singapore Management University (SMU) to advance both fundamental and applied research, while equipping the next generation of researchers with real-world experience. In addition, Microsoft Research Asia is fostering academic exchange and strengthening the research ecosystem through summer schools and joint workshops with NUS, NTU Singapore, and SMU. 

The launch of the Singapore lab further marks an important step in expanding the company’s global research footprint, serving as a bridge between regional innovation and Microsoft’s global ecosystem. Through its integrated lab network, Microsoft Research fosters the sharing of technologies, methods, and real-world insights, creating a virtuous cycle of innovation.

“We aim to build a research hub in Singapore that is globally connected and deeply rooted in the local ecosystem,” Xu says. “Many breakthroughs come from interdisciplinary and cross-regional collaboration. By breaking boundaries—across disciplines, industries, and geographies—we can drive research that has lasting impact.”

As AI becomes more deeply woven into industry and everyday life, Xu believes that meaningful research must be closely connected to regional development and social well-being. “Microsoft Research Asia – Singapore is a future-facing lab,” he says. “While we push technological frontiers, we’re equally committed to the responsibility of technology—ensuring AI can help address society’s most pressing challenges.”

In a world shaped by global challenges, Xu sees collaboration and innovation as essential to real progress. With Singapore as a launchpad, he and his team are working to extend AI’s impact and value across Southeast Asia and beyond.

Xingxing Xu (center) with colleagues at Microsoft Research Asia – Singapore  Three essential strengths for the next generation of AI researchers

AI’s progress depends not only on technical breakthroughs but also on the growth and dedication of talent. At Microsoft Research Asia, there is a strong belief that bringing research into the real world requires more than technical coordination—it depends on unlocking the full creativity and potential of researchers.

In Singapore—a regional innovation hub that connects Southeast Asia—Xu and his colleagues are working to push AI beyond the lab and into fields like healthcare, finance, and manufacturing. For young researchers hoping to shape the future of AI, this is a uniquely powerful stage.

To help guide the next generation, Xu shares three pieces of advice:

  • Build a strong foundation – “Core knowledge in machine learning, linear algebra, and probability and statistics is the bedrock of AI research,” Xu says. “A solid theoretical base is essential to remain competitive in a rapidly evolving field. Even today’s hottest trends in generative AI rely on longstanding principles of optimization and model architecture design.” While code generation tools are on the rise, Xu emphasizes that mathematical fundamentals remain essential for understanding and innovating in AI.
  • Understand real-world applications – Technical skills alone aren’t enough. Xu encourages young researchers to deeply engage with the problems they’re trying to solve. Only by tightly integrating technology with its context can researchers create truly valuable solutions.

    “In healthcare, for example, researchers may need to follow doctors in clinics to gain a true understanding of clinical workflows. That context helps identify the best entry points for AI deployment. Framing research problems around real-world needs is often more impactful than just tuning model parameters,” Xu says.
  • Develop interdisciplinary thinking – Cross-disciplinary collaboration is becoming essential to AI innovation. Xu advises young researchers to learn how to work with experts from other fields to explore new directions together. “These kinds of interactions often spark fresh, creative ideas,” he says.

    Maintaining curiosity is just as important. “Being open to new technologies and fields is what enables researchers to continually break new ground and produce original results.”

Xu extends an open invitation to aspiring researchers from all backgrounds to join Microsoft Research Asia – Singapore. “We offer a unique platform that blends cutting-edge research with real-world impact,” he says. “It’s a place where you can work on the frontiers of AI—and see how your work can help transform industries and improve lives.”

To learn more about current openings at the Singapore lab, please visit our careers page (opens in new tab)

Opens in a new tab

The post Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore appeared first on Microsoft Research.

Categories: Microsoft

Technical approach for classifying human-AI interactions at scale

Wed, 07/23/2025 - 18:00

As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale—efficiently, reliably, and in near real-time—is no longer a nice-to-have. It’s essential. The Semantic Telemetry project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.

But building a pipeline that can handle this volume isn’t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.

In this latest post in our series on Semantic Telemetry, we’ll walk through the engineering behind that system—how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we’ll share what it takes to build a real-time LLM classification pipeline.

For additional project background: Semantic Telemetry: Understanding how users interact with AI systems and Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project.

Blog Semantic Telemetry: Understanding how users interact with AI systems  Blog Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project  System architecture highlights

The Semantic Telemetry pipeline (opens in new tab) is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:

  • Hybrid compute engine
    The pipeline combines the distributed power of PySpark with the speed and simplicity of Polars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments—without code changes.
  • LLM-centric transformation layer
    At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:
    • Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.
    • Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.
    • Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. “create” versus “created”) and relabeling invalid classifications.
Figure 1. Architecture diagram

The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths—making it easy to adapt to new use cases or environments.

Engineering challenges & solutions

Building a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges—from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.

LLM endpoint latency & variability

Challenge: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.

Solution: We implemented a combination of:

  • Multiple Azure OpenAI endpoints in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.
  • Saving output in intervals to write data asynchronously in case of network errors.
  • Utilizing models with higher tokens per minute (TPM) such as OpenAI’s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)
  • Timeouts and retries with exponential backoff.
Evolving LLM models & prompt alignment

Challenge: Each new LLM release—such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)—brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.

Solution: We developed a model evaluation workflow to test prompt alignment across LLM versions:

  • Small-sample testing: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.
  • Distribution analysis: If the new model’s output aligned closely, we scaled up testing. If not, we iteratively tuned the prompts and re-ran comparisons.
  • Interpretation flexibility: We also recognized that a shift in distribution isn’t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.

To support this process, we used tools like Sammo (opens in new tab), which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.

Dynamic concurrency scaling for LLM calls

Challenge: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models’ speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.

Solution: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:

  • External task awareness: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.
  • Success/failure rate monitoring: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.
  • Latency-based feedback loop: Instead of waiting for rate-limit errors, measure the response time of LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab Optimization experiments

To further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.

Batch endpoints (Azure/OpenAI)

Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.

Conversation batching in prompts during pipeline runtime

Batching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.

Combining classifiers in a single prompt

Combining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.

Classification using text embeddings

An alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation—instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.

For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.

The benefits of this approach include significantly increased throughput and cost savings—since it’s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.

Prompt compression

Compressing prompts by eliminating unnecessary tokens or by using a tool such as LLMLingua (opens in new tab) to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.

Text truncation

Truncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint’s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it’s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.

Conclusion

Building a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they’re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we’ve shared represent a snapshot of what’s working today. But more importantly, they offer a foundation for what’s possible tomorrow.

Opens in a new tab

The post Technical approach for classifying human-AI interactions at scale appeared first on Microsoft Research.

Categories: Microsoft

CollabLLM: Teaching LLMs to collaborate with users

Tue, 07/15/2025 - 20:00

Large language models (LLMs) can solve complex puzzles in seconds, yet they sometimes struggle over simple conversations. When these AI tools make assumptions, overlook key details, or neglect to ask clarifying questions, the result can erode trust and derail real-world interactions, where nuance is everything.

A key reason these models behave this way lies in how they’re trained and evaluated. Most benchmarks use isolated, single-turn prompts with clear instructions. Training methods tend to optimize for the model’s next response, not its contribution to a successful, multi-turn exchange. But real-world interaction is dynamic and collaborative. It relies on context, clarification, and shared understanding.

User-centric approach to training 

To address this, we’re exploring ways to train LLMs with users in mind. Our approach places models in simulated environments that reflect the back-and-forth nature of real conversations. Through reinforcement learning, these models improve through trial and error, for example, learning when to ask questions and how to adapt tone and communication style to different situations. This user-centric approach helps bridge the gap between how LLMs are typically trained and how people actually use them.  

This is the concept behind CollabLLM, recipient of an ICML Outstanding Paper Award (opens in new tab). This training framework helps LLMs improve through simulated multi-turn interactions, as illustrated in Figure 1. The core insight behind CollabLLM is simple: in a constructive collaboration, the value of a response isn’t just in its immediate usefulness, but in how it contributes to the overall success of the conversation. A clarifying question might seem like a delay but often leads to better outcomes. A quick answer might appear useful but can create confusion or derail the interaction.

Figure 1. Diagram comparing two training approaches for LLMs. (a) The standard method lacks user-agent collaboration and uses single-turn rewards, leading to an inefficient conversation. (b) In contrast, CollabLLM simulates multi-turn user-agent interactions during training, enabling it to learn effective collaboration strategies and produce more efficient dialogues.

CollabLLM puts this collaborative approach into practice with a simulation-based training loop, illustrated in Figure 2. At any point in a conversation, the model generates multiple possible next turns by engaging in a dialogue with a simulated user.

Figure 2: Simulation-based training process used in CollabLLM

The system uses a sampling method to extend conversations turn by turn, choosing likely responses for each participant (the AI agent or the simulated user), while adding some randomness to vary the conversational paths. The goal is to expose the model to a wide variety of conversational scenarios, helping it learn more effective collaboration strategies.

PODCAST SERIES

AI Testing and Evaluation: Learnings from Science and Industry

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.

Listen now Opens in a new tab

To each simulated conversation, we applied multiturn-aware reward (MR) functions, which assess how the model’s response at the given turn influences the entire trajectory of the conversation. We sampled multiple conversational follow-ups from the model, such as statements, suggestions, questions, and used MR to assign a reward to each based on how well the conversation performed in later turns. We based these scores on automated metrics that reflect key factors like goal completion, conversational efficiency, and user engagement.

To score the sampled conversations, we used task-specific metrics and metrics from an LLM-as-a-judge framework, which supports efficient and scalable evaluation. For metrics like engagement, a judge model rates each sampled conversation on a scale from 0 to 1.

The MR of each model response was computed by averaging the scores from the sampled conversations, originating from the model response. Based on the score, the model updates its parameters using established reinforcement learning algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).

We tested CollabLLM through a combination of automated and human evaluations, detailed in the paper. One highlight is a user study involving 201 participants in a document co-creation task, shown in Figure 3. We compared CollabLLM to a baseline trained with single-turn rewards and to a second, more proactive baseline prompted to ask clarifying questions and take other proactive steps. CollabLLM outperformed both, producing higher-quality documents, better interaction ratings, and faster task completion times.

Figure 3: Results of the user study in a document co-creation task comparing CollabLLM to a baseline trained with single-turn rewards. Designing for real-world collaboration

Much of today’s AI research focuses on fully automated tasks, models working without input from or interaction with users. But many real-world applications depend on people in the loop: as users, collaborators, or decision-makers. Designing AI systems that treat user input not as a constraint, but as essential, leads to systems that are more accurate, more helpful, and ultimately more trustworthy.

This work is driven by a core belief: the future of AI depends not just on intelligence, but on the ability to collaborate effectively. And that means confronting the communication breakdowns in today’s systems.

We see CollabLLM as a step in that direction, training models to engage in meaningful multi-turn interactions, ask clarifying questions, and adapt to context. In doing so, we can build systems designed to work with people—not around them.

Opens in a new tab

The post CollabLLM: Teaching LLMs to collaborate with users appeared first on Microsoft Research.

Categories: Microsoft

eXTReMe Tracker