Microsoft
Applicability vs. job displacement: further notes on our recent research on AI and occupations
Recently, we released a paper (Working with AI: Measuring the Occupational Implications of Generative AI) that studied what occupations might find AI chatbots useful, and to what degree. The paper sparked significant discussion, which is no surprise since people care deeply about the future of AI and jobs–that’s part of why we think it’s important to study these topics.
Unfortunately, not all the discussion was accurate in its portrayal of the study’s scope or conclusions. Specifically, our study does not draw any conclusions about jobs being eliminated; in the paper, we explicitly cautioned against using our findings to make that conclusion.
Given the importance of this topic, we want to clarify any misunderstandings and provide a more digestible summary of the paper, our methodology, and its limitations.
What did our research find?We set out to better understand how people are using AI, highlighting where AI might be useful in different occupations. To do this, we analyzed how people currently use generative AI—specifically Microsoft Bing Copilot (now Microsoft Copilot)—to assist with tasks. We then compared these sets of tasks against the O*NET database (opens in new tab), a widely used occupational classification system, to understand potential applicability to various occupations.
We found that AI is most useful for tasks related to knowledge work and communication, particularly tasks such as writing, gathering information, and learning.
Those in occupations with these tasks may benefit by considering how AI can be used as a tool to help improve their workflows. On the flip side, it’s not surprising that physical tasks like performing surgeries or moving objects had less direct AI chatbot applicability.
So, to summarize, our paper is about identifying the occupations where AI may be most useful, by assisting or performing subtasks. Our data do not indicate, nor did we suggest, that certain jobs will be replaced by AI.
Methodological limitations are acknowledged—and importantThe paper is transparent about the limitations of our approach.
We analyzed anonymized Bing Copilot conversations to see what activities users are seeking AI assistance with and what activities AI can perform when mapped to the O*NET database. While O*NET provides a structured list of activities associated with various occupations, it does not capture the full spectrum of skills, context, and nuance required in the real world. A job is far more than the collection of tasks that make it up.
For example, a task might involve “writing reports,” but O*NET won’t reflect the interpersonal judgment, domain expertise, or ethical considerations that go into doing that well. The paper acknowledges this gap and warns against over-interpreting the AI applicability scores as measures of AI’s ability to perform an occupation.
Additionally, the dataset is based on user queries from Bing Copilot (from January – September 2024), which may be influenced by factors like awareness, access, or comfort with AI tools. Different people use different LLMs for different purposes and it also is very difficult (or nearly impossible) to determine what conversations are performed in a work context or for leisure.
Finally, we only evaluated AI chatbot usage, so this study does not evaluate the impact or applicability of other forms of AI.
Where do we go from here?Given the intense interest in how AI will shape our collective future, it’s important we continue to study and better understand its societal and economic impact. As with all research on this topic, the findings are nuanced, and it’s important to pay attention to this nuance.
The public interest in our research is based, in large part, on the topic of AI and job displacement. However, our current methodology for this study is unlikely to lead to firm conclusions about this. AI may prove to be a useful tool for many occupations, and we believe the right balance lies in finding how to use the technology in a way that leverages its abilities while complementing human strengths and accounting for people’s preferences.
For more information from Microsoft on the future of work and AI skilling, check out Microsoft’s Annual Work Trend Index (opens in new tab) and Microsoft Elevate (opens in new tab).
Opens in a new tabThe post Applicability vs. job displacement: further notes on our recent research on AI and occupations appeared first on Microsoft Research.
MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation
A new research framework helps AI agents explore three-dimensional spaces they can’t directly detect. Called MindJourney, the approach addresses a key limitation in vision-language models (VLMs), which give AI agents their ability to interpret and describe visual scenes.
While VLMs are strong at identifying objects in static images, they struggle to interpret the interactive 3D world behind 2D images. This gap shows up in spatial questions like “If I sit on the couch that is on my right and face the chairs, will the kitchen be to my right or left?”—tasks that require an agent to interpret its position and movement through space.
People overcome this challenge by mentally exploring a space, imagining moving through it and combining those mental snapshots to work out where objects are. MindJourney applies the same process to AI agents, letting them roam a virtual space before answering spatial questions.
How MindJourney navigates 3D spaceTo perform this type of spatial navigation, MindJourney uses a world model—in this case, a video generation system trained on a large collection of videos captured from a single moving viewpoint, showing actions such as going forward and turning left of right, much like a 3D cinematographer. From this, it learns to predict how a new scene would appear from different perspectives.
At inference time, the model can generate photo-realistic images of a scene based on possible movements from the agent’s current position. It generates multiple possible views of a scene while the VLM acts as a filter, selecting the constructed perspectives that are most likely to answer the user’s question.
These are kept and expanded in the next iteration, while less promising paths are discarded. This process, shown in Figure 1, avoids the need to generate and evaluate thousands of possible movement sequences by focusing only on the most informative perspectives.
Figure 1. Given a spatial reasoning query, MindJourney searches through the imagined 3D space using a world model and improves the VLM’s spatial interpretation through generated observations when encountering new challenges.
To make its search through a simulated space both effective and efficient, MindJourney uses a spatial beam search—an algorithm that prioritizes the most promising paths. It works within a fixed number of steps, each representing a movement. By balancing breadth with depth, spatial beam search enables MindJourney to gather strong supporting evidence. This process is illustrated in Figure 2.
Figure 2. The MindJourney workflow starts with a spatial beam search for a set number of steps before answering the query. The world model interactively generates new observations, while a VLM interprets the generated images, guiding the search throughout the process.By iterating through simulation, evaluation, and integration, MindJourney can reason about spatial relationships far beyond what any single 2D image can convey, all without the need for additional training. On the Spatial Aptitude Training (SAT) benchmark, it improved the accuracy of VLMs by 8% over their baseline performance.
Spotlight: AI-POWERED EXPERIENCE
Microsoft research copilot experienceDiscover more about research at Microsoft through our AI-powered experience
Start now Opens in a new tab Building smarter agentsMindJourney showed strong performance on multiple 3D spatial-reasoning benchmarks, and even advanced VLMs improved when paired with its imagination loop. This suggests that the spatial patterns that world models learn from raw images, combined with the symbolic capabilities of VLMs, create a more complete spatial capability for agents. Together, they enable agents to infer what lies beyond the visible frame and interpret the physical world more accurately.
It also demonstrates that pretrained VLMs and trainable world models can work together in 3D without retraining either one—pointing toward general-purpose agents capable of interpreting and acting in real-world environments. This opens the way to possible applications in autonomous robotics, smart home technologies, and accessibility tools for people with visual impairments.
By converting systems that simply describe static images into active agents that continually evaluate where to look next, MindJourney connects computer vision with planning. Because exploration occurs entirely within the model’s latent space—its internal representation of the scene—robots would be able to test multiple viewpoints before determining their next move, potentially reducing wear, energy use, and collision risk.
Looking ahead, we plan to extend the framework to use world models that not only predict new viewpoints but also forecast how the scene might change over time. We envision MindJourney working alongside VLMs that interpret those predictions and use them to plan what to do next. This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments.
Opens in a new tabThe post MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation appeared first on Microsoft Research.
Dion: the distributed orthonormal update revolution is here
Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)–W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.
There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP and TP parallelization becomes desirable. Going back to the inspiration for Muon, the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.
What’s an orthonormal update? Figure1. Illustration of matrix parametersAt the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.
What is Dion?While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).
Figure 2. Pseudocode of the centralized version of DionThis is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.
Download Dion optimizerDion implements orthonormalization using amortized power iteration (opens in new tab). Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it.
Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.
Spotlight: AI-POWERED EXPERIENCE
Microsoft research copilot experienceDiscover more about research at Microsoft through our AI-powered experience
Start now Opens in a new tab How does it work?Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.
Figure 3. Wall-clock time speedup of Dion for 3B model trainingWhy would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.
This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).
Figure 4. Scaling of Dion across different batch sizesHere you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).
In our experiments, these benefits extend to various post-training regimes as well.
We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.
Figure 5. Low-rank Dion across different model sizesProjecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.
Using hardware timings of the individual update steps suggests a story that looks this:
Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.
Dion optimizer AcknowledgementsWe thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.
Opens in a new tabThe post Dion: the distributed orthonormal update revolution is here appeared first on Microsoft Research.
Self-adaptive reasoning for science
Long-running LLM agents equipped with strong reasoning, planning, and execution skills have the potential to transform scientific discovery with high-impact advancements, such as developing new materials or pharmaceuticals. As these agents become more autonomous, ensuring effective human oversight and clear accountability becomes increasingly important, presenting challenges that must be addressed to unlock their full transformative power. Today’s approaches to long-term reasoning are established during the post-training phase, prior to end-user deployment and typically by the model provider. As a result, the expected actions of these agents are pre-baked by the model developer, offering little to no control from the end user.
At Microsoft, we are pioneering a vision for a continually steerable virtual scientist. In line with this vision, we created the ability to have a non-reasoning model develop thought patterns that allow for control and customizability by scientists. Our approach, a cognitive loop via in-situ optimization (CLIO), does not rely on reinforcement learning post-training to develop reasoning patterns yet still yields equivalent performance as demonstrated through our evaluation on Humanity’s Last Exam (HLE). Notably, we increased OpenAI GPT-4.1’s base model accuracy on text-only biology and medicine from 8.55% to 22.37%, an absolute increase of 13.82% (161.64% relative), surpassing o3 (high). This demonstrates that an optimization-based, self-adaptive AI system developed without further post-training can rival post-trained models in domains where adaptability, explainability, and control matter most.
Figure 1. Head-to-head comparison of OpenAI’s GPT-4.1 with CLIO, o3, and GPT-4.1 with no tools on HLE biology and medicine questions In-situ optimization with internal self-reflection to enable self-adaptive reasoningModel development has advanced from using reinforcement learning human feedback (RLHF) for answer alignment to external grading in reinforcement learning (RLVR). Recent approaches show promise in the utilization of intrinsic rewards for training reasoning models (RLIR). Traditionally, these reasoning processes are learned during the post-training process before any user interaction. While today’s reasoning models require additional data in the training phase and limit user control during the reasoning generation process, CLIO’s approach enables users to steer reasoning from scratch without additional data. Rather, CLIO generates its own necessary data by creating reflection loops at runtime. These reflection loops are utilized for a wide array of activities that CLIO self-defines, encompassing idea exploration, memory management, and behavior control. Most interesting is CLIO’s ability to leverage prior inferences to adjust future behaviors, handling uncertainties and raising flags for correction when necessary. Through this open architecture approach to reasoning, we alleviate the necessity for further model post-training to achieve desired reasoning behavior. Performing novel scientific discoveries often has no prior established patterns for reasoning, much less a large enough corpus of high-quality data to train on.
CLIO reasons by continuously reflecting on progress, generating hypotheses, and evaluating multiple discovery strategies. For the HLE test, CLIO was specifically steered to follow the scientific method as a guiding framework. Our research shows that equipping language models with self-adapting reasoning enhances their problem-solving ability. It provides a net benefit in quality for science questions, as well as providing exposure and control to the end user.
Figure 2. CLIO can raise key areas of uncertainty within its self-formulated reasoning process, balancing multiple different viewpoints using graph structures. Control over uncertainty: Building trust in AIOrchestrated reasoning systems like CLIO are valuable for scientific discovery, as they provide features beyond accuracy alone. Capabilities such as explaining the outcomes of internal reasoning are standard in the scientific field and are present in current reasoning model approaches. However, elements like displaying complete work, including final outcomes, internal thought processes, and uncertainty thresholds to support reproducibility or correction, as well as indicating uncertainty, are not yet universally implemented. Current models and systems do not have this same innate humility. Rather, we are left with models that produce confident results, whether correct or incorrect. When correct, it is valuable. When incorrect, it is dangerous to the scientific process. Hence, understanding a model or system’s uncertainty is a crucial aspect that we have developed natively into CLIO.
On the other end of the spectrum, orchestrated reasoning systems tend to oversaturate the user by raising too many flags. We enable prompt-free control knobs within CLIO to set thresholds for raising uncertainty flags. This allows CLIO to flag uncertainty for itself and the end user at the proper point in time. This also enables scientists to revisit CLIO’s reasoning path with critiques, edit beliefs during the reasoning process, and re-execute them from the desired point in time. Ultimately, this builds a foundational level of trust with scientists to use them in a scientifically defensible and rigorous way.
How does CLIO perform?We evaluate CLIO against text-based biology and medicine questions from HLE. For this domain, we demonstrate a 61.98% relative increase or an 8.56% net increase in accuracy over OpenAI’s o3 and substantially outperform base completion models like OpenAI’s GPT-4.1, while enabling the requisite explainability and control. This technique applies to all models, showing similar increases in OpenAI’s GPT-4o model, which we observe performs poorly on HLE-level questions. On average, GPT-4.1 is not considered competent for HLE scale questions (<9%), and GPT-4o is natively at less than 2%. By utilizing CLIO, we bring these to near state-of-the-art performance against top reasoning models. CLIO’s recursive nature enables the system to think broader and more deeply, ensuring coverage of the question when answered. In GPT-4.1, we see an increase of 5.92% in accuracy for overall performance using just the cognitive loop recursion. To think more deeply, we allow CLIO to ensemble different evolutions and intelligently choose from the best approach using GraphRAG. This extension of the cognition pattern provides a further 7.90% over a non-ensembled approach.
Figure 3. The impact of thinking effort on CLIO’s effectiveness.Furthermore, CLIO’s design offers different knobs of control, for example, how much time to think and which technique to utilize for a given problem. In Figure 3, we demonstrate these knobs of control and their increase on GPT-4.1 and GPT-4o’s performance. In this case, we analyze performance for a subset of biomedical questions, those focused on immunology. CLIO increases GPT-4o’s base performance to be at par with the best reasoning models for immunology questions. We observe a 13.60% improvement over the base model, GPT-4o. This result shows CLIO to be model agnostic, similar to Microsoft AI Diagnostic Orchestrator’s (MAI-DxO) (opens in new tab)‘s approach and corresponding performance boost.
Implications for science and trustworthy discoveryThe future of scientific discovery demands more than reasoning over knowledge and raw computational power alone. Here, we demonstrate how CLIO not only increases model performance but establishes new layers of control for scientists. In our upcoming work, we will demonstrate how CLIO increases tool utility for highly valuable scientific questions in the drug discovery space which requires precise tools designed for the language of science. While our experiments focus on scientific discovery, we believe CLIO can apply in a domain-agnostic fashion. Experts tackling problems in domains such as financial analysis, engineering, and legal services could potentially benefit from AI systems with a transparent, steerable reasoning approach. Ultimately, we envision CLIO as an enduring control-layer in hybrid AI stacks that combine traditional completion and reasoning models, with external memory systems, and advanced tool calling. These continuous checks and balances that CLIO enables will continue to remain valuable even as components within the AI stacks evolve. This combination of intelligent and steerable scientific decision making and tool optimization is the basis of the recently announced Microsoft Discovery platform (opens in new tab).
At Microsoft, we’re committed to advancing AI research that earns the trust of scientists, empowering them to discover new frontiers of knowledge. Our work is a testament to what’s possible when we blend innovation with trustworthiness and a human-centered vision for the future of AI-assisted scientific discovery. We invite the research and scientific community to join us in shaping that future.
Further information:
To learn more details about our approach, please read our pre-print paper published alongside this blog. We are in the process of submitting this work for external peer review and encourage partners to explore the utilization of CLIO in Microsoft Discovery. To learn more about Microsoft’s research on this or contact our team, please reach out to discoverylabs@microsoft.com.
AcknowledgementsWe are grateful for Jason Zander and Nadia Karim’s support. We extend our thanks to colleagues both inside and outside Microsoft Discovery and Quantum for sharing their insights and feedback, including Allen Stewart, Yasser Asmi, David Marvin, Harsha Nori, Scott Lundberg, and Phil Waymouth.
Opens in a new tabThe post Self-adaptive reasoning for science appeared first on Microsoft Research.
Project Ire autonomously identifies malware at scale
Today, we are excited to introduce an autonomous AI agent that can analyze and classify software without assistance, a step forward in cybersecurity and malware detection. The prototype, Project Ire, automates what is considered the gold standard in malware classification: fully reverse engineering a software file without any clues about its origin or purpose. It uses decompilers and other tools, reviews their output, and determines whether the software is malicious or benign.
Project Ire emerged from a collaboration between Microsoft Research, Microsoft Defender Research, and Microsoft Discovery & Quantum, bringing together security expertise, operational knowledge, data from global malware telemetry, and AI research. It is built on the same collaborative and agentic foundation behind GraphRAG (opens in new tab) and Microsoft Discovery (opens in new tab). The system uses advanced language models and a suite of callable reverse engineering and binary analysis tools to drive investigation and adjudication.
As of this writing, Project Ire has achieved a precision (opens in new tab) of 0.98 and a recall (opens in new tab) of 0.83 using public datasets of Windows drivers. It was the first reverse engineer at Microsoft, human or machine, to author a conviction case—a detection strong enough to justify automatic blocking—for a specific advanced persistent threat (APT) malware sample, which has since been identified and blocked by Microsoft Defender.
Malware classification at a global scaleMicrosoft’s Defender platform scans more than one billion monthly (opens in new tab) active devices through the company’s Defender suite of products, which routinely require manual review of software by experts.
This kind of work is challenging. Analysts often face error and alert fatigue, and there’s no easy way to compare and standardize how different people review and classify threats over time. For both of these reasons, today’s overloaded experts are vulnerable to burnout, a well-documented issue in the field.
Unlike other AI applications in security, malware classification lacks a computable validator (opens in new tab). The AI must make judgment calls without definitive validation beyond expert review. Many behaviors found in software, like reverse engineering protections, don’t clearly indicate whether a sample is malicious or benign.
This ambiguity requires analysts to investigate each sample incrementally, building enough evidence to determine whether it’s malicious or benign despite opposition from adaptive, active adversaries. This has long made it difficult to automate and scale what is inherently a complex and expensive process.
Technical foundationProject Ire attempts to address these challenges by acting as an autonomous system that uses specialized tools to reverse engineer software. The system’s architecture allows for reasoning at multiple levels, from low-level binary analysis to control flow reconstruction and high-level interpretation of code behavior.
Its tool-use API enables the system to update its understanding of a file using a wide range of reverse engineering tools, including Microsoft memory analysis sandboxes based on Project Freta (opens in new tab), custom and open-source tools, documentation search, and multiple decompilers.
Reaching a verdictThe evaluation process begins with a triage, where automated reverse engineering tools identify the file type, its structure, and potential areas of interest. From there, the system reconstructs the software’s control flow graph using frameworks such as angr (opens in new tab) and Ghidra (opens in new tab), building a graph that forms the backbone of Project Ire’s memory model and guides the rest of the analysis.
Through iterative function analysis, the LLM calls specialized tools through an API to identify and summarize key functions. Each result feeds into a “chain of evidence,” a detailed, auditable trail that shows how the system reached its conclusion. This traceable evidence log supports secondary review by security teams and helps refine the system in cases of misclassification.
To verify its findings, Project Ire can invoke a validator tool that cross-checks claims in the report against the chain of evidence. This tool draws on expert statements from malware reverse engineers on the Project Ire team. Drawing on this evidence and its internal model, the system creates a final report and classifies the sample as malicious or benign.
PODCAST SERIES
The AI Revolution in Medicine, RevisitedJoin Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.
Listen now Opens in a new tab Preliminary testing shows promiseTwo early evaluations tested Project Ire’s effectiveness as an autonomous malware classifier. In the first, we assessed Project Ire on a dataset of publicly accessible Windows drivers, some known to be malicious, others benign. Malicious samples came from the Living off the Land Drivers (opens in new tab) database, which includes a collection of Windows drivers used by attackers to bypass security controls, while known benign drivers were sourced from Windows Update.
This classifier performed well, correctly identifying 90% of all files and flagging only 2% of benign files as threats. It achieved a precision of 0.98 and a recall of 0.83. This low false-positive rate suggests clear potential for deployment in security operations, alongside expert reverse engineering reviews.
For each file it analyzes, Project Ire generates a report that includes an evidence section, summaries of all examined code functions, and other technical artifacts.
Figures 1 and 2 present reports for two successful malware classification cases generated during testing. The first involves a kernel-level rootkit, Trojan:Win64/Rootkit.EH!MTB (opens in new tab). The system identified several key features, including jump-hooking, process termination, and web-based command and control. It then correctly flagged the sample as malicious.
Figure 1 Analysis body { font-family: Arial, sans-serif; background: #f8f8f8; } .code-block { background: #23272e; color: #e6e6e6; font-family: 'Fira Mono', 'Consolas', 'Monaco', monospace; font-size: 1em; padding: 24px 28px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.10); border: 1px solid #444; margin: 40px auto; max-width: 900px; line-height: 1.7; word-break: break-word; } .code-block p { margin: 0 0 18px 0; text-indent: 0; }The binary contains a function named ‘MonitorAndTerminateExplorerThread_16f64’ that runs an infinite loop waiting on synchronization objects and terminates system threads upon certain conditions. It queries system or process information, iterates over processes comparing their names case-insensitively to ‘Explorer.exe’, and manipulates registry values related to ‘Explorer.exe’. This function appears to monitor and potentially terminate or manipulate the ‘Explorer.exe’ process, a critical Windows shell process. Such behavior is suspicious and consistent with malware that aims to disrupt or control system processes.
Another function, ‘HttpGetRequestAndResponse_174a4’, performs HTTP GET requests by parsing URLs, resolving hostnames, opening sockets, sending requests, and reading responses. This network communication capability could be leveraged for command and control or data exfiltration, common in malware.
The binary also includes a function ‘PatchProcessEntryPointWithHook_12b5c’ that patches the entry point of a process by writing a hook or trampoline that redirects execution to a specified address. This technique is commonly used for process injection or hooking, allowing malware to alter process behavior or inject malicious code.
Other functions related to sending IOCTL requests to device drivers were identified, but their maliciousness could not be conclusively determined without additional context.
Overall, the binary exhibits multiple indicators of malicious behavior, including process manipulation, network communication, and code injection techniques, suggesting it is likely malware designed to interfere with system processes and communicate with remote servers.
Figure 1. Project Ire report, sample with SHA256: 86047bb1969d1db455493955fd450d18c62a3f36294d0a6c3732c88dfbcc4f62 (opens in new tab)The second sample, HackTool:Win64/KillAV!MTB (opens in new tab), was designed to disable antivirus software. Project Ire correctly identified the code that locates and disables antivirus programs, providing evidence that the file was malicious.
In one section of the code, however, the system misidentified a function as anti-debugging behavior. To maintain accuracy, the system used the validator tool to flag the claim as unsupported. The issue was later resolved by updating decompiler rules, but this example illustrates how Project Ire navigates uncertainty during analysis. Figure 2 shows the corresponding report.
Figure 2 Analysis body { font-family: Arial, sans-serif; background: #f8f8f8; } .code-block { background: #23272e; color: #e6e6e6; font-family: 'Fira Mono', 'Consolas', 'Monaco', monospace; font-size: 1em; padding: 24px 28px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.10); border: 1px solid #444; margin: 40px auto; max-width: 900px; line-height: 1.7; word-break: break-word; } .code-block p { margin: 0 0 18px 0; text-indent: 0; }The binary contains several functions indicative of malicious intent. The function register_and_log_known_processes_140001000 logs and registers process names associated with antivirus and security software, such as ‘avp.exe’, ‘avpui.exe’, and ‘360Tray.exe’. It calls another function, TerminateProcessesByNameSubstring_1400010f4, which enumerates system processes and terminates those whose names contain specified substrings. This behavior is typical of malware attempting to disable or evade security software by killing their processes.
Another function, check_and_handle_special_state_14000502c, performs checks on a global variable and triggers software interrupts if certain conditions are not met. While the exact purpose of these interrupts (int 0x29 and int 0x3) is unclear, they could represent an anti-debug or anti-analysis mechanism to detect or interfere with debugging or tampering attempts. However, this assumption could not be fully validated against expert statements.
Other functions include initialization routines and simple logging wrappers, but the core malicious behavior centers on process termination targeting security software. This indicates the binary is designed to compromise system security by disabling protective processes, a hallmark of malware such as trojans or rootkits.
Figure 2. Project Ire report, sample with SHA256: b6cb163089f665c05d607a465f1b6272cdd5c949772ab9ce7227120cf61f971a (opens in new tab) Real-world evaluation with Microsoft DefenderThe more demanding test involved nearly 4,000 “hard-target” files not classified by automated systems and slated for manual review by expert reverse engineers.
In this real-world scenario, Project Ire operated fully autonomously on files created after the language models’ training cutoff, files that no other automated tools at Microsoft could classify at the time.
The system achieved a high precision score of 0.89, meaning nearly 9 out of 10 files flagged malicious were correctly identified as malicious. Recall was 0.26, indicating that under these challenging conditions, the system detected roughly a quarter of all actual malware.
The system correctly identified many of the malicious files, with few false alarms, just a 4% false positive rate. While overall performance was moderate, this combination of accuracy and a low error rate suggests real potential for future deployment.
Looking aheadBased on these early successes, the Project Ire prototype will be leveraged inside Microsoft’s Defender organization as Binary Analyzer for threat detection and software classification.
Our goal is to scale the system’s speed and accuracy so that it can correctly classify files from any source, even on first encounter. Ultimately, our vision is to detect novel malware directly in memory, at scale.
AcknowledgementsProject Ire acknowledges the following additional developers that contributed to the results in this publication: Dayenne de Souza, Raghav Pande, Ryan Terry, Shauharda Khadka, and Bob Fleck for their independent review of the system.
The system incorporates multiple tools, including the angr framework developed by Emotion Labs (opens in new tab). Microsoft has collaborated extensively with Emotion Labs, a pioneer in cyber autonomy, throughout the development of Project Ire, and thanks them for the innovations and insights that contributed to the successes reported here.
Opens in a new tabThe post Project Ire autonomously identifies malware at scale appeared first on Microsoft Research.
VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows
Many applications of language models (LMs) involve generating content based on source material, such as answering questions, summarizing information, and drafting documents. A critical challenge for these applications is that LMs may produce content that is not supported by the source text – a phenomenon known as “closed-domain hallucination.”1
Existing methods for detecting closed-domain hallucination typically compare a given LM output to the source text, implicitly assuming that there is only a single output to evaluate. However, applications of LMs increasingly involve processes with multiple generative steps: LMs generate intermediate outputs that serve as inputs to subsequent steps and culminate in a final output. Many agentic workflows follow this paradigm (e.g., each agent is responsible for a specific document or sub-task, and their outputs are synthesized into a final response).
In our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability,” we argue that, given the complexity of processes with multiple generative steps, detecting hallucination in the final output is necessary but not sufficient. We also need traceability, which has two components:
- Provenance: if the final output is supported by the source text, we should be able to trace its path through the intermediate outputs to the source.
- Error Localization: if the final output is not supported by the source text, we should be able to trace where the error was likely introduced.
Our paper presents VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for processes with any number of generative steps. We also demonstrate that VeriTrail outperforms baseline methods commonly used for hallucination detection. In this blog post, we provide an overview of VeriTrail’s design and performance.2
VeriTrail’s hallucination detection processA key idea leveraged by VeriTrail is that a wide range of generative processes can be represented as a directed acyclic graph (DAG). Each node in the DAG represents a piece of text (i.e., source material, an intermediate output, or the final output) and each edge from node A to node B indicates that A was used as an input to produce B. Each node is assigned a unique ID, as well as a stage reflecting its position in the generative process.
An example of a process with multiple generative steps is GraphRAG. A DAG representing a GraphRAG run is illustrated in Figure 1, where the boxes and arrows correspond to nodes and edges, respectively.3
Figure 1: GraphRAG splits the source text into chunks (Stage 1). For each chunk, an LM extracts entities and relationships (the latter are denoted by “⭤ “), along with short descriptions (Stage 2). If an entity or a relationship was extracted from multiple chunks, an LM summarizes the descriptions (Stage 3). A knowledge graph is constructed from the final set of entities and relationships, and a community detection algorithm, such as Leiden clustering, groups entities into communities. For each community, an LM generates a “community report” that summarizes the entities and relationships (Stage 4). To answer a user’s question, an LM generates “map-level answers” based on groups of community reports (Stage 5), then synthesizes them into a final answer (Stage 6).VeriTrail takes as input a DAG representing a completed generative process and aims to determine whether the final output is fully supported by the source text. It begins by extracting claims (i.e., self-contained, verifiable statements) from the final output using Claimify. VeriTrail verifies claims in the reverse order of the generative process: it starts from the final output and moves toward the source text. Each claim is verified separately. Below, we include two case studies that illustrate how VeriTrail works, using the DAG from Figure 1.
Case study 1: A “Fully Supported” claim Figure 2: Left: GraphRAG as a DAG. Right: VeriTrail’s hallucination detection process for a “Fully Supported” claim.Figure 2 shows an example of a claim that VeriTrail determined was not hallucinated:
- In Iteration 1, VeriTrail identified the nodes that were used as inputs for the final answer: Nodes 15 and 16. Each identified node was split into sentences, and each sentence was programmatically assigned a unique ID.
- An LM then performed Evidence Selection, selecting all sentence IDs that strongly implied the truth or falsehood of the claim. The LM also generated a summary of the selected sentences (not shown in Figure 2). In this example, a sentence was selected from Node 15.
- Next, an LM performed Verdict Generation. If no sentences had been selected in the Evidence Selection step, the claim would have been assigned a “Not Fully Supported” verdict. Instead, an LM was prompted to classify the claim as “Fully Supported,” “Not Fully Supported,” or “Inconclusive” based on the evidence. In this case, the verdict was “Fully Supported.”
- Since the verdict in Iteration 1 was “Fully Supported,” VeriTrail proceeded to Iteration 2. It considered the nodes from which at least one sentence was selected in the latest Evidence Selection step (Node 15) and identified their input nodes (Nodes 12 and 13). VeriTrail repeated Evidence Selection and Verdict Generation for the identified nodes. Once again, the verdict was “Fully Supported.” This process – identifying candidate nodes, performing Evidence Selection and Verdict Generation – was repeated in Iteration 3, where the verdict was still “Fully Supported,” and likewise in Iteration 4.
- In Iteration 4, a single source text chunk was verified. Since the source text, by definition, does not have any inputs, verification terminated and the verdict was deemed final.
Figure 3 provides an example of a claim where VeriTrail identified hallucination:
- In Iteration 1, VeriTrail identified the nodes used as inputs for the final answer: Nodes 15 and 16. After Evidence Selection and Verdict Generation, the verdict was “Not Fully Supported.” Users can configure the maximum number of consecutive “Not Fully Supported” verdicts permitted. If the maximum had been set to 1, verification would have terminated here, and the verdict would have been deemed final. Let’s assume the maximum was set to 2, meaning that VeriTrail had to perform at least one more iteration.
- Even though evidence was selected only from Node 15 in Iteration 1, VeriTrail checked the input nodes for both Node 15 and Node 16 (i.e., Nodes 12, 13, and 14) in Iteration 2. Recall that in Case Study 1 where the verdict was “Fully Supported,” VeriTrail only checked the input nodes for Node 15. Why was the “Not Fully Supported” claim handled differently? If the Evidence Selection step overlooked relevant evidence, the “Not Fully Supported” verdict might be incorrect. In this case, continuing verification based solely on the selected evidence (i.e., Node 15) would propagate the mistake, defeating the purpose of repeated verification.
- In Iteration 2, Evidence Selection and Verdict Generation were repeated for Nodes 12, 13, and 14. Once again, the verdict was “Not Fully Supported.” Since this was the second consecutive “Not Fully Supported” verdict, verification terminated and the verdict was deemed final.
PODCAST SERIES
The AI Revolution in Medicine, RevisitedJoin Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.
Listen now Opens in a new tab Providing traceabilityIn addition to assigning a final “Fully Supported,” “Not Fully Supported,” or “Inconclusive” verdict to each claim, VeriTrail returns (a) all Verdict Generation results and (b) an evidence trail composed of all Evidence Selection results: the selected sentences, their corresponding node IDs, and the generated summaries. Collectively, these outputs provide traceability:
- Provenance: For “Fully Supported” and “Inconclusive” claims, the evidence trail traces a path from the source material to the final output, helping users understand how the output may have been derived. For example, in Case Study 1, the evidence trail consists of Sentence 8 from Node 15, Sentence 11 from Node 13, Sentence 26 from Node 4, and Sentence 79 from Node 1.
- Error Localization: For “Not Fully Supported” claims, VeriTrail uses the Verdict Generation results to identify the stage(s) of the process where the unsupported content was likely introduced. For instance, in Case Study 2, where none of the verified intermediate outputs supported the claim, VeriTrail would indicate that the hallucination occurred in the final answer (Stage 6). Error stage identification helps users address hallucinations and understand where in the process they are most likely to occur.
The evidence trail also helps users verify the verdict: instead of reading through all nodes – which may be infeasible for processes that generate large amounts of text – users can simply review the evidence sentences and summaries.
Key design featuresVeriTrail’s design prioritizes reliability, efficiency, scalability, and user agency. Notable features include:
- During Evidence Selection (introduced in Case Study 1), the sentence IDs returned by the LM are checked against the programmatically assigned IDs. If a returned ID does not match an assigned ID, it is discarded; otherwise, it is mapped to its corresponding sentence. This approach guarantees that the sentences included in the evidence trail are not hallucinated.
- After a claim is assigned an interim “Fully Supported” or “Inconclusive” verdict (as in Case Study 1), VeriTrail verifies the input nodes of only the nodes from which evidence was previously selected – not all possible input nodes. By progressively narrowing the search space, VeriTrail limits the number of nodes the LM must evaluate. In particular, since VeriTrail starts from the final output and moves toward the source text, it tends to verify a smaller proportion of nodes as it approaches the source text. Nodes closer to the source text tend to be larger (e.g., a book chapter should be larger than its summary), so verifying fewer of them helps reduce computational cost.
- VeriTrail is designed to handle input graphs with any number of nodes, regardless of whether they fit in a single prompt. Users can specify an input size limit per prompt. For Evidence Selection, inputs that exceed the limit are split across multiple prompts. If the resulting evidence exceeds the input size limit for Verdict Generation, VeriTrail reruns Evidence Selection to compress the evidence further. Users can configure the maximum number of Evidence Selection reruns.
- The configurable maximum number of consecutive “Not Fully Supported” verdicts (introduced in Case Study 2) allows the user to find their desired balance between computational cost and how conservative VeriTrail is in flagging hallucinations. A lower maximum reduces cost by limiting the number of checks. A higher maximum increases confidence that a flagged claim is truly hallucinated since it requires repeated confirmation of the “Not Fully Supported” verdict.
We tested VeriTrail on two datasets covering distinct generative processes (hierarchical summarization4 and GraphRAG), tasks (summarization and question-answering), and types of source material (fiction novels and news articles). For the source material, we focused on long documents and large collections of documents (i.e., >100K tokens), where hallucination detection is especially challenging and processes with multiple generative steps are typically most valuable. The resulting DAGs were much more complex than the examples provided above (e.g., in one of the datasets, the average number of nodes was 114,368).
We compared VeriTrail to three types of baseline methods commonly used for closed-domain hallucination detection: Natural Language Inference models (AlignScore and INFUSE); Retrieval-Augmented Generation; and long-context models (Gemini 1.5 Pro and GPT-4.1 mini). Across both datasets and all language models tested, VeriTrail outperformed the baseline methods in detecting hallucination.5
Most importantly, VeriTrail traces claims through intermediate outputs – unlike the baseline methods, which directly compare the final output to the source material. As a result, it can identify where hallucinated content was likely introduced and how faithful content may have been derived from the source. By providing traceability, VeriTrail brings transparency to generative processes, helping users understand, verify, debug, and, ultimately, trust their outputs.
For an in-depth discussion of VeriTrail, please see our paper “VeriTrail: Closed-Domain Hallucination Detection with Traceability.”
1 (opens in new tab) The term “closed-domain hallucination” was introduced by OpenAI in the GPT-4 Technical Report (opens in new tab).
2 VeriTrail is currently used for research purposes only and is not available commercially.
3 We focus on GraphRAG’s global search method.
4 (opens in new tab) In hierarchical summarization, an LM summarizes each source text chunk individually, then the resulting summaries are repeatedly grouped and summarized until a final summary is produced (Wu et al., 2021 (opens in new tab); Chang et al., 2023 (opens in new tab)).
5 The only exception was the mistral-large-2411 model, where VeriTrail had the highest balanced accuracy, but not the highest macro F1 score.
Opens in a new tabThe post VeriTrail: Detecting hallucination and tracing provenance in multi-step AI workflows appeared first on Microsoft Research.
Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore
AI has made remarkable progress in recent years, but turning experimental models into tools that work in the real world is still a major challenge. Bridging this gap between innovation and application has shaped the career of Xinxing Xu, principal researcher at Microsoft Research Asia – Singapore (opens in new tab), and underpins the mission of the lab’s newly established presence in the region.
Xinxing Xu, Principal Researcher, Microsoft Research Asia – Singapore“Innovative algorithms can only demonstrate their true value when tested with real-world data and in actual scenarios, where they can be continuously optimized through iteration,” he says.
Xu’s commitment to balancing algorithmic innovation with practical application has shaped his entire career. During his PhD studies at Nanyang Technological University, Singapore, Xu focused on emerging technologies like multiple kernel learning methods and multimodal machine learning. Today he’s applying these techniques to real-world use cases like image recognition and video classification.
After completing his doctorate, he joined the Institute of High Performance Computing at Singapore’s Agency for Science, Technology and Research (A*STAR), where he worked on interdisciplinary projects ranging from medical image recognition to AI systems for detecting defects on facade of buildings. These experiences broadened his perspective and deepened his passion for translating AI into real-world impact.
In 2024, Xu joined Microsoft Research Asia where he began a new chapter focused on bridging between academic research and real-world AI applications.
“Microsoft Research Asia is committed to integrating scientific exploration with real-world applications, which creates a unique research environment,” Xu says. “It brings together top talent and resources, and Microsoft’s engineering and product ecosystem strongly supports turning research into impactful technology. The lab’s open and inclusive culture encourages innovation with broader societal impact. It reflects the approach to research I’ve always hoped to contribute to.”
Spotlight: Microsoft research newsletter
Microsoft Research Newsletter Subscribe today Opens in a new tab Bringing cross-domain expertise to AI’s real-world frontiersAs a key hub in Microsoft Research’s network across Asia, the Singapore lab is guided by a three-part mission: to drive industry-transforming AI deployment, pursue fundamental breakthroughs in the field, and promote responsible, socially beneficial applications of the technology.
To reach these goals, Xu and his colleagues are working closely with local collaborators, combining cross-disciplinary expertise to tackle complex, real-world challenges.
To deliver on that mission, Xinxing Xu and his colleagues are working closely with local collaborators, drawing on cross-disciplinary expertise to solve real-world problems. One key focus is healthcare, where Xu leads a collaboration with Singapore’s SingHealth to explore how AI can support precision medicine. By combining SingHealth’s clinical data with advanced AI models, the team aims to deliver more personalized analyses and sharper diagnostic tools—laying the groundwork for improved patient outcomes.
Beyond healthcare, the team is also targeting key sectors like finance and logistics. By developing domain-specific foundation models and AI agents, they aim to support smarter decision-making and accelerate digital transformation across industries. “Singapore has a strong foundation in these sectors,” Xu notes, “making it an ideal environment for technology validation and iteration.”
The team is also partnering with leading academic institutions, including the National University of Singapore (NUS) and Nanyang Technological University, Singapore (NTU Singapore), to advance the field of spatial intelligence. Their goal is to develop embodied intelligence systems capable of carrying out complex tasks in smart environments.
As AI becomes more deeply embedded in everyday life, researchers at the Singapore lab are also increasingly focused on what they call “societal AI”—building AI systems that are culturally relevant and trustworthy within Southeast Asia’s unique cultural and social contexts. In collaboration with global colleagues, they’re helping to advance a more culturally grounded and responsible approach to AI research in the region.
Microsoft Research Asia – Singapore: Expanding global reach, connecting regional innovationRealizing AI’s full potential requires more than technical breakthroughs. It also depends on collaboration—across industries, academia, and policy. Only through this intersection of forces can AI move beyond the lab to deliver meaningful societal value.
Singapore’s strengths in science, engineering, and digital governance make it an ideal setting for this kind of work. Its collaborative culture, robust infrastructure, international talent pool, and strong policy support for science and technology make it fertile ground for interdisciplinary research.
This is why Microsoft Research Asia continues to collaborate closely with Singapore’s top universities, research institutions, and industry partners. These partnerships support joint research, talent development, and technical exchange. Building on this foundation, Microsoft Research Asia – Singapore will further deepen its collaboration with NUS, NTU Singapore, and Singapore Management University (SMU) to advance both fundamental and applied research, while equipping the next generation of researchers with real-world experience. In addition, Microsoft Research Asia is fostering academic exchange and strengthening the research ecosystem through summer schools and joint workshops with NUS, NTU Singapore, and SMU.
The launch of the Singapore lab further marks an important step in expanding the company’s global research footprint, serving as a bridge between regional innovation and Microsoft’s global ecosystem. Through its integrated lab network, Microsoft Research fosters the sharing of technologies, methods, and real-world insights, creating a virtuous cycle of innovation.
“We aim to build a research hub in Singapore that is globally connected and deeply rooted in the local ecosystem,” Xu says. “Many breakthroughs come from interdisciplinary and cross-regional collaboration. By breaking boundaries—across disciplines, industries, and geographies—we can drive research that has lasting impact.”
As AI becomes more deeply woven into industry and everyday life, Xu believes that meaningful research must be closely connected to regional development and social well-being. “Microsoft Research Asia – Singapore is a future-facing lab,” he says. “While we push technological frontiers, we’re equally committed to the responsibility of technology—ensuring AI can help address society’s most pressing challenges.”
In a world shaped by global challenges, Xu sees collaboration and innovation as essential to real progress. With Singapore as a launchpad, he and his team are working to extend AI’s impact and value across Southeast Asia and beyond.
Xingxing Xu (center) with colleagues at Microsoft Research Asia – Singapore Three essential strengths for the next generation of AI researchersAI’s progress depends not only on technical breakthroughs but also on the growth and dedication of talent. At Microsoft Research Asia, there is a strong belief that bringing research into the real world requires more than technical coordination—it depends on unlocking the full creativity and potential of researchers.
In Singapore—a regional innovation hub that connects Southeast Asia—Xu and his colleagues are working to push AI beyond the lab and into fields like healthcare, finance, and manufacturing. For young researchers hoping to shape the future of AI, this is a uniquely powerful stage.
To help guide the next generation, Xu shares three pieces of advice:
- Build a strong foundation – “Core knowledge in machine learning, linear algebra, and probability and statistics is the bedrock of AI research,” Xu says. “A solid theoretical base is essential to remain competitive in a rapidly evolving field. Even today’s hottest trends in generative AI rely on longstanding principles of optimization and model architecture design.” While code generation tools are on the rise, Xu emphasizes that mathematical fundamentals remain essential for understanding and innovating in AI.
- Understand real-world applications – Technical skills alone aren’t enough. Xu encourages young researchers to deeply engage with the problems they’re trying to solve. Only by tightly integrating technology with its context can researchers create truly valuable solutions.
“In healthcare, for example, researchers may need to follow doctors in clinics to gain a true understanding of clinical workflows. That context helps identify the best entry points for AI deployment. Framing research problems around real-world needs is often more impactful than just tuning model parameters,” Xu says. - Develop interdisciplinary thinking – Cross-disciplinary collaboration is becoming essential to AI innovation. Xu advises young researchers to learn how to work with experts from other fields to explore new directions together. “These kinds of interactions often spark fresh, creative ideas,” he says.
Maintaining curiosity is just as important. “Being open to new technologies and fields is what enables researchers to continually break new ground and produce original results.”
Xu extends an open invitation to aspiring researchers from all backgrounds to join Microsoft Research Asia – Singapore. “We offer a unique platform that blends cutting-edge research with real-world impact,” he says. “It’s a place where you can work on the frontiers of AI—and see how your work can help transform industries and improve lives.”
To learn more about current openings at the Singapore lab, please visit our careers page (opens in new tab).
Opens in a new tabThe post Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore appeared first on Microsoft Research.
Technical approach for classifying human-AI interactions at scale
As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale—efficiently, reliably, and in near real-time—is no longer a nice-to-have. It’s essential. The Semantic Telemetry project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.
But building a pipeline that can handle this volume isn’t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.
In this latest post in our series on Semantic Telemetry, we’ll walk through the engineering behind that system—how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we’ll share what it takes to build a real-time LLM classification pipeline.
For additional project background: Semantic Telemetry: Understanding how users interact with AI systems and Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project.
Blog Semantic Telemetry: Understanding how users interact with AI systems Blog Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project System architecture highlightsThe Semantic Telemetry pipeline (opens in new tab) is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:
- Hybrid compute engine
The pipeline combines the distributed power of PySpark with the speed and simplicity of Polars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments—without code changes. - LLM-centric transformation layer
At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:- Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.
- Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.
- Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. “create” versus “created”) and relabeling invalid classifications.
The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths—making it easy to adapt to new use cases or environments.
Engineering challenges & solutionsBuilding a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges—from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.
LLM endpoint latency & variabilityChallenge: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.
Solution: We implemented a combination of:
- Multiple Azure OpenAI endpoints in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.
- Saving output in intervals to write data asynchronously in case of network errors.
- Utilizing models with higher tokens per minute (TPM) such as OpenAI’s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)
- Timeouts and retries with exponential backoff.
Challenge: Each new LLM release—such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)—brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.
Solution: We developed a model evaluation workflow to test prompt alignment across LLM versions:
- Small-sample testing: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.
- Distribution analysis: If the new model’s output aligned closely, we scaled up testing. If not, we iteratively tuned the prompts and re-ran comparisons.
- Interpretation flexibility: We also recognized that a shift in distribution isn’t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.
To support this process, we used tools like Sammo (opens in new tab), which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.
Dynamic concurrency scaling for LLM callsChallenge: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models’ speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.
Solution: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:
- External task awareness: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.
- Success/failure rate monitoring: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.
- Latency-based feedback loop: Instead of waiting for rate-limit errors, measure the response time of LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.
PODCAST SERIES
The AI Revolution in Medicine, RevisitedJoin Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.
Listen now Opens in a new tab Optimization experimentsTo further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.
Batch endpoints (Azure/OpenAI)Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.
Conversation batching in prompts during pipeline runtimeBatching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.
Combining classifiers in a single promptCombining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.
Classification using text embeddingsAn alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation—instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.
For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.
The benefits of this approach include significantly increased throughput and cost savings—since it’s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.
Prompt compressionCompressing prompts by eliminating unnecessary tokens or by using a tool such as LLMLingua (opens in new tab) to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.
Text truncationTruncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint’s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it’s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.
ConclusionBuilding a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they’re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we’ve shared represent a snapshot of what’s working today. But more importantly, they offer a foundation for what’s possible tomorrow.
Opens in a new tabThe post Technical approach for classifying human-AI interactions at scale appeared first on Microsoft Research.
CollabLLM: Teaching LLMs to collaborate with users
Large language models (LLMs) can solve complex puzzles in seconds, yet they sometimes struggle over simple conversations. When these AI tools make assumptions, overlook key details, or neglect to ask clarifying questions, the result can erode trust and derail real-world interactions, where nuance is everything.
A key reason these models behave this way lies in how they’re trained and evaluated. Most benchmarks use isolated, single-turn prompts with clear instructions. Training methods tend to optimize for the model’s next response, not its contribution to a successful, multi-turn exchange. But real-world interaction is dynamic and collaborative. It relies on context, clarification, and shared understanding.
User-centric approach to trainingTo address this, we’re exploring ways to train LLMs with users in mind. Our approach places models in simulated environments that reflect the back-and-forth nature of real conversations. Through reinforcement learning, these models improve through trial and error, for example, learning when to ask questions and how to adapt tone and communication style to different situations. This user-centric approach helps bridge the gap between how LLMs are typically trained and how people actually use them.
This is the concept behind CollabLLM (opens in new tab), recipient of an ICML (opens in new tab) Outstanding Paper Award (opens in new tab). This training framework helps LLMs improve through simulated multi-turn interactions, as illustrated in Figure 1. The core insight behind CollabLLM is simple: in a constructive collaboration, the value of a response isn’t just in its immediate usefulness, but in how it contributes to the overall success of the conversation. A clarifying question might seem like a delay but often leads to better outcomes. A quick answer might appear useful but can create confusion or derail the interaction.
Figure 1. Diagram comparing two training approaches for LLMs. (a) The standard method lacks user-agent collaboration and uses single-turn rewards, leading to an inefficient conversation. (b) In contrast, CollabLLM simulates multi-turn user-agent interactions during training, enabling it to learn effective collaboration strategies and produce more efficient dialogues.CollabLLM puts this collaborative approach into practice with a simulation-based training loop, illustrated in Figure 2. At any point in a conversation, the model generates multiple possible next turns by engaging in a dialogue with a simulated user.
Figure 2: Simulation-based training process used in CollabLLMThe system uses a sampling method to extend conversations turn by turn, choosing likely responses for each participant (the AI agent or the simulated user), while adding some randomness to vary the conversational paths. The goal is to expose the model to a wide variety of conversational scenarios, helping it learn more effective collaboration strategies.
Spotlight: Event Series
Microsoft Research ForumJoin us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.
Watch on-demand Opens in a new tabTo each simulated conversation, we applied multiturn-aware reward (MR) functions, which assess how the model’s response at the given turn influences the entire trajectory of the conversation. We sampled multiple conversational follow-ups from the model, such as statements, suggestions, questions, and used MR to assign a reward to each based on how well the conversation performed in later turns. We based these scores on automated metrics that reflect key factors like goal completion, conversational efficiency, and user engagement.
To score the sampled conversations, we used task-specific metrics and metrics from an LLM-as-a-judge framework, which supports efficient and scalable evaluation. For metrics like engagement, a judge model rates each sampled conversation on a scale from 0 to 1.
The MR of each model response was computed by averaging the scores from the sampled conversations, originating from the model response. Based on the score, the model updates its parameters using established reinforcement learning algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).
We tested CollabLLM through a combination of automated and human evaluations, detailed in the paper. One highlight is a user study involving 201 participants in a document co-creation task, shown in Figure 3. We compared CollabLLM to a baseline trained with single-turn rewards and to a second, more proactive baseline prompted to ask clarifying questions and take other proactive steps. CollabLLM outperformed both, producing higher-quality documents, better interaction ratings, and faster task completion times.
Figure 3: Results of the user study in a document co-creation task comparing CollabLLM to a baseline trained with single-turn rewards. Designing for real-world collaborationMuch of today’s AI research focuses on fully automated tasks, models working without input from or interaction with users. But many real-world applications depend on people in the loop: as users, collaborators, or decision-makers. Designing AI systems that treat user input not as a constraint, but as essential, leads to systems that are more accurate, more helpful, and ultimately more trustworthy.
This work is driven by a core belief: the future of AI depends not just on intelligence, but on the ability to collaborate effectively. And that means confronting the communication breakdowns in today’s systems.
We see CollabLLM as a step in that direction, training models to engage in meaningful multi-turn interactions, ask clarifying questions, and adapt to context. In doing so, we can build systems designed to work with people—not around them.
Opens in a new tabThe post CollabLLM: Teaching LLMs to collaborate with users appeared first on Microsoft Research.
PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays
In our ever-evolving journey to enhance healthcare through technology, we’re announcing a unique new benchmark for grounded radiology report generation—PadChest-GR (opens in new tab). The world’s first multimodal, bilingual sentence-level radiology report dataset, developed by the University of Alicante with Microsoft Research, University Hospital Sant Joan d’Alacant and MedBravo, is set to redefine how AI and radiologists interpret radiological images. Our work demonstrates how collaboration between humans and AI can create powerful feedback loops—where new datasets drive better AI models, and those models, in turn, inspire richer datasets. We’re excited to share this progress in NEJM AI, highlighting both the clinical relevance and research excellence of this initiative.
A new frontier in radiology report generationIt is estimated that over half of people visiting hospitals have radiology scans that must be interpreted by a clinical professional. Traditional radiology reports often condense multiple findings into unstructured narratives. In contrast, grounded radiology reporting demands that each finding be described and localized individually.
This can mitigate the risk of AI fabrications and enable new interactive capabilities that enhance clinical and patient interpretability. PadChest-GR is the first bilingual dataset to address this need with 4,555 chest X-ray studies complete with Spanish and English sentence-level descriptions and precise spatial (bounding box) annotations for both positive and negative findings. It is the first public benchmark that enables us to evaluate generation of fully grounded radiology reports in chest X-rays.
Figure 1. Example of a grounded report from PadChest-GR. The original free-text report in Spanish was ”Motivo de consulta: Preoperatorio. Rx PA tórax: Impresión diagnóstica: Ateromatosis aórtica calcificada. Engrosamiento pleural biapical. Atelectasia laminar basal izquierda. Elongación aórtica. Sin otros hallazgos radiológicos significativos.”This benchmark isn’t standing alone—it plays a critical role in powering our state-of-the-art multimodal report generation model, MAIRA-2. Leveraging the detailed annotations of PadChest-GR, MAIRA-2 represents our commitment to building more interpretable and clinically useful AI systems. You can explore our work on MAIRA-2 on our project web page, including recent user research conducted with clinicians in healthcare settings.
PadChest-GR is a testament to the power of collaboration. Aurelia Bustos at MedBravo and Antonio Pertusa at the University of Alicante published the original PadChest dataset (opens in new tab) in 2020, with the help of Jose María Salinas from Hospital San Juan de Alicante and María de la Iglesia Vayá from the Center of Excellence in Biomedical Imaging at the Ministry of Health in Valencia, Spain. We started to look at PadChest and were deeply impressed by the scale, depth, and diversity of the data.
As we worked more closely with the dataset, we realized the opportunity to develop this for grounded radiology reporting research and worked with the team at the University of Alicante to determine how to approach this together. Our complementary expertise was a nice fit. At Microsoft Research, our mission is to push the boundaries of medical AI through innovative, data-driven solutions. The University of Alicante, with its deep clinical expertise, provided critical insights that greatly enriched the dataset’s relevance and utility. The result of this collaboration is the PadChest-GR dataset.
A significant enabler of our annotation process was Centaur Labs. The team of senior and junior radiologists from the University Hospital Sant Joan d’Alacant, coordinated by Joaquin Galant, used this HIPAA-compliant labeling platform to perform rigorous study-level quality control and bounding box annotations. The annotation protocol implemented ensured that each annotation was accurate and consistent, forming the backbone of a dataset designed for the next generation of grounded radiology report generation models.
Accelerating PadChest-GR dataset annotation with AIOur approach integrates advanced large language models with comprehensive manual annotation:
Data Selection & Processing: Leveraging Microsoft Azure OpenAI Service (opens in new tab) with GPT-4, we extracted sentences describing individual positive and negative findings from raw radiology reports, translated them from Spanish to English, and linked each sentence to the existing expert labels from PadChest. This was done for a selected subset of the full PadChest dataset, carefully curated to reflect a realistic distribution of clinically relevant findings.
Manual Quality Control & Annotation: The processed studies underwent meticulous quality checks on the Centaur Labs platform by radiologist from Hospital San Juan de Alicante. Each positive finding was then annotated with bounding boxes to capture critical spatial information.
Standardization & Integration: All annotations were harmonized into coherent grounded reports, preserving the structure and context of the original findings while enhancing interpretability.
Figure 2. Overview of the data curation pipeline. Impact and future directionsPadChest-GR not only sets a new benchmark for grounded radiology reporting, but also serves as the foundation for our MAIRA-2 model, which already showcases the potential of highly interpretable AI in clinical settings. While we developed PadChest-GR to help train and validate our own models, we believe the research community will greatly benefit from this dataset for many years to come. We look forward to seeing the broader research community build on this—improving grounded reporting AI models and using PadChest-GR as a standard for evaluation. We believe that by fostering open collaboration and sharing our resources, we can accelerate progress in medical imaging AI and ultimately improve patient care together with the community.
The collaboration between Microsoft Research and the University of Alicante highlights the transformative power of working together across disciplines. With our publication in NEJM-AI and the integral role of PadChest-GR in the development of MAIRA-2 (opens in new tab) and RadFact (opens in new tab), we are excited about the future of AI-empowered radiology. We invite researchers and industry experts to explore PadChest-GR and MAIRA-2, contribute innovative ideas, and join us in advancing the field of grounded radiology reporting.
Papers already using PadChest-GR:
- [2406.04449] MAIRA-2: Grounded Radiology Report Generation (opens in new tab)
- RadVLM: A Multitask Conversational Vision-Language Model for Radiology (opens in new tab)
- Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions (opens in new tab)
- Visual Prompt Engineering for Vision Language Models in Radiology (opens in new tab)
For further details or to download PadChest-GR, please visit the BIMCV PadChest-GR Project (opens in new tab).
Models in the Azure Foundry that can do Grounded Reporting:
- How to deploy and use CXRReportGen healthcare AI model with Azure AI Foundry – Azure AI Foundry | Microsoft Learn (opens in new tab)
- Healthcare Orchestrator – Healthcare agent service | Microsoft Learn (opens in new tab)
- Authors: Daniel C. Castro (opens in new tab), Aurelia Bustos (opens in new tab), Shruthi Bannur (opens in new tab), Stephanie L. Hyland (opens in new tab), Kenza Bouzid (opens in new tab), Maria Teodora Wetscherek (opens in new tab), Maria Dolores Sánchez-Valverde (opens in new tab), Lara Jaques-Pérez (opens in new tab), Lourdes Pérez-Rodríguez (opens in new tab), Kenji Takeda (opens in new tab), José María Salinas (opens in new tab), Javier Alvarez-Valle (opens in new tab), Joaquín Galant Herrero (opens in new tab), Antonio Pertusa (opens in new tab)
- MSR Health Futures UK: Hannah Richardson, Valentina Salvatelli, Harshita Sharma, Sam Bond-Taylor, Max Ilse, Fernando Perez-Garcia, Anton Schwaighofer, Jonathan Carlson
- MSR Flow: Kenji Takeda, Evelyn Viegas, Ashley Llorens
- HLS: Matthew Lungren, Naiteek Sangani, Shrey Jain, Ivan Tarapov, Will Guyman, Mert Oez, Chris Burt, David Ardman
The post PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays appeared first on Microsoft Research.
Learning from other domains to advance AI evaluation and testing
As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?
Recent research and reports from Microsoft (opens in new tab), the UK AI Security Institute (opens in new tab), The New York Times (opens in new tab), and MIT Technology Review (opens in new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report (opens in new tab) (2025) and the Singapore Consensus (opens in new tab) (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust.
Today, we’re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, pharmaceuticals, and medical devices to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.
We’re also sharing written case studies from experts, along with top-level lessons we’re applying to AI. At the close of the podcast series, we’ll offer Microsoft’s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation.
Lessons from eight case studiesOur research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft’s Office of Responsible AI (opens in new tab) gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI (opens in new tab), in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past.
While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.
Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the “DNA” of the historical moment in which they’re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later.
Strict, pre-deployment testing regimes—such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals—offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.
In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment—such as cybersecurity and bank stress testing—rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.
Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation.
These variations in approaches across domains—stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors—also inform takeaways for AI.
Applying risk evaluation and governance lessons to AIWhile no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.
Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.
These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including:
- Rigor in defining what is being examined and why it matters. This requires detailed specification of what is being measured and understanding how the deployment context may affect outcomes.
- Standardization of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency.
- Interpretability of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results—while remaining aware of their limitations.
Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability—and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.
Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.
AcknowledgementsWe would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.
Case studiesCivil aviation: Testing in Aircraft Design and Manufacturing, by Paul Alp
Cybersecurity: Cybersecurity Standards and Testing—Lessons for AI Safety and Security, by Stewart Baker
Financial services (bank stress testing): The Evolving Use of Bank Stress Tests, by Kathryn Judge
Genome editing: Governance of Genome Editing in Human Therapeutics and Agricultural Applications, by Alta Charo and Andy Greenfield
Medical devices: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance, by Mateo Aboy and Timo Minssen
Nanoscience: The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation, by Jennifer Dionne
Nuclear energy: Testing in the Nuclear Industry, by Pablo Cantero and Gerónimo Poletto Antonacci
Pharmaceuticals: The History and Evolution of Testing in Pharmaceutical Regulation, by Daniel Benamouzig and Daniel Carpenter
Opens in a new tabThe post Learning from other domains to advance AI evaluation and testing appeared first on Microsoft Research.
Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning
We are excited to share our first big milestone in solving a grand challenge that has hampered the predictive power of computational chemistry, biochemistry, and materials science for decades. By using a scalable deep-learning approach and generating an unprecedented quantity of diverse, highly accurate data, we have achieved a breakthrough in the accuracy of density functional theory (DFT), the workhorse method that thousands of scientists use every year to simulate matter at the atomistic level. Within the region of chemical space represented in our large training dataset, our model reaches the accuracy required to reliably predict experimental outcomes, as assessed on the well-known benchmark dataset W4-17 (opens in new tab). This removes a fundamental barrier to shifting the balance of molecule and material design from being driven by laboratory experiments to being driven by computational simulations. The implications for accelerating scientific discovery are far reaching, spanning applications from drugs to batteries and green fertilizers.
What is DFT?Molecules and materials are made of atoms, which are held together by their electrons. These electrons act as a glue, determining the stability and properties of the chemical structure. Accurately computing the strength and properties of the electron glue is essential for predicting whether a chemical reaction will proceed, whether a candidate drug molecule will bind to its target protein, whether a material is suitable for carbon capture, or if a flow battery can be optimized for renewable energy storage. Unfortunately, a brute-force approach amounts to solving the many-electron Schrödinger equation, which requires computation that scales exponentially with the number of electrons. Considering that an atom has dozens of electrons, and that molecules and materials have large numbers of atoms, we could easily end up waiting the age of the universe to complete our computation unless we restrict our attention to small systems with only a few atoms.
DFT, introduced by Walter Kohn and collaborators in 1964-1965, was a true scientific breakthrough, earning Kohn the Nobel Prize in Chemistry in 1998. DFT provides an extraordinary reduction in the computational cost of calculating the electron glue in an exact manner, from exponential to cubic, making it possible to perform calculations of practical value within seconds to hours.
DFT Timeline What is the grand challenge in DFT?But there is a catch: the exact reformulation has a small but crucial term—the exchange-correlation (XC) functional—which Kohn proved is universal (i.e., the same for all molecules and materials), but for which no explicit expression is known. For 60 years, people have designed practical approximations for the XC functional. The magazine Science dubbed the gold rush to design better XC models the “pursuit of the Divine Functional (opens in new tab)”. With time, these approximations have grown into a zoo of hundreds of different XC functionals from which users must choose, often using experimental data as a guide. Owing to the uniquely favorable computational cost of DFT, existing functionals have enabled scientists to gain extremely useful insight into a huge variety of chemical problems. However, the limited accuracy and scope of current XC functionals mean that DFT is still mostly used to interpret experimental results rather than predict them.
Why is it important to increase the accuracy of DFT?We can contrast the present state of computational chemistry with the state of aircraft engineering and design. Thanks to predictive simulations, aeronautical engineers no longer need to build and test thousands of prototypes to identify one viable design. However, this is exactly what we currently must do in molecular and materials sciences. We send thousands of potential candidates to the lab, because the accuracy of the computational methods is not sufficient to predict the experiments. To make a significant shift in the balance from laboratory to in silico experiments, we need to remove the fundamental bottleneck of the insufficient accuracy of present XC functionals. This amounts to bringing the error of DFT calculations with respect to experiments within chemical accuracy, which is around 1 kcal/mol for most chemical processes. Present approximations typically have errors that are 3 to 30 times larger.
How can AI make a difference?AI can transform how we model molecules and materials with DFT by learning the XC functional directly from highly accurate data. The goal is to learn how the XC functional captures the complex relationship between its input, the electron density, and its output, the XC energy. You can think of the density like a glue, with regions of space where there is a lot of it and other regions with less of it. Traditionally, researchers have built XC functional approximations using the concept of the so-called Jacob’s ladder: a hierarchy of increasingly complex, hand-designed descriptors of the electron density. Including density descriptors from higher rungs of this ladder aims to improve accuracy, but it comes at the price of increased computational cost. Even the few attempts that use machine learning have stayed within this traditional paradigm, thereby taking an approach that is akin to what people were doing in computer vision and speech recognition before the deep-learning era. Progress toward better accuracy has stagnated for at least two decades with this approach.
Our project is driven by the intuition that a true deep learning approach—where relevant representations of the electron density are learned directly from data in a computationally scalable way—has the potential to revolutionize the accuracy of DFT, much like deep learning has transformed other fields. A significant challenge with going down this path, however, is that feature or representation learning is very data-hungry, and there is very little data around—too little to test this hypothesis reliably.
What have we done in this milestone?The first step was generating data—a lot of it. This posed a major challenge, since the data must come from accurate solutions of the many-electron Schrödinger equation, which is precisely the prohibitively expensive problem that DFT is designed to replace. Fortunately, decades of progress in the scientific community have led to smarter, more efficient variants of brute-force methods, making it possible to compute reference data for small molecules at experimental accuracy. While these high-accuracy methods, also referred to as wavefunction methods, are far too costly for routine use in applications, we made a deliberate investment in them for this project. The reason? The upfront cost of generating high-quality training data is offset by the long-term benefit of enabling vast numbers of industrially relevant applications with cost effective DFT using the trained XC functional. Crucially, we rely on the ability of DFT—and our learned XC functional—to generalize from high-accuracy data for small systems to larger, more complex molecules.
There are many different high-accuracy wavefunction methods, each tailored to different regions of chemical space. However, their use at scale is not well established, as they require extensive expertise—small methodological choices can significantly affect accuracy at the level that we target. We therefore joined forces with Prof. Amir Karton (opens in new tab) from the University of New England, Australia, a world-leading expert who developed widely recognized benchmark datasets for a fundamental thermochemical property: atomization energy—the energy required to break all bonds in a molecule and separate it into individual atoms. To create a training dataset of atomization energies at unprecedented scale, our team at Microsoft built a scalable pipeline to produce highly diverse molecular structures. Using these structures and substantial Azure compute resources via Microsoft’s Accelerating Foundation Models Research program (opens in new tab), Prof. Karton applied a high-accuracy wavefunction method to compute the corresponding energy labels. The result is a dataset (opens in new tab) two orders of magnitude larger than previous efforts. We are releasing a large part of this dataset (opens in new tab) to the scientific community.
Data generation was only half of the challenge. We also needed to design a dedicated deep-learning architecture for the XC functional—one that is both computationally scalable and capable of learning meaningful representations from electron densities to accurately predict the XC energy. Our team of machine learning specialists, assisted by DFT experts, introduced a series of innovations that solve these and other challenges inherent to this complex learning problem. The result is Skala, an XC functional that generalizes to unseen molecules, reaching the accuracy needed to predict experiments. This demonstrates for the first time that deep learning can truly disrupt DFT: reaching experimental accuracy does not require the computationally expensive hand-designed features of Jacob’s ladder. Instead, we can retain the original computational complexity of DFT while allowing the XC functional to learn how to extract meaningful features and predict accurate energies.
We compare the accuracy of Skala against the best existing functionals of varying computational cost. The prediction errors are evaluated on two well-known public benchmark datasets: the W4-17 dataset for atomization energies (y axis, mean absolute error) and the GMTKN55 dataset for general main-group chemistry (x axis, weighted total mean absolute deviation, or WTMAD-2 for short). Skala achieves near “chemical accuracy” (1 kcal/mol) on atomization energies. This is the accuracy required for predictive modeling of laboratory experiments, which, to date, no existing functional has reached. Skala works especially well on the “single reference” subset of this dataset, reaching a groundbreaking 0.85 kcal/mol. On the GMTKN55 dataset, Skala shows competitive accuracy to the best-performing hybrid functionals, at a lower cost.“Skala is a new density functional for the exchange-correlation energy that employs meta-GGA ingredients plus D3 dispersion and machine-learned nonlocal features of the electron density. Some exact constraints were imposed, and some others “emerge” from the fitting to about 150,000 accurate energy differences for sp molecules and atoms. Skala achieves high, hybrid-like accuracy on a large and diverse data set of properties of main group molecules, which has no overlap with its training set. The computational cost of Skala is higher than that of the r2SCAN meta-GGA for small molecules, but about the same for systems with 1,000 or more occupied orbitals. Its cost seems to be only 10% of the cost of standard hybrids and 1% of the cost of local hybrids. Developed by a Microsoft team of density functional theorists and deep-learning experts, Skala could be the first machine-learned density functional to compete with existing functionals for wide use in computational chemistry, and a sign of things to come in that and related fields. Skala learned from big data and was taught by insightful human scientists.”
— John P. Perdew, Professor of Physics, School of Science and Engineering, Tulane UniversityThis first milestone was achieved for a challenging property in a specific region of chemical space—atomization energies of main group molecules—for which we generated our initial large batch of high-accuracy training data. Building on this foundation, we have started to expand our training dataset to cover a broader range of general chemistry, using our scalable in-house data generation pipeline. With the first small batch of training data beyond atomization energies, we have already extended the accuracy of our model, making it competitive with the best existing XC functionals across a wider spectrum of main group chemistry. This motivates us to continue growing our high-accuracy data generation campaign, engaging with external experts such as Prof. Amir Karton, who noted, “After years of benchmarking DFT methods against experimental accuracy, this is the first time I’ve witnessed such an unprecedented leap in the accuracy–cost trade-off. It is genuinely exciting to see how the creation of our new dataset has enabled these groundbreaking results — opening up a path for transformative advances across chemical, biochemical, and materials research.”
Advancing computational chemistry togetherWe are excited to work closely with the global computational chemistry community to accelerate progress for all and look forward to openly releasing our first XC functional in the near future.
“Density Functional Theory (DFT) and related technologies are a core Digital Chemistry technology supporting advancements in Merck’s diverse Life Science, Healthcare and Electronics businesses. However, the limitations of traditional DFT methods, which have persisted for the last 50 years, have hindered its full potential. Microsoft Research’s innovative approach to integrating deep learning represents a substantial leap, enhancing its accuracy, robustness, and scalability. We are looking forward to exploring how this can advance Digital Chemistry workflows and unlock new possibilities for the future, aligning with our commitment to developing advanced algorithms and technologies that propel scientific innovation at Merck.”
— Jan Gerit Brandenburg – Director for Digital Chemistry at Merck“We are entering a golden age for predictive and realistic simulations: very accurate electronic-structure calculations provide vast amounts of consistent data that can be used to train novel machine-learning architectures, delivering the holy grail of precision and computational efficiency.”
— Professor Nicola Marzari, Chair of Theory and Simulation of Materials, EPFL and PSIWe believe that our new functional can help unlock new opportunities for businesses and are eager to work together on real-world applications. Today, we are delighted to launch the DFT Research Early Access Program (DFT REAP) and welcome Flagship Pioneering as the first participant. This program is for companies and research labs to collaborate with us to accelerate innovation across many industries. To find out more about how to join this program please visit: https://aka.ms/DFT-REAP (opens in new tab)
“Microsoft’s effort to enhance the predictive power of computational chemistry reflects a bold but thoughtful step toward a simulation-first future. At Flagship, we believe that openly shared, foundational advances in science – like this leap forward in DFT accuracy – can serve as powerful enablers of innovation. These next-generation tools promise to accelerate discovery across a wide range of sectors, from therapeutics to materials science, by helping researchers navigate chemical and biological space with far greater precision and speed.”
— Junaid Bajwa, M.D., Senior Partner at Flagship Pioneering and Science Partner at Pioneering IntelligenceBy making our work available to the scientific community, we hope to enable widespread testing and gather valuable feedback that will guide future improvements. For the first time, deep learning offers a clear and computationally scalable path to building an accurate, efficient, and broadly applicable model of the universal XC functional—one that could transform the computational design of molecules and materials.
Skala Paper Dataset Paper Dataset AcknowledgementThis work is the product of a highly collaborative and interdisciplinary effort led by Microsoft Research AI for Science, in partnership with colleagues from Microsoft Research Accelerator, Microsoft Quantum and the University of New England. The full author list includes Giulia Luise, Chin-Wei Huang, Thijs Vogels, Derk P. Kooi, Sebastian Ehlert, Stephanie Lanius, Klaas J. H. Giesbertz, Amir Karton, Deniz Gunceler, Megan Stanley, Wessel P. Bruinsma, Victor Garcia Satorras, Marwin Segler, Kenji Takeda, Lin Huang, Xinran Wei, José Garrido Torres, Albert Katbashev, Rodrigo Chavez Zavaleta, Bálint Máté, Sékou-Oumar Kaba, Roberto Sordillo, Yingrong Chen, David B. Williams-Young, Christopher M. Bishop, Jan Hermann, Rianne van den Berg and Paola Gori Giorgi.
Opens in a new tabThe post Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning appeared first on Microsoft Research.
New methods boost reasoning in small and large language models
Artificial intelligence is advancing across a wide range of fields, with one of the most important developments being its growing capacity for reasoning. This capability could help AI becomes a reliable partner in critical domains like scientific research and healthcare.
To support this progress, we’ve identified three primary strategies to strengthen reasoning capabilities in both small and large language models: improve architectural design to boost performance in smaller models; incorporate mathematical reasoning techniques to increase reliability; and build stronger generalization capabilities to enable reasoning across a variety of fields.
Smarter reasoning in smaller modelsWhile language models trained on broad world knowledge hold great potential, they lack the ability to learn continuously and refine their understanding. This limitation becomes especially pronounced in smaller models, where limited capacity makes strong reasoning even harder.
The problem stems from how current language models operate. They rely on fast, pattern recognition-based responses that break down in complex scenarios. In contrast, people use deliberate, step-by-step reasoning, test different approaches, and evaluate outcomes. To address this gap, we’re building methods to enable stronger reasoning in smaller systems.
rStar-Math is a method that uses Monte Carlo Tree Search (MCTS) to simulate deeper, more methodical reasoning in smaller models. It uses a three-step, self-improving cycle:
- Problem decomposition breaks down complex mathematical problems into manageable steps, creating a thorough and accurate course of reasoning.
- Process preference model (PPM) trains small models to predict reward labels for each step, improving process-level supervision.
- Iterative refinement applies a four-round, self-improvement cycle in which updated strategy models and PPMs guide MCTS to improve performance.
When tested on four small language models ranging from 1.5 billion to 7 billion parameters, rStar-Math achieved an average accuracy of 53% on the American Invitational Mathematics Examination (AIME)—performance that places it among the top 20% of high school competitors in the US.
Figure 1. The rStar-Math frameworkLogic-RL is a reinforcement learning framework that strengthens logical reasoning through a practical system prompt and a structured reward function. By training models on logic puzzles, Logic-RL grants rewards only when both the reasoning process and the final answer meet strict formatting requirements. This prevents shortcuts and promotes analytical rigor.
Language models trained with Logic-RL demonstrate strong performance beyond logic puzzles, generalizing effectively to mathematical competition problems. On the AIME and AMC (American Mathematics Competitions) datasets, 7-billion-parameter models improved accuracy by 125% and 38%, respectively, compared with baseline models.
Building reliable mathematical reasoningMathematics poses a unique challenge for language models, which often struggle to meet its precision and rigor using natural language. To address this, we’re creating formal and symbolic methods to enable language models to adopt structured mathematical tools. The goal is to convert language model outputs into code based on the fundamental rules of arithmetic, like 1 + 1 = 2, allowing us to systematically verify accuracy.
LIPS (LLM-based Inequality Prover with Symbolic Reasoning) is a system that combines LLMs’ pattern recognition capabilities with symbolic reasoning. LIPS draws on the strategies participants in math competitions use in order to distinguish between tasks best suited to symbolic solvers (e.g., scaling) and those better handled by language models (e.g., rewriting). On 161 Olympiad-level problems, LIPS achieved state-of-the-art results without additional training data.
Figure 2. An overview of LIPSHowever, translating natural-language math problems into precise, machine-readable formats is a challenge. Our goal is to bridge the gap between the one-pass success rate, where the top-ranked generated result is correct, and the k-pass success rate, where at least one of the top k generated results is correct.
We developed a new framework using two evaluation methods. Symbolic equivalence checks whether outputs are logically identical, while semantic consistency uses embedding similarity to detect subtle differences missed by symbolic checks.
When we evaluated this approach on the MATH and miniF2F datasets, which include problems from various math competitions, it improved accuracy by up to 1.35 times over baseline methods.
Figure 3. An overview of the auto-formalization frameworkTo address the shortage of high-quality training data, we developed a neuro-symbolic framework that automatically generates diverse, well-structured math problems. Symbolic solvers create the problems, while language models translate them into natural language. This approach not only broadens training resources but also supports more effective instruction and evaluation of mathematical reasoning in language models.
Figure 4. An overview of the neuro-symbolic data generation framework Boosting generalization across domainsA key indicator of advanced AI is its ability to generalize—the ability to transfer reasoning skills across different domains. We found that training language models on math data significantly improved performance in coding, science, and other areas, revealing unexpected cross-domain benefits.
This discovery motivated us to develop Chain-of-Reasoning (CoR), an approach that unifies reasoning across natural language, code, and symbolic forms. CoR lets models blend these formats using natural language to frame context, code for precise calculations, and symbolic representations for abstraction. By adjusting prompts, CoR adapts both reasoning depth and paradigm diversity to match specific problem requirements.
Tests of CoR across five math datasets showed its ability to tackle both computational and proof-based problems, demonstrating strong general mathematical problem-solving skills.
Figure 5. CoR’s reasoning process under different types of methodsCurrent language models often rely on domain-specific solutions, limiting their flexibility across different types of problems. To move beyond this constraint, we developed Critical Plan Step Learning (CPL), an approach focused on high-level abstract planning that teaches models to identify key knowledge, break down problems, and make strategic decisions.
The technique draws on how people solve problems, by breaking them down, identifying key information, and recalling relevant knowledge—strategies we want language models to learn.
CPL combines two key components: plan-based MCTS, which searches multi-step solution paths and constructs planning trees, and step-APO, which learns preferences for strong intermediate steps while filtering out weak ones. This combination enhances reasoning and improves generalization across tasks, moving AI systems closer to the flexible thinking that characterizes human intelligence.
Figure 6. Overview of the CPL framework Looking ahead: Next steps in AI reasoningFrom building reliable math solvers to unifying reasoning approaches, researchers are redefining how language models approach complex tasks. Their work sets the stage for more capable and versatile AI systems—applicable to education, science, healthcare, and beyond. Despite these advances, hallucinations and imprecise logic continue to pose risks in critical fields like medicine and scientific research, where accuracy is essential.
These challenges are driving the team’s exploration of additional tools and frameworks to improve language model reasoning. This includes AutoVerus for automated proof generation in Rust code, SAFE for addressing data scarcity in Rust formal verification, and Alchemy, which uses symbolic mutation to improve neural theorem proving.
Together, these technologies represent important progress toward building trustworthy, high-performing reasoning models and signal a broader shift toward addressing some of AI’s current limitations.
Opens in a new tabThe post New methods boost reasoning in small and large language models appeared first on Microsoft Research.
Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library
Outdated coding practices and memory-unsafe languages like C are putting software, including cryptographic libraries, at risk. Fortunately, memory-safe languages like Rust, along with formal verification tools, are now mature enough to be used at scale, helping prevent issues like crashes, data corruption, flawed implementation, and side-channel attacks.
To address these vulnerabilities and improve memory safety, we’re rewriting SymCrypt (opens in new tab)—Microsoft’s open-source cryptographic library—in Rust. We’re also incorporating formal verification methods. SymCrypt is used in Windows, Azure Linux, Xbox, and other platforms.
Currently, SymCrypt is primarily written in cross-platform C, with limited use of hardware-specific optimizations through intrinsics (compiler-provided low-level functions) and assembly language (direct processor instructions). It provides a wide range of algorithms, including AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA.
Formal verification will confirm that implementations behave as intended and don’t deviate from algorithm specifications, critical for preventing attacks. We’ll also analyze compiled code to detect side-channel leaks caused by timing or hardware-level behavior.
Proving Rust program properties with AeneasProgram verification is the process of proving that a piece of code will always satisfy a given property, no matter the input. Rust’s type system profoundly improves the prospects for program verification by providing strong ownership guarantees, by construction, using a discipline known as “aliasing xor mutability”.
For example, reasoning about C code often requires proving that two non-const pointers are live and non-overlapping, a property that can depend on external client code. In contrast, Rust’s type system guarantees this property for any two mutably borrowed references.
As a result, new tools have emerged specifically for verifying Rust code. We chose Aeneas (opens in new tab) because it helps provide a clean separation between code and proofs.
Developed by Microsoft Azure Research in partnership with Inria, the French National Institute for Research in Digital Science and Technology, Aeneas connects to proof assistants like Lean (opens in new tab), allowing us to draw on a large body of mathematical proofs—especially valuable given the mathematical nature of cryptographic algorithms—and benefit from Lean’s active user community.
Compiling Rust to C supports backward compatibilityWe recognize that switching to Rust isn’t feasible for all use cases, so we’ll continue to support, extend, and certify C-based APIs as long as users need them. Users won’t see any changes, as Rust runs underneath the existing C APIs.
Some users compile our C code directly and may rely on specific toolchains or compiler features that complicate the adoption of Rust code. To address this, we will use Eurydice (opens in new tab), a Rust-to-C compiler developed by Microsoft Azure Research, to replace handwritten C code with C generated from formally verified Rust. Eurydice (opens in new tab) compiles directly from Rust’s MIR intermediate language, and the resulting C code will be checked into the SymCrypt repository alongside the original Rust source code.
As more users adopt Rust, we’ll continue supporting this compilation path for those who build SymCrypt from source code but aren’t ready to use the Rust compiler. In the long term, we hope to transition users to either use precompiled SymCrypt binaries (via C or Rust APIs), or compile from source code in Rust, at which point the Rust-to-C compilation path will no longer be needed.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and IndustryDiscover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
Listen now Opens in a new tab Timing analysis with RevizorEven software that has been verified for functional correctness can remain vulnerable to low-level security threats, such as side channels caused by timing leaks or speculative execution. These threats operate at the hardware level and can leak private information, such as memory load addresses, branch targets, or division operands, even when the source code is provably correct.
To address this, we’re extending Revizor (opens in new tab), a tool developed by Microsoft Azure Research, to more effectively analyze SymCrypt binaries. Revizor models microarchitectural leakage and uses fuzzing techniques to systematically uncover instructions that may expose private information through known hardware-level effects.
Earlier cryptographic libraries relied on constant-time programming to avoid operations on secret data. However, recent research has shown that this alone is insufficient with today’s CPUs, where every new optimization may open a new side channel.
By analyzing binary code for specific compilers and platforms, our extended Revizor tool enables deeper scrutiny of vulnerabilities that aren’t visible in the source code.
Verified Rust implementations begin with ML-KEMThis long-term effort is in alignment with the Microsoft Secure Future Initiative and brings together experts across Microsoft, building on decades of Microsoft Research investment in program verification and security tooling.
A preliminary version of ML-KEM in Rust is now available on the preview feature/verifiedcrypto (opens in new tab) branch of the SymCrypt repository. We encourage users to try the Rust build and share feedback (opens in new tab). Looking ahead, we plan to support direct use of the same cryptographic library in Rust without requiring C bindings.
Over the coming months, we plan to rewrite, verify, and ship several algorithms in Rust as part of SymCrypt. As our investment in Rust deepens, we expect to gain new insights into how to best leverage the language for high-assurance cryptographic implementations with low-level optimizations.
As performance is key to scalability and sustainability, we’re holding new implementations to a high bar using our benchmarking tools to match or exceed existing systems.
Looking forwardThis is a pivotal moment for high-assurance software. Microsoft’s investment in Rust and formal verification presents a rare opportunity to advance one of our key libraries. We’re excited to scale this work and ultimately deliver an industrial-grade, Rust-based, FIPS-certified cryptographic library.
Opens in a new tabThe post Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library appeared first on Microsoft Research.
BenchmarkQED: Automated benchmarking of RAG systems
One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics.
To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.
BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks.
In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.
In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset.
Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.
AutoQ: Automated query synthesisThis limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.
AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom).
Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.
Figure 2. Synthesis process and example query for each of the four AutoQ query classes. Azure AI Foundry LabsGet a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry Opens in a new tab AutoE: Automated evaluation frameworkOur evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:
- Comprehensiveness: Does the answer address all relevant aspects of the question?
- Diversity: Does it present varied perspectives or insights?
- Empowerment: Does it help the reader understand and make informed judgments?
- Relevance: Does it address what the question is specifically asking?
The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.
An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.
These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.
Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.
Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab), RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.
LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context.
Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.
Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.
Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.
Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. AutoD: Automated data sampling and summarizationText datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.
The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.
AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.
Supporting the community with open data and toolsSince the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.
We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab).
Opens in a new tabThe post BenchmarkQED: Automated benchmarking of RAG systems appeared first on Microsoft Research.
FrodoKEM: A conservative quantum-safe cryptographic algorithm
In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers.
The quantum threat to cryptographyFor decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization and (elliptic curve) discrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of quantum computing poses a significant threat to these cryptographic schemes.
Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence.
At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1 (opens in new tab), the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems.
To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography (PQC)—new cryptographic algorithms that can resist attacks from both classical and quantum computers.
The NIST Post-Quantum Cryptography Standardization effortIn 2017, the U.S. National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography Standardization project (opens in new tab) to evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms (KEMs)—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes.
This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM (opens in new tab). Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS+, now called SLH-DSA; and Falcon, now called FN-DSA.
While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM.
International standardization of post-quantum cryptographyBeyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardization (ISO) is leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI (opens in new tab), the Netherlands’ NLNCSA and AIVD (opens in new tab), and France’s ANSSI (opens in new tab)—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes.
As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2 (opens in new tab).
What is FrodoKEM?FrodoKEM is a key encapsulation mechanism (KEM) based on the Learning with Errors (LWE) problem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem.
Why unstructured lattices?Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency.
It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances.
Lattices and the Learning with Errors (LWE) problemLattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problem (SVP) and the Learning with Errors (LWE) problem, forms the basis of lattice-based schemes.
The Learning with Errors (LWE) problemThe LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem.
As can be seen in Figure 1, for the setup of the problem we need a dimension \(n\) that defines the size of matrices, a modulus \(q\) that defines the value range of the matrix coefficients, and a certain error distribution \(\chi\) from which we sample \(\textit{“small”}\) matrices. We sample two matrices from \(\chi\), a small matrix \(\text{s}\) and an error matrix \(\text{e}\) (for simplicity in the explanation, we assume that both have only one column); sample an \(n \times n\) matrix \(\text{A}\) uniformly at random; and compute \(\text{b} = \text{A} \times \text{s} + \text{e}\). In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix \(\text{s}\) (small relative to the modulus \(q\)). Finally, given \(\text{A}\) and \(\text{b}\), the search LWE problem consists in finding \(\text{s}\). This problem is believed to be hard for suitably chosen parameters (e.g., for dimension \(n\) sufficiently large) and is used at the core of FrodoKEM.
In comparison, the LWE variant used in ML-KEM—called Module-LWE (M-LWE)—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix \(\text{A}\) can be represented by just a single row of coefficients.
FIGURE 1: Visualization of the (search) LWE problem.LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time.
How FrodoKEM WorksFrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key:
- Key generation (KeyGen), computed by the recipient
- Generates a public key and a secret key.
- The public key is sent to the sender, while the private key remains secret.
- Encapsulation (Encapsulate), computed by the sender
- Generates a random session key.
- Encrypts the session key using the recipient’s public key to produce a ciphertext.
- Produces a shared key using the session key and the ciphertext.
- The ciphertext is sent to the recipient.
- Decapsulation (Decapsulate), computed by the recipient
- Decrypts the ciphertext using their secret key to recover the original session key.
- Reproduces the shared key using the decrypted session key and the ciphertext.
The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties.
Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “\(\text{b} = \text{A} \times \text{s} + \text{e}\)”, which are directly applied within the KEM paradigm.
FIGURE 2: Simplified overview of FrodoKEM. Performance: Strong security has a costNot relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter set (variant called “FrodoKEM-640-AES”) and the respective parameter set of ML-KEM (variant called “ML-KEM-512”). These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively.
For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions.
TABLE 1: Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets. A holistic design made with security in mindFrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks.
ConclusionAfter years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world.
Further ReadingFor those interested in learning more about FrodoKEM, post-quantum cryptography, and lattice-based cryptography, the following resources provide valuable insights:
- The official FrodoKEM website: https://frodokem.org/ (opens in new tab), which contains, among several other resources, FrodoKEM’s specification document.
- The official FrodoKEM software library: https://github.com/Microsoft/PQCrypto-LWEKE (opens in new tab), which contains reference and optimized implementations of FrodoKEM written in C and Python.
- NIST’s Post-Quantum Cryptography Project: https://csrc.nist.gov/projects/post-quantum-cryptography (opens in new tab).
- Microsoft’s blogpost on its transition plan for PQC: https://techcommunity.microsoft.com/blog/microsoft-security-blog/microsofts-quantum-resistant-cryptography-is-here/4238780 (opens in new tab).
- A comprehensive survey on lattice-based cryptography: Peikert, C. “A Decade of Lattice Cryptography.” Foundations and Trends in Theoretical Computer Science. (2016)
- A comprehensive tutorial on modern lattice-based schemes, including ML-KEM and ML-DSA: Lyubashevsky, V. “Basic Lattice Cryptography: The concepts behind Kyber (ML-KEM) and Dilithium (ML-DSA).” https://eprint.iacr.org/2024/1287 (opens in new tab). (2024)
The post FrodoKEM: A conservative quantum-safe cryptographic algorithm appeared first on Microsoft Research.
Magentic-UI, an experimental human-centered web agent
Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and require activities beyond just performing simple web searches.
Magentic-UI builds on Magentic-One (opens in new tab), a powerful multi-agent team we released last year, and is powered by AutoGen (opens in new tab), our leading agent framework. It is available under MIT license at https://github.com/microsoft/Magentic-UI (opens in new tab) and on Azure AI Foundry Labs (opens in new tab), the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample (opens in new tab).
PODCAST SERIES
The AI Revolution in Medicine, RevisitedJoin Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.
Listen now Opens in a new tabMagentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include:
- Collaborative planning with users (co-planning). Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions.
- Collaborative execution with users (co-tasking). Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser.
- Safety with human-in-the-loop (action guards). Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
- Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
- Learning from experience (plan learning). Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks.
While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design.
Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI.For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning.
During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking.
Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent. Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important.Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky (e.g., paying for an item), they can reject it.
Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience.After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight.
Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI.
ArchitectureMagentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system:
- Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
- WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator.
- Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
- FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown (opens in new tab) package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.
To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success.
The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agents (WebSurfer, Coder, FileSurfer) or the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step.
Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate (for example, because a certain website is unreachable), the Orchestrator can replan with user permission and start executing a new plan.
All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actions (e.g., clicking a button) require approval.
Evaluating Magentic-UIMagentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmark (opens in new tab) for agents with a user-simulation experiment.
Evaluation with simulated users Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has a\access to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer.
To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed.
We experiment with two types of simulated users to show the value of human-in-the-loop: (1) a simulated user that is more intelligent than the Magentic-UI agents and (2) a simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask the (simulated) user for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead. These experiments reflect a lower bound on the value of human feedback, since real users can step in at any time and offer any kind of input—not just when the system explicitly asks for help.
The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents (GPT-4o). For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer.
On the validation subset of GAIA (162 tasks), we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode (without the simulated user), Magentic-UI with simulated user (1) (smarter model), Magentic-UI with simulated user (2) (side-information), and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLM (GPT-4o) is used for Magentic-UI and Magentic-One.
Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performance (e.g., task completion) over autonomous agents working alone, especially at a fraction of the cost compared to people completing tasks entirely manually.
Learning and reusing plansAs described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future.
The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memory (opens in new tab) to retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand.
Safety and controlMagentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this:
- Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface
- Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing.
- Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI (Figure 3).
- Detection and approval of irreversible agent actions: Users can configure an action-approval policy (action guards) to determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any action (e.g., any button click) needs explicit user approval. Users must press an “Accept” or “Deny” button for each action.
In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent (e.g., to execute risky code, access sensitive files, or perform actions on other websites). It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites (e.g., a synthetic website that asks Magentic-UI to log in and enter Google credentials to read an article). In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks.
We have also released transparency notes, which can be found at: https://github.com/microsoft/magentic-ui/blob/main/TRANSPARENCY_NOTE.md (opens in new tab)
Open research questionsMagentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report (opens in new tab), we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying.
ConclusionMagentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains.
Explore Magentic-UI on Azure AI Foundry Opens in a new tabThe post Magentic-UI, an experimental human-centered web agent appeared first on Microsoft Research.
Predicting and explaining AI model performance: A new approach to evaluation
With support from the Accelerating Foundation Models Research (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.
In the paper, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities.
ADeLe: An ability-based approach to task evaluationThe framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric, originally developed for human tasks and shown to work reliably when applied by AI models.
By comparing what a task requires with what a model can do, ADeLe generates an ability profile that not only predicts performance but also explains why a model is likely to succeed or fail—linking outcomes to specific strengths or limitations.
The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works—from rating task requirements to generating ability profiles.
Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems.To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings.
Evaluation resultsUsing ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task.
1. Revealing hidden flaws in AI testing methods
Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions—missing both simple and complex challenges.
2. Creating detailed AI ability profiles
Using the 0–5 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted “subject characteristic curves” to show how a model’s success rate changes with task difficulty.
They then calculated a score for each ability—the difficulty level at which a model has a 50% chance of success—and used these results to generate radial plots showing each model’s strengths and weaknesses across the different scales and levels, illustrated in Figure 2.
Figure 2. Ability profiles for the 15 LLMs evaluated.This analysis revealed the following:
- When measured against human performance, AI systems show different strengths and weaknesses across the 18 ability scales.
- Newer LLMs generally outperform older ones, though not consistently across all abilities.
- Knowledge-related performance depends heavily on model size and training methods.
- Reasoning models show clear gains over non-reasoning models in logical thinking, learning and abstraction, and social capabilities, such as inferring the mental states of their users.
- Increasing the size of general-purpose models after a given threshold only leads to small performance gains.
3. Predicting AI success and failure
In addition to evaluation, the team created a practical prediction system based on demand-level measurements that forecasts whether a model will succeed on specific tasks, even unfamiliar ones.
The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to anticipate potential failures before deployment, adding the important step of reliability assessment for AI models.
Looking aheadADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.
This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance. It aligns with the vision laid out in a previous Microsoft position paper on the promise of applying psychometrics to AI evaluation and a recent Societal AI white paper emphasizing the importance of AI evaluation.
As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment. The research team is working toward building a collaborative community to strengthen and expand this emerging field.
Opens in a new tabThe post Predicting and explaining AI model performance: A new approach to evaluation appeared first on Microsoft Research.
Research Focus: Week of May 7, 2025
In this issue:
New research on compound AI systems and causal verification of the Confidential Consortium Framework; release of Phi-4-reasoning; enriching tabular data with semantic structure, and more.
NEW RESEARCH Towards Resource-Efficient Compound AI SystemsThis research introduces Murakkab, a prototype system built on a declarative workflow that reimagines how compound AI systems are built and managed to significantly improve resource efficiency. Compound AI systems integrate multiple interacting components like language models, retrieval engines, and external tools. They are essential for addressing complex AI tasks. However, current implementations could benefit from greater efficiencies in resource utilization, with improvements to tight coupling between application logic and execution details, better connections between orchestration and resource management layers, and bridging gaps between efficiency and quality.
Murakkab addresses critical inefficiencies in current AI architectures and offers a new approach that unifies workflow orchestration and cluster resource management for better performance and sustainability. In preliminary evaluations, it demonstrates speedups up to ∼ 3.4× in workflow completion times while delivering ∼ 4.5× higher energy efficiency, showing promise in optimizing resources and advancing AI system design.
NEW RESEARCH Smart Casual Verification of the Confidential Consortium FrameworkThis work presents a new, pragmatic verification technique that improves the trustworthiness of distributed systems like the Confidential Consortium Framework (CCF) and proves its effectiveness by catching critical bugs before deployment. Smart casual verification is a novel hybrid verification approach to validating CCF, an open-source platform for developing trustworthy and reliable cloud applications which underpins Microsoft’s Azure Confidential Ledger service.
The researchers apply smart casual verification to validate the correctness of CCF’s novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. This hybrid approach combines the rigor of formal specification and model checking with the pragmatism of automated testing, specifically binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods are often one-off efforts by domain experts, the researchers have integrated smart casual verification into CCF’s continuous integration pipeline, allowing contributors to continuously validate CCF as it evolves.
NEW RESEARCH Phi-4-reasoning Technical ReportThis report introduces Phi-4-reasoning (opens in new tab), a 14-billion parameter model optimized for complex reasoning tasks. It is trained via supervised fine-tuning of Phi-4 using a carefully curated dataset of high-quality prompts and reasoning demonstrations generated by o3-mini. These prompts span diverse domains—including math, science, coding, and spatial reasoning—and are selected to challenge the base model near its capability boundaries.
Building on recent findings that reinforcement learning (RL) can further improve smaller models, the team developed Phi-4-reasoning-plus, which incorporates an additional outcome-based RL phase using verifiable math problems. This enhances the model’s ability to generate longer, more effective reasoning chains.
Despite its smaller size, the Phi-4-reasoning family outperforms significantly larger open-weight models such as DeepSeekR1-Distill-Llama-70B and approaches the performance of full-scale frontier models like DeepSeek R1. It excels in tasks requiring multi-step problem solving, logical inference, and goal-directed planning.
The work highlights the combined value of supervised fine-tuning and reinforcement learning for building efficient, high-performing reasoning models. It also offers insights into training data design, methodology, and evaluation strategies. Phi-4-reasoning contributes to the growing class of reasoning-specialized language models and points toward more accessible, scalable AI for science, education, and technical domains.
NEW RESEARCH TeCoFeS: Text Column Featurization using Semantic AnalysisThis research introduces a practical, cost-effective solution for enriching tabular data with semantic structure, making it more useful for downstream analysis and insights—which is especially valuable in business intelligence, data cleaning, and automated analytics workflows. This approach outperforms baseline models and naive LLM applications on converted text classification benchmarks.
Extracting structured insights from free-text columns in tables—such as product reviews or user feedback—can be time-consuming and error-prone, especially when relying on traditional syntactic methods that often miss semantic meaning. This research introduces the semantic text column featurization problem, which aims to assign meaningful, context-aware labels to each entry in a text column.
The authors propose a scalable, efficient method that combines the power of LLMs with text embeddings. Instead of labeling an entire column manually or applying LLMs to every cell—an expensive process—this new method intelligently samples a diverse subset of entries, uses an LLM to generate semantic labels for just that subset, and then propagates those labels to the rest of the column using embedding similarity.
NEW RESEARCH Agentic Reasoning and Tool Integration for LLMs via Reinforcement LearningThis work introduces ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a new paradigm for LLM reasoning that expands beyond traditional language-only inference.
While LLMs have made considerable strides in complex reasoning tasks, they remain limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this research, ARTIST brings together agentic reasoning, reinforcement learning (RL), and tool integration, designed to enable LLMs to autonomously decide when and how to invoke internal tools within multi-turn reasoning chains. ARTIST leverages outcome-based reinforcement learning to learn robust strategies for tool use and environment interaction without requiring step-level supervision.
Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies show that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions.
PODCAST Materialism Podcast: MatterGen (opens in new tab)What if you could find materials with tailored properties without ever entering the lab? The Materialism Podcast, which is dedicated to exploring materials science and engineering, talks with Tian Xie from Microsoft Research to discuss MatterGen, an AI tool which accelerates materials science discovery. Tune in to hear a discussion of the new Azure AI Foundry, where MatterGen will interact with and support MatterSim, an advanced deep learning model designed to simulate the properties of materials across a wide range of elements, temperatures, and pressures.
IN THE NEWS: Highlights of recent media coverage of Microsoft Research-
When ChatGPT Broke an Entire Field: An Oral History
Quanta Magazine | April 30, 2025
Large language models are everywhere, igniting discovery, disruption and debate in whatever scientific community they touch. But the one they touched first — for better, worse and everything in between — was natural language processing. What did that impact feel like to the people experiencing it firsthand?
To tell that story, Quanta interviewed 19 NLP experts, including Kalika Bali, senior principal researcher at Microsoft Research. From researchers to students, tenured academics to startup founders, they describe a series of moments — dawning realizations, elated encounters and at least one “existential crisis” — that changed their world. And ours.
The post Research Focus: Week of May 7, 2025 appeared first on Microsoft Research.
Microsoft Fusion Summit explores how AI can accelerate fusion research
The pursuit of nuclear fusion as a limitless, clean energy source has long been one of humanity’s most ambitious scientific goals. Research labs and companies worldwide are working to replicate the fusion process that occurs at the sun’s core, where isotopes of hydrogen combine to form helium, releasing vast amounts of energy. While scalable fusion energy is still years away, researchers are now exploring how AI can help accelerate fusion research and bring this energy to the grid sooner.
In March 2025, Microsoft Research held its inaugural Fusion Summit, a landmark event that brought together distinguished speakers and panelists from within and outside Microsoft Research to explore this question.
Ashley Llorens, Corporate Vice President and Managing Director of Microsoft Research Accelerator, opened the Summit by outlining his vision for a self-reinforcing system that uses AI to drive sustainability. Steven Cowley, laboratory director of the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (opens in new tab), professor at Princeton University, and former head of the UK Atomic Energy Authority, followed with a keynote explaining the intricate science and engineering behind fusion reactors. His message was clear: advancing fusion will require international collaboration and the combined power of AI and high-performance computing to model potential fusion reactor designs.
Applying AI to fusion researchNorth America’s largest fusion facility, DIII-D (opens in new tab), operated by General Atomics and owned by the US Department of Energy (DOE), provides a unique platform for developing and testing AI applications for fusion research, thanks to its pioneering data and digital twin platform.
Richard Buttery (opens in new tab) from DIII-D and Dave Humphreys (opens in new tab) from General Atomics demonstrated how the US DIII-D National Fusion Program (opens in new tab) is already applying AI to advance reactor design and operations, highlighting promising directions for future development. They provided examples of how to apply AI to active plasma control to avoid disruptive instabilities, using AI-controlled trajectories to avoid tearing modes, and implementing feedback control using machine learning-derived density limits for safer high-density operations.
One persistent challenge in reactor design involves building the interior “first wall,” which must withstand extreme heat and particle bombardment. Zulfi Alam, corporate vice president of Microsoft Quantum (opens in new tab), discussed the potential of using quantum computing in fusion, particularly for addressing material challenges like hydrogen diffusion in reactors.
He noted that silicon nitride shows promise as a barrier to hydrogen and vapor and explained the challenge of binding it to the reaction chamber. He emphasized the potential of quantum computing to improve material prediction and synthesis, enabling more efficient processes. He shared that his team is also investigating advanced silicon nitride materials to protect this critical component from neutron and alpha particle damage—an innovation that could make fusion commercially viable.
Spotlight: Blog post
MedFuzz: Exploring the robustness of LLMs on medical challenge problemsMedfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.
Read more Opens in a new tab Exploring AI’s broader impact on fusion engineeringLightning talks from Microsoft Research labs addressed the central question of AI’s potential to accelerate fusion research and engineering. Speakers covered a wide range of applications—from using gaming AI for plasma control and robotics for remote maintenance to physics-informed AI for simulating materials and plasma behavior. Closing the session, Archie Manoharan, Microsoft’s director of nuclear engineering for Cloud Operations and Infrastructure, emphasized the need for a comprehensive energy strategy, one that incorporates renewables, efficiency improvements, storage solutions, and carbon-free sources like fusion.
The Summit culminated in a thought-provoking panel discussion moderated by Ade Famoti, featuring Archie Manoharan, Richard Buttery, Steven Cowley, and Chris Bishop, Microsoft Technical Fellow and director of Microsoft Research AI for Science. Their wide-ranging conversation explored the key challenges and opportunities shaping the field of fusion.
The panel highlighted several themes: the role of new regulatory frameworks that balance innovation with safety and public trust; the importance of materials discovery in developing durable fusion reactor walls; and the game-changing role AI could play in plasma optimization and surrogate modelling of fusion’s underlying physics.
They also examined the importance of global research collaboration, citing projects like the International Thermonuclear Experimental Reactor (opens in new tab) (ITER), the world’s largest experimental fusion device under construction in southern France, as testbeds for shared progress. One persistent challenge, however, is data scarcity. This prompted a discussion of using physics-informed neural networks as a potential approach to supplement limited experimental data.
Global collaboration and next stepsMicrosoft is collaborating with ITER (opens in new tab) to help advance the technologies and infrastructure needed to achieve fusion ignition—the critical point where a self-sustaining fusion reaction begins, using Microsoft 365 Copilot, Azure OpenAI Service, Visual Studio, and GitHub (opens in new tab). Microsoft Research is now cooperating with ITER to identify where AI can be leveraged to model future experiments to optimize its design and operations.
Now Microsoft Research has signed a Memorandum of Understanding with the Princeton Plasma Physics Laboratory (PPPL) (opens in new tab) to foster collaboration through knowledge exchange, workshops, and joint research projects. This effort aims to address key challenges in fusion, materials, plasma control, digital twins, and experiment optimization. Together, Microsoft Research and PPPL will work to drive innovation and advances in these critical areas.
Fusion is a scientific challenge unlike any other and could be key to sustainable energy in the future. We’re excited about the role AI can play in helping make that vision a reality. To learn more, visit the Fusion Summit event page, or connect with us by email at FusionResearch@microsoft.com.
Opens in a new tabThe post Microsoft Fusion Summit explores how AI can accelerate fusion research appeared first on Microsoft Research.