Microsoft Research

URL: https://www.microsoft.com/en-us/research/blog/

Updated: 9 min 10 sec ago

PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays

Thu, 06/26/2025 - 18:08

In our ever-evolving journey to enhance healthcare through technology, we’re announcing a unique new benchmark for grounded radiology report generation—PadChest-GR (opens in new tab). The world’s first multimodal, bilingual sentence-level radiology report dataset, developed by the University of Alicante with Microsoft Research, University Hospital Sant Joan d’Alacant and MedBravo, is set to redefine how AI and radiologists interpret radiological images. Our work demonstrates how collaboration between humans and AI can create powerful feedback loops—where new datasets drive better AI models, and those models, in turn, inspire richer datasets. We’re excited to share this progress in NEJM AI, highlighting both the clinical relevance and research excellence of this initiative.

A new frontier in radiology report generation

It is estimated that over half of people visiting hospitals have radiology scans that must be interpreted by a clinical professional. Traditional radiology reports often condense multiple findings into unstructured narratives. In contrast, grounded radiology reporting demands that each finding be described and localized individually.

This can mitigate the risk of AI fabrications and enable new interactive capabilities that enhance clinical and patient interpretability. PadChest-GR is the first bilingual dataset to address this need with 4,555 chest X-ray studies complete with Spanish and English sentence-level descriptions and precise spatial (bounding box) annotations for both positive and negative findings. It is the first public benchmark that enables us to evaluate generation of fully grounded radiology reports in chest X-rays.

Figure 1. Example of a grounded report from PadChest-GR. The original free-text report in Spanish was ”Motivo de consulta: Preoperatorio. Rx PA tórax: Impresión diagnóstica: Ateromatosis aórtica calcificada. Engrosamiento pleural biapical. Atelectasia laminar basal izquierda. Elongación aórtica. Sin otros hallazgos radiológicos significativos.”

This benchmark isn’t standing alone—it plays a critical role in powering our state-of-the-art multimodal report generation model, MAIRA-2. Leveraging the detailed annotations of PadChest-GR, MAIRA-2 represents our commitment to building more interpretable and clinically useful AI systems. You can explore our work on MAIRA-2 on our project web page, including recent user research conducted with clinicians in healthcare settings.

PadChest-GR is a testament to the power of collaboration. Aurelia Bustos at MedBravo and Antonio Pertusa at the University of Alicante published the original PadChest dataset (opens in new tab) in 2020, with the help of Jose María Salinas from Hospital San Juan de Alicante and María de la Iglesia Vayá from the Center of Excellence in Biomedical Imaging at the Ministry of Health in Valencia, Spain. We started to look at PadChest and were deeply impressed by the scale, depth, and diversity of the data.

As we worked more closely with the dataset, we realized the opportunity to develop this for grounded radiology reporting research and worked with the team at the University of Alicante to determine how to approach this together. Our complementary expertise was a nice fit. At Microsoft Research, our mission is to push the boundaries of medical AI through innovative, data-driven solutions. The University of Alicante, with its deep clinical expertise, provided critical insights that greatly enriched the dataset’s relevance and utility. The result of this collaboration is the PadChest-GR dataset.

A significant enabler of our annotation process was Centaur Labs. The team of senior and junior radiologists from the University Hospital Sant Joan d’Alacant, coordinated by Joaquin Galant, used this HIPAA-compliant labeling platform to perform rigorous study-level quality control and bounding box annotations. The annotation protocol implemented ensured that each annotation was accurate and consistent, forming the backbone of a dataset designed for the next generation of grounded radiology report generation models.

Accelerating PadChest-GR dataset annotation with AI

Our approach integrates advanced large language models with comprehensive manual annotation:

Data Selection & Processing: Leveraging Microsoft Azure OpenAI Service (opens in new tab) with GPT-4, we extracted sentences describing individual positive and negative findings from raw radiology reports, translated them from Spanish to English, and linked each sentence to the existing expert labels from PadChest. This was done for a selected subset of the full PadChest dataset, carefully curated to reflect a realistic distribution of clinically relevant findings.

Manual Quality Control & Annotation: The processed studies underwent meticulous quality checks on the Centaur Labs platform by radiologist from Hospital San Juan de Alicante. Each positive finding was then annotated with bounding boxes to capture critical spatial information.

Standardization & Integration: All annotations were harmonized into coherent grounded reports, preserving the structure and context of the original findings while enhancing interpretability.

Figure 2. Overview of the data curation pipeline. Impact and future directions

PadChest-GR not only sets a new benchmark for grounded radiology reporting, but also serves as the foundation for our MAIRA-2 model, which already showcases the potential of highly interpretable AI in clinical settings. While we developed PadChest-GR to help train and validate our own models, we believe the research community will greatly benefit from this dataset for many years to come. We look forward to seeing the broader research community build on this—improving grounded reporting AI models and using PadChest-GR as a standard for evaluation. We believe that by fostering open collaboration and sharing our resources, we can accelerate progress in medical imaging AI and ultimately improve patient care together with the community.

The collaboration between Microsoft Research and the University of Alicante highlights the transformative power of working together across disciplines. With our publication in NEJM-AI and the integral role of PadChest-GR in the development of MAIRA-2 (opens in new tab) and RadFact (opens in new tab), we are excited about the future of AI-empowered radiology. We invite researchers and industry experts to explore PadChest-GR and MAIRA-2, contribute innovative ideas, and join us in advancing the field of grounded radiology reporting.

Papers already using PadChest-GR:

For further details or to download PadChest-GR, please visit the BIMCV PadChest-GR Project (opens in new tab).

Models in the Azure Foundry that can do Grounded Reporting:

Acknowledgement

Authors: Daniel C. Castro (opens in new tab), Aurelia Bustos (opens in new tab), Shruthi Bannur (opens in new tab), Stephanie L. Hyland (opens in new tab), Kenza Bouzid (opens in new tab), Maria Teodora Wetscherek (opens in new tab), Maria Dolores Sánchez-Valverde (opens in new tab), Lara Jaques-Pérez (opens in new tab), Lourdes Pérez-Rodríguez (opens in new tab), Kenji Takeda (opens in new tab), José María Salinas (opens in new tab), Javier Alvarez-Valle (opens in new tab), Joaquín Galant Herrero (opens in new tab), Antonio Pertusa (opens in new tab)

MSR Health Futures UK: Hannah Richardson, Valentina Salvatelli, Harshita Sharma, Sam Bond-Taylor, Max Ilse, Fernando Perez-Garcia, Anton Schwaighofer, Jonathan Carlson

MSR Flow: Kenji Takeda, Evelyn Viegas, Ashley Llorens

HLS: Matthew Lungren, Naiteek Sangani, Shrey Jain, Ivan Tarapov, Will Guyman, Mert Oez, Chris Burt, David Ardman

Opens in a new tab

The post PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays appeared first on Microsoft Research.

Categories: Microsoft

Learning from other domains to advance AI evaluation and testing

Mon, 06/23/2025 - 18:35

As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?

Recent research and reports from Microsoft (opens in new tab), the UK AI Security Institute (opens in new tab), The New York Times (opens in new tab), and MIT Technology Review (opens in new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report (opens in new tab) (2025) and the Singapore Consensus (opens in new tab) (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust.

Today, we’re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, pharmaceuticals, and medical devices to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.

We’re also sharing written case studies from experts, along with top-level lessons we’re applying to AI. At the close of the podcast series, we’ll offer Microsoft’s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation.

Lessons from eight case studies

Our research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft’s Office of Responsible AI (opens in new tab) gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI (opens in new tab), in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past.

While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.

Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the “DNA” of the historical moment in which they’re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later.

Strict, pre-deployment testing regimes—such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals—offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.

In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment—such as cybersecurity and bank stress testing—rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.

Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation.

These variations in approaches across domains—stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors—also inform takeaways for AI.

Applying risk evaluation and governance lessons to AI

While no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.

Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.

These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including:

Rigor in defining what is being examined and why it matters. This requires detailed specification of what is being measured and understanding how the deployment context may affect outcomes.
Standardization of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency.
Interpretability of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results—while remaining aware of their limitations.

Toward stronger foundations for AI testing

Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability—and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.

Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.

Acknowledgements

We would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.

Case studies

Civil aviation: Testing in Aircraft Design and Manufacturing, by Paul Alp

Cybersecurity: Cybersecurity Standards and Testing—Lessons for AI Safety and Security, by Stewart Baker

Financial services (bank stress testing): The Evolving Use of Bank Stress Tests, by Kathryn Judge

Genome editing: Governance of Genome Editing in Human Therapeutics and Agricultural Applications, by Alta Charo and Andy Greenfield

Medical devices: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance, by Mateo Aboy and Timo Minssen

Nanoscience: The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation, by Jennifer Dionne

Nuclear energy: Testing in the Nuclear Industry, by Pablo Cantero and Gerónimo Poletto Antonacci

Pharmaceuticals: The History and Evolution of Testing in Pharmaceutical Regulation, by Daniel Benamouzig and Daniel Carpenter

Opens in a new tab

The post Learning from other domains to advance AI evaluation and testing appeared first on Microsoft Research.

Categories: Microsoft

Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning

Wed, 06/18/2025 - 12:01

We are excited to share our first big milestone in solving a grand challenge that has hampered the predictive power of computational chemistry, biochemistry, and materials science for decades. By using a scalable deep-learning approach and generating an unprecedented quantity of diverse, highly accurate data, we have achieved a breakthrough in the accuracy of density functional theory (DFT), the workhorse method that thousands of scientists use every year to simulate matter at the atomistic level. Within the region of chemical space represented in our large training dataset, our model reaches the accuracy required to reliably predict experimental outcomes, as assessed on the well-known benchmark dataset W4-17 (opens in new tab). This removes a fundamental barrier to shifting the balance of molecule and material design from being driven by laboratory experiments to being driven by computational simulations. The implications for accelerating scientific discovery are far reaching, spanning applications from drugs to batteries and green fertilizers.

What is DFT?

Molecules and materials are made of atoms, which are held together by their electrons. These electrons act as a glue, determining the stability and properties of the chemical structure. Accurately computing the strength and properties of the electron glue is essential for predicting whether a chemical reaction will proceed, whether a candidate drug molecule will bind to its target protein, whether a material is suitable for carbon capture, or if a flow battery can be optimized for renewable energy storage. Unfortunately, a brute-force approach amounts to solving the many-electron Schrödinger equation, which requires computation that scales exponentially with the number of electrons. Considering that an atom has dozens of electrons, and that molecules and materials have large numbers of atoms, we could easily end up waiting the age of the universe to complete our computation unless we restrict our attention to small systems with only a few atoms.

DFT, introduced by Walter Kohn and collaborators in 1964-1965, was a true scientific breakthrough, earning Kohn the Nobel Prize in Chemistry in 1998. DFT provides an extraordinary reduction in the computational cost of calculating the electron glue in an exact manner, from exponential to cubic, making it possible to perform calculations of practical value within seconds to hours.

DFT Timeline What is the grand challenge in DFT?

But there is a catch: the exact reformulation has a small but crucial term—the exchange-correlation (XC) functional—which Kohn proved is universal (i.e., the same for all molecules and materials), but for which no explicit expression is known. For 60 years, people have designed practical approximations for the XC functional. The magazine Science dubbed the gold rush to design better XC models the “pursuit of the Divine Functional (opens in new tab)”. With time, these approximations have grown into a zoo of hundreds of different XC functionals from which users must choose, often using experimental data as a guide. Owing to the uniquely favorable computational cost of DFT, existing functionals have enabled scientists to gain extremely useful insight into a huge variety of chemical problems. However, the limited accuracy and scope of current XC functionals mean that DFT is still mostly used to interpret experimental results rather than predict them.

Why is it important to increase the accuracy of DFT?

We can contrast the present state of computational chemistry with the state of aircraft engineering and design. Thanks to predictive simulations, aeronautical engineers no longer need to build and test thousands of prototypes to identify one viable design. However, this is exactly what we currently must do in molecular and materials sciences. We send thousands of potential candidates to the lab, because the accuracy of the computational methods is not sufficient to predict the experiments. To make a significant shift in the balance from laboratory to in silico experiments, we need to remove the fundamental bottleneck of the insufficient accuracy of present XC functionals. This amounts to bringing the error of DFT calculations with respect to experiments within chemical accuracy, which is around 1 kcal/mol for most chemical processes. Present approximations typically have errors that are 3 to 30 times larger.

How can AI make a difference?

AI can transform how we model molecules and materials with DFT by learning the XC functional directly from highly accurate data. The goal is to learn how the XC functional captures the complex relationship between its input, the electron density, and its output, the XC energy. You can think of the density like a glue, with regions of space where there is a lot of it and other regions with less of it. Traditionally, researchers have built XC functional approximations using the concept of the so-called Jacob’s ladder: a hierarchy of increasingly complex, hand-designed descriptors of the electron density. Including density descriptors from higher rungs of this ladder aims to improve accuracy, but it comes at the price of increased computational cost. Even the few attempts that use machine learning have stayed within this traditional paradigm, thereby taking an approach that is akin to what people were doing in computer vision and speech recognition before the deep-learning era. Progress toward better accuracy has stagnated for at least two decades with this approach.

Our project is driven by the intuition that a true deep learning approach—where relevant representations of the electron density are learned directly from data in a computationally scalable way—has the potential to revolutionize the accuracy of DFT, much like deep learning has transformed other fields. A significant challenge with going down this path, however, is that feature or representation learning is very data-hungry, and there is very little data around—too little to test this hypothesis reliably.

What have we done in this milestone?

The first step was generating data—a lot of it. This posed a major challenge, since the data must come from accurate solutions of the many-electron Schrödinger equation, which is precisely the prohibitively expensive problem that DFT is designed to replace. Fortunately, decades of progress in the scientific community have led to smarter, more efficient variants of brute-force methods, making it possible to compute reference data for small molecules at experimental accuracy. While these high-accuracy methods, also referred to as wavefunction methods, are far too costly for routine use in applications, we made a deliberate investment in them for this project. The reason? The upfront cost of generating high-quality training data is offset by the long-term benefit of enabling vast numbers of industrially relevant applications with cost effective DFT using the trained XC functional. Crucially, we rely on the ability of DFT—and our learned XC functional—to generalize from high-accuracy data for small systems to larger, more complex molecules.

There are many different high-accuracy wavefunction methods, each tailored to different regions of chemical space. However, their use at scale is not well established, as they require extensive expertise—small methodological choices can significantly affect accuracy at the level that we target. We therefore joined forces with Prof. Amir Karton (opens in new tab) from the University of New England, Australia, a world-leading expert who developed widely recognized benchmark datasets for a fundamental thermochemical property: atomization energy—the energy required to break all bonds in a molecule and separate it into individual atoms. To create a training dataset of atomization energies at unprecedented scale, our team at Microsoft built a scalable pipeline to produce highly diverse molecular structures. Using these structures and substantial Azure compute resources via Microsoft’s Accelerating Foundation Models Research program (opens in new tab), Prof. Karton applied a high-accuracy wavefunction method to compute the corresponding energy labels. The result is a dataset (opens in new tab) two orders of magnitude larger than previous efforts. We are releasing a large part of this dataset (opens in new tab) to the scientific community.

Data generation was only half of the challenge. We also needed to design a dedicated deep-learning architecture for the XC functional—one that is both computationally scalable and capable of learning meaningful representations from electron densities to accurately predict the XC energy. Our team of machine learning specialists, assisted by DFT experts, introduced a series of innovations that solve these and other challenges inherent to this complex learning problem. The result is Skala, an XC functional that generalizes to unseen molecules, reaching the accuracy needed to predict experiments. This demonstrates for the first time that deep learning can truly disrupt DFT: reaching experimental accuracy does not require the computationally expensive hand-designed features of Jacob’s ladder. Instead, we can retain the original computational complexity of DFT while allowing the XC functional to learn how to extract meaningful features and predict accurate energies.

We compare the accuracy of Skala against the best existing functionals of varying computational cost. The prediction errors are evaluated on two well-known public benchmark datasets: the W4-17 dataset for atomization energies (y axis, mean absolute error) and the GMTKN55 dataset for general main-group chemistry (x axis, weighted total mean absolute deviation, or WTMAD-2 for short). Skala achieves near “chemical accuracy” (1 kcal/mol) on atomization energies. This is the accuracy required for predictive modeling of laboratory experiments, which, to date, no existing functional has reached. Skala works especially well on the “single reference” subset of this dataset, reaching a groundbreaking 0.85 kcal/mol. On the GMTKN55 dataset, Skala shows competitive accuracy to the best-performing hybrid functionals, at a lower cost.

“Skala is a new density functional for the exchange-correlation energy that employs meta-GGA ingredients plus D3 dispersion and machine-learned nonlocal features of the electron density. Some exact constraints were imposed, and some others “emerge” from the fitting to about 150,000 accurate energy differences for sp molecules and atoms. Skala achieves high, hybrid-like accuracy on a large and diverse data set of properties of main group molecules, which has no overlap with its training set. The computational cost of Skala is higher than that of the r2SCAN meta-GGA for small molecules, but about the same for systems with 1,000 or more occupied orbitals. Its cost seems to be only 10% of the cost of standard hybrids and 1% of the cost of local hybrids. Developed by a Microsoft team of density functional theorists and deep-learning experts, Skala could be the first machine-learned density functional to compete with existing functionals for wide use in computational chemistry, and a sign of things to come in that and related fields. Skala learned from big data and was taught by insightful human scientists.”

— John P. Perdew, Professor of Physics, School of Science and Engineering, Tulane University

This first milestone was achieved for a challenging property in a specific region of chemical space—atomization energies of main group molecules—for which we generated our initial large batch of high-accuracy training data. Building on this foundation, we have started to expand our training dataset to cover a broader range of general chemistry, using our scalable in-house data generation pipeline. With the first small batch of training data beyond atomization energies, we have already extended the accuracy of our model, making it competitive with the best existing XC functionals across a wider spectrum of main group chemistry. This motivates us to continue growing our high-accuracy data generation campaign, engaging with external experts such as Prof. Amir Karton, who noted, “After years of benchmarking DFT methods against experimental accuracy, this is the first time I’ve witnessed such an unprecedented leap in the accuracy–cost trade-off. It is genuinely exciting to see how the creation of our new dataset has enabled these groundbreaking results — opening up a path for transformative advances across chemical, biochemical, and materials research.”

Advancing computational chemistry together

We are excited to work closely with the global computational chemistry community to accelerate progress for all and look forward to openly releasing our first XC functional in the near future.

“Density Functional Theory (DFT) and related technologies are a core Digital Chemistry technology supporting advancements in Merck’s diverse Life Science, Healthcare and Electronics businesses. However, the limitations of traditional DFT methods, which have persisted for the last 50 years, have hindered its full potential. Microsoft Research’s innovative approach to integrating deep learning represents a substantial leap, enhancing its accuracy, robustness, and scalability. We are looking forward to exploring how this can advance Digital Chemistry workflows and unlock new possibilities for the future, aligning with our commitment to developing advanced algorithms and technologies that propel scientific innovation at Merck.”

— Jan Gerit Brandenburg – Director for Digital Chemistry at Merck

“We are entering a golden age for predictive and realistic simulations: very accurate electronic-structure calculations provide vast amounts of consistent data that can be used to train novel machine-learning architectures, delivering the holy grail of precision and computational efficiency.”

— Professor Nicola Marzari, Chair of Theory and Simulation of Materials, EPFL and PSI

We believe that our new functional can help unlock new opportunities for businesses and are eager to work together on real-world applications. Today, we are delighted to launch the DFT Research Early Access Program (DFT REAP) and welcome Flagship Pioneering as the first participant. This program is for companies and research labs to collaborate with us to accelerate innovation across many industries. To find out more about how to join this program please visit: https://aka.ms/DFT-REAP (opens in new tab)

“Microsoft’s effort to enhance the predictive power of computational chemistry reflects a bold but thoughtful step toward a simulation-first future. At Flagship, we believe that openly shared, foundational advances in science – like this leap forward in DFT accuracy – can serve as powerful enablers of innovation. These next-generation tools promise to accelerate discovery across a wide range of sectors, from therapeutics to materials science, by helping researchers navigate chemical and biological space with far greater precision and speed.”

— Junaid Bajwa, M.D., Senior Partner at Flagship Pioneering and Science Partner at Pioneering Intelligence

By making our work available to the scientific community, we hope to enable widespread testing and gather valuable feedback that will guide future improvements. For the first time, deep learning offers a clear and computationally scalable path to building an accurate, efficient, and broadly applicable model of the universal XC functional—one that could transform the computational design of molecules and materials.

Skala Paper Dataset Paper Dataset Acknowledgement

This work is the product of a highly collaborative and interdisciplinary effort led by Microsoft Research AI for Science, in partnership with colleagues from Microsoft Research Accelerator, Microsoft Quantum and the University of New England. The full author list includes Giulia Luise, Chin-Wei Huang, Thijs Vogels , Derk P. Kooi, Sebastian Ehlert, Stephanie Lanius, Klaas J. H. Giesbertz, Amir Karton, Deniz Gunceler, Megan Stanley, Wessel P. Bruinsma, Victor Garcia Satorras, Marwin Segler, Kenji Takeda, Lin Huang, Xinran Wei, José Garrido Torres, Albert Katbashev, Rodrigo Chavez Zavaleta, Bálint Máté, Sékou-Oumar Kaba, Roberto Sordillo, Yingrong Chen, David B. Williams-Young, Christopher M. Bishop, Jan Hermann, Rianne van den Berg and Paola Gori Giorgi.

Opens in a new tab

The post Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning appeared first on Microsoft Research.

Categories: Microsoft

New methods boost reasoning in small and large language models

Tue, 06/17/2025 - 18:00

Artificial intelligence is advancing across a wide range of fields, with one of the most important developments being its growing capacity for reasoning. This capability could help AI becomes a reliable partner in critical domains like scientific research and healthcare.

To support this progress, we’ve identified three primary strategies to strengthen reasoning capabilities in both small and large language models: improve architectural design to boost performance in smaller models; incorporate mathematical reasoning techniques to increase reliability; and build stronger generalization capabilities to enable reasoning across a variety of fields.

Smarter reasoning in smaller models

While language models trained on broad world knowledge hold great potential, they lack the ability to learn continuously and refine their understanding. This limitation becomes especially pronounced in smaller models, where limited capacity makes strong reasoning even harder.

The problem stems from how current language models operate. They rely on fast, pattern recognition-based responses that break down in complex scenarios. In contrast, people use deliberate, step-by-step reasoning, test different approaches, and evaluate outcomes. To address this gap, we’re building methods to enable stronger reasoning in smaller systems.

rStar-Math is a method that uses Monte Carlo Tree Search (MCTS) to simulate deeper, more methodical reasoning in smaller models. It uses a three-step, self-improving cycle:

Problem decomposition breaks down complex mathematical problems into manageable steps, creating a thorough and accurate course of reasoning.
Process preference model (PPM) trains small models to predict reward labels for each step, improving process-level supervision.
Iterative refinement applies a four-round, self-improvement cycle in which updated strategy models and PPMs guide MCTS to improve performance.

When tested on four small language models ranging from 1.5 billion to 7 billion parameters, rStar-Math achieved an average accuracy of 53% on the American Invitational Mathematics Examination (AIME)—performance that places it among the top 20% of high school competitors in the US.

Figure 1. The rStar-Math framework

Logic-RL is a reinforcement learning framework that strengthens logical reasoning through a practical system prompt and a structured reward function. By training models on logic puzzles, Logic-RL grants rewards only when both the reasoning process and the final answer meet strict formatting requirements. This prevents shortcuts and promotes analytical rigor.

Language models trained with Logic-RL demonstrate strong performance beyond logic puzzles, generalizing effectively to mathematical competition problems. On the AIME and AMC (American Mathematics Competitions) datasets, 7-billion-parameter models improved accuracy by 125% and 38%, respectively, compared with baseline models.

Building reliable mathematical reasoning

Mathematics poses a unique challenge for language models, which often struggle to meet its precision and rigor using natural language. To address this, we’re creating formal and symbolic methods to enable language models to adopt structured mathematical tools. The goal is to convert language model outputs into code based on the fundamental rules of arithmetic, like 1 + 1 = 2, allowing us to systematically verify accuracy.

LIPS (LLM-based Inequality Prover with Symbolic Reasoning) is a system that combines LLMs’ pattern recognition capabilities with symbolic reasoning. LIPS draws on the strategies participants in math competitions use in order to distinguish between tasks best suited to symbolic solvers (e.g., scaling) and those better handled by language models (e.g., rewriting). On 161 Olympiad-level problems, LIPS achieved state-of-the-art results without additional training data.

Figure 2. An overview of LIPS

However, translating natural-language math problems into precise, machine-readable formats is a challenge. Our goal is to bridge the gap between the one-pass success rate, where the top-ranked generated result is correct, and the k-pass success rate, where at least one of the top k generated results is correct.

We developed a new framework using two evaluation methods. Symbolic equivalence checks whether outputs are logically identical, while semantic consistency uses embedding similarity to detect subtle differences missed by symbolic checks.

When we evaluated this approach on the MATH and miniF2F datasets, which include problems from various math competitions, it improved accuracy by up to 1.35 times over baseline methods.

Figure 3. An overview of the auto-formalization framework

To address the shortage of high-quality training data, we developed a neuro-symbolic framework that automatically generates diverse, well-structured math problems. Symbolic solvers create the problems, while language models translate them into natural language. This approach not only broadens training resources but also supports more effective instruction and evaluation of mathematical reasoning in language models.

Figure 4. An overview of the neuro-symbolic data generation framework Boosting generalization across domains

A key indicator of advanced AI is its ability to generalize—the ability to transfer reasoning skills across different domains. We found that training language models on math data significantly improved performance in coding, science, and other areas, revealing unexpected cross-domain benefits.

This discovery motivated us to develop Chain-of-Reasoning (CoR), an approach that unifies reasoning across natural language, code, and symbolic forms. CoR lets models blend these formats using natural language to frame context, code for precise calculations, and symbolic representations for abstraction. By adjusting prompts, CoR adapts both reasoning depth and paradigm diversity to match specific problem requirements.

Tests of CoR across five math datasets showed its ability to tackle both computational and proof-based problems, demonstrating strong general mathematical problem-solving skills.

Figure 5. CoR’s reasoning process under different types of methods

Current language models often rely on domain-specific solutions, limiting their flexibility across different types of problems. To move beyond this constraint, we developed Critical Plan Step Learning (CPL), an approach focused on high-level abstract planning that teaches models to identify key knowledge, break down problems, and make strategic decisions.

The technique draws on how people solve problems, by breaking them down, identifying key information, and recalling relevant knowledge—strategies we want language models to learn.

CPL combines two key components: plan-based MCTS, which searches multi-step solution paths and constructs planning trees, and step-APO, which learns preferences for strong intermediate steps while filtering out weak ones. This combination enhances reasoning and improves generalization across tasks, moving AI systems closer to the flexible thinking that characterizes human intelligence.

Figure 6. Overview of the CPL framework Looking ahead: Next steps in AI reasoning

From building reliable math solvers to unifying reasoning approaches, researchers are redefining how language models approach complex tasks. Their work sets the stage for more capable and versatile AI systems—applicable to education, science, healthcare, and beyond. Despite these advances, hallucinations and imprecise logic continue to pose risks in critical fields like medicine and scientific research, where accuracy is essential.

These challenges are driving the team’s exploration of additional tools and frameworks to improve language model reasoning. This includes AutoVerus for automated proof generation in Rust code, SAFE for addressing data scarcity in Rust formal verification, and Alchemy, which uses symbolic mutation to improve neural theorem proving.

Together, these technologies represent important progress toward building trustworthy, high-performing reasoning models and signal a broader shift toward addressing some of AI’s current limitations.

Opens in a new tab

The post New methods boost reasoning in small and large language models appeared first on Microsoft Research.

Categories: Microsoft

Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library

Tue, 06/10/2025 - 18:00

Outdated coding practices and memory-unsafe languages like C are putting software, including cryptographic libraries, at risk. Fortunately, memory-safe languages like Rust, along with formal verification tools, are now mature enough to be used at scale, helping prevent issues like crashes, data corruption, flawed implementation, and side-channel attacks.

To address these vulnerabilities and improve memory safety, we’re rewriting SymCrypt (opens in new tab)—Microsoft’s open-source cryptographic library—in Rust. We’re also incorporating formal verification methods. SymCrypt is used in Windows, Azure Linux, Xbox, and other platforms.

Currently, SymCrypt is primarily written in cross-platform C, with limited use of hardware-specific optimizations through intrinsics (compiler-provided low-level functions) and assembly language (direct processor instructions). It provides a wide range of algorithms, including AES-GCM, SHA, ECDSA, and the more recent post-quantum algorithms ML-KEM and ML-DSA.

Formal verification will confirm that implementations behave as intended and don’t deviate from algorithm specifications, critical for preventing attacks. We’ll also analyze compiled code to detect side-channel leaks caused by timing or hardware-level behavior.

Proving Rust program properties with Aeneas

Program verification is the process of proving that a piece of code will always satisfy a given property, no matter the input. Rust’s type system profoundly improves the prospects for program verification by providing strong ownership guarantees, by construction, using a discipline known as “aliasing xor mutability”.

For example, reasoning about C code often requires proving that two non-const pointers are live and non-overlapping, a property that can depend on external client code. In contrast, Rust’s type system guarantees this property for any two mutably borrowed references.

As a result, new tools have emerged specifically for verifying Rust code. We chose Aeneas (opens in new tab) because it helps provide a clean separation between code and proofs.

Developed by Microsoft Azure Research in partnership with Inria, the French National Institute for Research in Digital Science and Technology, Aeneas connects to proof assistants like Lean (opens in new tab), allowing us to draw on a large body of mathematical proofs—especially valuable given the mathematical nature of cryptographic algorithms—and benefit from Lean’s active user community.

Compiling Rust to C supports backward compatibility

We recognize that switching to Rust isn’t feasible for all use cases, so we’ll continue to support, extend, and certify C-based APIs as long as users need them. Users won’t see any changes, as Rust runs underneath the existing C APIs.

Some users compile our C code directly and may rely on specific toolchains or compiler features that complicate the adoption of Rust code. To address this, we will use Eurydice (opens in new tab), a Rust-to-C compiler developed by Microsoft Azure Research, to replace handwritten C code with C generated from formally verified Rust. Eurydice (opens in new tab) compiles directly from Rust’s MIR intermediate language, and the resulting C code will be checked into the SymCrypt repository alongside the original Rust source code.

As more users adopt Rust, we’ll continue supporting this compilation path for those who build SymCrypt from source code but aren’t ready to use the Rust compiler. In the long term, we hope to transition users to either use precompiled SymCrypt binaries (via C or Rust APIs), or compile from source code in Rust, at which point the Rust-to-C compilation path will no longer be needed.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Start now Opens in a new tab Timing analysis with Revizor

Even software that has been verified for functional correctness can remain vulnerable to low-level security threats, such as side channels caused by timing leaks or speculative execution. These threats operate at the hardware level and can leak private information, such as memory load addresses, branch targets, or division operands, even when the source code is provably correct.

To address this, we’re extending Revizor (opens in new tab), a tool developed by Microsoft Azure Research, to more effectively analyze SymCrypt binaries. Revizor models microarchitectural leakage and uses fuzzing techniques to systematically uncover instructions that may expose private information through known hardware-level effects.

Earlier cryptographic libraries relied on constant-time programming to avoid operations on secret data. However, recent research has shown that this alone is insufficient with today’s CPUs, where every new optimization may open a new side channel.

By analyzing binary code for specific compilers and platforms, our extended Revizor tool enables deeper scrutiny of vulnerabilities that aren’t visible in the source code.

Verified Rust implementations begin with ML-KEM

This long-term effort is in alignment with the Microsoft Secure Future Initiative and brings together experts across Microsoft, building on decades of Microsoft Research investment in program verification and security tooling.

A preliminary version of ML-KEM in Rust is now available on the preview feature/verifiedcrypto (opens in new tab) branch of the SymCrypt repository. We encourage users to try the Rust build and share feedback (opens in new tab). Looking ahead, we plan to support direct use of the same cryptographic library in Rust without requiring C bindings.

Over the coming months, we plan to rewrite, verify, and ship several algorithms in Rust as part of SymCrypt. As our investment in Rust deepens, we expect to gain new insights into how to best leverage the language for high-assurance cryptographic implementations with low-level optimizations.

As performance is key to scalability and sustainability, we’re holding new implementations to a high bar using our benchmarking tools to match or exceed existing systems.

Looking forward

This is a pivotal moment for high-assurance software. Microsoft’s investment in Rust and formal verification presents a rare opportunity to advance one of our key libraries. We’re excited to scale this work and ultimately deliver an industrial-grade, Rust-based, FIPS-certified cryptographic library.

Opens in a new tab

The post Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library appeared first on Microsoft Research.

Categories: Microsoft

BenchmarkQED: Automated benchmarking of RAG systems

Thu, 06/05/2025 - 18:00

One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics.

To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.

BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks.

In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.

In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset.

Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.

AutoQ: Automated query synthesis

This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.

AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom).

Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.

AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.

Figure 2. Synthesis process and example query for each of the four AutoQ query classes.

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.

Read more Opens in a new tab AutoE: Automated evaluation framework

Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:

Comprehensiveness: Does the answer address all relevant aspects of the question?
Diversity: Does it present varied perspectives or insights?
Empowerment: Does it help the reader understand and make informed judgments?
Relevance: Does it address what the question is specifically asking?

The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.

An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.

These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.

Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.

The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.

Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab), RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.

LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context.

Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.

Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.

Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.

Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. AutoD: Automated data sampling and summarization

Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.

The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.

AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.

Supporting the community with open data and tools

Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.

We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab).

Opens in a new tab

The post BenchmarkQED: Automated benchmarking of RAG systems appeared first on Microsoft Research.

Categories: Microsoft

FrodoKEM: A conservative quantum-safe cryptographic algorithm

Tue, 05/27/2025 - 18:00

In this post, we describe FrodoKEM, a key encapsulation protocol that offers a simple design and provides strong security guarantees even in a future with powerful quantum computers.

The quantum threat to cryptography

For decades, modern cryptography has relied on mathematical problems that are practically impossible for classical computers to solve without a secret key. Cryptosystems like RSA, Diffie-Hellman key-exchange, and elliptic curve-based schemes—which rely on the hardness of the integer factorization and (elliptic curve) discrete logarithm problems—secure communications on the internet, banking transactions, and even national security systems. However, the emergence of quantum computing poses a significant threat to these cryptographic schemes.

Quantum computers leverage the principles of quantum mechanics to perform certain calculations exponentially faster than classical computers. Their ability to solve complex problems, such as simulating molecular interactions, optimizing large-scale systems, and accelerating machine learning, is expected to have profound and beneficial implications for fields ranging from chemistry and material science to artificial intelligence.

At the same time, quantum computing is poised to disrupt cryptography. In particular, Shor’s algorithm, a quantum algorithm developed in 1994, can efficiently factor large numbers and compute discrete logarithms—the very problems that underpin the security of RSA, Diffie-Hellman, and elliptic curve cryptography. This means that once large-scale, fault-tolerant quantum computers become available, public-key protocols based on RSA, ECC, and Diffie-Hellman will become insecure, breaking a sizable portion of the cryptographic backbone of today’s digital world. Recent advances in quantum computing, such as Microsoft’s Majorana 1 (opens in new tab), the first quantum processor powered by topological qubits, represent major steps toward practical quantum computing and underscore the urgency of transitioning to quantum-resistant cryptographic systems.

To address this looming security crisis, cryptographers and government agencies have been working on post-quantum cryptography (PQC)—new cryptographic algorithms that can resist attacks from both classical and quantum computers.

The NIST Post-Quantum Cryptography Standardization effort

In 2017, the U.S. National Institute of Standards and Technology (NIST) launched the Post-Quantum Cryptography Standardization project (opens in new tab) to evaluate and select cryptographic algorithms capable of withstanding quantum attacks. As part of this initiative, NIST sought proposals for two types of cryptographic primitives: key encapsulation mechanisms (KEMs)—which enable two parties to securely derive a shared key to establish an encrypted connection, similar to traditional key exchange schemes—and digital signature schemes.

This initiative attracted submissions from cryptographers worldwide, and after multiple evaluation rounds, NIST selected CRYSTALS-Kyber, a KEM based on structured lattices, and standardized it as ML-KEM (opens in new tab). Additionally, NIST selected three digital signature schemes: CRYSTALS-Dilithium, now called ML-DSA; SPHINCS+, now called SLH-DSA; and Falcon, now called FN-DSA.

While ML-KEM provides great overall security and efficiency, some governments and cryptographic researchers advocate for the inclusion and standardization of alternative algorithms that minimize reliance on algebraic structure. Reducing algebraic structure might prevent potential vulnerabilities and, hence, can be considered a more conservative design choice. One such algorithm is FrodoKEM.

International standardization of post-quantum cryptography

Beyond NIST, other international standardization bodies have been actively working on quantum-resistant cryptographic solutions. The International Organization for Standardization (ISO) is leading a global effort to standardize additional PQC algorithms. Notably, European government agencies—including Germany’s BSI (opens in new tab), the Netherlands’ NLNCSA and AIVD (opens in new tab), and France’s ANSSI (opens in new tab)—have shown strong support for FrodoKEM, recognizing it as a conservative alternative to structured lattice-based schemes.

As a result, FrodoKEM is undergoing standardization at ISO. Additionally, ISO is standardizing ML-KEM and a conservative code-based KEM called Classic McEliece. These three algorithms are planned for inclusion in ISO/IEC 18033-2:2006 as Amendment 2 (opens in new tab).

What is FrodoKEM?

FrodoKEM is a key encapsulation mechanism (KEM) based on the Learning with Errors (LWE) problem, a cornerstone of lattice-based cryptography. Unlike structured lattice-based schemes such as ML-KEM, FrodoKEM is built on generic, unstructured lattices, i.e., it is based on the plain LWE problem.

Why unstructured lattices?

Structured lattice-based schemes introduce additional algebraic properties that could potentially be exploited in future cryptanalytic attacks. By using unstructured lattices, FrodoKEM eliminates these concerns, making it a safer choice in the long run, albeit at the cost of larger key sizes and lower efficiency.

It is important to emphasize that no particular cryptanalytic weaknesses are currently known for recommended parameterizations of structured lattice schemes in comparison to plain LWE. However, our current understanding of the security of these schemes could potentially change in the future with cryptanalytic advances.

Lattices and the Learning with Errors (LWE) problem

Lattice-based cryptography relies on the mathematical structure of lattices, which are regular arrangements of points in multidimensional space. A lattice is defined as the set of all integer linear combinations of a set of basis vectors. The difficulty of certain computational problems on lattices, such as the Shortest Vector Problem (SVP) and the Learning with Errors (LWE) problem, forms the basis of lattice-based schemes.

The Learning with Errors (LWE) problem

The LWE problem is a fundamental hard problem in lattice-based cryptography. It involves solving a system of linear equations where some small random error has been added to each equation, making it extremely difficult to recover the original secret values. This added error ensures that the problem remains computationally infeasible, even for quantum computers. Figure 1 below illustrates the LWE problem, specifically, the search version of the problem.

As can be seen in Figure 1, for the setup of the problem we need a dimension \(n\) that defines the size of matrices, a modulus \(q\) that defines the value range of the matrix coefficients, and a certain error distribution \(\chi\) from which we sample \(\textit{“small”}\) matrices. We sample two matrices from \(\chi\), a small matrix \(\text{s}\) and an error matrix \(\text{e}\) (for simplicity in the explanation, we assume that both have only one column); sample an \(n \times n\) matrix \(\text{A}\) uniformly at random; and compute \(\text{b} = \text{A} \times \text{s} + \text{e}\). In the illustration, each matrix coefficient is represented by a colored square, and the “legend of coefficients” gives an idea of the size of the respective coefficients, e.g., orange squares represent the small coefficients of matrix \(\text{s}\) (small relative to the modulus \(q\)). Finally, given \(\text{A}\) and \(\text{b}\), the search LWE problem consists in finding \(\text{s}\). This problem is believed to be hard for suitably chosen parameters (e.g., for dimension \(n\) sufficiently large) and is used at the core of FrodoKEM.

In comparison, the LWE variant used in ML-KEM—called Module-LWE (M-LWE)—has additional symmetries, adding mathematical structure that helps improve efficiency. In a setting similar to that of the search LWE problem above, the matrix \(\text{A}\) can be represented by just a single row of coefficients.

FIGURE 1: Visualization of the (search) LWE problem.

LWE is conjectured to be quantum-resistant, and FrodoKEM’s security is directly tied to its hardness. In other words, cryptanalysts and quantum researchers have not been able to devise an efficient quantum algorithm capable of solving the LWE problem and, hence, FrodoKEM. In cryptography, absolute security can never be guaranteed; instead, confidence in a problem’s hardness comes from extensive scrutiny and its resilience against attacks over time.

How FrodoKEM Works

FrodoKEM follows the standard paradigm of a KEM, which consists of three main operations—key generation, encapsulation, and decapsulation—performed interactively between a sender and a recipient with the goal of establishing a shared secret key:

Key generation (KeyGen), computed by the recipient
- Generates a public key and a secret key.
- The public key is sent to the sender, while the private key remains secret.
Encapsulation (Encapsulate), computed by the sender
- Generates a random session key.
- Encrypts the session key using the recipient’s public key to produce a ciphertext.
- Produces a shared key using the session key and the ciphertext.
- The ciphertext is sent to the recipient.
Decapsulation (Decapsulate), computed by the recipient
- Decrypts the ciphertext using their secret key to recover the original session key.
- Reproduces the shared key using the decrypted session key and the ciphertext.

The shared key generated by the sender and reconstructed by the recipient can then be used to establish secure symmetric-key encryption for further communication between the two parties.

Figure 2 below shows a simplified view of the FrodoKEM protocol. As highlighted in red, FrodoKEM uses at its core LWE operations of the form “\(\text{b} = \text{A} \times \text{s} + \text{e}\)”, which are directly applied within the KEM paradigm.

FIGURE 2: Simplified overview of FrodoKEM. Performance: Strong security has a cost

Not relying on additional algebraic structure certainly comes at a cost for FrodoKEM in the form of increased protocol runtime and bandwidth. The table below compares the performance and key sizes corresponding to the FrodoKEM level 1 parameter set (variant called “FrodoKEM-640-AES”) and the respective parameter set of ML-KEM (variant called “ML-KEM-512”). These parameter sets are intended to match or exceed the brute force security of AES-128. As can be seen, the difference in speed and key sizes between FrodoKEM and ML-KEM is more than an order of magnitude. Nevertheless, the runtime of the FrodoKEM protocol remains reasonable for most applications. For example, on our benchmarking platform clocked at 3.2GHz, the measured runtimes are 0.97 ms, 1.9 ms, and 3.2 ms for security levels 1, 2, and 3, respectively.

For security-sensitive applications, a more relevant comparison is with Classic McEliece, a post-quantum code-based scheme also considered for standardization. In this case, FrodoKEM offers several efficiency advantages. Classic McEliece’s public keys are significantly larger—well over an order of magnitude greater than FrodoKEM’s—and its key generation is substantially more computationally expensive. Nonetheless, Classic McEliece provides an advantage in certain static key-exchange scenarios, where its high key generation cost can be amortized across multiple key encapsulation executions.

TABLE 1: Comparison of key sizes and performance on an x86-64 processor for NIST level 1 parameter sets. A holistic design made with security in mind

FrodoKEM’s design principles support security beyond its reliance on generic, unstructured lattices to minimize the attack surface of potential future cryptanalytic threats. Its parameters have been carefully chosen with additional security margins to withstand advancements in known attacks. Furthermore, FrodoKEM is designed with simplicity in mind—its internal operations are based on straightforward matrix-vector arithmetic using integer coefficients reduced modulo a power of two. These design decisions facilitate simple, compact and secure implementations that are also easier to maintain and to protect against side-channel attacks.

Conclusion

After years of research and analysis, the next generation of post-quantum cryptographic algorithms has arrived. NIST has chosen strong PQC protocols that we believe will serve Microsoft and its customers well in many applications. For security-sensitive applications, FrodoKEM offers a secure yet practical approach for post-quantum cryptography. While its reliance on unstructured lattices results in larger key sizes and higher computational overhead compared to structured lattice-based alternatives, it provides strong security assurances against potential future attacks. Given the ongoing standardization efforts and its endorsement by multiple governmental agencies, FrodoKEM is well-positioned as a viable alternative for organizations seeking long-term cryptographic resilience in a post-quantum world.

Magentic-UI, an experimental human-centered web agent

Mon, 05/19/2025 - 18:00

Modern productivity is rooted in the web—from searching for information and filling in forms to navigating dashboards. Yet, many of these tasks remain manual and repetitive. Today, we are introducing Magentic-UI, a new open-source research prototype of a human-centered agent that is meant to help researchers study open questions on human-in-the-loop approaches and oversight mechanisms for AI agents. This prototype collaborates with users on web-based tasks and operates in real time over a web browser. Unlike other computer use agents that aim for full autonomy, Magentic-UI offers a transparent and controllable experience for tasks that are action-oriented and require activities beyond just performing simple web searches.

Magentic-UI builds on Magentic-One (opens in new tab), a powerful multi-agent team we released last year, and is powered by AutoGen (opens in new tab), our leading agent framework. It is available under MIT license at https://github.com/microsoft/Magentic-UI (opens in new tab) and on Azure AI Foundry Labs (opens in new tab), the hub where developers, startups, and enterprises can explore groundbreaking innovations from Microsoft Research. Magentic-UI is integrated with Azure AI Foundry models and agents. Learn more about how to integrate Azure AI agents into the Magentic-UI multi-agent architecture by following this code sample (opens in new tab).

Magentic-UI can perform tasks that require browsing the web, writing and executing Python and shell code, and understanding files. Its key features include:

Collaborative planning with users (co-planning). Magentic-UI allows users to directly modify its plan through a plan editor or by providing textual feedback before Magentic-UI executes any actions.
Collaborative execution with users (co-tasking). Users can pause the system and give feedback in natural language or demonstrate it by directly taking control of the browser.
Safety with human-in-the-loop (action guards). Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
Safety with human-in-the-loop. Magentic-UI seeks user approval before executing potentially irreversible actions, and the user can specify how often Magentic-UI needs approvals. Furthermore, Magentic-UI is sandboxed for the safe operation of tools such as browsers and code executors.
Learning from experience (plan learning). Magentic-UI can learn and save plans from previous interactions to improve task completion for future tasks.

Figure 1: Screenshot of Magentic-UI actively performing a task. The left side of the screen shows Magentic-UI stating its plan and progress to accomplish a user’s complex goal. The right side shows the browser Magentic-UI is controlling. How is Magentic-UI human-centered?

While many web agents promise full autonomy, in practice users can be left unsure of what the agent can do, what it is currently doing, and whether they have enough control to intervene when something goes wrong or doesn’t occur as expected. By contrast, Magentic-UI considers user needs at every stage of interaction. We followed a human-centered design methodology in building Magentic-UI by prototyping and obtaining feedback from pilot users during its design.

Figure 2: Co-planning – Users can collaboratively plan with Magentic-UI.

For example, after a person specifies and before Magentic-UI even begins to execute, it creates a clear step-by-step plan that outlines what it would do to accomplish the task. People can collaborate with Magentic-UI to modify this plan and then give final approval for Magentic-UI to begin execution. This is crucial as users may have expectations of how the task should be completed; communicating that information could significantly improve agent performance. We call this feature co-planning.

During execution, Magentic-UI shows in real time what specific actions it’s about to take. For example, whether it is about to click on a button or input a search query. It also shows in real time what it observed on the web pages it is visiting. Users can take control of the action at any point in time and give control back to the agent. We call this feature co-tasking.

Figure 3: Co-tasking – Magentic-UI provides real-time updates about what it is about to do and what it already did, allowing users to collaboratively complete tasks with the agent. Figure 4: Action-guards – Magentic-UI will ask users for permission before executing actions that it deems consequential or important.

Additionally, Magentic-UI asks for user permission before performing actions that are deemed irreversible, such as closing a tab or clicking a button with side effects. We call these “action guards”. The user can also configure Magentic-UI’s action guards to always ask for permission before performing any action. If the user deems an action risky (e.g., paying for an item), they can reject it.

Figure 5: Plan learning – Once a task is successfully completed, users can request Magentic-UI to learn a step-by-step plan from this experience.

After execution, the user can ask Magentic-UI to reflect on the conversation and infer and save a step-by-step plan for future similar tasks. Users can view and modify saved plans for Magentic-UI to reuse in the future in a saved-plans gallery. In a future session, users can launch Magentic-UI with the saved plan to either execute the same task again, like checking the price of a specific flight, or use the plan as a guide to help complete similar tasks, such as checking the price of a different type of flight.

Combined, these four features—co-planning, co-tasking, action guards, and plan learning—enable users to collaborate effectively with Magentic-UI.

Architecture

Magentic-UI’s underlying system is a team of specialized agents adapted from AutoGen’s Magentic-One system. The agents work together to create a modular system:

Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator.
Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown (opens in new tab) package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.

Figure 6: System architecture diagram of Magentic-UI

To interact with Magentic-UI, users can enter a text message and attach images. In response, Magentic-UI creates a natural-language step-by-step plan with which users can interact through a plan-editing interface. Users can add, delete, edit, regenerate steps, and write follow-up messages to iterate on the plan. While the user editing the plan adds an upfront cost to the interaction, it can potentially save a significant amount of time in the agent executing the plan and increase its chance at success.

The plan is stored inside the Orchestrator and is used to execute the task. For each step of the plan, the Orchestrator determines which of the agents (WebSurfer, Coder, FileSurfer) or the user should complete the step. Once that decision is made, the Orchestrator sends a request to one of the agents or the user and waits for a response. After the response is received, the Orchestrator decides whether that step is complete. If it is, the Orchestrator moves on to the following step.

Once all steps are completed, the Orchestrator generates a final answer that is presented to the user. If, while executing any of the steps, the Orchestrator decides that the plan is inadequate (for example, because a certain website is unreachable), the Orchestrator can replan with user permission and start executing a new plan.

All intermediate progress steps are clearly displayed to the user. Furthermore, the user can pause the execution of the plan and send additional requests or feedback. The user can also configure through the interface whether agent actions (e.g., clicking a button) require approval.

Evaluating Magentic-UI

Magentic-UI innovates through its ability to integrate human feedback in its planning and execution of tasks. We performed a preliminary automated evaluation to showcase this ability on the GAIA benchmark (opens in new tab) for agents with a user-simulation experiment.

Evaluation with simulated users Figure 7: Comparison on the GAIA validation set of the accuracy of Magentic-One, Magentic-UI in autonomous mode, Magentic-UI with a simulated user powered by a smarter LLM than the MAGUI agents, Magentic-UI with a simulated user that has a\access to side information about the tasks, and human performance. This shows that human-in-the-loop can improve the accuracy of autonomous agents, bridging the gap to human performance at a fraction of the cost.

GAIA is a benchmark for general AI assistants, with multimodal question-answer pairs that are challenging, requiring the agents to navigate the web, process files, and execute code. The traditional evaluation setup with GAIA assumes the system will autonomously complete the task and return an answer, which is compared to the ground-truth answer.

To evaluate the human-in-the-loop capabilities of Magentic-UI, we transform GAIA into an interactive benchmark by introducing the concept of a simulated user. Simulated users provide value in two ways: by having specific expertise that the agent may not possess, and by providing guidance on how the task should be performed.

We experiment with two types of simulated users to show the value of human-in-the-loop: (1) a simulated user that is more intelligent than the Magentic-UI agents and (2) a simulated user with the same intelligence as Magentic-UI agents but with additional information about the task. During co-planning, Magentic-UI takes feedback from this simulated user to improve its plan. During co-tasking, Magentic-UI can ask the (simulated) user for help when it gets stuck. Finally, if Magentic-UI does not provide a final answer, then the simulated user provides an answer instead. These experiments reflect a lower bound on the value of human feedback, since real users can step in at any time and offer any kind of input—not just when the system explicitly asks for help.

The simulated user is an LLM without any tools, instructed to interact with Magentic-UI the way we expect a human would act. The first type of simulated user relies on OpenAI’s o4-mini, more performant at many tasks than the one powering the Magentic-UI agents (GPT-4o). For the second type of simulated user, we use GPT-4o for both the simulated user and the rest of the agents, but the user has access to side information about each task. Each task in GAIA has side information, which includes a human-written plan to solve the task. While this plan is not used as input in the traditional benchmark, in our interactive setting we provide this information to the second type of simulated user,which is powered by an LLM so that it can mimic a knowledgeable user. Importantly, we tuned our simulated user so as not to reveal the ground-truth answer directly as the answer is usually found inside the human written plan. Instead, it is prompted to guide Magentic-UI indirectly. We found that this tuning prevented the simulated user from inadvertently revealing the answer in all but 6% of tasks when Magentic-UI provides a final answer.

On the validation subset of GAIA (162 tasks), we show the results of Magentic-One operating in autonomous mode, Magentic-UI operating in autonomous mode (without the simulated user), Magentic-UI with simulated user (1) (smarter model), Magentic-UI with simulated user (2) (side-information), and human performance. We first note that Magentic-UI in autonomous mode is within a margin of error of the performance of Magentic-One. Note that the same LLM (GPT-4o) is used for Magentic-UI and Magentic-One.

Magentic-UI with the simulated user that has access to side information improves the accuracy of autonomous Magentic-UI by 71%, from a 30.3% task-completion rate to a 51.9% task-completion rate. Moreover, Magentic-UI only asks for help from the simulated user in 10% of tasks and relies on the simulated user for the final answer in 18% of tasks. And in those tasks where it does ask for help, it asks for help on average 1.1 times. Magentic-UI with the simulated user powered by a smarter model improves to 42.6% where Magentic-UI asks for help in only 4.3% of tasks, asking for help an average of 1.7 times in those tasks. This demonstrates the potential of even lightweight human feedback for improving performance (e.g., task completion) over autonomous agents working alone, especially at a fraction of the cost compared to people completing tasks entirely manually.

Learning and reusing plans

As described above, once Magentic-UI completes a task, users have the option for Magentic-UI to learn a plan based on the execution of the task. These plans are saved in a plan gallery, which users and Magentic-UI can access in the future.

The user can select a plan from the plan gallery, which is displayed by clicking on the Saved Plans button. Alternatively, as a user enters a task that closely matches a previous task, the saved plan will be displayed even before the user is done typing. If no identical task is found, Magentic-UI can use AutoGen’s Task-Centric Memory (opens in new tab) to retrieve plans for any similar tasks. Our preliminary evaluations show that this retrieval is highly accurate, and when recalling a saved plan can be around 3x faster than generating a new plan. Once a plan is recalled or generated, the user can always accept it, modify it, or ask Magentic-UI to modify it for the specific task at hand.

Safety and control

Magentic-UI can surf the live internet and execute code. With such capabilities, we need to ensure that Magentic-UI acts in a safe and secure manner. The following features, design decisions, and evaluations were made to ensure this:

Allow-list: Users can set a list of websites that Magentic-UI is allowed to access. If Magentic-UI needs to access a website outside of the allow-list, users must explicitly approve it through the interface
Anytime interruptions: At any point of Magentic-UI completing the task, the user can interrupt Magentic-UI and stop any pending code execution or web browsing.
Docker sandboxing: Magentic-UI controls a browser that is launched inside a Docker container with no credentials, which avoids risks with logged-in accounts and credentials. Moreover, any code execution is also performed inside a separate Docker container to avoid affecting the host environment in which Magentic-UI is running. This is illustrated in the system architecture of Magentic-UI (Figure 3).
Detection and approval of irreversible agent actions: Users can configure an action-approval policy (action guards) to determine which actions Magentic-UI can perform without user approval. In the extreme, users can specify that any action (e.g., any button click) needs explicit user approval. Users must press an “Accept” or “Deny” button for each action.

In addition to the above design decisions, we performed a red-team evaluation of Magentic-UI on a set of internal scenarios, which we developed to challenge the security and safety of Magentic-UI. Such scenarios include cross-site prompt injection attacks, where web pages contain malicious instructions distinct from the user’s original intent (e.g., to execute risky code, access sensitive files, or perform actions on other websites). It also contains scenarios comparable to phishing, which try to trick Magentic-UI into entering sensitive information, or granting permissions on impostor sites (e.g., a synthetic website that asks Magentic-UI to log in and enter Google credentials to read an article). In our preliminary evaluations, we found that Magentic-UI either refuses to complete the requests, stops to ask the user, or, as a final safety measure, is eventually unable to complete the request due to Docker sandboxing. We have found that this layered approach is effective for thwarting these attacks.

We have also released transparency notes, which can be found at: https://github.com/microsoft/magentic-ui/blob/main/TRANSPARENCY_NOTE.md (opens in new tab)

Open research questions

Magentic-UI provides a tool for researchers to study critical questions in agentic systems and particularly on human-agent interaction. In a previous report (opens in new tab), we outlined 12 questions for human-agent communication, and Magentic-UI provides a vehicle to study these questions in a realistic setting. A key question among these is how we enable humans to efficiently intervene and provide feedback to the agent while executing a task. Humans should not have to constantly watch the agent. Ideally, the agent should know when to reach out for help and provide the necessary context for the human to assist it. A second question is about safety. As agents interact with the live web, they may become prone to attacks from malicious actors. We need to study what necessary safeguards are needed to protect the human from side effects without adding a heavy burden on the human to verify every agent action. There are also many other questions surrounding security, personalization, and learning that Magentic-UI can help with studying.

Conclusion

Magentic-UI is an open-source agent prototype that works with people to complete complex tasks that require multi-step planning and browser use. As agentic systems expand in the scope of tasks they can complete, Magentic-UI’s design enables better transparency into agent actions and enables human control to ensure safety and reliability. Moreover, by facilitating human intervention, we can improve performance while still reducing human cost in completing tasks on aggregate. Today we have released the first version of Magentic-UI. Looking ahead, we plan to continue developing it in the open with the goal of improving its capabilities and answering research questions on human-agent collaboration. We invite the research community to extend and reuse Magentic-UI for their scientific explorations and domains.

Opens in a new tab

The post Magentic-UI, an experimental human-centered web agent appeared first on Microsoft Research.

Categories: Microsoft

Predicting and explaining AI model performance: A new approach to evaluation

Mon, 05/12/2025 - 18:00

With support from the Accelerating Foundation Models Research (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.

In the paper, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities.

ADeLe: An ability-based approach to task evaluation

The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric, originally developed for human tasks and shown to work reliably when applied by AI models.

By comparing what a task requires with what a model can do, ADeLe generates an ability profile that not only predicts performance but also explains why a model is likely to succeed or fail—linking outcomes to specific strengths or limitations.

The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works—from rating task requirements to generating ability profiles.

Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems.

To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings.

Evaluation results

Using ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task.

1. Revealing hidden flaws in AI testing methods

Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions—missing both simple and complex challenges.

2. Creating detailed AI ability profiles

Using the 0–5 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted “subject characteristic curves” to show how a model’s success rate changes with task difficulty.

They then calculated a score for each ability—the difficulty level at which a model has a 50% chance of success—and used these results to generate radial plots showing each model’s strengths and weaknesses across the different scales and levels, illustrated in Figure 2.

Figure 2. Ability profiles for the 15 LLMs evaluated.

This analysis revealed the following:

When measured against human performance, AI systems show different strengths and weaknesses across the 18 ability scales.

Newer LLMs generally outperform older ones, though not consistently across all abilities.

Knowledge-related performance depends heavily on model size and training methods.

Reasoning models show clear gains over non-reasoning models in logical thinking, learning and abstraction, and social capabilities, such as inferring the mental states of their users.

Increasing the size of general-purpose models after a given threshold only leads to small performance gains.

3. Predicting AI success and failure

In addition to evaluation, the team created a practical prediction system based on demand-level measurements that forecasts whether a model will succeed on specific tasks, even unfamiliar ones.

The system achieved approximately 88% accuracy in predicting the performance of popular models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to anticipate potential failures before deployment, adding the important step of reliability assessment for AI models.

Looking ahead

ADeLe can be extended to multimodal and embodied AI systems, and it has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

This technology marks a major step toward a science of AI evaluation, one that offers both clear explanations of system behavior and reliable predictions about performance. It aligns with the vision laid out in a previous Microsoft position paper on the promise of applying psychometrics to AI evaluation and a recent Societal AI white paper emphasizing the importance of AI evaluation.

As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment. The research team is working toward building a collaborative community to strengthen and expand this emerging field.

Opens in a new tab

The post Predicting and explaining AI model performance: A new approach to evaluation appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of May 7, 2025

Thu, 05/08/2025 - 01:25

In this issue:

New research on compound AI systems and causal verification of the Confidential Consortium Framework; release of Phi-4-reasoning; enriching tabular data with semantic structure, and more.

NEW RESEARCH Towards Resource-Efficient Compound AI Systems

This research introduces Murakkab, a prototype system built on a declarative workflow that reimagines how compound AI systems are built and managed to significantly improve resource efficiency. Compound AI systems integrate multiple interacting components like language models, retrieval engines, and external tools. They are essential for addressing complex AI tasks. However, current implementations could benefit from greater efficiencies in resource utilization, with improvements to tight coupling between application logic and execution details, better connections between orchestration and resource management layers, and bridging gaps between efficiency and quality.

Murakkab addresses critical inefficiencies in current AI architectures and offers a new approach that unifies workflow orchestration and cluster resource management for better performance and sustainability. In preliminary evaluations, it demonstrates speedups up to ∼ 3.4× in workflow completion times while delivering ∼ 4.5× higher energy efficiency, showing promise in optimizing resources and advancing AI system design.

NEW RESEARCH Smart Casual Verification of the Confidential Consortium Framework

This work presents a new, pragmatic verification technique that improves the trustworthiness of distributed systems like the Confidential Consortium Framework (CCF) and proves its effectiveness by catching critical bugs before deployment. Smart casual verification is a novel hybrid verification approach to validating CCF, an open-source platform for developing trustworthy and reliable cloud applications which underpins Microsoft’s Azure Confidential Ledger service.

The researchers apply smart casual verification to validate the correctness of CCF’s novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. This hybrid approach combines the rigor of formal specification and model checking with the pragmatism of automated testing, specifically binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods are often one-off efforts by domain experts, the researchers have integrated smart casual verification into CCF’s continuous integration pipeline, allowing contributors to continuously validate CCF as it evolves.

NEW RESEARCH Phi-4-reasoning Technical Report

This report introduces Phi-4-reasoning (opens in new tab), a 14-billion parameter model optimized for complex reasoning tasks. It is trained via supervised fine-tuning of Phi-4 using a carefully curated dataset of high-quality prompts and reasoning demonstrations generated by o3-mini. These prompts span diverse domains—including math, science, coding, and spatial reasoning—and are selected to challenge the base model near its capability boundaries.

Building on recent findings that reinforcement learning (RL) can further improve smaller models, the team developed Phi-4-reasoning-plus, which incorporates an additional outcome-based RL phase using verifiable math problems. This enhances the model’s ability to generate longer, more effective reasoning chains.

Despite its smaller size, the Phi-4-reasoning family outperforms significantly larger open-weight models such as DeepSeekR1-Distill-Llama-70B and approaches the performance of full-scale frontier models like DeepSeek R1. It excels in tasks requiring multi-step problem solving, logical inference, and goal-directed planning.

The work highlights the combined value of supervised fine-tuning and reinforcement learning for building efficient, high-performing reasoning models. It also offers insights into training data design, methodology, and evaluation strategies. Phi-4-reasoning contributes to the growing class of reasoning-specialized language models and points toward more accessible, scalable AI for science, education, and technical domains.

NEW RESEARCH TeCoFeS: Text Column Featurization using Semantic Analysis

This research introduces a practical, cost-effective solution for enriching tabular data with semantic structure, making it more useful for downstream analysis and insights—which is especially valuable in business intelligence, data cleaning, and automated analytics workflows. This approach outperforms baseline models and naive LLM applications on converted text classification benchmarks.

Extracting structured insights from free-text columns in tables—such as product reviews or user feedback—can be time-consuming and error-prone, especially when relying on traditional syntactic methods that often miss semantic meaning. This research introduces the semantic text column featurization problem, which aims to assign meaningful, context-aware labels to each entry in a text column.

The authors propose a scalable, efficient method that combines the power of LLMs with text embeddings. Instead of labeling an entire column manually or applying LLMs to every cell—an expensive process—this new method intelligently samples a diverse subset of entries, uses an LLM to generate semantic labels for just that subset, and then propagates those labels to the rest of the column using embedding similarity.

NEW RESEARCH Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

This work introduces ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a new paradigm for LLM reasoning that expands beyond traditional language-only inference.

While LLMs have made considerable strides in complex reasoning tasks, they remain limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this research, ARTIST brings together agentic reasoning, reinforcement learning (RL), and tool integration, designed to enable LLMs to autonomously decide when and how to invoke internal tools within multi-turn reasoning chains. ARTIST leverages outcome-based reinforcement learning to learn robust strategies for tool use and environment interaction without requiring step-level supervision.

Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies show that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions.

PODCAST Materialism Podcast: MatterGen (opens in new tab)

What if you could find materials with tailored properties without ever entering the lab? The Materialism Podcast, which is dedicated to exploring materials science and engineering, talks with Tian Xie from Microsoft Research to discuss MatterGen, an AI tool which accelerates materials science discovery. Tune in to hear a discussion of the new Azure AI Foundry, where MatterGen will interact with and support MatterSim, an advanced deep learning model designed to simulate the properties of materials across a wide range of elements, temperatures, and pressures.

IN THE NEWS: Highlights of recent media coverage of Microsoft Research

When ChatGPT Broke an Entire Field: An Oral History
Quanta Magazine | April 30, 2025
Large language models are everywhere, igniting discovery, disruption and debate in whatever scientific community they touch. But the one they touched first — for better, worse and everything in between — was natural language processing. What did that impact feel like to the people experiencing it firsthand?
To tell that story, Quanta interviewed 19 NLP experts, including Kalika Bali, senior principal researcher at Microsoft Research. From researchers to students, tenured academics to startup founders, they describe a series of moments — dawning realizations, elated encounters and at least one “existential crisis” — that changed their world. And ours.

View more news and awards Opens in a new tab

The post Research Focus: Week of May 7, 2025 appeared first on Microsoft Research.

Categories: Microsoft

Microsoft Fusion Summit explores how AI can accelerate fusion research

Wed, 05/07/2025 - 18:00

The pursuit of nuclear fusion as a limitless, clean energy source has long been one of humanity’s most ambitious scientific goals. Research labs and companies worldwide are working to replicate the fusion process that occurs at the sun’s core, where isotopes of hydrogen combine to form helium, releasing vast amounts of energy. While scalable fusion energy is still years away, researchers are now exploring how AI can help accelerate fusion research and bring this energy to the grid sooner.

In March 2025, Microsoft Research held its inaugural Fusion Summit, a landmark event that brought together distinguished speakers and panelists from within and outside Microsoft Research to explore this question.

Ashley Llorens, Corporate Vice President and Managing Director of Microsoft Research Accelerator, opened the Summit by outlining his vision for a self-reinforcing system that uses AI to drive sustainability. Steven Cowley, laboratory director of the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (opens in new tab), professor at Princeton University, and former head of the UK Atomic Energy Authority, followed with a keynote explaining the intricate science and engineering behind fusion reactors. His message was clear: advancing fusion will require international collaboration and the combined power of AI and high-performance computing to model potential fusion reactor designs.

Applying AI to fusion research

North America’s largest fusion facility, DIII-D (opens in new tab), operated by General Atomics and owned by the US Department of Energy (DOE), provides a unique platform for developing and testing AI applications for fusion research, thanks to its pioneering data and digital twin platform.

Richard Buttery (opens in new tab) from DIII-D and Dave Humphreys (opens in new tab) from General Atomics demonstrated how the US DIII-D National Fusion Program (opens in new tab) is already applying AI to advance reactor design and operations, highlighting promising directions for future development. They provided examples of how to apply AI to active plasma control to avoid disruptive instabilities, using AI-controlled trajectories to avoid tearing modes, and implementing feedback control using machine learning-derived density limits for safer high-density operations.

One persistent challenge in reactor design involves building the interior “first wall,” which must withstand extreme heat and particle bombardment. Zulfi Alam, corporate vice president of Microsoft Quantum (opens in new tab), discussed the potential of using quantum computing in fusion, particularly for addressing material challenges like hydrogen diffusion in reactors.

He noted that silicon nitride shows promise as a barrier to hydrogen and vapor and explained the challenge of binding it to the reaction chamber. He emphasized the potential of quantum computing to improve material prediction and synthesis, enabling more efficient processes. He shared that his team is also investigating advanced silicon nitride materials to protect this critical component from neutron and alpha particle damage—an innovation that could make fusion commercially viable.

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.

Read more Opens in a new tab Exploring AI’s broader impact on fusion engineering

Lightning talks from Microsoft Research labs addressed the central question of AI’s potential to accelerate fusion research and engineering. Speakers covered a wide range of applications—from using gaming AI for plasma control and robotics for remote maintenance to physics-informed AI for simulating materials and plasma behavior. Closing the session, Archie Manoharan, Microsoft’s director of nuclear engineering for Cloud Operations and Infrastructure, emphasized the need for a comprehensive energy strategy, one that incorporates renewables, efficiency improvements, storage solutions, and carbon-free sources like fusion.

The Summit culminated in a thought-provoking panel discussion moderated by Ade Famoti, featuring Archie Manoharan, Richard Buttery, Steven Cowley, and Chris Bishop, Microsoft Technical Fellow and director of Microsoft Research AI for Science. Their wide-ranging conversation explored the key challenges and opportunities shaping the field of fusion.

The panel highlighted several themes: the role of new regulatory frameworks that balance innovation with safety and public trust; the importance of materials discovery in developing durable fusion reactor walls; and the game-changing role AI could play in plasma optimization and surrogate modelling of fusion’s underlying physics.

They also examined the importance of global research collaboration, citing projects like the International Thermonuclear Experimental Reactor (opens in new tab) (ITER), the world’s largest experimental fusion device under construction in southern France, as testbeds for shared progress. One persistent challenge, however, is data scarcity. This prompted a discussion of using physics-informed neural networks as a potential approach to supplement limited experimental data.

Global collaboration and next steps

Microsoft is collaborating with ITER (opens in new tab) to help advance the technologies and infrastructure needed to achieve fusion ignition—the critical point where a self-sustaining fusion reaction begins, using Microsoft 365 Copilot, Azure OpenAI Service, Visual Studio, and GitHub (opens in new tab). Microsoft Research is now cooperating with ITER to identify where AI can be leveraged to model future experiments to optimize its design and operations.

Now Microsoft Research has signed a Memorandum of Understanding with the Princeton Plasma Physics Laboratory (PPPL) (opens in new tab) to foster collaboration through knowledge exchange, workshops, and joint research projects. This effort aims to address key challenges in fusion, materials, plasma control, digital twins, and experiment optimization. Together, Microsoft Research and PPPL will work to drive innovation and advances in these critical areas.

Fusion is a scientific challenge unlike any other and could be key to sustainable energy in the future. We’re excited about the role AI can play in helping make that vision a reality. To learn more, visit the Fusion Summit event page, or connect with us by email at FusionResearch@microsoft.com.

Opens in a new tab

The post Microsoft Fusion Summit explores how AI can accelerate fusion research appeared first on Microsoft Research.

Categories: Microsoft

Societal AI: Building human-centered AI systems

Mon, 05/05/2025 - 18:00

In October 2022, Microsoft Research Asia hosted a workshop that brought together experts in computer science, psychology, sociology, and law as part of Microsoft’s commitment to responsible AI (opens in new tab). The event led to ongoing collaborations exploring AI’s societal implications, including the Value Compass (opens in new tab) project.

As these efforts grew, researchers focused on how AI systems could be designed to meet the needs of people and institutions in areas like healthcare, education, and public services. This work culminated in Societal AI: Research Challenges and Opportunities, a white paper that explores how AI can better align with societal needs.

What is Societal AI?

Societal AI is an emerging interdisciplinary area of study that examines how AI intersects with social systems and public life. It focuses on two main areas: (1) the impact of AI technologies on fields like education, labor, and governance; and (2) the challenges posed by these systems, such as evaluation, accountability, and alignment with human values. The goal is to guide AI development in ways that respond to real-world needs.

The white paper offers a framework for understanding these dynamics and provides recommendations for integrating AI responsibly into society. This post highlights the paper’s key insights and what they mean for future research.

Tracing the development of Societal AI

Societal AI began nearly a decade ago at Microsoft Research Asia, where early work on personalized recommendation systems uncovered risks like echo chambers, where users are repeatedly exposed to similar viewpoints, and polarization, which can deepen divisions between groups. Those findings led to deeper investigations into privacy, fairness, and transparency, helping inform Microsoft’s broader approach to responsible AI.

The rapid rise of large-scale AI models in recent years has made these concerns more urgent. Today, researchers across disciplines are working to define shared priorities and guide AI development in ways that reflect social needs and values.

Key insights

The white paper outlines several important considerations for the field:

Interdisciplinary framework: Bridges technical AI research with the social sciences, humanities, policy studies, and ethics to address AI’s far-reaching societal effects.

Actionable research agenda: Identifies ten research questions that offer a roadmap for researchers, policymakers, and industry leaders.

Global perspective: Highlights the importance of different cultural perspectives and international cooperation in shaping responsible AI development dialogue.

Practical insights: Balances theory with real-world applications, drawing from collaborative research projects.

“AI’s impact extends beyond algorithms and computation—it challenges us to rethink fundamental concepts like trust, creativity, agency, and value systems,” says Lidong Zhou, managing director of Microsoft Research Asia. “It recognizes that developing more powerful AI models is not enough; we must examine how AI interacts with human values, institutions, and diverse cultural contexts.”

Figure 1. Societal AI research agenda Guiding principles for responsible integration

The research agenda is grounded in three key principles:

Harmony: AI should minimize conflict and build trust to support acceptance.
Synergy: AI should complement human capabilities, enabling outcomes that neither humans nor machines could achieve alone.
Resilience: AI should be robust and adaptable as social and technological conditions evolve.

Ten critical questions

These questions span both technical and societal concerns:

How can AI be aligned with diverse human values and ethical principles?
How can AI systems be designed to ensure fairness and inclusivity across different cultures, regions, and demographic groups?
How can we ensure AI systems are safe, reliable, and controllable, especially as they become more autonomous?
How can human-AI collaboration be optimized to enhance human abilities?
How can we effectively evaluate AI’s capabilities and performance in new, unforeseen tasks and environments?
How can we enhance AI interpretability to ensure transparency in its decision-making processes?
How will AI reshape human cognition, learning, and creativity, and what new capabilities might it unlock?
How will AI redefine the nature of work, collaboration, and the future of global business models?
How will AI transform research methodologies in the social sciences, and what new insights might it enable?
How should regulatory frameworks evolve to govern AI development responsibly and foster global cooperation?

This list will evolve alongside AI’s developing societal impact, ensuring the agenda remains relevant over time. Building on these questions, the white paper underscores the importance of sustained, cross-disciplinary collaboration to guide AI development in ways that reflect societal priorities and public interest.

“This thoughtful and comprehensive white paper from Microsoft Research Asia represents an important early step forward in anticipating and addressing the societal implications of AI, particularly large language models (LLMs), as they enter the world in greater numbers and for a widening range of purposes,” says research collaborator James A. Evans (opens in new tab), professor of sociology at the University of Chicago.

Looking ahead

Microsoft is committed to fostering collaboration and invites others to take part in developing governance systems. As new challenges arise, the responsible use of AI for the public good will remain central to our research.

We hope the white paper serves as both a guide and a call to action, emphasizing the need for engagement across research, policy, industry, and the public.

For more information, and to access the full white paper, visit the Microsoft Research Societal AI page. Listen to the author discuss more about the research in this podcast.

Acknowledgments

We are grateful for the contributions of the researchers, collaborators, and reviewers who helped shape this white paper.

Opens in a new tab

The post Societal AI: Building human-centered AI systems appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of April 21, 2025

Wed, 04/23/2025 - 18:00

In this issue:

Catch a preview of our presentations and papers at CHI 2025 and ICLR 2025. We also introduce new research on causal reasoning and LLMs; enhancing LLM jailbreak capabilities to bolster safety and robustness; understanding how people using AI compared to AI-alone, and Distill-MOS, a compact and efficient model that delivers state-of-the-art speech quality assessment. You’ll also find a replay of a podcast discussion on rural healthcare innovation with Senior Vice President of Microsoft Health Jim Weinstein.

CONFERENCE Microsoft at CHI 2025

Microsoft Research is proud to be a sponsor of the ACM Computer Human Interaction (CHI) 2025 Conference on Human Factors in Computing Systems (opens in new tab). CHI brings together researchers and practitioners from all over the world and from diverse cultures, backgrounds, and positionalities, who share an overarching goal to make the world a better place with interactive digital technologies.

Our researchers will host more than 30 sessions and workshops at this year’s conference in Yokohama, Japan. We invite you to preview our presentations and our two dozen accepted papers.

Microsoft @CHI 2025 CONFERENCE Microsoft at ICLR 2025

Microsoft is proud to be a sponsor of the thirteenth International Conference on Learning Representations (ICLR). This gathering is dedicated to the advancement of representation learning, which is a branch of AI. We are pleased to share that Microsoft has more than 30 accepted papers at this year’s conference, which we invite you to preview.

ICLR is globally renowned for presenting and publishing cutting-edge research on all aspects of deep learning used in the fields of artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, text understanding, gaming, and robotics.

Microsoft @ICLR 2025 NEW RESEARCH Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

What kinds of causal arguments can large language models (LLMs) generate, how valid are these arguments, and what causal reasoning workflows can this generation support or automate? This paper, which was selected for ICLR 2025, clarifies this debate. It advances our understanding of LLMs and their causal implications, and proposes a framework for future research at the intersection of LLMs and causality.

This discussion has critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. In capturing common sense and domain knowledge about causal mechanisms and supporting translation between natural language and formal methods, LLMs open new frontiers for advancing the research, practice, and adoption of causality.

Read the paper NEW RESEARCH The Future of AI in Knowledge Work: Tools for Thought at CHI 2025

Can AI tools do more than streamline workflows—can they actually help us think better? That’s the driving question behind the Microsoft Research Tools for Thought initiative. At this year’s CHI conference, this group is presenting four new research papers and cohosting a workshop that dives deep into this intersection of AI and human cognition.

The team provides an overview of their latest research, starting with a study on how AI is changing the way people think and work. They introduce three prototype systems designed to support different cognitive tasks. Finally, through their Tools for Thought workshop, they invite the CHI community to help define AI’s role in supporting human thinking.

Read the blog NEW RESEARCH Building LLMs with enhanced jailbreaking capabilities to bolster safety and robustness

Recent research shows that LLMs are vulnerable to automated jailbreak attacks, where algorithm-generated adversarial suffixes bypass safety alignment and trigger harmful responses. This paper introduces ADV-LLM, an iterative self-tuning process for crafting adversarial LLMs with enhanced jailbreak capabilities—which could provide valuable insights for future safety alignment research.

ADV-LLM is less computationally expensive than prior mechanisms and achieves higher attack success rates (ASR), especially against well-aligned models like Llama2 and Llama3.

It reaches nearly 100% ASR on various open-source LLMs and demonstrates strong transferability to closed-source models—achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4—despite being optimized solely on Llama3. Beyond improving jailbreak performance, ADV-LLM offers valuable insights for future alignment research by enabling large-scale generation of safety-relevant datasets.

Read the paper NEW RESEARCH ChatBench: From Static Benchmarks to Human-AI Evaluation

The rapid adoption of LLM-based chatbots raises the need to understand what people and LLMs can achieve together. However, standard benchmarks like MMLU (opens in new tab) assess LLM capabilities in isolation (i.e., “AI alone”). This paper presents the results of a user study that transforms MMLU questions into interactive user-AI conversations. The researchers seeded the participants with the question and then had them engage in a conversation with the LLM to arrive at an answer. The result is ChatBench, a new dataset comprising AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144,000 answers and 7,336 user-AI conversations.

The researchers’ analysis reveals that AI-alone accuracy does not predict user-AI accuracy, with notable differences across subjects such as math, physics, and moral reasoning. Examining user-AI conversations yields insights into how these interactions differ from AI-alone benchmarks. Finally, the researchers demonstrate that finetuning a user simulator on a subset of ChatBench improves its ability to predict user-AI accuracy, boosting correlation on held-out questions by more than 20 points, thereby enabling scalable interactive evaluation.

Read the paper NEW RESEARCH Distill-MOS: A compact speech-quality assessment model

Distill-MOS is a compact and efficient speech quality assessment model with dramatically reduced size—over 100x smaller than the reference model—enabling efficient, non-intrusive evaluation in real-world, low-resource settings.

This paper investigates the distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. The researchers’ experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. They retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips.

Read the paper View GitHub PODCAST Collaborating to Affect Change for Rural Health Care with Innovation and Technology

Senior Vice President of Microsoft Health Jim Weinstein joins Dan Liljenquist, Chief Strategy Officer from Intermountain Health, on the NEJM Catalyst podcast for a discussion of their combined expertise and resources and their collaboration to address healthcare challenges in the rural United States. These challenges include limited access to care, rising mortality rates, and severe staffing shortages. Working together, they aim to create a scalable model that can benefit both rural and urban health care systems. Key goals include expanding access through telemedicine and increasing cybersecurity, ultimately improving the quality of care delivered and financial stability for rural communities.

Listen to the podcast PODCAST Empowering patients and healthcare consumers in the age of generative AI

Two champions of patient-centered digital health join Microsoft Research President Peter Lee to talk about how AI is reshaping healthcare in terms of patient empowerment and emerging digital health business models. Dave deBronkart, a cancer survivor and longtime advocate for patient empowerment, discusses how AI tools like ChatGPT can help patients better understand their conditions, navigate the healthcare system, and communicate more effectively with clinicians. Christina Farr, a healthcare investor and former journalist, talks about the evolving digital health–startup ecosystem, highlighting where AI is having the most meaningful impact—particularly in women’s health, pediatrics, and elder care. She also explores consumer trends, like the rise of cash-pay healthcare. 

Listen to the podcast PODCAST Beyond the Image: AI’s Expanding Role in Healthcare

Jonathan Carlson, Managing Director of Microsoft Research Health Futures, joins the Healthcare Unfiltered show to explore the evolution of AI in medicine, from the early days to cutting-edge innovations like ambient clinical intelligence. This podcast explores how pre-trained models and machine learning are transforming care delivery, as well as the future of biomedicine and healthcare, including important ethical and practical questions.

Listen to the podcast Opens in a new tab

The post Research Focus: Week of April 21, 2025 appeared first on Microsoft Research.

Categories: Microsoft

The Future of AI in Knowledge Work: Tools for Thought at CHI 2025

Fri, 04/18/2025 - 18:00

Can AI tools do more than streamline workflows—can they actually help us think better? That’s the driving question behind the Microsoft Research Tools for Thought initiative. At this year’s CHI conference, we’re presenting four new research papers and cohosting a workshop that dives deep into this intersection of AI and human cognition.

This post provides an overview of our latest research, starting with a study on how AI is changing the way we think and work. We also introduce three prototype systems designed to support different cognitive tasks. Finally, through our Tools for Thought workshop, we’re inviting the CHI community to help define AI’s role in supporting human thinking.

AI’s effects on thinking at work

With a single prompt, AI can generate a wide range of outputs, from documents and meeting agendas to answers and automated workflows. But how are people’s thinking processes affected when they delegate these tasks to AI?

One of our goals is to understand how knowledge workers use AI, how they perceive its value, and how it affects cognitive effort.

Our study, “The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers,” surveyed 319 professionals using AI across a variety of occupations. Participants shared 936 real-world AI use cases and reflected on how it influenced their critical thinking and mental effort. We summarize these findings below.

Defining and deploying critical thinking. Knowledge workers describe critical thinking as involving activities like setting clear goals, refining prompts, and verifying AI outputs against external sources and their own expertise. They rely on these practices to maintain work quality when using AI—motivated by the need to avoid errors, produce better results, and develop their skills.

Findings

Balancing cognitive effort. Participants’ reports about critical thinking and the effort involved align with longstanding human tendencies to manage cognitive load at work. For high-stakes tasks requiring accuracy, they say they expend more effort in applying critical thinking with AI than they would performing the same tasks without it. In contrast, during routine, for low-stakes tasks under time pressure, they report spending less effort on critical thinking when using AI compared to completing tasks without it.

Confidence effects. The study found that higher confidence in AI was associated with less critical thinking, while higher self-confidence in one’s own abilities was associated with more critical thinking—though at a perceived higher cognitive cost. This suggests a delicate balance between using AI for efficiency and maintaining active critical engagement.

Shift in the nature of critical thinking. Participants reported a shift in critical thinking activities, with a greater focus on information verification, response integration, and task stewardship. While AI automates certain aspects of knowledge work, it also demands more effort in evaluating the accuracy and relevance of AI-generated content.

Barriers to critical engagement. The study identified several barriers that inhibit critical thinking when using AI. These include a lack of awareness of the need for critical evaluation, limited motivation due to time pressure or perceived job scope, and difficulty in refining prompts—especially in unfamiliar domains.

Recommendations

To foster critical thinking at work, we recommend that AI tools actively encourage awareness, motivation, and skill development.

AI tools should enhance motivators for critical thinking (e.g., quality standards, skill-building) and mitigate inhibitors (e.g., time constraints, low awareness). Proactive prompts can surface overlooked tasks, while reactive features can offer on-demand assistance. Motivation can be strengthened by positioning critical reflection as part of professional growth—not just extra work.

AI tools should also support knowledge workers’ ability to think critically by providing reasoning explanations (as some newer AI models now do), guided critiques, and cross-references. This shift must occur in both the design of the technology and in the mindsets of knowledge workers. Rather than treating AI as a tool for delivering answers, we suggest treating it as a thought partner—one that can also act as a provocateur.

Beyond these insights, our other CHI papers explore practical ways to design AI that augments human cognition.

Enhancing decision-making with AI

Decision-making is central to knowledge work, and AI is increasingly being used to help people make decisions in complex fields like healthcare and finance. However, how much agency do knowledge workers retain when AI is involved?

Our study, “AI, Help Me Think—but for Myself: Exploring How LLMs Can Assist People in Complex Decision-Making by Providing Different Forms of Cognitive Support,” conducted in collaboration with University College London, examines this question. We began with a small formative study involving 10 participants, followed by a comparative study with 21 participants using two different AI-supported decision-making systems.

For a complex financial investment task, we compared two different AI tools (Figure 1): RecommendAI, which provides AI-generated recommendations, and ExtendAI, which encourages users to articulate their reasoning before receiving AI feedback.

Figure 1. Illustrative comparison of the thought process involved when interacting with two types of AI: RecommendAI and ExtendAI. Findings

Both systems were found to offer benefits for augmenting cognition and addressing some of the challenges to critical thinking identified in the knowledge worker survey above, suggesting the potential for a balanced approach.

RecommendAI offered concrete suggestions that inspired users to explore new directions in their decision-making. This often led to fresh insights and reflections. However, the recommendations at times felt disconnected from the user’s own reasoning, reducing the depth of engagement.

In contrast, ExtendAI encouraged users to reflect more deeply on their decisions by providing feedback on their reasoning. This helped them examine their thought processes and consider alternative perspectives. However, some users found the feedback too general and not actionable enough.

When it came to how users integrated the tools into their decision-making process, RecommendAI, introduced perspectives that pushed users to think beyond their usual patterns. By recommending options not based on users’ own reasoning, it encouraged exploration of ideas they might not have considered. However, some users perceived the recommendations as a “black box” solution. This lack of transparency made those recommendations harder to understand, trust, and apply to their own thought processes.

ExtendAI, on the other hand, aligned with users’ existing reasoning, making its feedback easier to incorporate. This helped the users maintain a sense of control and continuity. However, because the feedback often echoed their initial thoughts, it sometimes limited new insights and risked reinforcing existing biases.

These findings suggest that AI tools like ExtendAI, designed to elicit and build on users’ own cognitive processes, may offer a more effective approach to augmentation than simply providing “ready-made solutions” that users must figure out how to interpret and apply.

Are we on track? Making meetings better with AI

Meetings are often criticized for being ineffective. While this is sometimes due to poor practices—such as weak agendas, late starts, and unclear facilitation—we believe the deeper issue is a lack of meeting intentionality: knowing why a meeting is occurring and keeping the discussion focused on that purpose. A key challenge is maintaining goal clarity throughout a meeting.

In the paper “Are We On Track? AI-Assisted Goal Reflection During Meetings,” we explore how AI tools can improve meetings in real time by encouraging reflection—awareness about the meeting’s goals and how well the current conversation is aligned with those goals.

Our study with 15 knowledge workers examined two AI-driven design paradigms: passive goal assistance through ambient visualization (a live chart displaying how conversational topics relate to meeting objectives) and active goal assistance through interactive questioning (nudging participants to consider whether the current conversation aligns with the meeting objectives). These approaches are illustrated in Figure 2.

Figure 2. Technology prototypes exploring passive and active ways to keep meetings focused on established objectives. Recommendations

The findings highlight AI’s potential to help teams with meeting objectives. We found three key design tradeoffs between passive and active support. Based on these, we offer the following AI design recommendations.

Information balance. There is a tradeoff between ambient visualizations in the passive approach—which can risk information overload—and interactive questioning in the active approach, which may lack detail. To be effective, AI should deliver the right amount of information at the right time and tailor content to the individuals who need it most—without overwhelming users, while offering meaningful and timely support for reflection.

Balance of engagement versus interruption. When participants are deeply engaged in discussion, significant interruptions can overwhelm and disrupt the flow. Conversely, during moments of confusion or misalignment, subtle cues may be insufficient to get the team back on track. AI systems should dynamically adjust their level of intervention—from ambient and lightweight to more direct—escalating or de-escalating based on timing thresholds, which can be customized for each team.

Balance of team versus individual goal awareness. AI assistance can nudge team action, such as adjusting agendas. These effects were stronger with the active approach, which required group responses, while the passive approach supported individual thinking without directly influencing team behavior. Team-wide engagement depends on both the visibility of AI cues and how they are introduced into the discussion.

This study helps us understand how AI design choices can support intentionality during meetings and enhance productivity without disrupting natural workflows.

About Microsoft Research

Advancing science and technology to benefit humanity

View our story Opens in a new tab Encouraging diverse problem-solving brainstorming with AI

Diverse perspectives drive creative problem-solving in organizations, but individuals often lack access to varied viewpoints. In the paper “YES AND: An AI-Powered Problem-Solving Framework for Diversity of Thought,” we build on the idea of “design improv” to explore a multi-agent AI prototype that simulates conversations with persona-based agents representing a range of expertise.

The agents follow a classic model of conversational turn-taking, combined with a confidence model to determine when to take or respond to a turn. This allows both the agents and the user to organically build on each others’ ideas and ask clarifying questions. The system enables free-flowing, multi-party idea generation while avoiding common pitfalls of group brainstorming—such as social loafing, production blocking, and groupthink (Figure 3).

Figure 3. The YES AND system supports conversational turn-taking among agents and the user to generate ideas around a problem.

At the end of a session, an AI agent called Sage distills the discussion, leaving it to the user to develop a conclusive approach to the problem. In this way, YES AND helps unblock forward momentum in problem-solving while preserving the agency of knowledge workers to shape their own ideas.

Next steps: Expanding the Tools for Thought community

We believe the best way to advance next-generation tools for thought is by bringing together a wide range of perspectives and approaches. In addition to our four papers, we are also conducting a workshop at CHI on April 26, co-organized with collaborators from industry and academia: Tools for Thought: Research and Design for Understanding, Protecting, and Augmenting Human Cognition with Generative AI.

In this session, over 60 researchers, designers, practitioners, and provocateurs will gather to examine what it means to understand and shape the impact of AI on human cognition. Together, we’ll explore how AI is changing workflows, the opportunities and challenges for design, and which theories, perspectives, and methods are increasingly relevant—or still need to be developed.

The enthusiastic response to this workshop highlights the growing interest in AI’s role in human thought. Our goal is to foster a multidisciplinary community dedicated to ensuring that AI not only accelerates work but also strengthens our ability to think critically, creatively, and strategically.

We look forward to ongoing discussions, new collaborations, and the next wave of innovations in AI-assisted cognition at CHI 2025.

Opens in a new tab

The post The Future of AI in Knowledge Work: Tools for Thought at CHI 2025 appeared first on Microsoft Research.

Categories: Microsoft

Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project

Mon, 04/14/2025 - 21:00

The Semantic Telemetry Project aims to better understand complex, turn-based human-AI interactions in Microsoft Copilot using a new data science approach.

This understanding is crucial for recognizing how individuals utilize AI systems to address real-world tasks. It provides actionable insights, enhances key use cases, and identifies opportunities for system improvement.

In a recent blog post, we shared our approach for classifying chat log data using large language models (LLMs), which allows us to analyze these interactions at scale and in near real time. We also introduced two of our LLM-generated classifiers: Topics and Task Complexity.

This blog post will examine how our suite of LLM-generated classifiers can serve as early indicators for user engagement and highlight how usage and satisfaction varies based on AI and user expertise.

The key findings from our research are:

When users engage in more professional, technical, and complex tasks, they are more likely to continue utilizing the tool and increase their level of interaction with it.
Novice users currently engage in simpler tasks, but their work is gradually becoming more complex over time.
More expert users are satisfied with AI responses only where AI expertise is on par with their own expertise on the topic, while novice users had low satisfaction rates regardless of AI expertise.

Read on for more information on these findings. Note that all analyses were conducted on anonymous Copilot in Bing interactions containing no personal information.

Classifiers mentioned in article:

Knowledge work classifier: Tasks that involve creating artifacts related to information work typically requiring creative and analytical thinking. Examples include strategic business planning, software design, and scientific research.

Task complexity classifier: Assesses the cognitive complexity of a task if a user performs it without the use of AI. We group into two categories: low complexity and high complexity.

Topics classifier: A single label for the primary topic of the conversation.

User expertise: Labels the user’s expertise on the primary topic within the conversation as one of the following categories: Novice (no familiarity with the topic), Beginner (little prior knowledge or experience), Intermediate (some basic knowledge or familiarity with the topic), Proficient (can apply relevant concepts from conversation), and Expert (deep and comprehensive understanding of the topic).

AI expertise: Labels the AI agent expertise based on the same criteria as user expertise above.

User satisfaction: A 20-question satisfaction/dissatisfaction rubric that the LLM evaluates to create an aggregate score for overall user satisfaction.

What keeps Bing Chat users engaged?

We conducted a study of a random sample of 45,000 anonymous Bing Chat users during May 2024. The data was grouped into three cohorts based on user activity over the course of the month:

Light (1 active chat session per week)
Medium (2-3 active chat sessions per week)
Heavy (4+ active chat sessions per week)

The key finding is that heavy users are doing more professional, complex work.

We utilized our knowledge work classifier to label the chat log data as relating to knowledge work tasks. What we found is knowledge work tasks were higher in all cohorts, with the highest percentage in heavy users.

Figure 1: Knowledge work based on engagement cohort

Analyzing task complexity, we observed that users with higher engagement frequently perform the highest number of tasks with high complexity, while users with lower engagement performed more tasks with low complexity.

Figure 2: High complexity and low complexity tasks by engagement cohort+

Looking at the overall data, we can filter on heavy users and see higher numbers of chats where the user was performing knowledge work tasks. Based on task complexity, we see that most knowledge work tasks seek to apply a solution to an existing problem, primarily within programming and scripting. This is in line with our top overall topic, technology, which we discussed in the previous post.

Figure 3: Heavy users tree diagram

In contrast, light users tended to do more low complexity tasks (“Remember”), using Bing Chat like a traditional search engine and engaging more in topics like business and finance and computers and electronics.

Figure 4: Light users tree diagram Novice queries are becoming more complex

We looked at Bing Chat data from January through August 2024 and we classified chats using our User Expertise classifier. When we looked at how the different user expertise groups were using the tool for professional tasks, we discovered that proficient and expert users tend to do more professional tasks with high complexity in topics like programming and scripting, professional writing and editing, and physics and chemistry.

Figure 5: Top topics for proficient/expert users Figure 6: Task complexity for proficient/expert Figure 7: Top topics for novices

In contrast, novice users engaged more in professional tasks relating to business and finance and education and learning, mainly using the tool to recall information.

Figure 8: Task complexity for novices

However, novices are targeting increasingly more complex tasks over time. Over the eight-month period, we see the percentage of high complexity tasks rise from about 36% to 67%, revealing that novices are learning and adapting quickly (see Figure 9).

Figure 9: High complexity for novices Jan-Aug 2024 How does user satisfaction vary according to expertise?

We classified both the user expertise and AI agent expertise for anonymous interactions in Copilot in Bing. We compared the level of user and AI agent expertise with our user satisfaction classifier.

The key takeaways are:

Experts and proficient users are only satisfied with AI agents with similar expertise (expert/proficient).
Novices are least satisfied, regardless of the expertise of the AI agent.

Figure 10: Copilot in Bing satisfaction intersection of AI expertise and User expertise (August-September 2024) Conclusion

Understanding these metrics is vital for grasping user behavior over time and relating it to real-world business indicators. Users are finding value from complex professional knowledge work tasks, and novices are quickly adapting to the tool and finding these high value use-cases. By analyzing user satisfaction in conjunction with expertise levels, we can tailor our tools to better meet the needs of different user groups. Ultimately, these insights can help improve user understanding across a variety of tasks.

In our next post, we will examine the engineering processes involved in LLM-generated classification.

Opens in a new tab

The post Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project appeared first on Microsoft Research.

Categories: Microsoft

Debug-gym: an environment for AI coding tools to learn how to debug code like programmers

Thu, 04/10/2025 - 18:00

The ongoing proliferation of AI coding tools is not only boosting developers’ efficiency, it also signals a future where AI will generate a growing share of all new code. GitHub CEO Thomas Dohmke (opens in new tab) predicted as much in 2023, when he said that “sooner than later, 80% of the code is going to be written by Copilot.”

Both large and small software companies are already heavily using AI to generate code. Y Combinator’s Garry Tan (opens in new tab) noted that 95% of code for a quarter of Y Combinator’s latest batch of startups was written by large language models.

In fact, most developers spend the majority of their time debugging code, not writing it. As maintainers of popular open-source repositories, this resonates with us. But what if an AI tool could propose fixes for hundreds of open issues, and all we had to do was approve them before merging? This was what motivated us to maximize the potential time savings from AI coding tools by teaching them to debug code.

By debugging we mean the interactive, iterative process to fix code. Developers typically hypothesize why their code crashed, then gather evidence by stepping through the program and examining variable values. They often use debugging tools like pdb (Python debugger) to assist in gathering information. This process is repeated until the code is fixed.

Today’s AI coding tools boost productivity and excel at suggesting solutions for bugs based on available code and error messages. However, unlike human developers, these tools don’t seek additional information when solutions fail, leaving some bugs unaddressed, as you can see in this simple demo of how a mislabeled column stumps today’s coding tools (opens in new tab). This may leave users feeling like AI coding tools don’t understand the full context of the issues they are trying to solve.

Introducing debug-gym

A natural research question emerges: to what degree can LLMs use interactive debugging tools such as pdb? To explore this question, we released debug-gym (opens in new tab) – an environment that allows code-repairing agents to access tools for active information-seeking behavior. Debug-gym expands an agent’s action and observation space with feedback from tool usage, enabling setting breakpoints, navigating code, printing variable values, and creating test functions. Agents can interact with tools to investigate code or rewrite it, if confident. We believe interactive debugging with proper tools can empower coding agents to tackle real-world software engineering tasks and is central to LLM-based agent research. The fixes proposed by a coding agent with debugging capabilities, and then approved by a human programmer, will be grounded in the context of the relevant codebase, program execution and documentation, rather than relying solely on guesses based on previously seen training data.

Figure 1: Diagram demonstrating the code-repairing process in outline. In most existing approaches (shown in black), an agent rewrites its code conditioned on error messages obtained from executing the code. debug-gym equips the agent with additional tools such as pdb (shown in red), so it can interactively seek necessary information from the semantic space hidden behind the code and therefore have better code-repairing performance.

Debug-gym is designed and developed to:

Handle repository-level information: the full repository is available to agents in debug-gym, allowing them to navigate and edit files.
Be robust and safe: to safeguard both the system and the development process, debug-gym runs code within sandbox Docker containers. This isolates the runtime environment, preventing harmful actions while still allowing thorough testing and debugging.
Be easily extensible: debug-gym was conceived with extensibility in mind and provides practitioners with the possibility of easily adding new tools.
Be text-based: debug-gym represents observation information in structured text (e.g., JSON format) and defines a simple syntax for text actions, making the environment fully compatible with modern LLM-based agents.

With debug-gym, researchers and developers can specify a folder path to work with any custom repository to evaluate their debugging agent’s performance. Additionally, debug-gym includes three coding benchmarks to measure LLM-based agents’ performance in interactive debugging: Aider for simple function-level code generation, Mini-nightmare for short, hand-crafted buggy code examples, and SWE-bench for real-world coding problems requiring a comprehensive understanding of a large codebase and a solution in the format of a GitHub pull request.

To learn more about debug-gym and start using it to train your own debugging agents, please refer to the technical report (opens in new tab) and GitHub (opens in new tab).

Early experimentation: promising signal

For our initial attempt to validate that LLMs perform better on coding tests when they have access to debugging tools, we built a simple prompt-based agent and provided it with access to the following debug tools: eval, view, pdb, rewrite, and listdir. We used nine different LLMs as the backbone for our agent. Detailed results can be found in the technical report (opens in new tab). (opens in new tab)

Even with debugging tools, our simple prompt-based agent rarely solves more than half of the SWE-bench (opens in new tab)Lite issues. We believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in the current LLM training corpus. However, the significant performance improvement (as shown in the most promising results in the graph below) validates that this is a promising research direction.

Figure 2: The success rate represents the percentage of the 300 SWE-bench Lite issues resolved. The green bars indicate the performance of the agent with debugging tools, while the gray bars show the performance of the agent without debugging tools. Note that both agents use the same backbone LLM to make decisions and propose code edits.

Microsoft research podcast

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India.

Listen now Opens in a new tab Future work

We believe that training or fine-tuning LLMs can enhance their interactive debugging abilities. This requires specialized data, such as trajectory data that records agents interacting with a debugger to gather information before suggesting a fix. Unlike conventional reasoning problems, interactive debugging involves generating actions at each step that trigger feedback from the environment. This feedback helps the agent make new decisions, requiring dense data like the problem description and the sequence of actions leading to the solution.

Our plan is to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs. The goal is to use this model to actively build relevant context for a code generation model. If the code generation model is large, there is an opportunity to build a smaller info-seeking model that can provide relevant information to the larger one, e.g., a generalization of retrieval augmented generation (RAG), thus saving AI inference costs. The data collected during the reinforcement learning loop to train the info-seeking model can also be used to fine-tune larger models for interactive debugging.

We are open-sourcing debug-gym to facilitate this line of research. We encourage the community to help us advance this research towards building interactive debugging agents and, more generally, agents that can seek information by interacting with the world on demand.

Acknowledgements

We thank Ruoyao Wang for their insightful discussion on building interactive debugging agents, Chris Templeman and Elaina Maffeo for their team coaching, Jessica Mastronardi and Rich Ciapala for their kind support in project management and resource allocation, and Peter Jansen for providing valuable feedback for the technical report.

Opens in a new tab

The post Debug-gym: an environment for AI coding tools to learn how to debug code like programmers appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of April 7, 2025

Wed, 04/09/2025 - 18:00

In this issue:

We introduce a new dataset designed to assist renewable energy infrastructure planners, a new method for denoising MRI imagery, and an AI tool for analyzing distant galaxies. Check out our latest research and other updates.

NEW RESEARCH Global Renewables Watch: A Temporal Dataset of Solar and Wind Energy Derived from Satellite Imagery

Siting renewable energy infrastructure requires careful consideration of the potential impact on ecosystems, cultural and historical resources, agriculture, and scenic landscapes. To help policymakers, researchers, and other stakeholders assess strategies for deployment, researchers from Microsoft, The Nature Conservancy (opens in new tab), and Planet (opens in new tab) present a comprehensive global temporal dataset of commercial solar photovoltaic (PV) farms and onshore wind turbines.

The researchers built the dataset by training deep learning-based segmentation models on high-resolution satellite imagery and then deploying them on over 13 trillion pixels of images covering the world. The final spatial dataset includes 375,197 individual wind turbines and 86,410 solar photovoltaic installations. For each detected feature, they estimate the construction date and the preceding land use type, and aggregate their findings to the country level, along with estimates of total power capacity.

Read the paper NEW RESEARCH SNRAware: Improved Deep Learning MRI Denoising with SNR Unit Training and G-factor Map Augmentation

This research proposes a new training method, SNRAware, to improve the ability of deep learning models to denoise—or remove unwanted random variations—from MRI images. MRI images can suffer from high levels of noise when scanning is accelerated with parallel imaging or when data are acquired using lower cost, low-field MRI systems.

The researchers tested SNRAware on 14 different models, including ones based on transformer and convolutional architectures. The proposed training scheme improved the performance of all the tested models. This broad applicability means that the method is flexible and can be applied to different kinds of models without redesigning them. The testing showed SNRAware significantly improves the quality and clinical utility of MRI images while preserving important diagnostic details.

Read the paper NEW RESEARCH Can AI unlock the mysteries of the universe?

Analyzing the physical properties of individual galaxies is a fundamental skill in astronomy. It requires a thorough understanding of galaxy formation theories and the ability to interpret vast amounts of observational data. However, even for seasoned astronomers, this process can be time-consuming and labor-intensive. To help astronomers accelerate this fundamental process, researchers from Microsoft and external colleagues introduce Mephisto, research designed to analyze extremely distant galaxies observed by the James Webb Space Telescope (JWST).

Mephisto analyzes photometric data from distant galaxies, proposing physical models and interacting with Code Investigating Galaxy Emission (opens in new tab), a commonly used galaxy spectral simulation program. Mephisto can detect discrepancies between models and observational data, identifies potential instrumental errors or limitations in the models, iteratively adjusts parameters, and generates multiple explanations for the observational data.

Read the article APPLIED AI Japan Airlines’ new AI app will make it easier for cabin attendants to report inflight events with Microsoft’s Phi-4 small language model

Japan Airlines (JAL) is using technology developed by Microsoft Research to deploy an AI app that helps flight crews communicate more effectively with ground staff when something unexpected comes up during a flight.

The JAL-AI Report is being developed using Microsoft’s Phi-4 small language model (SLM), which requires less computing power than the large language models (LLMs) most generative AI tools run on, so it can be used offline on a device for specific tasks.

Cabin attendants who have tried it say it can slash the time for writing operation reports by up to two thirds, say, from one hour to 20 minutes, or from 30 minutes to 10 for simpler cases.

Read the story Microsoft Research | In case you missed it AI weather forecast project eyes access through desktop computers

Financial Times | March 20, 2025

Aardvark Weather uses AI to deliver accurate forecasts in just minutes from a desktop computer. Developed by scientists at the University of Cambridge, with support from the Alan Turing Institute, Microsoft Research, and the European Centre for Medium-Range Weather Forecasts, this technology is tens of times faster than existing methods and requires only a fraction of the computing power.

Director of Microsoft Research talks AI for science (what it really means)

The Deep View | March 11, 2025

Chris Bishop, Director, AI for Science, Microsoft Research, discusses what AI is doing for science. This interview dives into how AI is accelerating discovery of new techniques and findings, the benefits of foundation models like Aurora, MatterGen’s capabilities, and AI’s impact on scientists.

Microsoft’s Christopher Bishop: Scientific discovery is AI’s killer application

Financial Times | April 3, 2025

Christopher Bishop runs Microsoft’s AI for Science research unit, which applies the powerful technology to the natural sciences. Bishop sees the mission of the lab, which was founded in 2022, as accelerating scientific discovery using the technology.

In this conversation with the Financial Times’ AI editor Madhumita Murgia, he explains why he believes scientific discovery will prove to be the single most important application of the technology.

Innovation to Impact (ft. Dr M – DGTL Voices with Ed Marx)

DGTL Voices with Ed Marx | March 12, 2025

Matthew Lungren, Chief Scientific Officer, Microsoft Health and Life Sciences, and Jonathan Carlson, Managing Director, Microsoft Health Futures, discuss AI’s transformative impact on radiology and the importance of collaboration in research and product development. They highlight how healthcare organizations can leverage Microsoft’s resources for innovation, emphasizing Microsoft’s progress in developing radiology-specific multimodal models and its broader work in healthcare.

Tech Life – The doctor will see you now

BBC Sounds | March 4, 2025

An update from the live trials in Ghana of Microsoft Research’s Holoportation 3D telemedicine technology. BBC’s Tech Life speaks to lead researcher Spencer Fowers, as well as a patient and doctor benefiting from the portable kit.

Microsoft Unveils New AI Model to Edit Video Games

IEEE Spectrum | March 11, 2025

Lead researcher Katja Hoffman discusses Microsoft’s Muse, a transformer model with 1.6 billion parameters trained on 500,000 hours of player data that can generate gameplay examples from a single screenshot.

National University of Singapore collaborates with Microsoft Research Asia to advance AI research and cultivate computing talent

NUS News | April 2, 2025

The National University of Singapore (NUS) has signed a five-year collaboration agreement with Microsoft Research Asia for a Joint PhD Supervision Program, bringing together NUS’s academic and research excellence with Microsoft Research Asia’s global leadership in AI, computing research, and industrial applications to cultivate talent. As part of this collaboration, NUS and Microsoft Research Asia will nurture PhD students through the Industrial Postgraduate Program, supported by the Singapore Economic Development Board (EDB). This initiative will help to cultivate interdisciplinary, high-caliber tech professionals and drive the integration of AI technology across industries.

How Microsoft made it through 50 years

The Verge | April 4, 2025

A lot has changed since Microsoft was founded, but in many ways, the company’s core business model and ethos remain the same: make software that everyone needs and get it installed everywhere. Adapting to change, including the ongoing AI transformation, has always played an important role in the company’s success.

View more news and awards Opens in a new tab

The post Research Focus: Week of April 7, 2025 appeared first on Microsoft Research.

Categories: Microsoft

VidTok introduces compact, efficient tokenization to enhance AI video processing

Wed, 04/02/2025 - 18:00

Every day, countless videos are uploaded and processed online, putting enormous strain on computational resources. The problem isn’t just the sheer volume of data—it’s how this data is structured. Videos consist of raw pixel data, where neighboring pixels often store nearly identical information. This redundancy wastes resources, making it harder for systems to process visual content effectively and efficiently.

To tackle this, we’ve developed a new approach to compress visual data into a more compact and manageable form. In our paper “VidTok: A Versatile and Open-Source Video Tokenizer,” we introduce a method that converts video data into smaller, structured units, or tokens. This technique provides researchers and developers in visual world modeling—a field dedicated to teaching machines to interpret images and videos—with a flexible and efficient tool for advancing their work.

How VidTok works

VidTok is a technique that converts raw video footage into a format that AI can easily work with and understand, a process called video tokenization. This process converts complex visual information into compact, structured tokens, as shown in Figure 1.

Figure 1. An overview of how video tokenizers work, which form the basis of VidTok.

By simplifying videos into manageable chunks, VidTok can enable AI systems to learn from, analyze, and generate video content more efficiently. VidTok offers several potential advantages over previous solutions:

Supports both discrete and continuous tokens. Not all AI models use the same “language” for video generation. Some perform best with continuous tokens—ideal for high-quality diffusion models—while others rely on discrete tokens, which are better suited for step-by-step generation, like language models for video. VidTok is a tokenizer that has demonstrated seamless support for both, making it adaptable across a range of AI applications.

Operates in both causal and noncausal modes. In some scenarios, video understanding depends solely on past frames (causal), while in others, it benefits from access to both past and future frames (noncausal). VidTok can accommodate both modes, making it suitable for real-time use cases like robotics and video streaming, as well as for high-quality offline video generation.

Efficient training with high performance. AI-powered video generation typically requires substantial computational resources. VidTok can reduce training costs by half through a two-stage training process—delivering high performance and lowering costs.

About Microsoft Research

Advancing science and technology to benefit humanity

View our story Opens in a new tab Architecture

The VidTok framework builds on a classic 3D encoder-decoder structure but introduces 2D and 1D processing techniques to handle spatial and temporal information more efficiently. Because 3D architectures are computationally intensive, VidTok combines them with less resource-intensive 2D and 1D methods to reduce computational costs while maintaining video quality.

Spatial processing. Rather than treating video frames solely as 3D volumes, VidTok applies 2D convolutions—pattern-recognition operations commonly used in image processing—to handle spatial information within each frame more efficiently.

Temporal processing. To model motion over time, VidTok introduces the AlphaBlender operator, which blends frames smoothly using a learnable parameter. Combined with 1D convolutions—similar operations applied over sequences—this approach captures temporal dynamics without abrupt transitions.

Figure 2 illustrates VidTok’s architecture in detail.

Figure 2. VidTok’s architecture. It uses a combination of 2D and 1D operations instead of solely relying on 3D techniques, improving efficiency. For smooth frame transitions, VidTok employs the AlphaBlender operator in its temporal processing modules. This approach strikes a balance between computational speed and high-quality video output. Quantization

To efficiently compress video data, AI systems often use quantization to reduce the amount of information that needs to be stored or transmitted. A traditional method for doing this is vector quantization (VQ), which groups values together and matches them to a fixed set of patterns (known as a codebook). However, this can lead to an inefficient use of patterns and lower video quality.

For VidTok, we use an approach called finite scalar quantization (FSQ). Instead of grouping values, FSQ treats each value separately. This makes the compression process more flexible and accurate, helping preserve video quality while keeping the file size small. Figure 3 shows the difference between the VQ and FSQ approaches.

Figure 3. VQ (left) relies on learning a codebook, while FSQ (right) simplifies the process by independently grouping values into fixed sets, making optimization easier. VidTok adopts FSQ to enhance training stability and reconstruction quality. Training

Training video tokenizers requires significant computing power. VidTok uses a two-stage process:

It first trains the full model on low-resolution videos.
Then, it fine-tunes only the decoder using high-resolution videos.

This approach cuts training costs in half—from 3,072 to 1,536 GPU hours—while maintaining video quality. Older tokenizers, trained on full-resolution videos from the start, were slower and more computationally intensive.

VidTok’s method allows the model to quickly adapt to new types of videos without affecting its token distribution. Additionally, it trains on lower-frame-rate data to better capture motion, improving how it represents movement in videos.

Evaluating VidTok

VidTok’s performance evaluation using the MCL-JCV benchmark—a comprehensive video quality assessment dataset—and an internal dataset demonstrates its superiority over existing state-of-the-art models in video tokenization. The assessment, which covered approximately 5,000 videos of various types, employed four standard metrics to measure video quality:

Peak Signal-to-Noise Ratio (PSNR)
Structural Similarity Index Measure (SSIM)
Learned Perceptual Image Patch Similarity (LPIPS)
Fréchet Video Distance (FVD)

The following table and Figure 4 illustrate VidTok’s performance:

Table 1

The results indicate that VidTok outperforms existing models in both discrete and continuous tokenization scenarios. This improved performance is achieved even when using a smaller model or a more compact set of reference patterns, highlighting VidTok’s efficiency.

Figure 4. Quantitative comparison of discrete and continuous tokenization performance in VidTok and state-of-the-art methods, evaluated using four metrics: PSNR, SSIM, LPIPS, and FVD. Larger chart areas indicate better overall performance. Looking ahead

VidTok represents a significant development in video tokenization and processing. Its innovative architecture and training approach enable improved performance across various video quality metrics, making it a valuable tool for video analysis and compression tasks. Its capacity to model complex visual dynamics could improve the efficiency of video systems by enabling AI processing on more compact units rather than raw pixels.

VidTok serves as a promising foundation for further research in video processing and representation. The code for VidTok is available on GitHub (opens in new tab), and we invite the research community to build on this work and help advance the broader field of video modeling and generation.

Opens in a new tab

The post VidTok introduces compact, efficient tokenization to enhance AI video processing appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of March 24, 2025

Wed, 03/26/2025 - 18:00

In this issue:

We examine a new conversation segmentation method that delivers more coherent and personalized agent conversation, and we review efforts to improve MLLMs’ understanding of geologic maps. Check out the latest research and other updates.

NEW RESEARCH SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents

Researchers from Microsoft and Tsinghua University propose a new method to help conversational AI agents deliver more coherent and personalized responses during complex long-term dialogue.

Large language models (LLMs) are widely used to enable more complicated discussions across a broader range of topics than traditional dialogue systems. However, managing excessively long context that contains irrelevant information is a major challenge. Existing solutions typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization.

The proposed new approach, SeCom, constructs the memory bank at segment level by introducing a conversation Segmentation model that partitions long-term conversations into topically coherent segments, while applying Compression based denoising on memory units to enhance memory retrieval. Experimental results show that SeCom exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.

Read the paper NEW RESEARCH PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

Microsoft Researchers and external colleagues introduce GeoMap-Agent, an AI system specifically designed for geologic map understanding and analysis. In the lab, they measure its effectiveness using a new benchmark called GeoMap-Bench, a novel gauge for evaluating multimodal large language models (MLLMs) in geologic map understanding. Geologic maps provide critical insights into the structure and composition of Earth’s surface and subsurface. They are indispensable in fields including disaster detection, resource exploration, and civil engineering.

Current MLLMs often fall short in understanding geologic maps, largely due to the challenging nature of cartographic generalization, which involves handling high-resolution maps, managing multiple associated components, and requiring domain-specific knowledge.

This paper presents results of experiments in which GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming the 0.369 score of GPT-4o. The researchers intend to enable advanced AI applications in geology, powering more efficient and accurate geological investigations.

Read the paper NEW RESEARCH The future of the industrial AI edge is cellular

Reliable, high-bandwidth wireless connectivity and local processing at the edge are crucial enablers for emerging industrial AI applications. This work proposes that cellular networking is the ideal connectivity solution for these applications, due to its virtualization and support for open APIs. The researchers project the emergence of a converged industrial AI edge encompassing both computing and connectivity, in which application developers leverage the API to implement advanced functionalities. They present a case study showing evidence of the effectiveness of this approach, evaluated on an enterprise-grade 5G testbed.

Read the paper NEW RESEARCH RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement, and Restricted Lookarounds

A regular expression (regex or RE) is a sequence of characters used to match, search, and manipulate strings in text based on specific criteria. REs are used in programming languages for data validation, text parsing, and search operations.

This paper presents a tool and theory built on symbolic derivatives that does not use backtracking, while supporting both classical operators and complement, intersection, and restricted lookarounds. The researchers show that the main matching algorithm has input-linear complexity both in theory as well as experimentally. They apply thorough evaluation on popular benchmarks that show that RE# is over 71% faster than the next fastest regex engine in Rust on the baseline, and outperforms all state-of-the-art engines on extensions of the benchmarks, often by several orders of magnitude.

This work could potentially enable new applications in LLM prompt engineering frameworks, new applications in medical research and bioinformatics, and new opportunities in access and resource policy language design by web service providers.

Read the paper NEW RESEARCH Toward deep learning sequence–structure co-generation for protein design

Researchers review recent advances in deep generative models for protein design, with a focus on sequence-structure co-generation methods. They describe the key methodological and evaluation principles underlying these methods, highlight recent advances from the literature, and discuss opportunities for continued development of sequence-structure co-generation approaches.

Deep generative models that learn from the distribution of natural protein sequences and structures may enable the design of new proteins with valuable functions. While most of today’s models focus on generating either sequences or structures, emerging co-generation methods promise more accurate and controllable protein design, ideally achieved by modeling both modalities simultaneously.

Read the paper

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Start now Opens in a new tab PODCAST New Series: The AI Revolution in Medicine, Revisited

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote The AI Revolution in Medicine: GPT-4 and Beyond, a book full of optimism for the potential of advanced AI models to transform the world of healthcare. In this special Microsoft Research Podcast series, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

Watch the series PODCAST The future of generative AI for scientific discovery

Most of us think of generative AI in the context of text or image generation, but it’s also a powerful tool for scientific discovery. In this episode of the Leading the Shift podcast (opens in new tab), host Susan Etlinger speaks with Ade Famoti, a senior leader on the Microsoft Research Accelerator team. Ade discusses what he calls “AI’s physics moment,” and why he believes generative AI feels fundamentally different from past platform shifts. Ade shares examples of the work Microsoft Research is doing to uncover the opportunities of generative AI for materials discovery—to improve energy efficiency and carbon capture, and for drug discovery, to fight disease. Ade also highlights the role of culture in building trust, informing priorities and driving adoption of emerging technologies.

VIDEO Microsoft Research’s Chris Bishop talks AI for Science (what it really means)

In this interview, the director of Microsoft Research AI for Science, Chris Bishop, discusses how AI is unlocking new scientific outcomes, from drug creation to materials generation to improved climate modeling.

Microsoft Research | In case you missed it Tech Life – The doctor will see you now

BBC Sounds | March 4, 2025

An update on live trials in Ghana of 3D telemedicine technology, developed by Microsoft Research and external collaborators. Using portable equipment and holoportation technology, patients in remote locations can connect with a doctor many miles away. The BBC speaks to Spencer Fowers, who is the lead engineer on the project, as well as a patient and a doctor benefiting from the program.

Katja Hofmann: Why we're training AI on video games

TED Talk | October 2024

In a recent TED Talk: Why we’re training AI on video games, Microsoft researcher Katja Hofmann discusses the work the Game Intelligence team at Microsoft Research is doing to develop AI that can transform video games. Using AI trained on years of human gameplay data, the team built World and Human Action Model, which can learn to think, play and innovate alongside humans, enabling video game creators to build more robust games. Hoffmann was also interviewed in a related article: Microsoft’s Muse AI Edits Video Games on the Fly.

View more news and awards Opens in a new tab

The post Research Focus: Week of March 24, 2025 appeared first on Microsoft Research.

Categories: Microsoft

Navigation

Geo Tracker

User login

Monthly archive

Microsoft Research

PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays

Learning from other domains to advance AI evaluation and testing

Breaking bonds, breaking ground: Advancing the accuracy of computational chemistry with deep learning

New methods boost reasoning in small and large language models

Rewriting SymCrypt in Rust to modernize Microsoft’s cryptographic library

BenchmarkQED: Automated benchmarking of RAG systems

FrodoKEM: A conservative quantum-safe cryptographic algorithm

Magentic-UI, an experimental human-centered web agent

Predicting and explaining AI model performance: A new approach to evaluation

Research Focus: Week of May 7, 2025

Microsoft Fusion Summit explores how AI can accelerate fusion research

Societal AI: Building human-centered AI systems

Research Focus: Week of April 21, 2025

The Future of AI in Knowledge Work: Tools for Thought at CHI 2025

Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project

Debug-gym: an environment for AI coding tools to learn how to debug code like programmers

Research Focus: Week of April 7, 2025

VidTok introduces compact, efficient tokenization to enhance AI video processing

Research Focus: Week of March 24, 2025

Search

Twitter

RSS Feedburner

Add This

Dilbert...

SkyDrive

Security