Microsoft Research

Syndicate content
Updated: 11 hours 7 min ago

VidTok introduces compact, efficient tokenization to enhance AI video processing

Wed, 04/02/2025 - 18:00

Every day, countless videos are uploaded and processed online, putting enormous strain on computational resources. The problem isn’t just the sheer volume of data—it’s how this data is structured. Videos consist of raw pixel data, where neighboring pixels often store nearly identical information. This redundancy wastes resources, making it harder for systems to process visual content effectively and efficiently.

To tackle this, we’ve developed a new approach to compress visual data into a more compact and manageable form. In our paper “VidTok: A Versatile and Open-Source Video Tokenizer,” we introduce a method that converts video data into smaller, structured units, or tokens. This technique provides researchers and developers in visual world modeling—a field dedicated to teaching machines to interpret images and videos—with a flexible and efficient tool for advancing their work. 

How VidTok works

VidTok is a technique that converts raw video footage into a format that AI can easily work with and understand, a process called video tokenization. This process converts complex visual information into compact, structured tokens, as shown in Figure 1.

Figure 1. An overview of how video tokenizers work, which form the basis of VidTok.

By simplifying videos into manageable chunks, VidTok can enable AI systems to learn from, analyze, and generate video content more efficiently. VidTok offers several potential advantages over previous solutions:

Supports both discrete and continuous tokens. Not all AI models use the same “language” for video generation. Some perform best with continuous tokens—ideal for high-quality diffusion models—while others rely on discrete tokens, which are better suited for step-by-step generation, like language models for video. VidTok is a tokenizer that has demonstrated seamless support for both, making it adaptable across a range of AI applications.

Operates in both causal and noncausal modes. In some scenarios, video understanding depends solely on past frames (causal), while in others, it benefits from access to both past and future frames (noncausal). VidTok can accommodate both modes, making it suitable for real-time use cases like robotics and video streaming, as well as for high-quality offline video generation.

Efficient training with high performance. AI-powered video generation typically requires substantial computational resources. VidTok can reduce training costs by half through a two-stage training process—delivering high performance and lowering costs.

on-demand event

Microsoft Research Forum Episode 4

Learn about the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization.

Watch on-demand Opens in a new tab Architecture

The VidTok framework builds on a classic 3D encoder-decoder structure but introduces 2D and 1D processing techniques to handle spatial and temporal information more efficiently. Because 3D architectures are computationally intensive, VidTok combines them with less resource-intensive 2D and 1D methods to reduce computational costs while maintaining video quality.

Spatial processing. Rather than treating video frames solely as 3D volumes, VidTok applies 2D convolutions—pattern-recognition operations commonly used in image processing—to handle spatial information within each frame more efficiently.

Temporal processing. To model motion over time, VidTok introduces the AlphaBlender operator, which blends frames smoothly using a learnable parameter. Combined with 1D convolutions—similar operations applied over sequences—this approach captures temporal dynamics without abrupt transitions.

Figure 2 illustrates VidTok’s architecture in detail.

Figure 2. VidTok’s architecture. It uses a combination of 2D and 1D operations instead of solely relying on 3D techniques, improving efficiency. For smooth frame transitions, VidTok employs the AlphaBlender operator in its temporal processing modules. This approach strikes a balance between computational speed and high-quality video output. Quantization

To efficiently compress video data, AI systems often use quantization to reduce the amount of information that needs to be stored or transmitted. A traditional method for doing this is vector quantization (VQ), which groups values together and matches them to a fixed set of patterns (known as a codebook). However, this can lead to an inefficient use of patterns and lower video quality.

For VidTok, we use an approach called finite scalar quantization (FSQ). Instead of grouping values, FSQ treats each value separately. This makes the compression process more flexible and accurate, helping preserve video quality while keeping the file size small. Figure 3 shows the difference between the VQ and FSQ approaches.

Figure 3. VQ (left) relies on learning a codebook, while FSQ (right) simplifies the process by independently grouping values into fixed sets, making optimization easier. VidTok adopts FSQ to enhance training stability and reconstruction quality. Training

Training video tokenizers requires significant computing power. VidTok uses a two-stage process:

  1. It first trains the full model on low-resolution videos.
  2. Then, it fine-tunes only the decoder using high-resolution videos.

This approach cuts training costs in half—from 3,072 to 1,536 GPU hours—while maintaining video quality. Older tokenizers, trained on full-resolution videos from the start, were slower and more computationally intensive. 

VidTok’s method allows the model to quickly adapt to new types of videos without affecting its token distribution. Additionally, it trains on lower-frame-rate data to better capture motion, improving how it represents movement in videos.

Evaluating VidTok

VidTok’s performance evaluation using the MCL-JCV benchmark—a comprehensive video quality assessment dataset—and an internal dataset demonstrates its superiority over existing state-of-the-art models in video tokenization. The assessment, which covered approximately 5,000 videos of various types, employed four standard metrics to measure video quality:

  1. Peak Signal-to-Noise Ratio (PSNR)
  2. Structural Similarity Index Measure (SSIM)
  3. Learned Perceptual Image Patch Similarity (LPIPS)
  4. Fréchet Video Distance (FVD)

The following table and Figure 4 illustrate VidTok’s performance:

Table 1

The results indicate that VidTok outperforms existing models in both discrete and continuous tokenization scenarios. This improved performance is achieved even when using a smaller model or a more compact set of reference patterns, highlighting VidTok’s efficiency.

Figure 4. Quantitative comparison of discrete and continuous tokenization performance in VidTok and state-of-the-art methods, evaluated using four metrics: PSNR, SSIM, LPIPS, and FVD. Larger chart areas indicate better overall performance. Looking ahead

VidTok represents a significant development in video tokenization and processing. Its innovative architecture and training approach enable improved performance across various video quality metrics, making it a valuable tool for video analysis and compression tasks. Its capacity to model complex visual dynamics could improve the efficiency of video systems by enabling AI processing on more compact units rather than raw pixels.

VidTok serves as a promising foundation for further research in video processing and representation. The code for VidTok is available on GitHub (opens in new tab), and we invite the research community to build on this work and help advance the broader field of video modeling and generation.

Opens in a new tab

The post VidTok introduces compact, efficient tokenization to enhance AI video processing appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of March 24, 2025

Wed, 03/26/2025 - 18:00

In this issue:

We examine a new conversation segmentation method that delivers more coherent and personalized agent conversation, and we review efforts to improve MLLMs’ understanding of geologic maps. Check out the latest research and other updates.

NEW RESEARCH SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents

Researchers from Microsoft and Tsinghua University propose a new method to help conversational AI agents deliver more coherent and personalized responses during complex long-term dialogue.

Large language models (LLMs) are widely used to enable more complicated discussions across a broader range of topics than traditional dialogue systems. However, managing excessively long context that contains irrelevant information is a major challenge. Existing solutions typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization.

The proposed new approach, SeCom, constructs the memory bank at segment level by introducing a conversation Segmentation model that partitions long-term conversations into topically coherent segments, while applying Compression based denoising on memory units to enhance memory retrieval. Experimental results show that SeCom exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg. 

Read the paper NEW RESEARCH PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

Microsoft Researchers and external colleagues introduce GeoMap-Agent, an AI system specifically designed for geologic map understanding and analysis. In the lab, they measure its effectiveness using a new benchmark called GeoMap-Bench, a novel gauge for evaluating multimodal large language models (MLLMs) in geologic map understanding. Geologic maps provide critical insights into the structure and composition of Earth’s surface and subsurface. They are indispensable in fields including disaster detection, resource exploration, and civil engineering.

Current MLLMs often fall short in understanding geologic maps, largely due to the challenging nature of cartographic generalization, which involves handling high-resolution maps, managing multiple associated components, and requiring domain-specific knowledge.

This paper presents results of experiments in which GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming the 0.369 score of GPT-4o. The researchers intend to enable advanced AI applications in geology, powering more efficient and accurate geological investigations.

Read the paper NEW RESEARCH The future of the industrial AI edge is cellular

Reliable, high-bandwidth wireless connectivity and local processing at the edge are crucial enablers for emerging industrial AI applications. This work proposes that cellular networking is the ideal connectivity solution for these applications, due to its virtualization and support for open APIs. The researchers project the emergence of a converged industrial AI edge encompassing both computing and connectivity, in which application developers leverage the API to implement advanced functionalities. They present a case study showing evidence of the effectiveness of this approach, evaluated on an enterprise-grade 5G testbed.

Read the paper NEW RESEARCH RE#: High Performance Derivative-Based Regex Matching with Intersection, Complement, and Restricted Lookarounds

A regular expression (regex or RE) is a sequence of characters used to match, search, and manipulate strings in text based on specific criteria. REs are used in programming languages for data validation, text parsing, and search operations.

This paper presents a tool and theory built on symbolic derivatives that does not use backtracking, while supporting both classical operators and complement, intersection, and restricted lookarounds. The researchers show that the main matching algorithm has input-linear complexity both in theory as well as experimentally. They apply thorough evaluation on popular benchmarks that show that RE# is over 71% faster than the next fastest regex engine in Rust on the baseline, and outperforms all state-of-the-art engines on extensions of the benchmarks, often by several orders of magnitude. 

This work could potentially enable new applications in LLM prompt engineering frameworks, new applications in medical research and bioinformatics, and new opportunities in access and resource policy language design by web service providers.

Read the paper NEW RESEARCH Toward deep learning sequence–structure co-generation for protein design

Researchers review recent advances in deep generative models for protein design, with a focus on sequence-structure co-generation methods. They describe the key methodological and evaluation principles underlying these methods, highlight recent advances from the literature, and discuss opportunities for continued development of sequence-structure co-generation approaches.

Deep generative models that learn from the distribution of natural protein sequences and structures may enable the design of new proteins with valuable functions. While most of today’s models focus on generating either sequences or structures, emerging co-generation methods promise more accurate and controllable protein design, ideally achieved by modeling both modalities simultaneously. 

Read the paper

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Start now Opens in a new tab PODCAST New Series: The AI Revolution in Medicine, Revisited

Two years ago, OpenAI’s GPT-4 kick-started a new era in AI. In the months leading up to its public release, Peter Lee, president of Microsoft Research, cowrote The AI Revolution in Medicine: GPT-4 and Beyond, a book full of optimism for the potential of advanced AI models to transform the world of healthcare. In this special Microsoft Research Podcast series, Lee revisits the book, exploring how patients, providers, and other medical professionals are experiencing and using generative AI today while examining what he and his coauthors got right—and what they didn’t foresee.

Watch the series PODCAST The future of generative AI for scientific discovery

Most of us think of generative AI in the context of text or image generation, but it’s also a powerful tool for scientific discovery. In this episode of the Leading the Shift podcast (opens in new tab), host Susan Etlinger speaks with Ade Famoti, a senior leader on the Microsoft Research Accelerator team. Ade discusses what he calls “AI’s physics moment,” and why he believes generative AI feels fundamentally different from past platform shifts. Ade shares examples of the work Microsoft Research is doing to uncover the opportunities of generative AI for materials discovery—to improve energy efficiency and carbon capture, and for drug discovery, to fight disease. Ade also highlights the role of culture in building trust, informing priorities and driving adoption of emerging technologies.

VIDEO Microsoft Research’s Chris Bishop talks AI for Science (what it really means)

In this interview, the director of Microsoft Research AI for Science, Chris Bishop, discusses how AI is unlocking new scientific outcomes, from drug creation to materials generation to improved climate modeling.

Microsoft Research | In case you missed it Tech Life – The doctor will see you now 

BBC Sounds | March 4, 2025

An update on live trials in Ghana of 3D telemedicine technology, developed by Microsoft Research and external collaborators. Using portable equipment and holoportation technology, patients in remote locations can connect with a doctor many miles away. The BBC speaks to Spencer Fowers, who is the lead engineer on the project, as well as a patient and a doctor benefiting from the program.

Katja Hofmann: Why we're training AI on video games 

TED Talk | October 2024

In a recent TED Talk: Why we’re training AI on video games, Microsoft researcher Katja Hofmann discusses the work the Game Intelligence team at Microsoft Research is doing to develop AI that can transform video games. Using AI trained on years of human gameplay data, the team built World and Human Action Model, which can learn to think, play and innovate alongside humans, enabling video game creators to build more robust games. Hoffmann was also interviewed in a related article: Microsoft’s Muse AI Edits Video Games on the Fly.

View more news and awards Opens in a new tab

The post Research Focus: Week of March 24, 2025 appeared first on Microsoft Research.

Categories: Microsoft

Metasurface: Unlocking the future of wireless sensing and communication

Wed, 03/19/2025 - 18:00

As the demand for faster, more reliable wireless communication continues to grow, but traditional systems face limitations in efficiency and adaptability. To keep up with evolving needs, researchers are investigating new ways to manipulate electromagnetic waves to improve wireless performance. 

To address these challenges, researchers are exploring new approaches, including metasurfaces—engineered materials that can control wave propagation in unprecedented ways. By dynamically shaping and directing electromagnetic waves, metasurfaces offer a promising path to overcoming the constraints of conventional wireless systems. 

Building on these capabilities, we are developing metasurfaces for a wide range of wireless applications, such as enhancing Low Earth Orbit satellite communication, optimizing acoustic sensing, and enabling acoustic and millimeter-wave technologies for 5G and 6G communication systems with commercial devices. More recently, our work has focused on enabling indoor access to the Global Navigation Satellite System (GNSS), improving millimeter-wave coverage in targeted environments, optimizing heat distribution in microwave ovens, and providing directional sound projection without headphones.

These advances, published at leading networking conferences—including MobiCom 2023 and 2024, MobiSys 2024 and 2025, and NSDI 2023—highlight metasurfaces’ potential in wireless communication and sensing. This post explores some of these applications in more detail. 

Microsoft research podcast

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India.

Listen now Opens in a new tab Metasurfaces optimize GNSS for accurate indoor positioning

While GNSS is widely used for outdoor positioning and navigation, its indoor performance is often hindered by signal blockage, reflection, and attenuation caused by physical obstacles. Additional technologies like Wi-Fi and Bluetooth Low Energy (BLE) are often employed to address these issues. However, these solutions require extra infrastructure, are costly, and are complicated to deploy. Accurate positioning also typically depends on specialized hardware and software on mobile devices. 

Despite these challenges, GNSS signals hold promise for accurate indoor positioning. By leveraging the vast number of available satellites, GNSS-based solutions eliminate the need for base station deployment and maintenance required by Wi-Fi and BLE systems. This approach also allows seamless integration between indoor and outdoor environments, supporting continuous positioning in scenarios like guiding smart vehicles through indoor and outdoor industrial environments. 

To explore this potential, we conducted indoor measurements and found that GNSS satellite signals can penetrate windows at different angles and reflect or diffract from surfaces like floors and ceilings, resulting in uneven signals. Metasurfaces can control structured arrays of electromagnetic signals, allowing them to capture and redirect more GNSS signals. This allows signals to enter buildings in a path parallel to the ground, achieving broader coverage. Using this capability, we developed a GNSS positioning metasurface system (GPMS) based on passive metasurface technology.

One limitation of passive metasurfaces is their lack of programmability. To overcome this and enable them to effectively guide signals from different angles and scatter them in parallel, we designed a two-layer metasurface system. As shown in Figure 1, this design ensures that electromagnetic waves from different angles follow similar emission trajectories.  

Figure 1: The GPMS two-layer metasurface structure

To improve positioning accuracy, we developed new algorithms that allow signals to pass through metasurfaces, using them as anchor points. Traditional GPS positioning requires signals from at least four satellites to decode location information. In the GPMS system, illustrated in Figure 2, each deployed metasurface functions as a virtual satellite. By deploying at least three metasurfaces indoors, we achieved high-precision positioning through a triangulation algorithm.

Figure 2. Diagram of the GPMS system. Passive metasurfaces guide GNSS signals indoors, while enhanced positioning algorithms provide precise indoor positioning on mobile devices. 

To evaluate the system, we deployed the GPMS with six metasurfaces on a 10×50-meter office floor and a 15×20-meter conference hall. The results show significant improvements in signal quality and availability. C/N₀, a measure of signal-to-noise ratio, increased from 9.1 dB-Hz to 32.2 dB-Hz. The number of visible satellites increased from 3.6 to 21.5. Finally, the absolute positioning error decreased from 30.6 meters to 3.2 meters in the office and from 11.2 meters to 2.7 meters in the conference hall. These findings are promising and highlight the feasibility and advantages of GNSS-based metasurfaces for indoor positioning. 

Metasurfaces extend millimeter-wave coverage

Millimeter waves enable the high-speed, low-latency performance needed for 5G and 6G communication systems. While commercial products like 60 GHz Wi-Fi routers and mobile devices are becoming popular, their limited coverage and susceptibility to signal obstruction restrict their widespread application. 

Traditional solutions include deploying multiple millimeter-wave access points, such as routers or base stations, or placing reflective metal panels in room corners to reflect electromagnetic waves. However, these approaches are both costly and offer limited performance. Metasurfaces offer a promising alternative for improving millimeter-wave applications. Previous research has shown that programmable metasurfaces can enhance signal coverage in blind spots and significantly improve signal quality and efficiency.  

To maximize the benefits of metasurfaces, we developed the AutoMS automation service framework, shown in Figure 3. This proposed framework can optimize millimeter-wave coverage using low-cost passive metasurface design and strategic placement. 

The three main components of AutoMS can address the limitations of traditional solutions: 

  1. Automated joint optimization: AutoMS determines the optimal network deployment configuration by analyzing phase settings, metasurface placement, and access point positioning. It also refines beam-forming configurations to enhance signal coverage. By iteratively identifying and optimizing the number, size, and placement of metasurfaces, AutoMS adjusts the metasurface phase settings and the access point’s configurations to achieve optimal signal coverage. 
Figure 3. The AutoMS framework generates optimized deployment plans for passive metasurface and access points based on environment scanning results. 
  1. Fast 3D ray tracing simulator: Using hardware and software acceleration, our simulator efficiently calculates channel matrices resulting from metasurfaces with tens of thousands of elements. This simulator, capable of tracing 1.3 billion rays in just three minutes on an A100 GPU, significantly accelerates calculations for complex environments.
  1. Low-cost passive metasurface design: We designed a high-reflectivity passive metasurface with near-2π phase control and broadband compatibility for the millimeter-wave frequency band. This metasurface is compatible with low-precision, cost-effective thermoforming processes. This process enables users to create metasurfaces at minimal cost, significantly reducing deployment expenses.

    Shown in Figure 4, users can capture the environment using existing 3D scanning apps on mobile devices, generate a 3D layout model, and upload it to the cloud. AutoMS then generates metasurface settings and placement guidelines.  

    Users can print metasurface patterns using hot stamping and customize them without affecting functionality, as millimeter waves penetrate paint and paper. 
Figure 4: The low-cost passive metasurface creation process 

Evaluation using publicly available 3D layout datasets and real-world tests shows that AutoMS significantly improves millimeter-wave coverage across various scenarios. Compared to a single router setup, AutoMS increased signal strength by 12.1 dB. Onsite tests further confirmed gains of 11 dB in target areas and over 20 dB in blind spots, with signal throughput increasing from 77 Mbps to 373 Mbps. AutoMS adapts to diverse environments, ensuring reliable and flexible deployment in real-world applications. 

Metasurfaces support uniform heating in microwave ovens 

Microwave ovens often heat unevenly, creating cold spots in food. These can allow harmful bacteria and other pathogens to survive, increasing the risk of foodborne illnesses. Uneven heating can cause eggs to burst or create “hot spots” that can scald.

Uneven heating is due to the appliance’s heating mechanism. Microwave ovens generate high-power radio frequency (RF) electromagnetic waves through dielectric heating. These waves create nodes with zero amplitude, which prevents heating. They also create antinodes, where heating occurs more rapidly.  

To address this issue, we developed MicroSurf, a low-cost solution that improves heating by using passive metasurfaces to control electromagnetic energy inside the microwave oven. It uses the resonance effect between the metasurface and electromagnetic waves to modify the standing-wave distribution and achieve more uniform heating. This is shown in Figure 5. 

Figure 5: MicroSurf’s working principle: Uneven electric field distribution inside the microwave oven leads to uneven heating. B. Modeling the microwave oven. C. Designing and optimizing a metasurface that can function in a high-power environment to change the standing wave distribution. D. Achieving uniform heating of different foods and selectively heating specific parts. 

Tests across four different microwave oven brands demonstrate that MicroSurf effectively optimizes heating for various liquids and solids, uniformly heating water, milk, bread, and meat. It concentrates heat on specific areas and adapts to differently shaped foods. MicroSurf offers a promising solution for even heating in microwave ovens, demonstrating the potential of metasurface technology in everyday applications. This innovation paves the way for smarter, more efficient home appliances.  

Advancing wireless innovation

Wireless sensing and communication technologies are evolving rapidly, driving innovation across a wide range of applications. We are continuing to push the boundaries of these technologies—particularly in metasurface development—while working to create practical solutions for a variety of use cases. 

Opens in a new tab

The post Metasurface: Unlocking the future of wireless sensing and communication appeared first on Microsoft Research.

Categories: Microsoft

Claimify: Extracting high-quality claims from language model outputs

Wed, 03/19/2025 - 18:00

While large language models (LLMs) are capable of synthesizing vast amounts of information, they sometimes produce inaccurate or unsubstantiated content. To mitigate this risk, tools like Azure AI’s Groundedness Detection (opens in new tab) can be used to verify LLM outputs. 

A common strategy for fact-checking LLM-generated texts – especially complex, highly detailed outputs – is claim extraction: instead of evaluating the entire text at once, it’s broken down into simple factual statements that can be verified independently. However, the effectiveness of this approach depends on the quality of the extracted claims: if they are inaccurate or incomplete, the fact-checking results may be compromised. 

Our paper “Towards Effective Extraction and Evaluation of Factual Claims” proposes a framework for evaluating claim extraction and introduces Claimify, a novel LLM-based method that outperforms prior solutions. In this blog post, we discuss common claim extraction issues and explore Claimify’s approach.

Key issues in claim extraction

As a case study, we prompted OpenAI’s GPT-4o model to “provide an overview of challenges in emerging markets” based on a collection of news articles. Here is its output:

# Emerging market challenges
Emerging markets face a myriad of challenges that span economic, social, and environmental dimensions. These challenges are often interconnected, exacerbating the difficulties faced by these regions.

## Economic instability and inflation
Several emerging markets are grappling with severe economic instability. For instance, Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.

## Food insecurity and agricultural challenges
Food insecurity is another critical issue in emerging markets. Nigeria, for example, is striving to become self-sufficient in wheat production but is hindered by climate change and violence, exacerbated by high grain prices due to the suspension of the Black Sea Grain Initiative. Climate change has played a pivotal role in creating food insecurity and economic instability in farming-dependent economies, such as Zambia and Mozambique.

## Natural disasters and public health emergencies
Public health crises, as demonstrated by the experience of Derna, Libya, remain a significant concern in emerging markets. Catastrophic flooding caused by the collapse of two dams has led to significant destruction, complicating access to healthcare. The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management. Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.

## Conclusion
Addressing emerging market challenges will require comprehensive strategies that consider economic stability, food security, and public health. International cooperation and targeted interventions are crucial in fostering sustainable development in emerging markets.

We then applied a simple prompt[1] that extracted the following claims from the answer:

1. Emerging markets face economic challenges.
2. Emerging markets face social challenges.
3. Emerging markets face environmental challenges.
4. Argentina experiences severe economic instability.
5. Argentina has a monthly inflation rate as high as 25.5%.
6. Argentina’s inflation has made many goods unobtainable.
7. Argentina’s currency value has plunged.
8. Some experts estimate Argentina’s annual inflation rate could reach 300%.
9. Food insecurity is a critical issue in emerging markets.
10. Nigeria is striving to become self-sufficient in wheat production.
11. Nigeria’s wheat production is hindered by climate change.
12. Nigeria’s wheat production is hindered by violence.
13. High grain prices affect Nigeria due to the suspension of the Black Sea Grain Initiative.
14. Climate change creates food insecurity in Zambia.
15. Climate change creates food insecurity in Mozambique.
16. Climate change causes economic instability in farming-dependent economies.
17. Public health crises are a concern in emerging markets.
18. Derna, Libya, experienced catastrophic flooding.
19. The collapse of two dams caused flooding in Derna, Libya.
20. Flooding in Derna, Libya, complicated access to healthcare.
21. The U.N. found contaminated water in Derna, Libya.
22. Contaminated water in Derna, Libya, caused many residents to fall ill.
23. Afghanistan has experienced challenges similar to those in Libya.
24. Sudan has experienced challenges similar to those in Libya.
25. Addressing emerging market challenges requires comprehensive strategies.
26. Comprehensive strategies should consider economic stability.
27. Comprehensive strategies should consider food security.
28. Comprehensive strategies should consider public health.
29. International cooperation is crucial for sustainable development in emerging markets.
30. Targeted interventions are crucial for sustainable development in emerging markets.

These claims appear to reflect the answer, but a closer look reveals four critical issues:

#IssueExample1Some claims are not factual statements that can be verified as true or false.Excerpt:Addressing emerging market challenges will require comprehensive strategies that consider economic stability, food security, and public health.

Claims:
  • Addressing emerging market challenges requires comprehensive strategies.
  • Comprehensive strategies should consider economic stability.
  • Comprehensive strategies should consider food security.
  • Comprehensive strategies should consider public health.
Explanation: These claims are not verifiable because they are opinions.2Some claims are missing or incomplete.Excerpt:Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.

Claims:
  • Argentina has a monthly inflation rate as high as 25.5%.
  • Argentina’s inflation has made many goods unobtainable.
  • Argentina’s currency value has plunged.
  • Some experts estimate Argentina’s annual inflation rate could reach 300%.
Explanation: The phrases “causing severe economic hardship” and “others predict even higher rates” are not reflected in any of the claims. The third claim also omits the fact that inflation caused the currency depreciation.3Some claims are inaccurate.Excerpt: The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management.”

Claims:
  • The U.N. found contaminated water in Derna, Libya.
  • Contaminated water in Derna, Libya, caused many residents to fall ill.
Explanation: The first claim is inaccurate because the U.N. found the link between contaminated water and illness, not the contaminated water itself. The second claim also misrepresents the sentence since it shifts the meaning from a viewpoint of a specific entity (the U.N.) to a general assertion about the effects of contaminated water in Derna, Libya.4Some claims cannot be understood without additional context.Excerpt: Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.

Claims:
  • Afghanistan has experienced challenges similar to those in Libya.
  • Sudan has experienced challenges similar to those in Libya.
Explanation: These claims cannot be understood on their own because “those” is not defined. Introducing Claimify

The case study highlights that claim extraction is surprisingly error-prone. Our paper demonstrates that the issues identified above are common across LLM-based claim extraction methods. To minimize these errors, we created a system called Claimify[2].

Core principles

Claimify is an LLM-based claim extraction system built on the following principles:

#PrincipleExample1The claims should capture all verifiable content in the source text and exclude unverifiable content.In the sentence “The partnership between John and Jane illustrates the importance of collaboration,” the only verifiable content is the existence of a partnership between John and Jane. The rest is subjective interpretation.2Each claim should be entailed (i.e., fully supported) by the source text.Consider the sentence “Governments are curtailing emissions from cars and trucks, which are the largest source of greenhouse gases from transportation.” The following claims are incorrect:

  • Cars are the largest source of greenhouse gases from transportation.
  • Trucks are the largest source of greenhouse gases from transportation.
The sentence attributes the highest emissions to cars and trucks collectively, not individually.3Each claim should be understandable on its own, without additional context.The claim “They will update the policy next year” is not understandable on its own because it’s unclear what “They,” “the policy,” and “next year” refer to.4Each claim should minimize the risk of excluding critical context.Suppose the claim “The World Trade Organization has supported trade barriers” was extracted from the sentence “An exception to the World Trade Organization’s open-market philosophy is its history of supporting trade barriers when member countries have failed to comply with their obligations.” A fact-checking system would likely classify the claim as false, since there is extensive evidence that the WTO aims to reduce trade barriers. However, if the claim had specified that the WTO has supported trade barriers “when member countries have failed to comply with their obligations,” it would likely have been classified as true. This example demonstrates that missing context can distort the fact-checking verdict.5The system should flag cases where ambiguity cannot be resolved.The sentence “AI has advanced renewable energy and sustainable agriculture at Company A and Company B” has two mutually exclusive interpretations:

  • AI has advanced renewable energy and sustainable agriculture at both Company A and Company B.
  • AI has advanced renewable energy at Company A and sustainable agriculture at Company B.
If the context does not clearly indicate that one of these interpretations is correct, the system should flag the ambiguity instead of picking one interpretation arbitrarily. Implementation

Claimify accepts a question-answer pair as input and performs claim extraction in four stages, illustrated in Figure 1:

#StageDescription1Sentence splitting and context creationThe answer is split into sentences, with “context” – a configurable combination of surrounding sentences and metadata (e.g., the header hierarchy in a Markdown-style answer) – created for each sentence.2SelectionAn LLM identifies sentences that do not contain verifiable content. These sentences are labeled “No verifiable claims” and excluded from subsequent stages. When sentences contain verifiable and unverifiable components, the LLM rewrites the sentence, retaining only the verifiable components.3DisambiguationFor sentences that passed the Selection stage, an LLM detects ambiguity and determines if it can be resolved using the context. If all ambiguity is resolvable, the LLM returns a disambiguated version of the sentence. Otherwise, the sentence is labeled “Cannot be disambiguated” and excluded from the Decomposition stage.4DecompositionFor sentences that are unambiguous or were disambiguated, an LLM creates standalone claims that preserve critical context. If no claims are extracted, the sentence is labeled “No verifiable claims.” Figure 1: Overview of Claimify’s stages Results

In our paper, we demonstrate that Claimify outperforms existing LLM-based methods[3]. Specifically, we show that: (1) 99% of claims extracted by Claimify are entailed by their source sentence, (2) Claimify strikes the best balance between including verifiable content and excluding unverifiable content, and (3) Claimify is least likely to omit context critical to the fact-checking verdict.

For the above case study on challenges in emerging markets, here are Claimify’s outputs, with source sentences preceded by a letter and claims numbered[4]:

A. Several emerging markets are grappling with severe economic instability.
1. Several emerging markets are grappling with severe economic instability.

B. For instance, Argentina’s rampant inflation, with monthly rates reaching as high as 25.5%, has made many goods unobtainable and plunged the value of the currency, causing severe economic hardship.
1. Argentina has rampant inflation.
2. The monthly inflation rates in Argentina have reached as high as 25.5%.
3. Inflation has made many goods unobtainable in Argentina.
4. Inflation has plunged the value of the currency in Argentina.
5. Inflation has caused severe economic hardship in Argentina.

C. Some experts estimate that the annual inflation rate could potentially double to 300%, while others predict even higher rates.
1. Some experts estimate that Argentina’s annual inflation rate could double to 300% in the future.
2. Some experts predict that Argentina’s annual inflation rate could be higher than 300% in the future.

D. Nigeria, for example, is striving to become self-sufficient in wheat production but is hindered by climate change and violence, exacerbated by high grain prices due to the suspension of the Black Sea Grain Initiative.
1. Nigeria is striving to become self-sufficient in wheat production.
2. Nigeria is hindered by climate change in becoming self-sufficient in wheat production.
3. Nigeria is hindered by violence in becoming self-sufficient in wheat production.
4. High grain prices exacerbate the hindrance to Nigeria’s efforts to become self-sufficient in wheat production.
5. The suspension of the Black Sea Grain Initiative is a reason for high grain prices.

E. Climate change has played a pivotal role in creating food insecurity and economic instability in farming-dependent economies, such as Zambia and Mozambique.
1. Climate change has played a role in creating food insecurity in farming-dependent economies.
2. Zambia is a farming-dependent economy where climate change has played a role in creating food insecurity.
3. Mozambique is a farming-dependent economy where climate change has played a role in creating food insecurity.
4. Climate change has played a role in creating economic instability in farming-dependent economies.
5. Zambia is a farming-dependent economy where climate change has played a role in creating economic instability.
6. Mozambique is a farming-dependent economy where climate change has played a role in creating economic instability.

F. Public health crises, as demonstrated by the experience of Derna, Libya, remain a significant concern in emerging markets.
1. Public health crises are a concern in emerging markets.
2. Derna, Libya, is an example of a public health crisis in emerging markets.

G. Catastrophic flooding caused by the collapse of two dams has led to significant destruction, complicating access to healthcare.
1. There was catastrophic flooding in Derna, Libya.
2. The flooding in Derna, Libya, was caused by the collapse of two dams.
3. The flooding in Derna, Libya, has led to significant destruction.
4. The flooding in Derna, Libya, has complicated access to healthcare.

H. Countries like Afghanistan and Sudan have experienced similar challenges to those of Libya.
1. Afghanistan has experienced challenges related to public health crises.
2. Afghanistan has experienced challenges related to catastrophic flooding.
3. Afghanistan has experienced challenges related to contaminated water.
4. Sudan has experienced challenges related to public health crises.
5. Sudan has experienced challenges related to catastrophic flooding.
6. Sudan has experienced challenges related to contaminated water.

Note that the baseline prompt extracted several claims from the sentence “The U.N. found that the resulting contaminated water caused many residents to fall ill, highlighting the need for improved water management,” but it ignored the phrase “highlighting the need for improved water management.” It also failed to capture that the contaminated water resulted from flooding, as implied by “resulting” in the original sentence.

Claimify took a different approach. First, it found two instances of ambiguity – “resulting contaminated water” and “many residents” – that it determined could be resolved using the context. Here’s an excerpt from its reasoning: “…the context specifies that the contaminated water is a result of the catastrophic flooding in Derna, Libya, and the residents are those of Derna, Libya.

However, it also found an instance of ambiguity – “highlighting the need for improved water management” – where it concluded that the context does not definitively support a single interpretation: “The sentence could be interpreted as: (1) The U.N. found that the contaminated water caused illness and also highlighted the need for improved water management, (2) The U.N. only found that the contaminated water caused illness, while the need for improved water management is an implication or conclusion drawn by the writer. Readers … would likely fail to reach consensus about the correct interpretation of this ambiguity.” As a result, Claimify labeled the sentence “Cannot be disambiguated” at the Disambiguation stage and did not proceed to the Decomposition stage. 

To the best of our knowledge, Claimify is the first claim extraction system that identifies when the source text has multiple possible interpretations and extracts claims only when there is high confidence in the correct interpretation.

Next steps

We’re currently working on new methods for evaluating LLM-generated texts. We anticipate that the high-quality claims extracted by Claimify will help not only in verifying the veracity of LLM outputs, but also in assessing their overall quality – especially when gold-standard references are difficult to create (e.g., long-form texts where people may disagree on what defines “good” content). For example, we recently used Claimify to evaluate the comprehensiveness and diversity of answers generated by GraphRAG, showing that GraphRAG outperforms traditional Retrieval Augmented Generation (RAG) in these areas.

For an in-depth discussion of Claimify and our evaluation framework, please see our paper “Towards Effective Extraction and Evaluation of Factual Claims.”

[1] (opens in new tab) We used the “proposition chunking” prompt from NirDiamant’s RAG Techniques repository (opens in new tab). We generated multiple responses using GPT-4o, then picked the response that was most representative of the samples.

[2] Claimify is currently used for research purposes only and is not available commercially.

[3] (opens in new tab) We benchmarked Claimify against VeriScore (opens in new tab), DnD (opens in new tab), SAFE (opens in new tab), AFaCTA (opens in new tab), and Factcheck-GPT (opens in new tab).

[4] The outputs were generated using GPT-4o. Sentences not shown were either labeled “No verifiable claims” or “Cannot be disambiguated.”

Opens in a new tab

The post Claimify: Extracting high-quality claims from language model outputs appeared first on Microsoft Research.

Categories: Microsoft

Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs

Tue, 03/18/2025 - 18:00

Large language models (LLMs) have demonstrated remarkable capabilities in reasoning, language understanding, and even creative tasks. Yet, a key challenge persists: how to efficiently integrate external knowledge.

Traditional methods such as fine-tuning and Retrieval-Augmented Generation (RAG) come with trade-offs—fine-tuning demands costly retraining, while RAG introduces separate retrieval modules that increase complexity and prevent seamless, end-to-end training. In-context learning, on the other hand, becomes increasingly inefficient as knowledge bases grow, facing quadratic computational scaling that hinders its ability to handle large repositories. A comparison of these approaches can be seen in Figure 1.

A new way to integrate knowledge

To address these challenges, we introduce the Knowledge Base-Augmented Language Model (KBLaM) —a novel approach that integrates structured knowledge bases into pre-trained LLMs. Instead of relying on external retrieval modules or costly fine-tuning, KBLaM encodes knowledge into continuous key-value vector pairs, efficiently embedding them within the model’s attention layers using a specialized rectangular attention mechanism, which implicitly performs retrieval in an integrated manner.

We use structured knowledge bases to represent the data, allowing us to consolidate knowledge and leverage structure. This design allows it to scale linearly with the size of the knowledge base while maintaining dynamic updates without retraining, making it far more efficient than existing methods.

Spotlight: blog post

GraphRAG auto-tuning provides rapid adaptation to new domains

GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.

Read more Opens in a new tab Scalable, efficient, and future-ready

At its core, KBLaM is designed to integrate structured knowledge into LLMs, making them more efficient and scalable. It achieves this by converting external knowledge bases—collections of facts structured as triples consisting of an entity, a property, and a value—into a format that LLMs can process naturally.  Such knowledge bases allow for consolidated, reliable sources of knowledge.

To create these knowledge bases, we first extract structured data in JSON format using small language models. We then apply Project Alexandria’s probabilistic clustering. Once we have this structured knowledge base, KBLaM follows a three-step pipeline:

  1. Knowledge Encoding: Each knowledge triple is mapped into a key-value vector pair using a pre-trained sentence encoder with lightweight linear adapters. The key vector, derived from the entity name and property, encodes “index information,” while the value vector captures the corresponding property value. This allows us to create continuous, learnable key-value representations.
  2. Integration with LLMs: These key-value pairs, or knowledge tokens, are augmented into the model’s attention layers using a specialized rectangular attention structure. Unlike traditional transformer models that process all tokens equally and come with quadratic cost—such as GPT-4, Phi, and Llama—rectangular attention enables the model to attend over knowledge with linear cost, as illustrated in Figure 2. Compared to standard attention mechanisms in generative language models, where each token attends to all preceding tokens, our approach introduces a more efficient structure. In this setup, language tokens (such as those from a user’s question) attend to all knowledge tokens. However, knowledge tokens do not attend to one another, nor do they attend back to the language tokens. This selective attention pattern significantly reduces computational cost while preserving the model’s ability to incorporate external knowledge effectively.

    This linear cost, which is crucial for the efficiency of KBLaM, effectively amounts to treating each fact independently—an assumption that holds for most facts. For example, the model’s name, KBLaM, and the fact that the research was conducted at Microsoft Research are very weakly correlated. This rectangular attention is implemented as an extension of standard attention. During training, we keep the base model’s weights frozen, ensuring that when no knowledge tokens are provided, the model functions exactly as it did originally.
  3. Efficient Knowledge Retrieval: Through this rectangular attention, the model learns to dynamically retrieve relevant knowledge tokens during inference, eliminating the need for separate retrieval steps.
Figure 1: KBLaM allows for attention over the entire knowledge base instead of having an external retriever. Figure 2: By having the user’s question attend to the knowledge base, while treating facts in the knowledge base independently, KBLaM scales efficiently and linearly with the size of the knowledge base.

Unlike RAG, which appends retrieved document chunks to prompts, KBLaM allows for direct integration of knowledge into the model. Compared to in-context learning,  KBLaM’s rectangular attention maintains a linear memory footprint, making it vastly more scalable for large knowledge bases. 

Its efficiency is a game-changer. While traditional in-context learning methods struggle with quadratic memory growth due to self-attention overhead, KBLaM’s linear overhead means we can store much more knowledge in the context. In practice, this means KBLaM can store and process over 10,000 knowledge triples, the equivalent of approximately 200,000 text tokens on a single GPU—a feat that would be computationally prohibitive with conventional in-context learning. The results across a wide range of triples and can be seen in Figure 3. Remarkably, it achieves this while extending a base model that has a context length of only 8K tokens. Additionally, KBLaM enables dynamic updates: modifying a single knowledge triple does not require retraining or re-computation of the entire knowledge base. 

Figure 3: KBLaM is much faster and uses much less memory than adding the equivalent number of triples in the context using conventional RAG-like approaches. In particular, we have lower time to first token with 4,096 tripes in the context with KBLaM than we would with 5 triples in the context. Enhancing interpretability and reliability

Another major benefit of KBLaM is its interpretability. Unlike in-context learning, where knowledge injection is opaque, KBLAM’s attention weights provide clear insights into how the model utilizes knowledge tokens. Experiments show that KBLaM assigns high attention scores to relevant knowledge triples, effectively mimicking a soft retrieval process.

Furthermore, KBLaM enhances model reliability by learning through its training examples when not to answer a question if the necessary information is missing from the knowledge base. In particular, with knowledge bases larger than approximately 200 triples, we found that the model refuses to answer questions it has no knowledge about more precisely than a model given the information as text in context. This feature helps reduce hallucinations, a common problem in LLMs that rely on internal knowledge alone, making responses more accurate and trustworthy.

The future of knowledge-augmented AI

KBLaM represents a major step forward in integrating structured knowledge into LLMs. By offering a scalable, efficient, and interpretable alternative to existing techniques, it paves the way for AI systems that can stay up to date and provide reliable, knowledge-driven responses. In fields where accuracy and trust are critical—such as medicine, finance, and scientific research—this approach has the potential to transform how language models interact with real-world information.

As AI systems increasingly rely on dynamic knowledge rather than static model parameters, we hope KBLaM will serve as a bridge between raw computational power and real-world understanding.

However, there is still work to be done before it can be deployed at scale. Our current model has been trained primarily on factual question-answer pairs, and further research is needed to expand its capabilities across more complex reasoning tasks and diverse knowledge domains.

To accelerate progress, we are releasing KBLaM’s code and datasets (opens in new tab) to the research community, and we are planning integrations with the Hugging Face transformers library. By making these resources available, we hope to inspire further research and adoption of scalable, efficient knowledge augmentation for LLMs. The future of AI isn’t just about generating text—it’s about generating knowledge that is accurate, adaptable, and deeply integrated with the evolving world. KBLaM is a step in that direction.

Opens in a new tab

The post Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs appeared first on Microsoft Research.

Categories: Microsoft

Semantic Telemetry: Understanding how users interact with AI systems

Mon, 03/10/2025 - 18:00

AI tools are proving useful across a range of applications, from helping to drive the new era of business transformation to helping artists craft songs. But which applications are providing the most value to users? We’ll dig into that question in a series of blog posts that introduce the Semantic Telemetry project at Microsoft Research. In this initial post, we will introduce a new data science approach that we will use to analyze topics and task complexity of Copilot in Bing usage.

Human-AI interactions can be iterative and complex, requiring a new data science approach to understand user behavior to build and support increasingly high value use cases. Imagine the following chat:

Here we see that chats can be complex and span multiple topics, such as event planning, team building, and logistics. Generative AI has ushered in a two-fold paradigm shift. First, LLMs give us a new thing to measure, that is, how people interact with AI systems. Second, they give us a new way to measure those interactions, that is, they give us the capability to understand and make inferences on these interactions, at scale. The Semantic Telemetry project has created new measures to classify human-AI interactions and understand user behavior, contributing to efforts in developing new approaches for measuring generative AI (opens in new tab) across various use cases.

Semantic Telemetry is a rethink of traditional telemetry–in which data is collected for understanding systems–designed for analyzing chat-based AI. We employ an innovative data science methodology that uses a large language model (LLM) to generate meaningful categorical labels, enabling us to gain insights into chat log data.

Figure 1: Prompting an LLM to classify a conversation based on LLM generated label taxonomy

This process begins with developing a set of classifications and definitions. We create these classifications by instructing an LLM to generate a short summary of the conversation, and then iteratively prompting the LLM to generate, update, and review classification labels on a batched set of summaries. This process is outlined in the paper: TnT-LLM: Text Mining at Scale with Large Language Models. We then prompt an LLM with these generated classifiers to label new unstructured (and unlabeled) chat log data.

Description of LLM generated label taxonomy process

With this approach, we have analyzed how people interact with Copilot in Bing. In this blog, we examine insights into how people are using Copilot in Bing, including how that differs from traditional search engines. Note that all analyses were conducted on anonymous Copilot interactions containing no personal information.

Topics

To get a clear picture of how people are using Copilot in Bing, we need to first classify sessions into topical categories. To do this, we developed a topic classifier. We used the LLM classification approach described above to label the primary topic (domain) for the entire content of the chat. Although a single chat can cover multiple topics, for this analysis, we generated a single label for the primary topic of the conversation. We sampled five million anonymized Copilot in Bing chats during August and September 2024, and found that globally, 21% of all chats were about technology, with a high concentration of these chats in programming and scripting and computers and electronics.

Figure 2: Top Copilot in Bing topics based on anonymized data (August-September 2024) Figure 3: Frequent topic summaries in Technology Figure 4: Frequent topic summaries in Entertainment

Diving into the technology category, we find a lot of professional tasks in programming and scripting, where users request problem-specific assistance such as fixing a SQL query syntax error. In computers and electronics, we observe users getting help with tasks like adjusting screen brightness and troubleshooting internet connectivity issues. We can compare this with our second most common topic, entertainment, in which we see users seeking information related to personal activities like hiking and game nights.

We also note that top topics differ by platform. The figure below depicts topic popularity based on mobile and desktop usage. Mobile device users tend to use the chat for more personal-related tasks such as helping to plant a garden or understanding medical symptoms whereas desktop users conduct more professional tasks like revising an email.

Figure 5: Top topics for desktop users and mobile users

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Search versus Copilot

Beyond analyzing topics, we compared Copilot in Bing usage to that of traditional search. Chat extends beyond traditional online search by enabling users to summarize, generate, compare, and analyze information. Human-AI interactions are conversational and more complex than traditional search (Figure 6).

Figure 6: Bing Search Query compared to Copilot in Bing Conversation

A major differentiation between search and chat is the ability to ask more complex questions, but how can we measure this? We think of complexity as a scale ranging from simply asking chat to look up information to evaluating several ideas. We aim to understand the difficulty of a task if performed by a human without the assistance of AI. To achieve this, we developed the task complexity classifier, which assesses task difficulty using Anderson and Krathwohl’s Taxonomy of Learning Objectives (opens in new tab). For our analysis, we have grouped the learning objectives into two categories: low complexity and high complexity. Any task more complicated than information lookup is classified as high complexity. Note that this would be very challenging to classify using traditional data science techniques.

Description of task complexity and 6 categories of the Anderson and Krathwohl’s Taxonomy of Learning Objectives

Comparing low versus high complexity tasks, most chat interactions were categorized as high complexity (78.9%), meaning that they were more complex than looking up information. Programming and scripting, marketing and sales, and creative and professional writing are topics in which users engage in higher complexity tasks (Figure 7) such as learning a skill, troubleshooting a problem, or writing an article.

Figure 7: Most and least complex topics based on percentage of high complexity tasks.

Travel and tourism and history and culture scored lowest in complexity, with users looking up information like flight times and latest news updates.

Demo of task complexity and topics on anonymous Copilot interactions

When should you use chat instead of search? A 2024 Microsoft Research study: The Use of Generative Search Engines for Knowledge Work and Complex Tasks, suggests that people are seeing value in technical, complex tasks such as web development and data analysis. Bing Search contained more queries with lower complexity focused on non-professional areas, like gaming and entertainment, travel and tourism, and fashion and beauty, while chat had a greater distribution of complex technical tasks. (Figure 8).

Figure 8: Comparison of Bing Search and Copilot in Bing for anonymized sample data (May-June 2023) Conclusion

LLMs have enabled a new era of high-quality human-AI interaction, and with it, the capability to analyze those same interactions with high fidelity, at scale, and in near real-time. We are now able to obtain actionable insight from complex data that is not possible with traditional data science pattern-matching methods. LLM-generated classifications are pushing research into new directions that will ultimately improve user experience and satisfaction when using chat and other user-AI interaction tools.

This analysis indicates that Copilot in Bing is enabling users to do more complex work, specifically in areas such as technology. In our next post, we will explore how Copilot in Bing is supporting professional knowledge work and how we can use these measures as indicators for retention and engagement.

FOOTNOTE: This research was conducted at the time the feature Copilot in Bing was available as part of the Bing service; since October 2024 Copilot in Bing has been deprecated in favor of the standalone Microsoft Copilot service.

References:

  1. Krathwohl, D. R. (2002). A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice, 41(4), 212–218. https://doi.org/10.1207/s15430421tip4104_2 (opens in new tab)
Opens in a new tab

The post Semantic Telemetry: Understanding how users interact with AI systems appeared first on Microsoft Research.

Categories: Microsoft

Advancing biomedical discovery: Overcoming data challenges in precision medicine

Wed, 03/05/2025 - 19:00
Introduction

Modern biomedical research is driven by the promise of precision medicine—tailored treatments for individual patients through the integration of diverse, large-scale datasets. Yet, the journey from raw data to actionable insights is fraught with challenges. Our team of researchers at Microsoft Research in the Health Futures group, in collaboration with the Perelman School of Medicine at the University of Pennsylvania (opens in new tab), conducted an in-depth exploration of these challenges in a study published in Nature Scientific Reports. The goal of this research was to identify pain points in the biomedical data lifecycle and offer actionable recommendations to enable secure data-sharing, improved interoperability, robust analysis, and foster collaboration across the biomedical research community.

Study at a glance

A deep understanding of the biomedical discovery process is crucial for advancing modern precision medicine initiatives. To explore this, our study involved in-depth, semi-structured interviews with biomedical research professionals spanning various roles including bench scientists, computational biologists, researchers, clinicians, and data curators. Participants provided detailed insights into their workflows, from data acquisition and curation to analysis and result dissemination. We used an inductive-deductive thematic analysis to identify key challenges occurring at each stage of the data lifecycle—from raw data collection to the communication of data-driven findings.

Some key challenges identified include:
  • Data procurement and validation: Researchers struggle to identify and secure the right datasets for their research questions, often battling inconsistent quality and manual data validation.
  • Computational hurdles: The integration of multiomic data requires navigating disparate computational environments and rapidly evolving toolsets, which can hinder reproducible analysis.
  • Data distribution and collaboration: The absence of a unified data workflow and secure sharing infrastructure often leads to bottlenecks when coordinating between stakeholders across university labs, pharmaceutical companies, clinical settings, and third-party vendors.
Main takeaways and recommendations:
  1. Establishing a unified biomedical data lifecycle 

    This study highlights the need for a unified process that spans all phases of the biomedical discovery process—from data-gathering and curation to analysis and dissemination. Such a data jobs-to-be-done framework would streamline standardized quality checks, reduce manual errors such as metadata reformatting, and ensure that the flow of data across different research phases remains secure and consistent. This harmonization is essential to accelerate research and build more robust, reproducible models that propel precision medicine forward.
  2. Empowering stakeholder collaboration and secure data sharing 

    Effective biomedical discovery requires collaboration across multiple disciplines and institutions. A key takeaway from our interviews was the critical importance of collaboration and trust among stakeholders. Secure, user-friendly platforms that enable real-time data sharing and open communication among clinical trial managers, clinicians, computational scientists, and regulators can bridge the gap between isolated research silos. As a possible solution, by implementing centralized cloud-based infrastructures and democratizing data access, organizations can dramatically reduce data handoff issues and accelerate scientific discovery.
  3. Adopting actionable recommendations to address data pain points 

    Based on the insights from this study, the authors propose a list of actionable recommendations such as:
    • Creating user-friendly platforms to transition from manual (bench-side) data collection to electronic systems.
    • Standardizing analysis workflows to facilitate reproducibility, including version control and the seamless integration of notebooks into larger workflows.
    • Leveraging emerging technologies such as generative AI and transformer models for automating data ingestion and processing of unstructured text.

If implemented, the recommendations from this study would help forge a reliable, scalable infrastructure for managing the complexity of biomedical data, ultimately advancing research and clinical outcomes.

Looking ahead

At Microsoft Research, we believe in the power of interdisciplinarity and innovation. This study not only identifies the critical pain points that have slowed biomedical discovery but also illustrates a clear path toward improved data integrity, interoperability, and collaboration. By uniting diverse stakeholders around a common, secure, and scalable data research lifecycle, we edge closer to realizing individualized therapeutics for every patient.

We encourage our colleagues, partners, and the broader research community to review the full study and consider these insights as key steps toward a more integrated biomedical data research infrastructure. The future of precision medicine depends on our ability to break down data silos and create a research data lifecycle that is both robust and responsive to the challenges of big data.

Explore the full paper (opens in new tab) in Nature Scientific Reports to see how these recommendations were derived, and consider how they might integrate into your work. Let’s reimagine biomedical discovery together—where every stakeholder contributes to a secure, interoperable, and innovative data ecosystem that transforms patient care.

We look forward to engaging with the community on these ideas as we continue to push the boundaries of biomedical discovery at Microsoft Research.

Access the full paper Opens in a new tab

The post Advancing biomedical discovery: Overcoming data challenges in precision medicine appeared first on Microsoft Research.

Categories: Microsoft

Magma: A foundation model for multimodal AI agents across digital and physical worlds

Tue, 02/25/2025 - 21:08

Imagine an AI system capable of guiding a robot to manipulate physical objects as effortlessly as it navigates software menus. Such seamless integration of digital and physical tasks has long been the stuff of science fiction.  

Today, Microsoft researchers are bringing that vision closer to reality with Magma (opens in new tab), a multimodal AI foundation model designed to process information and generate action proposals across both digital and physical environments. It is designed to enable AI agents to interpret user interfaces and suggest actions like button clicks, while also orchestrating robotic movements and interactions in the physical world.  

Built on the foundation model paradigm, Magma is pretrained on an expansive and diverse dataset, allowing it to generalize better across tasks and environments than smaller, task-specific models. As illustrated in Figure 1, Magma synthesizes visual and textual inputs to generate meaningful actions—whether executing a command in software or grabbing a tool in the physical world. This new model represents a significant step toward AI agents that can serve as versatile, general-purpose assistants. 

Figure 1: Magma is one of the first foundation models that is capable of interpreting and grounding multimodal inputs within both digital and physical environments. Given a described goal, Magma can formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.

Vision-Language-Action (VLA) models integrate visual perception, language comprehension, and action reasoning to enable AI systems to interpret images, process textual instructions, and propose actions. These models bridge the gap between multimodal understanding and real-world interaction. Typically pretrained on large numbers of VLA datasets, they acquire the ability to understand visual content, process language, and perceive and interact with the spatial world, allowing them to perform a wide range of tasks. However, due to the dramatic difference among various digital and physical environments, separate VLA models are trained and used for different environments. As a result, these models struggle to generalize to new tasks and environments outside of their training data. Moreover, most of these models do not leverage pretrained vision-language (VL) models or diverse VL datasets, which hampers their understanding of VL relations and generalizability.  

Magma, to the best of our knowledge, is one of the first VLA foundation model that can adapt to new tasks in both digital and physical environments, which helps AI-powered assistants or robots understand their surroundings and suggest appropriate actions. For example, it could enable a home assistant robot to learn how to organize a new type of object it has never encountered or help a virtual assistant generate step-by-step user interface navigation instructions for an unfamiliar task. Through Magma, we demonstrate the advantages of pretraining a single VLA model for AI agents across multiple environments while still achieving state-of-the-art results on user interface navigation and robotic manipulation tasks, outperforming previous models that are tailored to these specific domains. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets. 

Building a foundation model that spans such different modalities has required us to rethink how we train and supervise AI agents. Magma introduces a novel training paradigm centered on two key innovations: Set-of-Mark (SoM) and Trace-of-Mark (ToM) annotations. These techniques developed by Microsoft Research, imbue the model with a structured understanding of tasks in both user interface navigation and robotic manipulation domains. 

  • Set-of-Mark (SoM): SoM is an annotated set of key objects, or interface elements that are relevant to achieving a given goal. For example, if the task is to navigate a web page, the SoM includes all the bounding boxes for clickable user interface elements. In a physical task like setting a table, the SoM could include the plate, the cup, and the position of each item on the table. By providing SoM, we give Magma a high-level hint of “what needs attention”—the essential elements of the task—without yet specifying the order or method.
Figure 2: Set-of-Mark (SoM) for Action Grounding. Set-of-Mark prompting enables effective action grounding in images for both UI screenshot (left), robot manipulation (middle) and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space. These marks give Magma a high-level hint of “what needs attention” – the essential elements of the task 
  • Trace-of-Mark (ToM): In ToM we extend the strategy of “overlaying marks” from static images to dynamic videos, by incorporating tracing lines following object movements over time. While SoM highlights key objects or interface elements relevant to a task, ToM captures how these elements change or move throughout an interaction. For example, in a physical task like moving an object on a table, ToM might illustrate the motion of a hand placing the object and adjusting its position. By providing these temporal traces, ToM offers Magma a richer understanding of how actions unfold, complementing SoM’s focus on what needs attention.
Figure 3: Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation (left) and human action (right). It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions.  Performance and evaluation Zero-shot agentic intelligence Table 1: Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum. Figure 4: Zero-shot evaluation on Google Robots and Bridge with SimplerEnv. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks. Efficient finetuning Table 2: Efficient finetuning on Mind2Web for web UI navigation. Figure 5: Few-shot finetuning on Widow-X robot (left) and LIBERO (right). Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance. Table 3: Without task-specific data, Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data. Relation to broader research

Magma is one component of a much larger vision within Microsoft Research for the future of agentic AI systems. Across various teams and projects at Microsoft, we are collectively exploring how AI systems can detect, analyze, and respond in the world to amplify human capabilities.

Earlier this month, we announced AutoGen v0.4, a fully reimagined open-source library for building advanced agentic AI systems. While AutoGen focuses on the structure and management of AI agents, Magma enhances those agents by empowering them with a new level of capability. Developers can already use AutoGen to set up an AI assistant that leverages an LLM for planning and dialogue using conventional LLMs. Now with MAGMA, if developers want to build agents that execute both physical or user interface/browser tasks, that same assistant would call upon Magma to understand the environment, perform reasoning, and take a sequence of actions to complete the task. 

The reasoning ability of Magma can be further developed by incorporating test-time search and reinforcement learning, as described in ExACT. ExACT shows an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies.

At the application level, we are also exploring new user experience (UX) powered by foundation models for the next generation of agentic AI systems. Data Formulator is a prime example. Announced late last year, Data Formulator, is an AI-driven visualization tool developed by Microsoft Research that translates high-level analytical intents into rich visual representations by handling complex data transformations behind the scenes​.  

Looking ahead, the integration of reasoning, exploration and action capabilities will pave the way for highly capable, robust agentic AI systems.

Magma is available on Azure AI Foundry Labs (opens in new tab) as well as on HuggingFace (opens in new tab) with an MIT license. Please refer to the Magma project page (opens in new tab) for more technical details. We invite you to test and explore these cutting-edge agentic model innovations from Microsoft Research.

Opens in a new tab

The post Magma: A foundation model for multimodal AI agents across digital and physical worlds appeared first on Microsoft Research.

Categories: Microsoft

Exploring the structural changes driving protein function with BioEmu-1

Thu, 02/20/2025 - 17:13

From forming muscle fibers to protecting us from disease, proteins play an essential role in almost all biological processes in humans and other life forms alike. There has been extraordinary progress in recent years toward better understanding protein structures using deep learning, enabling the accurate prediction of protein structures from their amino acid sequences. However, predicting a single protein structure from its amino acid sequence is like looking at a single frame of a movie—it offers only a snapshot of a highly flexible molecule. Biomolecular Emulator-1 (BioEmu-1) is a deep-learning model that provides scientists with a glimpse into the rich world of different structures each protein can adopt, or structural ensembles, bringing us a step closer to understanding how proteins work. A deeper understanding of proteins enables us to design more effective drugs, as many medications work by influencing protein structures to boost their function or prevent them from causing harm.

One way to model different protein structures is through molecular dynamics (MD) simulations. These tools simulate how proteins move and deform over time and are widely used in academia and industry. However, in order to simulate functionally important changes in structure, MD simulations must be run for a long time. This is a computationally demanding task and significant effort has been put into accelerating simulations, going as far as designing custom computer architectures (opens in new tab). Yet, even with these improvements, many proteins remain beyond what is currently possible to simulate and would require simulation times of years or even decades. 

Enter BioEmu-1 (opens in new tab)—a deep learning model that can generate thousands of protein structures per hour on a single graphics processing unit. Today, we are making BioEmu-1 open-source (opens in new tab), following our preprint (opens in new tab) from last December, to empower protein scientists in studying structural ensembles with our model. It provides orders of magnitude greater computational efficiency compared to classical MD simulations, thereby opening the door to insights that have, until now, been out of reach. BioEmu-1 is featured in Azure AI Foundry Labs (opens in new tab), a hub for developers, startups, and enterprises to explore groundbreaking innovations from research at Microsoft.

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.

Read more Opens in a new tab

We have enabled this by training BioEmu-1 on three types of data sets: (1) AlphaFold Database (AFDB) (opens in new tab) structures (2) an extensive MD simulation dataset, and (3) an experimental protein folding stability dataset (opens in new tab). Training BioEmu-1 on the AFDB structures is like mapping distinct islands in a vast ocean of possible structures. When preparing this dataset, we clustered similar protein sequences so that BioEmu-1 can recognize that a protein sequence maps to multiple distinct structures. The MD simulation dataset helps BioEmu-1 predict physically plausible structural changes around these islands, mapping out the plethora of possible structures that a single protein can adopt. Finally, through fine-tuning on the protein folding stability dataset, BioEmu-1 learns to sample folded and unfolded structures with the right probabilities.

Figure 1: BioEmu-1 predicts diverse structures of LapD protein unseen during training. We sampled structures independently and reordered the samples to create a movie connecting two experimentally known structures.

Combining these advances, BioEmu-1 successfully generalizes to unseen protein sequences and predicts multiple structures. In Figure 1, we show that BioEmu-1can predict structures of the LapD protein (opens in new tab) from Vibrio cholerae bacteria, which causes cholera. BioEmu-1 predicts structures of LapD when it is bound and unbound with c-di-GMP molecules, both of which are experimentally known but not in the training set. Furthermore, our model offers a view on intermediate structures, which have never been experimentally observed, providing viable hypotheses about how this protein functions. Insights into how proteins function pave the way for further advancements in areas like drug development.

Figure 2: BioEmu-1 reproduces the D. E. Shaw research (DESRES) simulation of Protein G accurately with a fraction of the computational cost. On the top, we compare the distributions of structures obtained by extensive MD simulation (left) and independent sampling from BioEmu-1 (right). Three representative sample structures are shown at the bottom.

Moreover, BioEmu-1 reproduces MD equilibrium distributions accurately with a tiny fraction of the computational cost. In Figure 2, we compare 2D projections of the structural distribution of D. E. Shaw research (DESRES) simulation of Protein G (opens in new tab) and samples from BioEmu-1. BioEmu-1 reproduces the MD distribution accurately, while requiring 10,000-100,000 times fewer GPU hours.

Figure 3: BioEmu-1 accurately predicts protein stability. On the left, we plot the experimentally measured free energy differences ΔG against those predicted by BioEmu-1. On the right, we show a protein in folded and unfolded structures.

Furthermore, BioEmu-1 accurately predicts protein stability, which we measure by computing the folding free energies—a way to quantify the ratio between the folded and unfolded states of a protein. Protein stability is an important factor when designing proteins, e.g., for therapeutic purposes. Figure 3 shows the folding free energies predicted by BioEmu-1, obtained by sampling protein structures and counting folded versus unfolded protein structures, compared against experimental folding free energy measurements. We see that even on sequences that BioEmu-1 has never seen during training, the predicted free energy values correlate well with experimental values.

Professor Martin Steinegger (opens in new tab) of Seoul National University, who was not part of the study, says “With highly accurate structure prediction, protein dynamics is the next frontier in discovery. BioEmu marks a significant step in this direction by enabling blazing-fast sampling of the free-energy landscape of proteins through generative deep learning.”

We believe that BioEmu-1 is a first step toward generating the full ensemble of structures that a protein can take. In these early days, we are also aware of its limitations. With this open-source release, we hope scientists will start experimenting with BioEmu-1, helping us carve out its potentials and shortcomings so we can improve it in the future. We are looking forward to hearing how it performs on various proteins you care about.

Acknowledgements

BioEmu-1 is the result of highly collaborative team effort at Microsoft Research AI for Science. The full authors: Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Ryota Tomioka, Cecilia Clementi, Frank Noé

Opens in a new tab

The post Exploring the structural changes driving protein function with BioEmu-1 appeared first on Microsoft Research.

Categories: Microsoft

Introducing Muse: Our first generative AI model designed for gameplay ideation

Wed, 02/19/2025 - 18:05

Today, the journal Nature (opens in new tab) is publishing our latest research, which introduces the first World and Human Action Model (WHAM). The WHAM, which we’ve named “Muse,” is a generative AI model of a video game that can generate game visuals, controller actions, or both.

The paper in Nature offers a detailed look at Muse, which was developed by the Microsoft Research Game Intelligence (opens in new tab) and Teachable AI Experiences (opens in new tab) (Tai X) teams in collaboration with Xbox Games Studios’ Ninja Theory (opens in new tab). Simultaneously, to help other researchers explore these models and build on our work, we are open sourcing the weights and sample data and making the executable available for the WHAM Demonstrator—a concept prototype that provides a visual interface for interacting with WHAM models and multiple ways of prompting the models. Developers can learn and experiment with the weights, sample data, and WHAM Demonstrator on Azure AI Foundry (opens in new tab)

In our research, we focus on exploring the capabilities that models like Muse need to effectively support human creatives. I’m incredibly proud of our teams and the milestone we have achieved, not only by showing the rich structure of the game world that a model like Muse can learn, as you see in the video demo below, but also, and even more importantly, by demonstrating how to develop research insights to support creative uses of generative AI models.

Generated gameplay examples Example gameplay sequences generated by Muse (based on WHAM-1.6B) demonstrate that our model can generate complex gameplay sequences that are consistent over several minutes. All examples shown here were generated by prompting the model with 10 initial frames (1 second) of human gameplay and the controller actions of the whole play sequence. Muse is used in “world model mode” meaning that it is used to predict how the game will evolve from the initial prompt sequence. The more closely the generated gameplay sequence resembles the actual game, the more accurately Muse has captured the dynamics of that game. What motivated this research?

As we release our research insights and model today, I keep thinking back to how this all started.  There was a key moment back in December 2022 that I remember clearly. I had recently returned from maternity leave, and while I was away the machine learning world had changed in fundamental ways. ChatGPT had been publicly released, and those who had tried it were in awe of OpenAI’s technical achievements and the model’s capabilities. It was a powerful demonstration of what transformer-based generative models could do when trained on large amounts of (text) data. Coming back from leave at that moment, the key question on my mind was, “What are the implications of this achievement for our team’s work at the intersection of artificial intelligence and video games?”

A new research opportunity enabled by data

In our team, we had access to a very different source of data. For years, we had collaborated with Xbox Game Studios’ Ninja Theory (based in Cambridge, UK, just like our research team) to collect gameplay data from Bleeding Edge, their 2020 Xbox game. Bleeding Edge is a 4-versus-4 game where all games are played online, and matches are recorded if the player agrees to the End User License Agreement (EULA). We worked closely with our colleagues at Ninja Theory and with Microsoft compliance teams to ensure that the data was collected ethically and used responsibly for research purposes.

“It’s been amazing to see the variety of ways Microsoft Research has used the Bleeding Edge environment and data to explore novel techniques in a rapidly moving AI industry,” said Gavin Costello, technical director at Ninja Theory. “From the hackathon that started it all, where we first integrated AI into Bleeding Edge, to building AI agents that could behave more like human players, to the World and Human Action Model being able to dream up entirely new sequences of Bleeding Edge gameplay under human guidance, it’s been eye-opening to see the potential this type of technology has.” 

Muse Training Data Current Muse instances were trained on human gameplay data (visuals and controller actions) from the Xbox game Bleeding Edge – shown here at the 300×180 px resolution at which we train current models. Muse (using WHAM-1.6B) has been trained on more than 1 billion images and controller actions, corresponding to over 7 years of continuous human gameplay. The Game Intelligence and Teachable AI Experiences teams playing the Bleeding Edge game together.

Until that point in late 2022, we had used Bleeding Edge as a platform for human-like navigation experiments, but we had not yet made meaningful use of the large amount of human player data we now had available. With the powerful demonstration of text-models, the next question was clear: “What could we achieve if we trained a transformer-based model on large amounts of human gameplay data?” 

Scaling up model training

As the team got to work, some of the key challenges included scaling up the model training. We initially used a V100 cluster, where we were able to prove out how to scale up to training on up to 100 GPUs; that eventually paved the way to training at scale on H100s. Key design decisions we made early focused on how to best leverage insights from the large language model (LLM) community and included choices such as how to effectively represent controller actions and especially images.

The first sign that the hard work of scaling up training was paying off came in the form of a demo that thoroughly impressed me. Tim Pearce, at that time a researcher in Game Intelligence, had put together examples of what happened early versus later in training. You can see the demo here – it’s like watching the model learn. This led to our follow-up work showing how scaling laws emerge in these kinds of models.

Muse consistency over the course of training Ground truth
Human gameplayGame visuals generated by Muse with 206M parameters
Conditioned on 1 second of real gameplay and 9 seconds of actionsOriginal10k training updates100k training updates1M training updatesCharacter recognizableBasic movements and geometry​No degeneration over time​✘Correct interaction with power cell​✘✘Models flying mechanic correctly​✘✘Comparing ground truth human gameplay (left) to visuals generated using Muse (using WHAM-206M) when prompted with 1 second of human gameplay (visuals and controller actions) and 9 seconds of controller actions from the ground truth. In this setting, if Muse can generate visuals that closely match the ground truth, then it has captured the game dynamics. We see that the quality of generated visuals improves visibly over the course of training. In early training (10k training updates) we see signs of life, but quality deteriorates quickly. After 100k training updates, the model is consistent over time but does not yet capture relatively less frequent aspects of the game dynamics, such as the flying mechanic. Consistency with the ground truth continues to improve with additional training, e.g., the flying mechanic is captured after 1M training updates. Multidisciplinary collaboration: Involving users from the beginning

We had started to investigate how to evaluate these types of models early on. For example, we wanted to understand the representations learned using linear probing, which was driven by Research Intern Gunshi Gupta and Senior Research Scientist Sergio Valcarcel Macua; to explore online evaluation, driven by Senior Research Scientist Raluca Georgescu; and to generate both visuals and actions, initially termed “full dreaming” and driven by Research Intern Tarun Gupta. But working through how to systematically evaluate Muse required a much broader set of insights. More importantly, we needed to understand how people might use these models in order to know how to evaluate them.  

This was where the opportunity for multidisciplinary research became crucial. We had discussed aspects of this work with Senior Principal Research Manager Cecily Morrison and her Teachable AI Experiences team for several months. And we had already partnered on an engagement with game creatives (driven by Cecily, Design Researcher Linda Wen, and Principal Research Software Development Engineer Martin Grayson) to investigate how game creators would like to use generative AI capabilities in their creative practice.

“It was a great opportunity to join forces at this early stage to shape model capabilities to suit the needs of creatives right from the start, rather than try to retrofit an already developed technology,” Cecily said. 

Linda offered some valuable insights about how we approached the work: “We’ve seen how technology-driven AI innovation has disrupted the creative industry—often catching creators off guard and leaving many feeling excluded,” she said. “This is why we invited game creators to help us shape this technology from the start. Recognizing that most AI innovations are developed in the Global North, we also made it a priority to recruit game creators from underrepresented backgrounds and geographies. Our goal was to create a technology that benefits everyone—not just those already in positions of privilege.” 

Unlocking new creative use cases with the WHAM Demonstrator

Now, with the model’s emerging capabilities and user insights in mind, it was time to put all the pieces together. The teams joined forces during a Microsoft internal hackathon to explore new interaction paradigms and creative uses that Muse could unlock. As a result, we developed a prototype that we call the WHAM Demonstrator, which allows users to directly interface with the model.

“The Global Hackathon was the perfect opportunity for everyone to come together and build our first working prototype,” Martin said. “We wanted to develop an interface for the WHAM model that would allow us to explore its creative potential and start to test ideas and uses we had learned from our interviews with game developers.” 

WHAM Demonstrator

For interacting with World and Human Action Models like Muse, the WHAM Demonstrator provides a visual interface for interacting with a WHAM instance.

In this example, the user is loading a visual as an initial prompt to the model, here a single promotional image for the game Bleeding Edge. They use Muse to generate multiple potential continuations from this starting point. The user explores the generated sequences and can tweak them, for example using a game controller to direct the character. These features demonstrate how Muse’s capabilities can enable iteration as part of the creative process. Identifying key capabilities and how to evaluate them

The hands-on experience of exploring Muse capabilities with the WHAM Demonstrator, and drawing on insights we gained from the user study, allowed us to systematically identify capabilities that game creatives would require to use generative models like Muse. This in turn allowed us to establish evaluation protocols for three key capabilities: consistency, diversity, and persistency. Consistency refers to a model’s ability to generate gameplay sequences that respect the dynamics of the game. For example, the character moves consistently with controller actions, does not walk through walls, and generally reflects the physics of the underlying game. Diversity refers to a model’s ability to generate a range of gameplay variants given the same initial prompt, covering a wide range of ways in which gameplay could evolve. Finally, persistency refers to a model’s ability to incorporate (or “persist”) user modifications into generated gameplay sequences, such as a character that is copy-pasted into a game visual. We give an overview of these capabilities below. 

Muse evaluation of consistency, diversity and persistency Consistency We evaluate consistency by prompting the model with ground truth gameplay sequences and controller actions, and letting the model generate game visuals. The videos shown here are generated using Muse (based on WHAM-1.6B) and demonstrate the model’s ability to generate consistent gameplay sequences of up to two minutes. In our paper, we also compare the generated visuals to the ground truth visuals using FVD (Fréchet Video Distance), an established metric in the video generation community. Diversity Muse (based on WHAM-1.6B) generated examples of behavioral and visual diversity, conditioned on the same initial 10 frames (1 second) of real gameplay. The three examples at the top show behavioral diversity (diverse camera movement, loitering near the spawn location, and navigating various paths to the middle jump pad). The three examples below show visual diversity (different hoverboards for the character). In the paper, we also quantitatively assess diversity using the Wasserstein distance, a measure of distance between two distributions, to compare the model-generated sequences to the diversity reflected in human gameplay recordings. Muse generated examples of behavioral and visual diversity, conditioned on the same 10 frames of real gameplay. Three examples of behavioral diversity show diverse camera movement, loitering near the spawn location, and navigating various paths to the middle jump pad. Three examples of visual diversity show different hoverboards for the character.

With our evaluation framework in place, and access to an H100 compute allocation, the team was able to further improve Muse instances, including higher resolution image encoders (our current models generate visuals at a resolution of 300×180 pixels, up from the 128×128 resolution of our earliest models) and larger models, and expand to all seven Bleeding Edge maps. To show some of the capabilities of the model we are publishing today, we have included videos of 2-minute-long generated gameplay sequences above, which give an impression of the consistency and diversity of gameplay sequences that the model can generate.

According to Senior Researcher Tabish Rashid: “Being handed an allocation of H100s was initially quite daunting, especially in the early stages figuring out how to make best use of it to scale to larger models with the new image encoders. After months of experimentation, it was immensely rewarding to finally see outputs from the model on a different map (not to knock the lovely greenery of Skygarden) and not have to squint so much at smaller images. I’m sure at this point many of us have watched so many videos from Muse that we’ve forgotten what the real game looks like.”

One of my favorite capabilities of the model is how it can be prompted with modifications of gameplay sequences and persist newly introduced elements. For example, in the demo below, we’ve added a character onto the original visual from the game. Prompting the model with the modified visual, we can see how the model “persists” the added character and generates plausible variants of how the gameplay sequence could have evolved from this modified starting point.

Persistency Demonstrations of how Muse (based on WHAM-1.6B) can persist modifications. A visual is taken from the original gameplay data and an image of an additional character is edited into the image. The generated gameplay sequence shows how the character is adapted into the generated gameplay sequence. Conclusion

Today, our team is excited to be publishing our work in Nature and simultaneously releasing Muse open weights, the WHAM Demonstrator, and sample data to the community.

I look forward to seeing the many ways in which the community will explore these models and build on our research. I cannot wait to see all the ways that these models and subsequent research will help shape and increase our understanding of how generative AI models of human gameplay may support gameplay ideation and pave the way for future, novel, AI-based game experiences, including the use cases that our colleagues at Xbox (opens in new tab) have already started to explore.

Opens in a new tab

The post Introducing Muse: Our first generative AI model designed for gameplay ideation appeared first on Microsoft Research.

Categories: Microsoft

Microsoft Research and Physics Wallah team up to enhance AI-based tutoring

Wed, 02/12/2025 - 23:01

In India, limited resources, geographical constraints, and economic factors present barriers to quality education for some students.

A shortage of teachers, particularly in remote or low-income areas, makes it harder for students to receive the guidance they need to prepare for highly competitive professional and academic programs. Microsoft Research is developing new algorithms and techniques that are enabling Physics Wallah (opens in new tab), a growing educational company, to make its AI-based tutoring services more accurate and reliable, to better support students on their education journey.

As in other countries, many Indian students purchase coaching and tutoring services to prepare for entrance exams at top institutions. This includes offline coaching, where hundreds of students meet in a classroom staffed by teachers covering a structured curriculum. Online coaching enables students to learn remotely in a virtual classroom. Hybrid coaching delivers virtual lessons in a physical classroom.

Offline courses can cost as much as 100,000 Indian rupees a year—equivalent to hundreds of U.S. dollars. This puts them out of reach for many lower income students living in smaller and mid-sized Indian cities, as well as rural villages. Online courses are much more affordable. They allow students to work at their own pace by providing high-quality web-based content supported by teachers who work remotely.

Vineet Govil

Meeting this need is the mission of Physics Wallah. The company uses AI to offer on-demand tutoring at scale, curating volumes of standard science- and math-related content to provide the best answers. Some 2 million students use the Physics Wallah platform every day, at a fraction of the cost of offline tutoring. For example, its prep courses for the Joint Entrance Examination (JEE), which is required for admission to engineering and technology programs, and the National Eligibility cum Entrance Test (NEET), a required entrance exam for medical and dental school candidates, cost between 4,200 and 4,500 rupees per year. That’s roughly 50 U.S. dollars.

“The mantra here really is how do we provide quality education in an affordable manner and accessible to every student, regardless of who they are or where they come from.”

—Vineet Govil, Chief Technology and Product Officer, Physics Wallah

Microsoft Research India’s collaboration with Physics Wallah is part of a 20-year legacy of supporting emerging Indian companies, underscored by the January 2025 announcement that Microsoft will invest $3 billion (opens in new tab) in cloud and AI infrastructure to accelerate the adoption of AI, skilling, and innovation.  

Physics Wallah has developed an AI-driven educational suite, Alakh AI, leveraging OpenAI’s GPT-4o model through Microsoft Azure OpenAI Service. Alakh AI’s flagship offerings include AI Guru and the Smart Doubt Engine, both designed to transform the learning experience in and beyond the classroom.

  • AI Guru acts as a personal academic tutor, delivering adaptive guidance based on a student’s progress, real-time question-solving, and customized content that evolves with their learning journey.
  • Smart Doubt Engine is an AI tool through which students can ask questions (also known as “doubts” in Indian English) during live classes and receive instant responses.

Additionally, the Alakh AI suite includes:

  • AI Grader for subjective answer evaluation without human intervention
  • Sahayak for crafting hyper-personalized learning paths tailored to individual students’ needs

This innovative ecosystem elevates learning efficiency and accessibility for students.

AI Guru in action – A student asks, “Explain Newton’s First Law,” and the AI tutor provides a detailed explanation along with two videos for further learning. Smart Doubt Engine in action – A student asks a clarifying question during a live class, and the AI provides a detailed explanation in real time. How does AI Guru work?

Let’s say a student had a question about Newton’s laws of motion, a core concept in physics. She would type her query into the AI Guru chat window (she could also just talk to it or upload an image from a textbook) and receive a text answer plus images derived from standard textbooks and curated content, typically in just a few seconds. AI Guru also provides a short video where a teacher offers additional context.

Getting the technology right

The Alakh AI suite is powered by OpenAI’s foundational models GPT-4 and GPT-4o, integrated with a retrieval-augmented generation (RAG) architecture. It leverages Physics Wallah’s rich repository of high-quality curated content—developed and refined over several years—along with continuous updates from subject matter experts to ensure new materials, textbooks, tutorials, and question banks are seamlessly incorporated. Despite considerable progress, the existing AI sometimes falters when navigating complex academic problems.

“The accuracy level of today’s large language models (LLMs) is not up to the mark where we can provide reliable and satisfactory answers to the students all the time—specifically, if it’s a hard mathematical problem involving complex equations,” Govil said.

That’s one important focus of the collaboration. Researchers from Microsoft Research are developing new algorithms and techniques to enhance the accuracy and reasoning capabilities of AI models. They are now collaborating with Physics Wallah to apply these advancements to the Alakh AI suite, improving its ability to solve complex problems and provide more reliable, step-by-step guidance to students. A key challenge is the nature of student queries, which are often ambiguous and involve multimodal inputs—text, images, videos, or audio—requiring unified capabilities to address the problem. Many STEM problems require breaking down complex queries into logical sub-problems and applying high-order, step-by-step reasoning for consistency. Additionally, integrating domain-specific knowledge in advanced math, physics, chemistry, and biology requires contextualization and seamless retrieval of specialized, grade-appropriate information. 

Microsoft Research is working with Physics Wallah to move beyond traditional next-token prediction and develop AI systems that approach reliable, systematic, step-by-step problem-solving.

That includes ongoing work to enhance the model’s reasoning capabilities and deliver more accurate query answers on complex JEE math problems. Instead of just providing the final answer, the underlying models now break problems into step-by-step solutions. That helps students learn how to solve the actual problems. The AI can also review student answers, detect mistakes, and give detailed feedback, acting as a personal tutor to guide students, improve their understanding, and enhance their learning experience.

Microsoft research blog

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.

Read more Opens in a new tab

Solving complex problems requires enhancing the reasoning capabilities of both large and small language models by training them to not just generate answers, but to systematically think through and reason about complex problems. This requires high-quality reasoning traces—detailed, step-by-step breakdowns of logical problem-solving processes.

To enable this, researchers collaborated with Physics Wallah to curate a dataset of 150,000 high-quality math reasoning traces. These traces serve as the foundation for training specialized small language models (SLMs) using supervised fine-tuning (SFT). Model performance is further refined through training on carefully curated on-policy preference data, ensuring alignment with high-quality reasoning standards. The team’s current Phi-based models have already outperformed leading LLMs and other baselines on complex math problems.

“Building AI systems capable of human-like thinking and reasoning represents a significant challenge.”

—Akshay Nambi, Principal Researcher at Microsoft Research India

The next step is to develop a self-evolving learning pipeline using online reinforcement learning techniques, allowing the model to continuously generate high-quality synthetic data that further enhances its capabilities. Additionally, researchers are building a reward model and integrating it with Monte Carlo Tree Search (MCTS) to optimize reasoning and improve inference-time decision-making.

“The goal is to develop tools that complement education. To do this, we are enhancing the model’s capabilities to process, break down, and solve problems step-by-step. We do this by incorporating high-quality data into training to teach the model how to approach such tasks, alongside algorithmic innovations that enable the model to think and reason more effectively.”

Listen or read along as Microsoft Research Podcast guest Akshay Nambi shares how his passion for tackling real-world challenges across various domains fuels his work in building reliable and robust AI systems. Opening new doors for students Chandramouleswar Parida

Getting an education at a top university can be life changing for anyone. For Chandramouleswar Parida, it could change the lives of everyone in his home village in Baniatangi, Khordha, Odisha State, India. Chandra decided to become a doctor after watching his grandfather die from a heart attack. The nearest doctor who could have treated him was at a regional hospital 65 kilometers away.

“He could have been saved if certain procedures had been followed,” Chandra said. He wants to study medicine, perhaps receiving advanced training overseas, and then return home. “I want to be a doctor here in our village and serve our people, because there is a lack of treatment. Being a doctor is a very noble kind of job in this society.”

Chandra is the only student in Baniatangi Village, Khordha, Odisha, currently preparing for the NEET. Without Physics Wallah, students like Chandra would likely have no access to the support and resources that can’t be found locally.

Anushka Sunil Dhanwade

Another student, Anushka Sunil Dhanwade, is optimistic that Physics Wallah will help her dramatically improve her initial score on the NEET exam. While in 11th class, or grade, she joined an online NEET prep class with 800 students. But she struggled to follow the coursework, as the teachers tailored the content to the strongest students. After posting a low score on the NEET exam, her hopes of becoming a doctor were fading.

But after a serious stomach illness reminded her of the value of having a doctor in her family, she tried again, this time with Physics Wallah and AI Guru. After finishing 12th class, she began preparing for NEET and plans to take the exams again in May, confident that she will increase her score.

“AI Guru has made my learning so smooth and easy because it provides me answers related to my study and study-related doubt just within a click.”

—Anushka Sunil Dhanwade, Student Next steps in the collaboration

The collaboration between Microsoft Research and Physics Wallah aims to apply the advancements in solving math problems across additional subjects, ultimately creating a unified education LLM with enhanced reasoning capabilities and improved accuracy to support student learning.

“We’re working on an education-specific LLM that will be fine-tuned using the extensive data we’ve gathered and enriched by Microsoft’s expertise in LLM training and algorithms. Our goal is to create a unified model that significantly improves accuracy and raises student satisfaction rates to 95% and beyond,” Govil explained.

The teams are also integrating a new tool from Microsoft Research called PromptWizard (opens in new tab), an automated framework for optimizing the instructions given to a model, into Physics Wallah’s offerings. New prompts can now be generated in minutes, eliminating months of manual work, while providing more accurate and aligned answers for students.

For Nambi and the Microsoft Research India team, the collaboration is the latest example of their deep commitment to cultivating the AI ecosystem in India and translating new technology from the lab into useful business applications.

“By leveraging advanced reasoning techniques and domain expertise, we are transforming how AI addresses challenges across multiple subjects. This represents a key step in building AI systems that act as holistic personal tutors, enhancing student understanding and creating a more engaging learning experience,” Nambi said.

Explore more Opens in a new tab

The post Microsoft Research and Physics Wallah team up to enhance AI-based tutoring appeared first on Microsoft Research.

Categories: Microsoft

Advances to low-bit quantization enable LLMs on edge devices

Wed, 02/05/2025 - 19:32

Large language models (LLMs) are increasingly being deployed on edge devices—hardware that processes data locally near the data source, such as smartphones, laptops, and robots. Running LLMs on these devices supports advanced AI and real-time services, but their massive size, with hundreds of millions of parameters, requires significant memory and computational power, limiting widespread adoption. Low-bit quantization, a technique that compresses models and reduces memory demands, offers a solution by enabling more efficient operation.

Recent advances in low-bit quantization have made mixed-precision matrix multiplication (mpGEMM) viable for LLMs. This deep learning technique allows data of the same or different formats to be multiplied, such as int8*int1, int8*int2, or FP16*int4. By combining a variety of precision levels, mpGEMM strikes a balance among speed, memory efficiency, and computational accuracy. 

However, most hardware supports only symmetric computations—operations on data of similar formats—creating challenges for mixed-precision calculations during General Matrix Multiplication (GEMM), a critical operation for LLMs. Overcoming these hardware limitations is essential to fully benefit from mpGEMM and support asymmetrical computations. 

To unlock the potential of low-bit quantization on resource-constrained edge devices, hardware must natively support mpGEMM. To address this, we developed the following three approaches for computing kernels and hardware architectures: 

  • Ladder data type compiler: Supports various low-precision data types by converting unsupported types into hardware-compatible ones without data loss, while also generating high-performance conversion code. 
  • T-MAC mpGEMM library: Implements GEMM using a lookup table (LUT) approach, eliminating multiplications to significantly reduce computational overhead. Optimized for diverse CPUs, T-MAC delivers several times the speed of other libraries. 
  • LUT Tensor Core hardware architecture: Introduces a cutting-edge design for next-generation AI hardware, tailored for low-bit quantization and mixed-precision computations.

The following sections describe these techniques in detail.

Ladder: Bridging the gap between custom data and hardware limits

Cutting-edge hardware accelerators, such as GPUs, TPUs, and specialized chips, are designed to speed up computationally intensive tasks like deep learning by efficiently handling large-scale operations. These accelerators now integrate lower-bit computing units, such as FP32, FP16, and even FP8, into their architectures.  

However, constraints in chip area and hardware costs limit the availability of these units for standard data types. For instance, the NVIDIA V100 Tensor Core GPU supports only FP16, while the A100 supports int2, int4, and int8 but not newer formats like FP8 or OCP-MXFP. Additionally, the rapid development of LLMs often outpaces hardware upgrades, leaving many new data formats unsupported and complicating deployment.

Additionally, while hardware accelerators may lack direct support for custom data types, their memory systems can convert these types into fixed-width data blocks that store any data format. For instance, NF4 tensors can be converted into FP16 or FP32 for floating-point operations.

Building on these insights, we developed the Ladder data type compiler, a method to separate data storage from computation, enabling broader support for custom data types. It bridges the gap between emerging custom data formats with the precision types supported by current hardware.

Ladder offers a flexible system for converting between algorithm-specific and hardware-supported data types without data loss. For low-bit applications, it optimizes performance by translating low-bit data into the most efficient formats for the hardware being used. As shown in Figure 1, this includes mapping low-bit computations to supported instructions and efficiently managing data storage across the memory hierarchy. 

Figure 1: The Ladder architecture Evaluating Ladder

Evaluations of Ladder on NVIDIA and AMD GPUs show that it outperforms existing deep neural network (DNN) compilers for natively supported data types. It also handles custom data types not supported by GPUs, achieving speedups of up to 14.6 times. 

As the first system to support custom low-precision data types for running DNNs on modern hardware accelerators, Ladder provides researchers with flexibility in optimizing data types. It also enables hardware developers to support a wider range of data types without requiring hardware modifications. 

T-MAC: Table-lookup for mpGEMM without multiplication

Deploying low-bit quantized LLMs on edge devices often requires dequantizing models to ensure hardware compatibility. However, this approach has two major drawbacks: 

  1. Performance: Dequantization overhead can result in poor performance, negating the benefits of low-bit quantization.
  2. Development: Developers must redesign data layouts and kernels for different mixed precisions.

To address these challenges, we introduce T-MAC, a novel LUT-based method that enables mpGEMM without dequantization or multiplication. 

T-MAC replaces traditional multiplication operations with bit-wise table lookups, offering a unified and scalable solution for mpGEMM. It incorporates techniques to reduce the size of tables and store them directly on the chip, minimizing the overhead of accessing data from memory. By eliminating dequantization and lowering computational costs, T-MAC enables efficient inference of low-bit LLMs on resource-constrained edge devices. Figure 2 illustrates T-MAC’s architecture. 

Figure 2. Overview of the T-MAC system Evaluating T-MAC

Performance evaluations of T-MAC on low-bit models demonstrated substantial benefits in efficiency and speed. On the Surface Laptop 7 with the Qualcomm Snapdragon X Elite chipset, T-MAC achieved: 

  • 48 tokens per second for the 3B BitNet-b1.58 model 
  • 30 tokens per second for the 2-bit 7B Llama model 
  • 20 tokens per second for the 4-bit 7B Llama model

These speeds far exceed average human reading rates, outperforming llama.cpp by 4–5 times and doubling the speed of a dedicated NPU accelerator. Even on lower-end devices like the Raspberry Pi 5, T-MAC made it possible for the 3B BitNet-b1.58 model to generate 11 tokens per second. It also proved highly power-efficient, matching llama.cpp’s generation rate while using only 1/4 to 1/6 of the CPU cores.

These results establish T-MAC as a practical solution for deploying LLMs on edge devices with standard CPUs, without relying on GPUs or NPUs. T-MAC allows LLMs to run efficiently on resource-constrained devices, expanding their applicability across a wider range of scenarios.

LUT Tensor Core: Driving hardware for mpGEMM

While T-MAC and Ladder optimize mpGEMM on existing CPU and GPU architectures, improving computational efficiency, they cannot match the performance of dedicated hardware accelerators with built-in LUT support. Achieving significant improvements in performance, power, and area (PPA) requires overcoming four key challenges:

  1. Table precompute and storage: Precomputing and storing LUTs add overhead, increasing area usage, latency, and storage requirements, which can reduce overall efficiency gains.
  2. Bit-width flexibility: Hardware must support various precision levels, such as int4/2/1 for weights and FP16/8 or int8 for activations, along with their combinations. This flexibility is crucial for accommodating diverse model architectures and use cases.
  3. LUT tiling shape: Inefficient tiling shapes can raise storage costs and limit reuse opportunities, adversely affecting performance and efficiency.
  4. Instruction and compilation: LUT-based mpGEMM requires a new instruction set. Existing compilation stacks, designed for standard GEMM hardware, may not optimally map and schedule these instructions, complicating integration with LLM inference software.

In response, we developed LUT Tensor Core, a software-hardware codesign for low-bit LLM inference. To address precomputation overhead in conventional LUT-based methods, we introduce techniques like software-based DFG transformation, operator fusion, and table symmetrization to optimize table precomputation and storage. Additionally, we propose a hardware design with an elongated tiling shape to support table reuse and a bit-serial design to handle various precision combinations in mpGEMM.

To integrate with existing GPU microarchitectures and software stacks, we extended the MMA instruction set, added new LMMA instructions, and developed a cuBLAS-like software stack for easy integration into existing DNN frameworks. We also created a compiler for end-to-end execution planning on GPUs with LUT Tensor Core. This design and workflow, illustrated in Figure 3, enabled the quick and seamless adoption of LUT Tensor Core.

Figure 3. The LUT Tensor Core workflow Evaluating LUT Tensor Core

Testing LUT Tensor Core on low-bit LLMs, such as BitNet and Llama, showed significant performance gains, achieving 6.93 times the inference speed while using just 38.3% of the area of a traditional Tensor Core. With nearly identical model accuracy, this results in a 20.9-fold increase in computational density and an 11.2-fold boost in energy efficiency. As AI models grow in scale and complexity, LUT Tensor Core enables low-bit LLMs to be applied in new and diverse scenarios.

We believe the LUT technique could drive a paradigm shift in AI model inference. Traditional methods rely on multiplication and accumulation operations, whereas LUT implementations provide higher transistor density, greater throughput per chip area, lower energy costs, and better scalability. As large models adopt low-bit quantization, the LUT method could become the standard for system and hardware design, advancing the next generation of AI hardware innovation.

Unlocking new possibilities for embodied AI

Low-bit quantization improves the efficiency of running large models on edge devices while also enabling model scaling by reducing the bits used to represent each parameter. This scaling enhances model capabilities, generality, and expressiveness, as shown by the BitNet model, which starts with a low-bit configuration and expands.

Technologies like T-MAC, Ladder, and LUT Tensor Core provide solutions for running low-bit quantized LLMs, supporting efficient operation across edge devices and encouraging researchers to design and optimize LLMs using low-bit quantization. By reducing memory and computational demands, low-bit LLMs could power embodied AI systems, such as robots, enabling dynamic perception and real-time environmental interaction.

T-MAC (opens in new tab) and Ladder (opens in new tab) are open source and available on GitHub. We invite you to test and explore these innovations in AI technology with Microsoft Research.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.

Watch on-demand Opens in a new tab Opens in a new tab

The post Advances to low-bit quantization enable LLMs on edge devices appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of January 13, 2025

Fri, 01/17/2025 - 19:37

In this edition:

  • We introduce privacy enhancements for multiparty deep learning, a framework using smaller, open-source models to provide relevance judgments, and other notable new research.
  • We congratulate Yasuyuki Matsushita, who was named an IEEE Computer Society Fellow.
  • We’ve included a recap of the extraordinary, far-reaching work done by researchers at Microsoft in 2024.  
AI meets materials discovery

Two of the transformative tools that play a central role in Microsoft’s work on AI for science are MatterGen and MatterSim. In the world of materials discovery, each plays a distinct yet complementary role in reshaping how researchers design and validate new materials.

Read the story NEW RESEARCH Communication Efficient Secure and Private Multi-Party Deep Learning

Distributed training enables multiple parties to jointly train a machine learning model on their respective datasets, which can help address the challenges posed by requirements in modern machine learning for large volumes of diverse data. However, this can raise security and privacy issues – protecting each party’s data during training and preventing leakage of private information from the model after training through various inference attacks.  

In a recent paper, Communication Efficient Secure and Private Multi-Party Deep Learning, researchers from Microsoft address these concerns simultaneously by designing efficient Differentially Private, secure Multiparty Computation (DP-MPC) protocols for jointly training a model on data distributed among multiple parties. This DP-MPC protocol in the two-party setting is 56-to-794 times more communication-efficient and 16-to-182 times faster than previous such protocols. This work simplifies and improves on previous attempts to combine techniques from secure multiparty computation and differential privacy, especially in the context of training machine learning models. 

Read the paper NEW RESEARCH JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Training and evaluating retrieval systems requires significant relevance judgments, which are traditionally collected from human assessors. This process is both costly and time-consuming. Large language models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM. While effective, this approach can be expensive and prone to intra-model biases that can favor systems leveraging similar models.

In a recent paper: JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment, researchers from Microsoft we introduce a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark, they compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. This research shows that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.

Read the paper NEW RESEARCH Convergence to Equilibrium of No-regret Dynamics in Congestion Games

Congestion games are used to describe the behavior of agents who share a set of resources. Each player chooses a combination of resources, which may become congested, decreasing utility for the players who choose them. Players can avoid congestion by choosing combinations that are less popular. This is useful for modeling a range of real-world scenarios, such as traffic flow, data routing, and wireless communication networks.

In a recent paper: Convergence to Equilibrium of No-regret Dynamics in Congestion Games; researchers from Microsoft and external colleagues propose CongestEXP, a decentralized algorithm based on the classic exponential weights method. They evaluate CongestEXP in a traffic congestion game setting. As more drivers use a particular route, congestion increases, leading to higher travel times and lower utility. Players can choose a different route every day to optimize their utility, but the observed utility by each player may be subject to randomness due to uncertainty (e.g., bad weather). The researchers show that this approach provides both regret guarantees and convergence to Nash Equilibrium, where no player can unilaterally improve their outcome by changing their strategy.

Read the paper NEW RESEARCH RD-Agent: An open-source solution for smarter R&D

Research and development (R&D) plays a pivotal role in boosting industrial productivity. However, the rapid advance of AI has exposed the limitations of traditional R&D automation. Current methods often lack the intelligence needed to support innovative research and complex development tasks, underperforming human experts with deep knowledge.

LLMs trained on vast datasets spanning many subjects are equipped with extensive knowledge and reasoning capabilities that support complex decision-making in diverse workflows. By autonomously performing tasks and analyzing data, LLMs can significantly increase the efficiency and precision of R&D processes.

In a recent article, researchers from Microsoft introduce RD-Agent, a tool that integrates data-driven R&D systems and harnesses advanced AI to automate innovation and development.

At the heart of RD-Agent is an autonomous agent framework with two key components: a) Research and b) Development. Research focuses on actively exploring and generating new ideas, while Development implements these ideas. Both components improve through an iterative process, illustrated in Figure 1 of the article, ensures the system becomes increasingly effective over time.

Read the article

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Subscribe today Opens in a new tab Microsoft Research | In case you missed it Microsoft Research 2024: A year in review 

December 20, 2024

Microsoft Research did extraordinary work this year, using AI and scientific research to make progress on real-world challenges like climate change, food security, global health, and human trafficking. Here’s a look back at the broad range of accomplishments and advances in 2024.

AIOpsLab: Building AI agents for autonomous clouds 

December 20, 2024

AIOpsLab is a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks.

Yasuyuki Matsushita, IEEE Computer Society 2025 Fellow 

December 19, 2024

Congratulations to Yasuyuki Matsushita, Senior Principal Research Manager at Microsoft Research, who was named a 2025 IEEE Computer Society Fellow. Matsushita was recognized for contributions to photometric 3D modeling and computational photography.

View more news and awards Opens in a new tab

The post Research Focus: Week of January 13, 2025 appeared first on Microsoft Research.

Categories: Microsoft

MatterGen: A new paradigm of materials design with generative AI 

Thu, 01/16/2025 - 12:05

Materials innovation is one of the key drivers of major technological breakthroughs. The discovery of lithium cobalt oxide in the 1980s laid the groundwork for today’s lithium-ion battery technology. It now powers modern mobile phones and electric cars, impacting the daily lives of billions of people. Materials innovation is also required for designing more efficient solar cells, cheaper batteries for grid-level energy storage, and adsorbents to recycle CO2 from atmosphere.  

Finding a new material for a target application is like finding a needle in a haystack. Historically, this task has been done via expensive and time-consuming experimental trial-and-error. More recently, computational screening of large materials databases has allowed researchers to speed up this process. Nonetheless, finding the few materials with the desired properties still requires the screening of millions of candidates. 

Today, in a paper published in Nature (opens in new tab), we share MatterGen, a generative AI tool that tackles materials discovery from a different angle. Instead of screening the candidates, it directly generates novel materials given prompts of the design requirements for an application. It can generate materials with desired chemistry, mechanical, electronic, or magnetic properties, as well as combinations of different constraints. MatterGen enables a new paradigm of generative AI-assisted materials design that allows for efficient exploration of materials, going beyond the limited set of known ones.   

Figure 1: Schematic representation of screening and generative approaches to materials design  A novel diffusion architecture 

MatterGen is a diffusion model that operates on the 3D geometry of materials. Much like an image diffusion model generates pictures from a text prompt by modifying the color of pixels from a noisy image, MatterGen generates proposed structures by adjusting the positions, elements, and periodic lattice from a random structure. The diffusion architecture is specifically designed for materials to handle specialties like periodicity and 3D geometry.  

Figure 2: Schematic representation of MatterGen: a diffusion model to generate novel and stable materials. MatterGen can be fine-tuned to generate materials under different design requirements such as specific chemistry, crystal symmetry, or materials’ properties.  

The base model of MatterGen achieves state-of-the-art performance in generating novel, stable, diverse materials (Figure 3). It is trained on 608,000 stable materials from the Materials Project (opens in new tab) (MP) and Alexandria (opens in new tab) (Alex) databases. The performance improvement can be attributed to both the architecture advancements, as well as the quality and size of our training data.  

Figure 3: Performance of MatterGen and other methods in the generation of stable, unique, and novel structures. The training dataset for each method is indicated in parentheses. The purple bar highlights performance improvements due to MatterGen’s architecture alone, while the teal bar highlights performance improvements that come also from the larger training dataset. 

MatterGen can be fine-tuned with a labelled dataset to generate novel materials given any desired conditions. We demonstrate examples of generating novel materials given a target’s chemistry and symmetry, as well as electronic, magnetic, and mechanical property constraints (Figure 2).  

Outperforming screening  Figure 4: Performance of MatterGen (teal) and traditional screening (yellow) in finding novel, stable, and unique structures that satisfy the design requirement of having bulk modulus greater than 400 GPa. 

The key advantage of MatterGen over screening is its ability to access the full space of unknown materials. In Figure 4, we show that MatterGen continues to generate more novel candidate materials with high bulk modulus above 400 GPa, for example, which are hard to compress. In contrast, screening baseline saturates due to exhausting known candidates.  

Microsoft research podcast

Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa Ness

As the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India.

Listen now Opens in a new tab Handling compositional disorder  Figure 5: Illustration of compositional disorder. Left: a perfect crystal without compositional disorder and with a repeating unit cell (black dashed). Right: crystal with compositional disorder, where each site has 50% probability of yellow and teal atoms. 

Compositional disorder (Figure 5) is a commonly observed phenomenon where different atoms can randomly swap their crystallographic sites in a synthesized material. Recently (opens in new tab), the community has been exploring what it means for a material to be novel in the context of computationally designed materials, as widely employed algorithms will not distinguish between pairs of structures where the only difference is a permutation of similar elements in their respective sites.

We provide an initial solution to this issue by introducing a new structure matching algorithm that considers compositional disorder. The algorithm assesses whether a pair of structures can be identified as ordered approximations of the same underlying compositionally disordered structure. This provides a new definition of novelty and uniqueness, which we adopt in our computational evaluation metrics. We also make our algorithm publicly available (opens in new tab) as part of our evaluation package. 

Experimental lab verification  Figure 6: Experimental validation of the proposed compound, TaCr2O6  

In addition to our extensive computational evaluation, we have validated MatterGen’s capabilities through experimental synthesis. In collaboration with the team led by Prof Li Wenjie from the Shenzhen Institutes of Advanced Technology (opens in new tab) (SIAT) of the Chinese Academy of Sciences, we have synthesized a novel material, TaCr2O6, whose structure was generated by MatterGen after conditioning the model on a bulk modulus value of 200 GPa. The synthesized material’s structure aligns with the one proposed by MatterGen, with the caveat of compositional disorder between Ta and Cr. Additionally, we experimentally measure a bulk modulus of 169 GPa against the 200 GPa given as design specification, with a relative error below 20%, very close from an experimental perspective. If similar results can be translated to other domains, it will have a profound impact on the design of batteries, fuel cells, and more.  

AI emulator and generator flywheel 

MatterGen presents a new opportunity for AI accelerated materials design, complementing our AI emulator MatterSim. MatterSim follows the fifth paradigm of scientific discovery, significantly accelerating the speed of material properties’ simulations. MatterGen in turn accelerates the speed of exploring new material candidates with property guided generation. MatterGen and MatterSim can work together as a flywheel to speed up both the simulation and exploration of novel materials.

Making MatterGen available 

We believe the best way to make an impact in materials design is to make our model available to the public. We release the source code of MatterGen (opens in new tab) under the MIT license, together with the training and fine-tuning data. We welcome the community to use and build on top of our model.  

Looking ahead 

MatterGen represents a new paradigm of materials design enabled by generative AI technology. It explores a significantly larger space of materials than screening-based methods. It is also more efficient by guiding materials exploration with prompts. Similar to how generative AI has impacted drug discovery (opens in new tab), it will have profound impact on how we design materials in broad domains including batteries, magnets, and fuel cells. 

We plan to continue our work with external collaborators to further develop and validate the technology. “At the Johns Hopkins University Applied Physics Laboratory (APL), we’re dedicated to the exploration of tools with the potential to advance discovery of novel, mission-enabling materials. That’s why we are interested in understanding the impact that MatterGen could have on materials discovery,” said Christopher Stiles, a computational materials scientists leading multiple materials discovery efforts at APL.

Acknowledgement 

This work is the result of highly collaborative team efforts at Microsoft Research AI for Science. The full authors include: Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Zilong Wang, Aliaksandra Shysheya, Jonathan Crabbé, Shoko Ueda, Roberto Sordillo, Lixin Sun, Jake Smith, Bichlien Nguyen, Hannes Schulz, Sarah Lewis, Chin-Wei Huang, Ziheng Lu, Yichi Zhou, Han Yang, Hongxia Hao, Jielan Li, Chunlei Yang, Wenjie Li, Ryota Tomioka, Tian Xie.  

Opens in a new tab

The post MatterGen: A new paradigm of materials design with generative AI  appeared first on Microsoft Research.

Categories: Microsoft

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Tue, 01/14/2025 - 16:33

Over the past year, our work on AutoGen has highlighted the transformative potential of agentic AI and multi-agent applications. Today, we are excited to announce AutoGen v0.4, a significant milestone informed by insights from our community of users and developers. This update represents a complete redesign of the AutoGen library, developed to improve code quality, robustness, generality, and scalability in agentic workflows. 

The initial release of AutoGen generated widespread interest in agentic technologies. At the same time, users struggled with architectural constraints, an inefficient API compounded by rapid growth, and limited debugging and intervention functionality. Feedback highlighted the need for stronger observability and control, more flexible multi-agent collaboration patterns, and reusable components. AutoGen v0.4 addresses these issues with its asynchronous, event-driven architecture

This update makes AutoGen more robust and extensible, enabling a broader range of agentic scenarios. The new framework includes the following features, inspired by feedback from both within and outside Microsoft.  

  • Asynchronous messaging: Agents communicate through asynchronous messages, supporting both event-driven and request/response interaction patterns. 
  • Modular and extensible: Users can easily customize systems with pluggable components, including custom agents, tools, memory, and models. They can also build proactive and long-running agents using event-driven patterns. 
  • Observability and debugging: Built-in metric tracking, message tracing, and debugging tools provide monitoring and control over agent interactions and workflows, with support for OpenTelemetry for industry-standard observability. 
  • Scalable and distributed: Users can design complex, distributed agent networks that operate seamlessly across organizational boundaries. 
  • Built-in and community extensions: The extensions module enhances the framework’s functionality with advanced model clients, agents, multi-agent teams, and tools for agentic workflows. Community support allows open-source developers to manage their own extensions. 
  • Cross-language support: This update enables interoperability between agents built in different programming languages, with current support for Python and .NET and additional languages in development. 
  • Full type support: Interfaces enforce type checks at build time, improving robustness and maintaining code quality.

Microsoft research podcast

NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opportunities to use AI to enhance systems engineering itself.

Listen now Opens in a new tab New AutoGen framework

As shown in Figure 1, the AutoGen framework features a layered architecture with clearly defined responsibilities across the framework, developer tools, and applications. The framework comprises three layers: core, agent chat, and first-party extensions.  

  • Core: The foundational building blocks for an event-driven agentic system.
  • AgentChat: A task-driven, high-level API built on the core layer, featuring group chat, code execution, pre-built agents, and more. This layer is most similar to AutoGen v0.2 (opens in new tab), making it the easiest API to migrate to.
  • Extensions: Implementations of core interfaces and third-party integrations, such as the Azure code executor and OpenAI model client.
Figure 1. The v0.4 update introduces a cohesive AutoGen ecosystem that includes the framework, developer tools, and applications. The framework’s layered architecture clearly defines each layer’s functionality. It supports both first-party and third-party applications and extensions.  Developer tools

In addition to the framework, AutoGen 0.4 includes upgraded programming tools and applications, designed to support developers in building and experimenting with AutoGen. 

AutoGen Bench (opens in new tab): Enables developers to benchmark their agents by measuring and comparing performance across tasks and environments. 

AutoGen Studio (opens in new tab): Rebuilt on the v0.4 AgentChat API, this low-code interface enables rapid prototyping of AI agents. It introduces several new capabilities: 

  • Real-time agent updates: View agent action streams in real time with asynchronous, event-driven messages.  
  • Mid-execution control: Pause conversations, redirect agent actions, and adjust team composition. Then seamlessly resume tasks. 
  • Interactive feedback through the UI: Add a UserProxyAgent to enable user input and guidance during team runs in real time. 
  • Message flow visualization: Understand agent communication through an intuitive visual interface that maps message paths and dependencies. 
  • Drag-and-drop team builder: Design agent teams visually using an interface for dragging components into place and configuring their relationships and properties. 
  • Third-party component galleries: Import and use custom agents, tools, and workflows from external galleries to extend functionality. 

Magentic-One: A new generalist multi-agent application to solve open-ended web and file-based tasks across various domains. This tool marks a significant step toward creating agents capable of completing tasks commonly encountered in both work and personal contexts.

Migrating to AutoGen v0.4

We implemented several measures to facilitate a smooth upgrade from the previous v0.2 API, addressing core differences in the underlying architecture. 

First, the AgentChat API maintains the same level of abstraction as v0.2, making it easy to migrate existing code to v0.4. For example, AgentChat offers an AssistantAgent and UserProxy agent with similar behaviors to those in v0.2. It also provides a team interface with implementations like RoundRobinGroupChat and SelectorGroupChat, which cover all the capabilities of the GroupChat class in v0.2. Additionally, v0.4 introduces many new functionalities, such as streaming messages, improved observability, saving and restoring task progress, and resuming paused actions where they left off.  

For detailed guidance, refer to the migration guide (opens in new tab).

Looking forward

This new release sets the stage for a robust ecosystem and strong foundation to drive advances in agentic AI application and research. Our roadmap includes releasing .NET support, introducing built-in, well-designed applications and extensions for challenging domains, and fostering a community-driven ecosystem. We remain committed to the responsible development of AutoGen and its evolving capabilities. 

We encourage you to engage with us on AutoGen’s Discord server (opens in new tab) and share feedback on the official AutoGen repository (opens in new tab) via GitHub Issues.  Stay up to date with frequent AutoGen updates via X (opens in new tab)

Acknowledgments

We would like to thank the many individuals whose ideas and insights helped formalize the concepts introduced in this release, including Rajan Chari, Ece Kamar, John Langford, Ching-An Chen, Bob West, Paul Minero, Safoora Yousefi, Will Epperson, Grace Proebsting, Enhao Zhang, and Andrew Ng. 

Opens in a new tab

The post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.

Categories: Microsoft

AIOpsLab: Building AI agents for autonomous clouds

Sat, 12/21/2024 - 01:56

In our increasingly complex digital landscape, enterprises and cloud providers face significant challenges in the development, deployment, and maintenance of sophisticated IT applications. The broad adoption of microservices and cloud-based serverless architecture has streamlined certain aspects of application development while simultaneously introducing a host of operational difficulties, particularly in fault diagnosis and mitigation. These complexities can result in outages, which have the potential to cause major business disruptions, underscoring the critical need for robust solutions that ensure high availability and reliability in cloud services. As the expectation for five-nines availability grows, organizations must navigate the intricate web of operational demands to maintain customer satisfaction and business continuity. 

To tackle these challenges, recent research on using AIOps agents for cloud operations—such as AI agents for incident root cause analysis (RCA) or triaging—has relied on proprietary services and datasets. Other prior works use frameworks specific to the solutions that they are building, or ad hoc and static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users developing agents for cloud operations tasks with Azure AI Agent Service can evaluate and improve them using AIOpsLab. Furthermore, current approaches do not agree on standard metrics or a standard taxonomy for operational tasks. This calls for a standardized and principled research framework for building, testing, comparing, and improving AIOps agents. The framework should allow agents to interact with realistic service operation tasks in a reproducible manner. It must be flexible in extending to new applications, workloads, and faults. Importantly, it should go beyond just evaluating the AI agents and enabling users to improve the agents themselves; for example, by providing sufficient observability and even serving as a training environment (“gym”) to generate samples to learn on.  

We developed AIOpsLab, a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks. AIOpsLab is open sourced at GitHub (opens in new tab) with the MIT license, so that researchers and engineers can leverage it to evaluate AIOps agents at scale. We recently presented the AIOpsLab vision paper (opens in new tab) at SoCC ’24. Please see the p (opens in new tab)reprint (opens in new tab) for more details about the AIOpsLab framework.

Figure 1. System architecture of AIOpsLab.  Agent-cloud interface (ACI)

AIOpsLab strictly separates the agent and the application service using an intermediate orchestrator. It provides several interfaces for other system parts to integrate and extend. First, it establishes a session with an agent to share information about benchmark problems: (1) the problem description, (2) instructions (e.g., response format), and (3) available APIs to call as actions.

The APIs are a set of documented tools, e.g., get logs, get metrics, and exec shell, designed to help the agent solve a task. There are no restrictions on the agent’s implementation; the orchestrator poses problems and polls it for the next action to perform given the previous result. Each action must be a valid API call, which the orchestrator validates and carries out. The orchestrator has privileged access to the deployment and can take arbitrary actions (e.g., scale-up, redeploy) using appropriate tools (e.g., helm, kubectl) to resolve problems on behalf of the agent. Lastly, the orchestrator calls workload and fault generators to create service disruptions, which serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators. 

Example shows how to onboard an agent to AIOpsLab from aiopslab import Orchestrator class Agent: def __init__(self, prob, instructs, apis): self.prompt = self.set_prompt(prob, instructs, apis) self.llm = GPT4() async def get_action(self, state: str) -> str: return self.llm.generate(self.prompt + state) #initialize the orchestrator orch = Orchestrator() pid = "misconfig_app_hotel_res-mitigation-1" prob_desc, instructs, apis = orch.init_problem(pid) #register and evaluate the agent agent = Agent(prob_desc, instructs, apis) orch.register_agent(agent, name="myAgent") asyncio.run(orch.start_problem(max_steps=10)) Service

AIOpsLab abstracts a diverse set of services to reflect the variance in production environments. This includes live, running services that are implemented using various architectural principles, including microservices, serverless, and monolithic.

We also leverage open-sourced application suites such as DeathStarBench as they provide artifacts, like source code and commit history, along with run-time telemetry. Adding tools like BluePrint can help AIOpsLab scale to other academic and production services. 

Workload generator

The workload generator in AIOpsLab plays a crucial role by creating simulations of both faulty and normal scenarios. It receives specifications from the orchestrator, such as the task, desired effects, scale, and duration. The generator can use a model trained on real production traces to generate workloads that align with these specifications. Faulty scenarios may simulate conditions like resource exhaustion, exploit edge cases, or trigger cascading failures, inspired by real incidents. Normal scenarios mimic typical production patterns, such as daily activity cycles and multi-user interactions. When various characteristics (e.g., service calls, user distribution, arrival times) can lead to the desired effect, multiple workloads can be stored in the problem cache for use by the orchestrator. In coordination with the fault generator, the workload generator can also create complex fault scenarios with workloads.  

Fault generator

AIOpsLab has a novel push-button fault generator designed for generic applicability across various cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” compatible with AIOps scenarios. This includes fine-grained fault injection capable of simulating complex failures inspired by production incidents. Additionally, it can inject faults at various system levels, exposing root causes while maintaining semantic integrity and considering interdependencies between cloud microservices. The fault injector’s versatility can enhance the reliability and robustness of cloud systems by enabling thorough testing and evaluation of AIOps capabilities. 

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Start now Opens in a new tab Observability

AIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers for any AIOps tool. AIOpsLab collects a wide array of telemetry data, including (1) traces from Jaeger detailing the end-to-end paths of requests through distributed systems, (2) application logs formatted and recorded by Filebeat and Logstash, and (3) system metrics monitored by Prometheus. Additionally, AIOpsLab also captures lower-level system information such as syscall logs and cluster information. As mentioned, we handle potential data overload through flexible APIs to tune the telemetry data relevant to the AIOps tools. 

AIOpsLab currently supports four key tasks within the AIOps domain. These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, we support several popular agent frameworks: React, Autogen, and TaskWeaver. Two key insights from the study emphasize the importance of observability and well-designed ACI. Observability is crucial for clear root-cause diagnosis, for instance, pinpointing a misconfigured API gateway can be vital in preventing service downtime.

Flexibility is another critical factor; the ability to execute arbitrary shell commands allowed for effective troubleshooting in real-time scenarios. Lastly, robust error handling is essential—providing agents with high-quality feedback on execution barriers, such as a failed database connection, ensures swift resolution and continuous improvement. 

Example of ACI on diagnosis task Agent: I need to first look at the logs to understand if there are any errors or warnings that could indicate problems. Action: ``` get_logs("compose-post-service", "test-social-network") ``` Service: [2024-Aug-04 23:18:49.365494] <info>: … Thrift: Sun Aug 4 23:19:19 2024 TSocket::open() connect() <Host: user-service Port: 9090>: Connection refused Agent: … unable to connect to the `user-service` at port 9090, even though the pod is running. This suggests a possible network issue or misconfiguration in service discovery Next steps

This research project adopts Microsoft security standards and Responsible AI principles, and we envision this research evolving into a vital resource for organizations aiming to optimize their IT operations. Additionally, we plan to collaborate with various generative AI teams to incorporate AIOpsLab as a benchmark scenario for evaluating state-of-the-art models. By doing so, we aim to foster innovation and encourage the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology, as it has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve their customers in an increasingly automated world. 

Acknowledgements

We would like to thank Yinfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath, for contributing to this project.

Opens in a new tab

The post AIOpsLab: Building AI agents for autonomous clouds appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of December 16, 2024

Wed, 12/18/2024 - 19:00

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

The Compute Express Link (CXL) open standard interconnect enables integration of diverse types of memory into servers via its byte-addressable SerDes links. To fully utilize CXL-based heterogeneous memory systems (which combine different types of memory with varying access speeds), it’s necessary to implement efficient memory tiering—a strategy to manage data placement across memory tiers for optimal performance. Efficiently managing these memory systems is crucial, but has been challenging due to the lack of precise and efficient tools for understanding how memory is accessed.

In a recent paper: NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering researchers from Microsoft propose a novel solution which features a hardware/software co-design to address this problem. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf, which monitors memory accesses and provides the operating system (OS) with crucial page hotness statistics and other system state information. On the OS kernel side, the researchers designed a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. Implemented on a real FPGA-based CXL memory platform and Linux kernel v6.3, NeoMem demonstrated 32% to 67% geomean speedup over several existing memory tiering solutions.

Read the paper NEW RESEARCH Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Planning and conducting chemical syntheses is a significant challenge in the discovery of functional small molecules, which limits the potential of generative AI for molecular inverse design. Although early machine learning-based retrosynthesis models have shown the ability to predict reasonable routes, they are less accurate for infrequent, yet important reactions.

In a recent paper: Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases, researchers from Microsoft and external colleagues address this limitation, with a new framework for building highly accurate reaction models. Chimera incorporates two newly developed models, each achieving state-of-the-art performance in their respective categories. Evaluations by PhD-level organic chemists show that Chimera’s predictions are preferred for their higher quality compared to baseline models.

The researchers further validate Chimera’s robustness by applying its largest-scale model to an internal dataset from a major pharmaceutical company, demonstrating its ability to generalize effectively under distribution shifts. This new framework shows the potential to substantially accelerate the development of even more accurate and versatile reaction prediction models.

Read the paper About Microsoft Research

Advancing science and technology to benefit humanity

View our story Opens in a new tab NEW RESEARCH The GA4GH Task Execution API: Enabling Easy Multicloud Task Execution

In bioinformatics and computational biology, data analysis often involves chaining command-line programs developed by specialized teams at different institutions. These tools, which vary widely in age, software stacks, and dependencies, lack a common programming interface, which makes integration, workflow management and reproducibility challenging.

A recent article (opens in new tab) emphasizes the development, adoption and implementation of the Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API, created in collaboration with researchers at Microsoft and other institutions. The TES API offers a unified schema and interface for submitting and managing tasks, seamlessly bridging gaps between on-premises high-performance and high-throughput computing systems, cloud platforms, and hybrid infrastructures. Its flexibility and extensibility have already made it a critical asset for applications ranging from federated data analysis to load balancing across multi-cloud systems.

Adopted by numerous service providers and integrated into several workflow engines, TES empowers researchers to execute complex computational tasks through a single, abstracted interface. This eliminates compatibility hurdles, accelerates research timelines, reduces costs and enables “compute to data” solutions—essential for tackling the challenges of distributed data analysis.

Read the paper NEW RESEARCH RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Increasing use of code agents for AI-assisted coding and software development has brought safety and security concerns, such as generating or executing malicious code, which have become significant barriers to real-world deployment of these agents.

In a recent paper: RedCode: Risky Code Execution and Generation Benchmark for Code Agents, published at NeurIPS 2024, researchers from Microsoft and external colleagues propose comprehensive and practical evaluations on the safety of code agents. RedCode is an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests.

This research evaluated three agents based on various large language models (LLMs), providing insights into code agents’ vulnerabilities. For instance, results showed that agents are more likely to reject executing unsafe operations on the operating system. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additional evaluations revealed that more capable base models and agents with stronger overall coding abilities, such as GPT-4, tend to produce more sophisticated harmful software.

These findings highlight the need for stringent safety evaluations for diverse code agents. The underlying dataset and related code are publicly available at https://github.com/AI-secure/RedCode (opens in new tab).

Read the paper NEW RESEARCH Towards industrial foundation models: Integrating large language models with industrial data intelligence

Although large language models (LLMs) excel at language-focused tasks like news writing, document summarization, customer service, and supporting virtual assistants, they can face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular and time series data. To address these issues, researchers from Microsoft propose a new approach to building industrial foundation models (IFMs). As outlined in a recent blog post, they have successfully demonstrated the feasibility of cross-domain universal in-context learning on tabular data and the significant potential it could achieve.

The researchers designed Generative Tabular Learning (opens in new tab) (GTL), a new framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks. This technical paradigm has been open-sourced (opens in new tab) to promote broader use.

Read the paper Microsoft Research in the news Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king 

December 12, 2024

Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors.

Microsoft researcher Ece Kamar discusses the future of AI agents in 2025 

Tech Brew | December 12, 2024

With AI agents widely expected to take off in 2025, the director of Microsoft’s AI Frontiers lab weighs in on the future of this technology, the safeguards needed, and the year ahead in AI research.

A new frontier awaits — computing with light 

December 12, 2024

In the guts of a new type of computer, a bunch of tiny LEDs emit a green glow. Those lights have a job to do. They’re performing calculations. Right now, this math is telling the computer how to identify handwritten images of numbers. The computer is part of a research program at Microsoft.

View more news and awards Opens in a new tab

The post Research Focus: Week of December 16, 2024 appeared first on Microsoft Research.

Categories: Microsoft

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

Tue, 12/17/2024 - 19:00
The challenge of effective prompting

AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error. 

This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks? 

To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.

Introducing PromptWizard

PromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.

PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance. 

Three key insights behind PromptWizard:

  • Feedback-driven refinement: At its core, PW leverages an iterative feedback loop where the LLM generates, critiques, and refines its own prompts and examples. This continuous improvement mechanism ensures that each iteration is better than the last, leading to highly effective prompts and examples. 
  • Joint optimization and synthesis of diverse examples: PW generates synthetic examples that are not only robust and diverse but also task-aware. By optimizing prompts and examples together, it ensures they work in tandem to address specific task requirements effectively. 
  • Self-generated chain-of-thought (CoT) steps: Incorporating CoT reasoning improves the problem-solving capabilities of the model. By using selected few-shot examples, PW generates a detailed reasoning chain for each example, facilitating nuanced and step-by-step problem-solving approaches.
Figure 1. Overview of PromptWizard How PromptWizard works

PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.

Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses. 

Stage 1: Refinement of prompt instruction

The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.

For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state. 

Figure 2. Refinement of prompt instruction

Stage 2: Joint optimization of instructions and examples

The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.

This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content. 

Figure 3. Joint optimization of instructions and examples

Microsoft Research Blog

AIOpsLab: Building AI agents for autonomous clouds

AIOpsLab is an open-source framework designed to evaluate and improve AI agents for cloud operations, offering standardized, scalable benchmarks for real-world testing, enhancing cloud system reliability.

Read more Opens in a new tab Results

PromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency. 

Comprehensive evaluation across tasks

PromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper

  • Accuracy: PW consistently outperformed other methods, maintaining performance close to the best across all tasks. Figure 4 shows the performance profile curve that highlights PW’s reliability, demonstrating how frequently it achieves near-best accuracy compared to other approaches for BigBench Instruction Induction dataset (BBII).
  • Efficiency: Beyond accuracy, PW demonstrates its computational efficiency. Unlike many baseline methods that require extensive API calls and computational resources, PW achieves superior results with minimal overhead by striking an effective balance between exploration and exploitation. Table 1 demonstrates PW’s cost-effectiveness, with significantly reduced token usage for input and output while optimizing prompts effectively.
Figure 4. Performance Profile curve on BBII dataset MethodsAPI callsTotal tokensInstinct1730115kPromptBreeder186001488kEvoPrompt5000400kPW6924kTable 1. Cost analysis on BBII dataset

We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs. 

Resilience with limited data

Real-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2). 

Datasets5 Examples25 ExamplesMMLU80.489.5GSM8k9495.4Ethos86.489.4PubMedQA6878.2MedQA80.482.9Average81.987Table 2. PW’s performance with varying number of examples Leveraging smaller models for optimization

PromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).

DatasetPrompt Gen: Llama-70BPrompt Gen: GPT4GSM8k94.695.4Ethos89.289.4Average91.992.4Table 3. Performance with smaller LLMs for prompt generation 

PromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.

 Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.

Opens in a new tab

The post PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts appeared first on Microsoft Research.

Categories: Microsoft

Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users

Mon, 12/16/2024 - 19:00
Introducing GraphRAG 1.0

Microsoft debuted (opens in new tab) the pre-release version of GraphRAG (opens in new tab) in July 2024 to advance AI use in complex domains. Since that time, we’ve seen incredible adoption and community engagement (over 20k stars and 2k forks on GitHub as of this writing), with numerous fixes and improvements by the core team and community contributors. We’re deeply grateful for the contributions and feedback we’ve received and are excited to share a number of major ergonomic and structural improvements that culminate in the official release of GraphRAG 1.0. 

Ergonomic refactors Easier setup for new projects

When we first launched GraphRAG, most config was done using environment variables, which could be daunting, given the many options available. We’ve reduced the friction on setup by adding an init command (opens in new tab) that generates a simplified starter settings.yml file with all core required config already set. We recommend developers start here to ensure they get the clearest initial config. With this update, a minimal starting config does not require the user to have expertise with GraphRAG for a quick setup, only an OpenAI API key in their environment. 

New and expanded command line interface

We expanded the functionality and ease of use of the command line interface (opens in new tab) (CLI) and adopted Typer (opens in new tab) to provide better inline documentation and a richer CLI experience. The original CLI was intended as a starter demo for users to try GraphRAG on a sample dataset. We’ve since learned from the community that most people actually want to use this as their primary interaction mode for GraphRAG, so as part of this milestone release, we’ve incorporated enhancements that result in a more streamlined experience. From this work, CLI startup times dropped from an average of 148 seconds to 2 seconds. 

Consolidated API layer

In August 2024 we introduced a standalone API layer to simplify developer usage. The original CLI contained all the code required to instantiate and execute basic indexing and query commands, which users often needed to replicate. The API layer is still considered provisional as we gather feedback, but is intended to be the primary entry point for developers who wish to integrate GraphRAG functionality into their own applications without deep pipeline or query class customization. In fact, the CLI and Accelerator (opens in new tab) are built entirely on top of the API layer, acting as a documented example of how to interact with the API. We have also added examples of how to use this API to our notebook collection (opens in new tab) that we will continue to update as we iterate in future releases. 

Simplified data model

GraphRAG creates several output artifacts to store the indexed knowledge model. The initial model contained a large number of files, fields, and cross-references based on experimental ideas during the early research, which can be overwhelming for both new and routine users. We performed a comprehensive review of the data model and incorporated fixes to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify the data models. Previously, the output lacked standardization, and relevant outputs could easily be confused with non-critical intermediary output files. Now with GraphRAG 1.0, the output will only include relevant outputs that are easily readable and traceable. 

About Microsoft Research

Advancing science and technology to benefit humanity

View our story Opens in a new tab Streamlined vector stores

Embeddings and their vector stores are some of the primary drivers of  GraphRAG’s storage needs. Our original data model stored all embeddings within the parquet output files after data ingestion and indexing. This made the files portable, which was convenient for early research, but for many users it became unnecessary as they configured their own vector stores and the scale of data ingestion grew. We have updated the GraphRAG pipeline to create a default vector store during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use. The benefit of this change is that those vectors (which can be quite large) no longer need to be loaded when the output files are read from disk, saving read time and memory during every query. Coupled with the simplified data model, this resulted in output parquet disk savings of 80%, and total disk space (including embeddings in the vector store) reduction of 43%. GraphRAG supports LanceDB and Azure AI Search out-of-the-box for vector stores. For simple startup, LanceDB is used as the default, and is written to a local database alongside the knowledge model artifacts. 

Flatter, clearer code structure

A key initiative on the road to version 1.0 has been to simplify the codebase so it is easier to maintain and more approachable for third-party users. We’ve removed much of the code depth from the organization to make it easier to browse, and co-located more code that our own usage patterns indicate was not required to be in separate functional areas. 

We have also found that very few users need the declarative configuration that the underlying DataShaper (opens in new tab) engine provides, so we collapsed these 88 verbose workflow definitions into a smaller set of 11 workflows that operate in a functional versus composed manner. This makes the pipeline easier to understand and is a step toward an architecture that is better suited for our future research plans and improves performance across the board. By collapsing workflows, we now have fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. This streamlining has also reduced the in-memory footprint of the pipeline, enabling users to index and analyze larger datasets with GraphRAG.

Incremental ingest

Until now, an evolving dataset needed complete re-indexing every time new information was acquired in order to re-generate the knowledge model. In GraphRAG 1.0 we are including a new update command in the CLI that computes the deltas between an existing index and newly added content and intelligently merges the updates to minimize re-indexing. GraphRAG uses an LLM caching mechanism to save as much cost as possible when re-indexing, so re-runs over a dataset are often significantly faster and cheaper than an initial run. Adding brand new content can alter the community structure such that much of an index needs to be re-computed – the update command (opens in new tab) resolves this while also improving answer quality. 

Availability

GraphRAG version 1.0 is now available on GitHub (opens in new tab), and published to PyPI (opens in new tab). Check out the Getting Started (opens in new tab) guide to use GraphRAG 1.0 today. today. 

Migrating

We recommend users migrate to GraphRAG 1.0, which offers a streamlined experience including multiple improvements for both users and developers. However, because of the breadth of its updates, version 1.0 is not backwards compatible. If you’ve used GraphRAG prior to version 1.0 and have existing indexes, there are a handful of breaking changes that need to be addressed, but this should be a straightforward process. To support the community in this migration, we’ve created a migration guide (opens in new tab) in the repository with more information. 

Future directions

We recently posted about a brand-new approach to GraphRAG called LazyGraphRAG, which performs minimal up-front indexing to avoid LLM usage until user queries are executed. This avoids LLM-based summarization of large volumes of content that may not be interesting to users – and therefore never explored even after expensive processing. This approach shows strong performance at a fraction of the cost of GraphRAG, and will be added to the core GraphRAG codebase in the near future as a new option for users. 

Additionally, Microsoft has been active in exploring how GraphRAG can advance the rate of scientific progress, and is in the process of building relevant GraphRAG capabilities to align with our broader work in AI-enabled scientific discovery (opens in new tab).

We continue to refine the codebase and investigate architectural changes that will enable users to use their own language model APIs, storage providers, and vector stores. We’re excited about this major milestone, and the foundation that this refactoring lays for our continued research in the GraphRAG space.

Opens in a new tab

The post Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users appeared first on Microsoft Research.

Categories: Microsoft

eXTReMe Tracker