Microsoft Research
Research Focus: Week of January 13, 2025
In this edition:
- We introduce privacy enhancements for multiparty deep learning, a framework using smaller, open-source models to provide relevance judgments, and other notable new research.
- We congratulate Yasuyuki Matsushita, who was named an IEEE Computer Society Fellow.
- We’ve included a recap of the extraordinary, far-reaching work done by researchers at Microsoft in 2024.
NEW RESEARCH
AI meets materials discoveryTwo of the transformative tools that play a central role in Microsoft’s work on AI for science are MatterGen and MatterSim. In the world of materials discovery, each plays a distinct yet complementary role in reshaping how researchers design and validate new materials.
Read the story NEW RESEARCH Communication Efficient Secure and Private Multi-Party Deep LearningDistributed training enables multiple parties to jointly train a machine learning model on their respective datasets, which can help address the challenges posed by requirements in modern machine learning for large volumes of diverse data. However, this can raise security and privacy issues – protecting each party’s data during training and preventing leakage of private information from the model after training through various inference attacks.
In a recent paper, Communication Efficient Secure and Private Multi-Party Deep Learning, researchers from Microsoft address these concerns simultaneously by designing efficient Differentially Private, secure Multiparty Computation (DP-MPC) protocols for jointly training a model on data distributed among multiple parties. This DP-MPC protocol in the two-party setting is 56-to-794 times more communication-efficient and 16-to-182 times faster than previous such protocols. This work simplifies and improves on previous attempts to combine techniques from secure multiparty computation and differential privacy, especially in the context of training machine learning models.
Read the paper NEW RESEARCH JudgeBlender: Ensembling Judgments for Automatic Relevance AssessmentTraining and evaluating retrieval systems requires significant relevance judgments, which are traditionally collected from human assessors. This process is both costly and time-consuming. Large language models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM. While effective, this approach can be expensive and prone to intra-model biases that can favor systems leveraging similar models.
In a recent paper: JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment, researchers from Microsoft we introduce a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark, they compare JudgeBlender with state-of-the-art methods and the top performers in the LLMJudge challenge. This research shows that JudgeBlender achieves competitive performance, demonstrating that very large models are often unnecessary for reliable relevance assessments.
Read the paper NEW RESEARCH Convergence to Equilibrium of No-regret Dynamics in Congestion GamesCongestion games are used to describe the behavior of agents who share a set of resources. Each player chooses a combination of resources, which may become congested, decreasing utility for the players who choose them. Players can avoid congestion by choosing combinations that are less popular. This is useful for modeling a range of real-world scenarios, such as traffic flow, data routing, and wireless communication networks.
In a recent paper: Convergence to Equilibrium of No-regret Dynamics in Congestion Games; researchers from Microsoft and external colleagues propose CongestEXP, a decentralized algorithm based on the classic exponential weights method. They evaluate CongestEXP in a traffic congestion game setting. As more drivers use a particular route, congestion increases, leading to higher travel times and lower utility. Players can choose a different route every day to optimize their utility, but the observed utility by each player may be subject to randomness due to uncertainty (e.g., bad weather). The researchers show that this approach provides both regret guarantees and convergence to Nash Equilibrium, where no player can unilaterally improve their outcome by changing their strategy.
Read the paper NEW RESEARCH RD-Agent: An open-source solution for smarter R&DResearch and development (R&D) plays a pivotal role in boosting industrial productivity. However, the rapid advance of AI has exposed the limitations of traditional R&D automation. Current methods often lack the intelligence needed to support innovative research and complex development tasks, underperforming human experts with deep knowledge.
LLMs trained on vast datasets spanning many subjects are equipped with extensive knowledge and reasoning capabilities that support complex decision-making in diverse workflows. By autonomously performing tasks and analyzing data, LLMs can significantly increase the efficiency and precision of R&D processes.
In a recent article, researchers from Microsoft introduce RD-Agent, a tool that integrates data-driven R&D systems and harnesses advanced AI to automate innovation and development.
At the heart of RD-Agent is an autonomous agent framework with two key components: a) Research and b) Development. Research focuses on actively exploring and generating new ideas, while Development implements these ideas. Both components improve through an iterative process, illustrated in Figure 1 of the article, ensures the system becomes increasingly effective over time.
Read the articleMicrosoft research podcast
Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa NessAs the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India.
Listen now Opens in a new tab Microsoft Research | In case you missed it Microsoft Research 2024: A year in reviewDecember 20, 2024
Microsoft Research did extraordinary work this year, using AI and scientific research to make progress on real-world challenges like climate change, food security, global health, and human trafficking. Here’s a look back at the broad range of accomplishments and advances in 2024.
AIOpsLab: Building AI agents for autonomous cloudsDecember 20, 2024
AIOpsLab is a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks.
Yasuyuki Matsushita, IEEE Computer Society 2025 FellowDecember 19, 2024
Congratulations to Yasuyuki Matsushita, Senior Principal Research Manager at Microsoft Research, who was named a 2025 IEEE Computer Society Fellow. Matsushita was recognized for contributions to photometric 3D modeling and computational photography.
View more news and awards Opens in a new tabThe post Research Focus: Week of January 13, 2025 appeared first on Microsoft Research.
MatterGen: A new paradigm of materials design with generative AI
Materials innovation is one of the key drivers of major technological breakthroughs. The discovery of lithium cobalt oxide in the 1980s laid the groundwork for today’s lithium-ion battery technology. It now powers modern mobile phones and electric cars, impacting the daily lives of billions of people. Materials innovation is also required for designing more efficient solar cells, cheaper batteries for grid-level energy storage, and adsorbents to recycle CO2 from atmosphere.
Finding a new material for a target application is like finding a needle in a haystack. Historically, this task has been done via expensive and time-consuming experimental trial-and-error. More recently, computational screening of large materials databases has allowed researchers to speed up this process. Nonetheless, finding the few materials with the desired properties still requires the screening of millions of candidates.
Today, in a paper published in Nature (opens in new tab), we share MatterGen, a generative AI tool that tackles materials discovery from a different angle. Instead of screening the candidates, it directly generates novel materials given prompts of the design requirements for an application. It can generate materials with desired chemistry, mechanical, electronic, or magnetic properties, as well as combinations of different constraints. MatterGen enables a new paradigm of generative AI-assisted materials design that allows for efficient exploration of materials, going beyond the limited set of known ones.
Figure 1: Schematic representation of screening and generative approaches to materials design A novel diffusion architectureMatterGen is a diffusion model that operates on the 3D geometry of materials. Much like an image diffusion model generates pictures from a text prompt by modifying the color of pixels from a noisy image, MatterGen generates proposed structures by adjusting the positions, elements, and periodic lattice from a random structure. The diffusion architecture is specifically designed for materials to handle specialties like periodicity and 3D geometry.
Figure 2: Schematic representation of MatterGen: a diffusion model to generate novel and stable materials. MatterGen can be fine-tuned to generate materials under different design requirements such as specific chemistry, crystal symmetry, or materials’ properties.The base model of MatterGen achieves state-of-the-art performance in generating novel, stable, diverse materials (Figure 3). It is trained on 608,000 stable materials from the Materials Project (opens in new tab) (MP) and Alexandria (opens in new tab) (Alex) databases. The performance improvement can be attributed to both the architecture advancements, as well as the quality and size of our training data.
Figure 3: Performance of MatterGen and other methods in the generation of stable, unique, and novel structures. The training dataset for each method is indicated in parentheses. The purple bar highlights performance improvements due to MatterGen’s architecture alone, while the teal bar highlights performance improvements that come also from the larger training dataset.MatterGen can be fine-tuned with a labelled dataset to generate novel materials given any desired conditions. We demonstrate examples of generating novel materials given a target’s chemistry and symmetry, as well as electronic, magnetic, and mechanical property constraints (Figure 2).
Outperforming screening Figure 4: Performance of MatterGen (teal) and traditional screening (yellow) in finding novel, stable, and unique structures that satisfy the design requirement of having bulk modulus greater than 400 GPa.The key advantage of MatterGen over screening is its ability to access the full space of unknown materials. In Figure 4, we show that MatterGen continues to generate more novel candidate materials with high bulk modulus above 400 GPa, for example, which are hard to compress. In contrast, screening baseline saturates due to exhausting known candidates.
Microsoft research blog
PromptWizard: The future of prompt optimization through feedback-driven self-evolving promptsPromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.
Read more Opens in a new tab Handling compositional disorder Figure 5: Illustration of compositional disorder. Left: a perfect crystal without compositional disorder and with a repeating unit cell (black dashed). Right: crystal with compositional disorder, where each site has 50% probability of yellow and teal atoms.Compositional disorder (Figure 5) is a commonly observed phenomenon where different atoms can randomly swap their crystallographic sites in a synthesized material. Recently (opens in new tab), the community has been exploring what it means for a material to be novel in the context of computationally designed materials, as widely employed algorithms will not distinguish between pairs of structures where the only difference is a permutation of similar elements in their respective sites.
We provide an initial solution to this issue by introducing a new structure matching algorithm that considers compositional disorder. The algorithm assesses whether a pair of structures can be identified as ordered approximations of the same underlying compositionally disordered structure. This provides a new definition of novelty and uniqueness, which we adopt in our computational evaluation metrics. We also make our algorithm publicly available (opens in new tab) as part of our evaluation package.
Experimental lab verification Figure 6: Experimental validation of the proposed compound, TaCr2O6In addition to our extensive computational evaluation, we have validated MatterGen’s capabilities through experimental synthesis. In collaboration with the team led by Prof Li Wenjie from the Shenzhen Institutes of Advanced Technology (opens in new tab) (SIAT) of the Chinese Academy of Sciences, we have synthesized a novel material, TaCr2O6, whose structure was generated by MatterGen after conditioning the model on a bulk modulus value of 200 GPa. The synthesized material’s structure aligns with the one proposed by MatterGen, with the caveat of compositional disorder between Ta and Cr. Additionally, we experimentally measure a bulk modulus of 169 GPa against the 200 GPa given as design specification, with a relative error below 20%, very close from an experimental perspective. If similar results can be translated to other domains, it will have a profound impact on the design of batteries, fuel cells, and more.
AI emulator and generator flywheelMatterGen presents a new opportunity for AI accelerated materials design, complementing our AI emulator MatterSim. MatterSim follows the fifth paradigm of scientific discovery, significantly accelerating the speed of material properties’ simulations. MatterGen in turn accelerates the speed of exploring new material candidates with property guided generation. MatterGen and MatterSim can work together as a flywheel to speed up both the simulation and exploration of novel materials.
Making MatterGen availableWe believe the best way to make an impact in materials design is to make our model available to the public. We release the source code of MatterGen (opens in new tab) under the MIT license, together with the training and fine-tuning data. We welcome the community to use and build on top of our model.
Looking aheadMatterGen represents a new paradigm of materials design enabled by generative AI technology. It explores a significantly larger space of materials than screening-based methods. It is also more efficient by guiding materials exploration with prompts. Similar to how generative AI has impacted drug discovery (opens in new tab), it will have profound impact on how we design materials in broad domains including batteries, magnets, and fuel cells.
We plan to continue our work with external collaborators to further develop and validate the technology. “At the Johns Hopkins University Applied Physics Laboratory (APL), we’re dedicated to the exploration of tools with the potential to advance discovery of novel, mission-enabling materials. That’s why we are interested in understanding the impact that MatterGen could have on materials discovery,” said Christopher Stiles, a computational materials scientists leading multiple materials discovery efforts at APL.
AcknowledgementThis work is the result of highly collaborative team efforts at Microsoft Research AI for Science. The full authors include: Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Zilong Wang, Aliaksandra Shysheya, Jonathan Crabbé, Shoko Ueda, Roberto Sordillo, Lixin Sun, Jake Smith, Bichlien Nguyen, Hannes Schulz, Sarah Lewis, Chin-Wei Huang, Ziheng Lu, Yichi Zhou, Han Yang, Hongxia Hao, Jielan Li, Chunlei Yang, Wenjie Li, Ryota Tomioka, Tian Xie.
Opens in a new tabThe post MatterGen: A new paradigm of materials design with generative AI appeared first on Microsoft Research.
AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness
Over the past year, our work on AutoGen has highlighted the transformative potential of agentic AI and multi-agent applications. Today, we are excited to announce AutoGen v0.4, a significant milestone informed by insights from our community of users and developers. This update represents a complete redesign of the AutoGen library, developed to improve code quality, robustness, generality, and scalability in agentic workflows.
The initial release of AutoGen generated widespread interest in agentic technologies. At the same time, users struggled with architectural constraints, an inefficient API compounded by rapid growth, and limited debugging and intervention functionality. Feedback highlighted the need for stronger observability and control, more flexible multi-agent collaboration patterns, and reusable components. AutoGen v0.4 addresses these issues with its asynchronous, event-driven architecture.
This update makes AutoGen more robust and extensible, enabling a broader range of agentic scenarios. The new framework includes the following features, inspired by feedback from both within and outside Microsoft.
- Asynchronous messaging: Agents communicate through asynchronous messages, supporting both event-driven and request/response interaction patterns.
- Modular and extensible: Users can easily customize systems with pluggable components, including custom agents, tools, memory, and models. They can also build proactive and long-running agents using event-driven patterns.
- Observability and debugging: Built-in metric tracking, message tracing, and debugging tools provide monitoring and control over agent interactions and workflows, with support for OpenTelemetry for industry-standard observability.
- Scalable and distributed: Users can design complex, distributed agent networks that operate seamlessly across organizational boundaries.
- Built-in and community extensions: The extensions module enhances the framework’s functionality with advanced model clients, agents, multi-agent teams, and tools for agentic workflows. Community support allows open-source developers to manage their own extensions.
- Cross-language support: This update enables interoperability between agents built in different programming languages, with current support for Python and .NET and additional languages in development.
- Full type support: Interfaces enforce type checks at build time, improving robustness and maintaining code quality.
Microsoft research podcast
Ideas: AI and democracy with Madeleine Daepp and Robert Osazuwa NessAs the “biggest election year in history” comes to an end, researchers Madeleine Daepp and Robert Osazuwa Ness and Democracy Forward GM Ginny Badanes discuss AI’s impact on democracy, including the tech’s use in Taiwan and India.
Listen now Opens in a new tab New AutoGen frameworkAs shown in Figure 1, the AutoGen framework features a layered architecture with clearly defined responsibilities across the framework, developer tools, and applications. The framework comprises three layers: core, agent chat, and first-party extensions.
- Core: The foundational building blocks for an event-driven agentic system.
- AgentChat: A task-driven, high-level API built on the core layer, featuring group chat, code execution, pre-built agents, and more. This layer is most similar to AutoGen v0.2 (opens in new tab), making it the easiest API to migrate to.
- Extensions: Implementations of core interfaces and third-party integrations, such as the Azure code executor and OpenAI model client.
In addition to the framework, AutoGen 0.4 includes upgraded programming tools and applications, designed to support developers in building and experimenting with AutoGen.
AutoGen Bench: Enables developers to benchmark their agents by measuring and comparing performance across tasks and environments.
AutoGen Studio: Rebuilt on the v0.4 AgentChat API, this low-code interface enables rapid prototyping of AI agents. It introduces several new capabilities:
- Real-time agent updates: View agent action streams in real time with asynchronous, event-driven messages.
- Mid-execution control: Pause conversations, redirect agent actions, and adjust team composition. Then seamlessly resume tasks.
- Interactive feedback through the UI: Add a UserProxyAgent to enable user input and guidance during team runs in real time.
- Message flow visualization: Understand agent communication through an intuitive visual interface that maps message paths and dependencies.
- Drag-and-drop team builder: Design agent teams visually using an interface for dragging components into place and configuring their relationships and properties.
- Third-party component galleries: Import and use custom agents, tools, and workflows from external galleries to extend functionality.
Magentic-One: A new generalist multi-agent application to solve open-ended web and file-based tasks across various domains. This tool marks a significant step toward creating agents capable of completing tasks commonly encountered in both work and personal contexts.
Migrating to AutoGen v0.4We implemented several measures to facilitate a smooth upgrade from the previous v0.2 API, addressing core differences in the underlying architecture.
First, the AgentChat API maintains the same level of abstraction as v0.2, making it easy to migrate existing code to v0.4. For example, AgentChat offers an AssistantAgent and UserProxy agent with similar behaviors to those in v0.2. It also provides a team interface with implementations like RoundRobinGroupChat and SelectorGroupChat, which cover all the capabilities of the GroupChat class in v0.2. Additionally, v0.4 introduces many new functionalities, such as streaming messages, improved observability, saving and restoring task progress, and resuming paused actions where they left off.
For detailed guidance, refer to the migration guide (opens in new tab).
Looking forwardThis new release sets the stage for a robust ecosystem and strong foundation to drive advances in agentic AI application and research. Our roadmap includes releasing .NET support, introducing built-in, well-designed applications and extensions for challenging domains, and fostering a community-driven ecosystem. We remain committed to the responsible development of AutoGen and its evolving capabilities.
We encourage you to engage with us on AutoGen’s Discord server (opens in new tab) and share feedback on the official AutoGen repository (opens in new tab) via GitHub Issues. Stay up to date with frequent AutoGen updates via X (opens in new tab).
AcknowledgmentsWe would like to thank the many individuals whose ideas and insights helped formalize the concepts introduced in this release, including Rajan Chari, Ece Kamar, John Langford, Ching-An Chen, Bob West, Paul Minero, Safoora Yousefi, Will Epperson, Grace Proebsting, Enhao Zhang, and Andrew Ng.
Opens in a new tabThe post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.
AIOpsLab: Building AI agents for autonomous clouds
In our increasingly complex digital landscape, enterprises and cloud providers face significant challenges in the development, deployment, and maintenance of sophisticated IT applications. The broad adoption of microservices and cloud-based serverless architecture has streamlined certain aspects of application development while simultaneously introducing a host of operational difficulties, particularly in fault diagnosis and mitigation. These complexities can result in outages, which have the potential to cause major business disruptions, underscoring the critical need for robust solutions that ensure high availability and reliability in cloud services. As the expectation for five-nines availability grows, organizations must navigate the intricate web of operational demands to maintain customer satisfaction and business continuity.
To tackle these challenges, recent research on using AIOps agents for cloud operations—such as AI agents for incident root cause analysis (RCA) or triaging—has relied on proprietary services and datasets. Other prior works use frameworks specific to the solutions that they are building, or ad hoc and static benchmarks and metrics that fail to capture the dynamic nature of real-world cloud services. Users developing agents for cloud operations tasks with Azure AI Agent Service can evaluate and improve them using AIOpsLab. Furthermore, current approaches do not agree on standard metrics or a standard taxonomy for operational tasks. This calls for a standardized and principled research framework for building, testing, comparing, and improving AIOps agents. The framework should allow agents to interact with realistic service operation tasks in a reproducible manner. It must be flexible in extending to new applications, workloads, and faults. Importantly, it should go beyond just evaluating the AI agents and enabling users to improve the agents themselves; for example, by providing sufficient observability and even serving as a training environment (“gym”) to generate samples to learn on.
We developed AIOpsLab, a holistic evaluation framework for researchers and developers, to enable the design, development, evaluation, and enhancement of AIOps agents, which also serves the purpose of reproducible, standardized, interoperable, and scalable benchmarks. AIOpsLab is open sourced at GitHub (opens in new tab) with the MIT license, so that researchers and engineers can leverage it to evaluate AIOps agents at scale. We recently presented the AIOpsLab vision paper (opens in new tab) at SoCC ’24. Please see the p (opens in new tab)reprint (opens in new tab) for more details about the AIOpsLab framework.
Figure 1. System architecture of AIOpsLab. Agent-cloud interface (ACI)AIOpsLab strictly separates the agent and the application service using an intermediate orchestrator. It provides several interfaces for other system parts to integrate and extend. First, it establishes a session with an agent to share information about benchmark problems: (1) the problem description, (2) instructions (e.g., response format), and (3) available APIs to call as actions.
The APIs are a set of documented tools, e.g., get logs, get metrics, and exec shell, designed to help the agent solve a task. There are no restrictions on the agent’s implementation; the orchestrator poses problems and polls it for the next action to perform given the previous result. Each action must be a valid API call, which the orchestrator validates and carries out. The orchestrator has privileged access to the deployment and can take arbitrary actions (e.g., scale-up, redeploy) using appropriate tools (e.g., helm, kubectl) to resolve problems on behalf of the agent. Lastly, the orchestrator calls workload and fault generators to create service disruptions, which serve as live benchmark problems. AIOpsLab provides additional APIs to extend to new services and generators.
Example shows how to onboard an agent to AIOpsLab from aiopslab import Orchestrator class Agent: def __init__(self, prob, instructs, apis): self.prompt = self.set_prompt(prob, instructs, apis) self.llm = GPT4() async def get_action(self, state: str) -> str: return self.llm.generate(self.prompt + state) #initialize the orchestrator orch = Orchestrator() pid = "misconfig_app_hotel_res-mitigation-1" prob_desc, instructs, apis = orch.init_problem(pid) #register and evaluate the agent agent = Agent(prob_desc, instructs, apis) orch.register_agent(agent, name="myAgent") asyncio.run(orch.start_problem(max_steps=10)) ServiceAIOpsLab abstracts a diverse set of services to reflect the variance in production environments. This includes live, running services that are implemented using various architectural principles, including microservices, serverless, and monolithic.
We also leverage open-sourced application suites such as DeathStarBench as they provide artifacts, like source code and commit history, along with run-time telemetry. Adding tools like BluePrint can help AIOpsLab scale to other academic and production services.
Workload generatorThe workload generator in AIOpsLab plays a crucial role by creating simulations of both faulty and normal scenarios. It receives specifications from the orchestrator, such as the task, desired effects, scale, and duration. The generator can use a model trained on real production traces to generate workloads that align with these specifications. Faulty scenarios may simulate conditions like resource exhaustion, exploit edge cases, or trigger cascading failures, inspired by real incidents. Normal scenarios mimic typical production patterns, such as daily activity cycles and multi-user interactions. When various characteristics (e.g., service calls, user distribution, arrival times) can lead to the desired effect, multiple workloads can be stored in the problem cache for use by the orchestrator. In coordination with the fault generator, the workload generator can also create complex fault scenarios with workloads.
Fault generatorAIOpsLab has a novel push-button fault generator designed for generic applicability across various cloud scenarios. Our approach integrates application and domain knowledge to create adaptable policies and “oracles” compatible with AIOps scenarios. This includes fine-grained fault injection capable of simulating complex failures inspired by production incidents. Additionally, it can inject faults at various system levels, exposing root causes while maintaining semantic integrity and considering interdependencies between cloud microservices. The fault injector’s versatility can enhance the reliability and robustness of cloud systems by enabling thorough testing and evaluation of AIOps capabilities.
Microsoft Research Blog
Research at Microsoft 2024: Meeting the challenge of a changing worldIn this new AI era, technology is changing even faster than before, and the transition from research to reality, from concept to solution, now takes days or weeks rather than months or years.
Read more Opens in a new tab ObservabilityAIOpsLab is equipped with an extensible observability layer designed to provide comprehensive monitoring capabilities across various system layers for any AIOps tool. AIOpsLab collects a wide array of telemetry data, including (1) traces from Jaeger detailing the end-to-end paths of requests through distributed systems, (2) application logs formatted and recorded by Filebeat and Logstash, and (3) system metrics monitored by Prometheus. Additionally, AIOpsLab also captures lower-level system information such as syscall logs and cluster information. As mentioned, we handle potential data overload through flexible APIs to tune the telemetry data relevant to the AIOps tools.
AIOpsLab currently supports four key tasks within the AIOps domain. These tasks include incident detection, localization, root cause diagnosis, and mitigation. Additionally, we support several popular agent frameworks: React, Autogen, and TaskWeaver. Two key insights from the study emphasize the importance of observability and well-designed ACI. Observability is crucial for clear root-cause diagnosis, for instance, pinpointing a misconfigured API gateway can be vital in preventing service downtime.
Flexibility is another critical factor; the ability to execute arbitrary shell commands allowed for effective troubleshooting in real-time scenarios. Lastly, robust error handling is essential—providing agents with high-quality feedback on execution barriers, such as a failed database connection, ensures swift resolution and continuous improvement.
Example of ACI on diagnosis task Agent: I need to first look at the logs to understand if there are any errors or warnings that could indicate problems. Action: ``` get_logs("compose-post-service", "test-social-network") ``` Service: [2024-Aug-04 23:18:49.365494] <info>: … Thrift: Sun Aug 4 23:19:19 2024 TSocket::open() connect() <Host: user-service Port: 9090>: Connection refused Agent: … unable to connect to the `user-service` at port 9090, even though the pod is running. This suggests a possible network issue or misconfiguration in service discovery Next stepsThis research project adopts Microsoft security standards and Responsible AI principles, and we envision this research evolving into a vital resource for organizations aiming to optimize their IT operations. Additionally, we plan to collaborate with various generative AI teams to incorporate AIOpsLab as a benchmark scenario for evaluating state-of-the-art models. By doing so, we aim to foster innovation and encourage the development of more advanced AIOps solutions. This research is essential not only for IT professionals but also for anyone invested in the future of technology, as it has the potential to redefine how organizations manage operations, respond to incidents, and ultimately serve their customers in an increasingly automated world.
AcknowledgementsWe would like to thank Yinfang Chen, Manish Shetty, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, and Suman Nath, for contributing to this project.
Opens in a new tabThe post AIOpsLab: Building AI agents for autonomous clouds appeared first on Microsoft Research.
Research Focus: Week of December 16, 2024
Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.
NEW RESEARCH NeoMem: Hardware/Software Co-Design for CXL-Native Memory TieringThe Compute Express Link (CXL) open standard interconnect enables integration of diverse types of memory into servers via its byte-addressable SerDes links. To fully utilize CXL-based heterogeneous memory systems (which combine different types of memory with varying access speeds), it’s necessary to implement efficient memory tiering—a strategy to manage data placement across memory tiers for optimal performance. Efficiently managing these memory systems is crucial, but has been challenging due to the lack of precise and efficient tools for understanding how memory is accessed.
In a recent paper: NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering researchers from Microsoft propose a novel solution which features a hardware/software co-design to address this problem. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf, which monitors memory accesses and provides the operating system (OS) with crucial page hotness statistics and other system state information. On the OS kernel side, the researchers designed a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. Implemented on a real FPGA-based CXL memory platform and Linux kernel v6.3, NeoMem demonstrated 32% to 67% geomean speedup over several existing memory tiering solutions.
Read the paper NEW RESEARCH Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biasesPlanning and conducting chemical syntheses is a significant challenge in the discovery of functional small molecules, which limits the potential of generative AI for molecular inverse design. Although early machine learning-based retrosynthesis models have shown the ability to predict reasonable routes, they are less accurate for infrequent, yet important reactions.
In a recent paper: Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases, researchers from Microsoft and external colleagues address this limitation, with a new framework for building highly accurate reaction models. Chimera incorporates two newly developed models, each achieving state-of-the-art performance in their respective categories. Evaluations by PhD-level organic chemists show that Chimera’s predictions are preferred for their higher quality compared to baseline models.
The researchers further validate Chimera’s robustness by applying its largest-scale model to an internal dataset from a major pharmaceutical company, demonstrating its ability to generalize effectively under distribution shifts. This new framework shows the potential to substantially accelerate the development of even more accurate and versatile reaction prediction models.
Read the paperSpotlight: Microsoft research newsletter
Microsoft Research NewsletterStay connected to the research community at Microsoft.
Subscribe today Opens in a new tab NEW RESEARCH The GA4GH Task Execution API: Enabling Easy Multicloud Task ExecutionIn bioinformatics and computational biology, data analysis often involves chaining command-line programs developed by specialized teams at different institutions. These tools, which vary widely in age, software stacks, and dependencies, lack a common programming interface, which makes integration, workflow management and reproducibility challenging.
A recent article (opens in new tab) emphasizes the development, adoption and implementation of the Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API, created in collaboration with researchers at Microsoft and other institutions. The TES API offers a unified schema and interface for submitting and managing tasks, seamlessly bridging gaps between on-premises high-performance and high-throughput computing systems, cloud platforms, and hybrid infrastructures. Its flexibility and extensibility have already made it a critical asset for applications ranging from federated data analysis to load balancing across multi-cloud systems.
Adopted by numerous service providers and integrated into several workflow engines, TES empowers researchers to execute complex computational tasks through a single, abstracted interface. This eliminates compatibility hurdles, accelerates research timelines, reduces costs and enables “compute to data” solutions—essential for tackling the challenges of distributed data analysis.
Read the paper NEW RESEARCH RedCode: Risky Code Execution and Generation Benchmark for Code AgentsIncreasing use of code agents for AI-assisted coding and software development has brought safety and security concerns, such as generating or executing malicious code, which have become significant barriers to real-world deployment of these agents.
In a recent paper: RedCode: Risky Code Execution and Generation Benchmark for Code Agents, published at NeurIPS 2024, researchers from Microsoft and external colleagues propose comprehensive and practical evaluations on the safety of code agents. RedCode is an evaluation platform with benchmarks grounded in four key principles: real interaction with systems, holistic evaluation of unsafe code generation and execution, diverse input formats, and high-quality safety scenarios and tests.
This research evaluated three agents based on various large language models (LLMs), providing insights into code agents’ vulnerabilities. For instance, results showed that agents are more likely to reject executing unsafe operations on the operating system. Unsafe operations described in natural text lead to a lower rejection rate than those in code format. Additional evaluations revealed that more capable base models and agents with stronger overall coding abilities, such as GPT-4, tend to produce more sophisticated harmful software.
These findings highlight the need for stringent safety evaluations for diverse code agents. The underlying dataset and related code are publicly available at https://github.com/AI-secure/RedCode (opens in new tab).
Read the paper NEW RESEARCH Towards industrial foundation models: Integrating large language models with industrial data intelligenceAlthough large language models (LLMs) excel at language-focused tasks like news writing, document summarization, customer service, and supporting virtual assistants, they can face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular and time series data. To address these issues, researchers from Microsoft propose a new approach to building industrial foundation models (IFMs). As outlined in a recent blog post, they have successfully demonstrated the feasibility of cross-domain universal in-context learning on tabular data and the significant potential it could achieve.
The researchers designed Generative Tabular Learning (opens in new tab) (GTL), a new framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks. This technical paradigm has been open-sourced (opens in new tab) to promote broader use.
Read the paper Microsoft Research in the news Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency kingDecember 12, 2024
Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors.
Microsoft researcher Ece Kamar discusses the future of AI agents in 2025Tech Brew | December 12, 2024
With AI agents widely expected to take off in 2025, the director of Microsoft’s AI Frontiers lab weighs in on the future of this technology, the safeguards needed, and the year ahead in AI research.
A new frontier awaits — computing with lightDecember 12, 2024
In the guts of a new type of computer, a bunch of tiny LEDs emit a green glow. Those lights have a job to do. They’re performing calculations. Right now, this math is telling the computer how to identify handwritten images of numbers. The computer is part of a research program at Microsoft.
View more news and awards Opens in a new tabThe post Research Focus: Week of December 16, 2024 appeared first on Microsoft Research.
PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts
AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error.
This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks?
- Download PromptWizard
To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.
Introducing PromptWizardPromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.
PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance.
Three key insights behind PromptWizard:
- Feedback-driven refinement: At its core, PW leverages an iterative feedback loop where the LLM generates, critiques, and refines its own prompts and examples. This continuous improvement mechanism ensures that each iteration is better than the last, leading to highly effective prompts and examples.
- Joint optimization and synthesis of diverse examples: PW generates synthetic examples that are not only robust and diverse but also task-aware. By optimizing prompts and examples together, it ensures they work in tandem to address specific task requirements effectively.
- Self-generated chain-of-thought (CoT) steps: Incorporating CoT reasoning improves the problem-solving capabilities of the model. By using selected few-shot examples, PW generates a detailed reasoning chain for each example, facilitating nuanced and step-by-step problem-solving approaches.
PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.
Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses.
Stage 1: Refinement of prompt instruction
The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.
For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state.
Figure 2. Refinement of prompt instructionStage 2: Joint optimization of instructions and examples
The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.
This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content.
Figure 3. Joint optimization of instructions and examples About Microsoft ResearchAdvancing science and technology to benefit humanity
View our story Opens in a new tab ResultsPromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency.
Comprehensive evaluation across tasksPromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper.
- Accuracy: PW consistently outperformed other methods, maintaining performance close to the best across all tasks. Figure 4 shows the performance profile curve that highlights PW’s reliability, demonstrating how frequently it achieves near-best accuracy compared to other approaches for BigBench Instruction Induction dataset (BBII).
- Efficiency: Beyond accuracy, PW demonstrates its computational efficiency. Unlike many baseline methods that require extensive API calls and computational resources, PW achieves superior results with minimal overhead by striking an effective balance between exploration and exploitation. Table 1 demonstrates PW’s cost-effectiveness, with significantly reduced token usage for input and output while optimizing prompts effectively.
We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs.
Resilience with limited dataReal-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2).
Datasets5 Examples25 ExamplesMMLU80.489.5GSM8k9495.4Ethos86.489.4PubMedQA6878.2MedQA80.482.9Average81.987Table 2. PW’s performance with varying number of examples Leveraging smaller models for optimizationPromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).
DatasetPrompt Gen: Llama-70BPrompt Gen: GPT4GSM8k94.695.4Ethos89.289.4Average91.992.4Table 3. Performance with smaller LLMs for prompt generationPromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.
Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.
Opens in a new tabThe post PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts appeared first on Microsoft Research.
Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users
Microsoft debuted (opens in new tab) the pre-release version of GraphRAG (opens in new tab) in July 2024 to advance AI use in complex domains. Since that time, we’ve seen incredible adoption and community engagement (over 20k stars and 2k forks on GitHub as of this writing), with numerous fixes and improvements by the core team and community contributors. We’re deeply grateful for the contributions and feedback we’ve received and are excited to share a number of major ergonomic and structural improvements that culminate in the official release of GraphRAG 1.0.
Ergonomic refactors Easier setup for new projectsWhen we first launched GraphRAG, most config was done using environment variables, which could be daunting, given the many options available. We’ve reduced the friction on setup by adding an init command (opens in new tab) that generates a simplified starter settings.yml file with all core required config already set. We recommend developers start here to ensure they get the clearest initial config. With this update, a minimal starting config does not require the user to have expertise with GraphRAG for a quick setup, only an OpenAI API key in their environment.
New and expanded command line interfaceWe expanded the functionality and ease of use of the command line interface (opens in new tab) (CLI) and adopted Typer (opens in new tab) to provide better inline documentation and a richer CLI experience. The original CLI was intended as a starter demo for users to try GraphRAG on a sample dataset. We’ve since learned from the community that most people actually want to use this as their primary interaction mode for GraphRAG, so as part of this milestone release, we’ve incorporated enhancements that result in a more streamlined experience. From this work, CLI startup times dropped from an average of 148 seconds to 2 seconds.
Consolidated API layerIn August 2024 we introduced a standalone API layer to simplify developer usage. The original CLI contained all the code required to instantiate and execute basic indexing and query commands, which users often needed to replicate. The API layer is still considered provisional as we gather feedback, but is intended to be the primary entry point for developers who wish to integrate GraphRAG functionality into their own applications without deep pipeline or query class customization. In fact, the CLI and Accelerator (opens in new tab) are built entirely on top of the API layer, acting as a documented example of how to interact with the API. We have also added examples of how to use this API to our notebook collection (opens in new tab) that we will continue to update as we iterate in future releases.
Simplified data modelGraphRAG creates several output artifacts to store the indexed knowledge model. The initial model contained a large number of files, fields, and cross-references based on experimental ideas during the early research, which can be overwhelming for both new and routine users. We performed a comprehensive review of the data model and incorporated fixes to add clarity and consistency, remove redundant or unused fields, improve storage space, and simplify the data models. Previously, the output lacked standardization, and relevant outputs could easily be confused with non-critical intermediary output files. Now with GraphRAG 1.0, the output will only include relevant outputs that are easily readable and traceable.
Spotlight: Microsoft research newsletter
Microsoft Research NewsletterStay connected to the research community at Microsoft.
Subscribe today Opens in a new tab Streamlined vector storesEmbeddings and their vector stores are some of the primary drivers of GraphRAG’s storage needs. Our original data model stored all embeddings within the parquet output files after data ingestion and indexing. This made the files portable, which was convenient for early research, but for many users it became unnecessary as they configured their own vector stores and the scale of data ingestion grew. We have updated the GraphRAG pipeline to create a default vector store during indexing, so no post-processing is needed, and the query library shares this configuration for seamless use. The benefit of this change is that those vectors (which can be quite large) no longer need to be loaded when the output files are read from disk, saving read time and memory during every query. Coupled with the simplified data model, this resulted in output parquet disk savings of 80%, and total disk space (including embeddings in the vector store) reduction of 43%. GraphRAG supports LanceDB and Azure AI Search out-of-the-box for vector stores. For simple startup, LanceDB is used as the default, and is written to a local database alongside the knowledge model artifacts.
Flatter, clearer code structureA key initiative on the road to version 1.0 has been to simplify the codebase so it is easier to maintain and more approachable for third-party users. We’ve removed much of the code depth from the organization to make it easier to browse, and co-located more code that our own usage patterns indicate was not required to be in separate functional areas.
We have also found that very few users need the declarative configuration that the underlying DataShaper (opens in new tab) engine provides, so we collapsed these 88 verbose workflow definitions into a smaller set of 11 workflows that operate in a functional versus composed manner. This makes the pipeline easier to understand and is a step toward an architecture that is better suited for our future research plans and improves performance across the board. By collapsing workflows, we now have fewer unused output artifacts, reduced data duplication, and fewer disk I/O operations. This streamlining has also reduced the in-memory footprint of the pipeline, enabling users to index and analyze larger datasets with GraphRAG.
Incremental ingestUntil now, an evolving dataset needed complete re-indexing every time new information was acquired in order to re-generate the knowledge model. In GraphRAG 1.0 we are including a new update command in the CLI that computes the deltas between an existing index and newly added content and intelligently merges the updates to minimize re-indexing. GraphRAG uses an LLM caching mechanism to save as much cost as possible when re-indexing, so re-runs over a dataset are often significantly faster and cheaper than an initial run. Adding brand new content can alter the community structure such that much of an index needs to be re-computed – the update command (opens in new tab) resolves this while also improving answer quality.
AvailabilityGraphRAG version 1.0 is now available on GitHub (opens in new tab), and published to PyPI (opens in new tab). Check out the Getting Started (opens in new tab) guide to use GraphRAG 1.0 today. today.
MigratingWe recommend users migrate to GraphRAG 1.0, which offers a streamlined experience including multiple improvements for both users and developers. However, because of the breadth of its updates, version 1.0 is not backwards compatible. If you’ve used GraphRAG prior to version 1.0 and have existing indexes, there are a handful of breaking changes that need to be addressed, but this should be a straightforward process. To support the community in this migration, we’ve created a migration guide (opens in new tab) in the repository with more information.
Future directionsWe recently posted about a brand-new approach to GraphRAG called LazyGraphRAG, which performs minimal up-front indexing to avoid LLM usage until user queries are executed. This avoids LLM-based summarization of large volumes of content that may not be interesting to users – and therefore never explored even after expensive processing. This approach shows strong performance at a fraction of the cost of GraphRAG, and will be added to the core GraphRAG codebase in the near future as a new option for users.
Additionally, Microsoft has been active in exploring how GraphRAG can advance the rate of scientific progress, and is in the process of building relevant GraphRAG capabilities to align with our broader work in AI-enabled scientific discovery (opens in new tab).
We continue to refine the codebase and investigate architectural changes that will enable users to use their own language model APIs, storage providers, and vector stores. We’re excited about this major milestone, and the foundation that this refactoring lays for our continued research in the GraphRAG space.
Opens in a new tabThe post Moving to GraphRAG 1.0 – Streamlining ergonomics for developers and users appeared first on Microsoft Research.
Research Focus: Week of December 2, 2024
Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.
NEW RESEARCH Adaptive Security, Erasures, and Network Assumptions in Communication-Local MPCn-party Multi-Party Computation (MPC) is a cryptographic protocol technique that allows separate parties to securely compute a function on their joint data while keeping their inputs private. To build such a protocol, most works require all pairs of participating parties to be able to securely and reliably communicate with each other. Recently, the problem of Communication-Local (CL) MPC has been explored where this assumption is modelled more realistically – e.g. by only requiring that participating parties can securely and reliably communicate with a few other participating parties (as for example in networks like blockchains). However, few solutions exist that guarantee adaptive security—resilience to dynamic corruption of parties—and most rely on strong assumptions about party actions.
In a recent paper: Adaptive Security, Erasures, and Network Assumptions in Communication-Local MPC, researchers from Microsoft and external collaborators revisit assumptions made in earlier work. The authors conclude that for secure, adaptive CL-MPC, some previously assumed capabilities (like secure erasure and multisend) can be bypassed under certain conditions; however, fully reducing all-to-all to all-to-one communication remains unachievable in CL settings without some minimal assumptions. They propose a new SOS-RMT protocol, enabling more efficient CL-MPC under specific feasibility bounds and additional cryptographic assumptions.
Read the paper NEW RESEARCH Cuttlefish: A Fair, Predictable Execution Environment for Cloud-hosted Financial ExchangesLow-latency algorithmic trading is driving efficiency in modern financial markets by promoting accurate/timely pricing of securities, higher liquidity, and lower trade costs for investors. The goal is to process incoming market data and issue trades as quickly as possible to take advantage of ephemeral market-making and arbitrage opportunities. Interest in cloud-hosted financial exchanges is growing, as they promise a cost-effective platform more accessible to market participants, among other benefits.
Unfortunately, one of the major roadblocks in cloud environments is to ensure equal network and compute despite the unpredictable network latencies as well as non-deterministic computation times.
In a recent preprint: Cuttlefish: A Fair, Predictable Execution Environment for Cloud-hosted Financial Exchanges, researchers from Microsoft and external collaborators present a fair-by-design algorithmic trading platform that can run in cloud environments. Cuttlefish aims to apply efficient and robust mapping of real operations to a novel formulation of ‘virtual time’. This allows Cuttlefish to push fairness to the extreme, regardless of the underlying network communication and computation hardware. The researchers’ implementation and evaluation validate the practicality of Cuttlefish and shows its operational efficiency on public cloud platforms. This paper builds on previous work: Rethinking Cloud-hosted Financial Exchanges for Response Time Fairness and DBO: Fairness for Cloud-Hosted Financial Exchanges.
Read the paperMicrosoft research podcast
Collaborators: Silica in space with Richard Black and Dexter GreeneCollege freshman Dexter Greene and Microsoft research manager Richard Black discuss how technology that stores data in glass is supporting students as they expand earlier efforts to communicate what it means to be human to extraterrestrials.
Listen now Opens in a new tab NEW RESEARCH LLM2CLIP: Powerful language model unlocks richer visual representationCLIP is a prominent multimodal foundational model, aligning visual and textual signals into a shared feature space. It supports various tasks, including zero-shot classification, detection, segmentation, and cross-modal retrieval, significantly influencing the entire multimodal domain. As a feature extractor, it has become dominant in cross-modal representation tasks such as image understanding, video understanding, and text-to-image/video generation. However, rapid advancements in large language models (LLMs) are continually pushing the boundaries of language comprehension and generation. Can the capabilities of LLMs be harnessed to further improve multimodal representation learning?
In a recent article: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation, researchers from Microsoft and external collaborators propose LLM2CLIP, a novel approach to unlock CLIP’s potential, focusing on fundamental optimizations of promising foundation models. By fine-tuning the LLM in the caption space with contrastive learning, they extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability. The researchers then design a training process where the fine-tuned LLM acts as a powerful teacher for CLIP’s visual encoder. The LLM’s presence allows them to incorporate longer and more complex captions without being restricted by CLIP’s text encoder’s context window and ability limitations. Their experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.
Read the paper Learn more about LLM2CLIP NEW RESEARCH LORASC: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded LearningFoundation models, which are large-scale models pre-trained on extensive datasets and subsequently adapted for specific downstream tasks, have become integral to contemporary machine learning frameworks. Fine-tuning these models is essential, yet full parameter fine-tuning often encounters significant memory and computational bottlenecks. Parameter-efficient finetuning (PEFT) techniques aim to minimize the number of trainable parameters to reduce training costs and improve training stability. Among these techniques, Low-Rank Adaptation (LoRA) is highly efficient, although it has limitations in terms of expressiveness and generalization have been noted.
In a recent paper: LORASC: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning, researchers from Microsoft and external collaborators present an innovative technique designed to enhance LoRA’s expressiveness and generalization capabilities while preserving its training efficiency. Their cascaded learning strategy enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns. They also introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. Their extensive experiments on various language and vision datasets, as well as robustness benchmarks, show that the proposed method significantly outperforms existing baselines, while also mitigating overfitting, enhancing model stability, and improving out-of-distribution (OOD) robustness.
Read the paper Microsoft Research in the news Can AI spot the next bomb cyclone far in advance? Microsoft hopes soSeattle Times | November 23, 2024
Microsoft claims that Aurora, a deep-learning model that’s constantly being trained, can produce weather forecasts much faster than — and with accuracy that meets or exceeds — traditional forecasting models.
How Microsoft's next-gen BitNet architecture is turbocharging LLM efficiencyVentureBeat | November 13, 2024
One-bit large language models (LLMs) have emerged as a promising approach to making generative AI more accessible and affordable. In a new paper, Microsoft researchers introduce Binet a4.8, a new technique that further improves the efficiency of one-bit LLMs without sacrificing their performance.
2024 Ellison Cliffe Lecture: AI in science and medicine with Christopher BishopRoyal Society of Medicine | November 13, 2024
Christopher Bishop, Technical Fellow and Director of Microsoft Research AI for Science, discusses the extraordinary advances in the deep learning technology that underpins the AI revolution, including crucial progress in the fields of scientific discovery and medicine. This recent speech at the Royal Society of Medicine includes current examples of AI’s impact in materials design, drug discovery, and healthcare.
View more news and awards Opens in a new tabThe post Research Focus: Week of December 2, 2024 appeared first on Microsoft Research.
MarS: A unified financial market simulation engine in the era of generative foundation models
Generative foundation models have transformed various domains, creating new paradigms for content generation. Integrating these models with domain-specific data enables industry-specific applications. Microsoft Research has used this approach to develop the large market model (LMM) and the Financial Market Simulation Engine (MarS) for the financial domain. These innovations have the potential to empower financial researchers to customize generative models for diverse scenarios, establishing a new paradigm for applying generative models to downstream tasks in financial markets. This integration may provide enhanced efficiency, more accurate insights, and significant advancements in the financial domain.
Applying generative models to financial marketsIn recent years, generative foundation models have achieved notable success in fields like natural language processing and media generation. Their rise has sparked a new wave of research and industrial adoption, reshaping production processes across industries. These models excel due to three essential elements: a large volume of high-quality training data; effective tokenization and serialization of core information (such as semantic information in text); and an auto-regressive training approach that models data comprehensively, enabling implicit reasoning.
Building on years of AI applications across industries, Microsoft researchers recognized that combining generative models with domain-specific data could lead to impactful solutions, particularly in finance. The financial market is a prime example, notably for its vast amount of order data, which are characterized by three key features:
- Fine granularity: Orders, as the atomic data in the financial market, provide a comprehensive and detailed representation of the real market. Combined with matching rules, one can reproduce the entire market operation process.
- Large scale: Electronic trading has resulted in the accumulation of massive trade-order data across global exchanges
- Well-structured: The structured nature of order data makes it ideal for tokenization and sequential modeling
- Download MarS
These characteristics position order flow data as a critical foundation for generative modeling in financial markets. To this end, Microsoft Research developed the LMM and the MarS, which financial researchers can use to customize generative models for various applications, thus fostering a new paradigm of generative solutions for all downstream tasks in finance. This has the potential to advance efficiency and insight generation in the financial industry.
Figure 1: Illustration of stock market and orders Tokenization of order flow informationOrder flow data is vital for generative models in finance, reflecting real-time interactions among market participants. It offers two types of value:
- Fine-grained market feedback: Each order, especially large ones, may influence others’ decisions, providing a micro-level view of pricing behavior.
- Macroscopic market dynamics: Collective interactions shape trading dynamics over time, capturing the evolution and resolution of conflicts between market forces.
Researchers at Microsoft developed LMM by modeling both individual orders and entire order sets over time. This two-tiered approach captures both fine-grained feedback and macro-level dynamics of competition. Figure 2 shows the tokenization techniques for these models, enabling high-fidelity simulations of complex market dynamics.
Figure 2: Tokenization for individual orders (top) and batch orders (bottom) Expansion law of large market model: Unlocking the potential of financial dataThe effectiveness of generative models improves significantly with larger training datasets and model parameters. Researchers at Microsoft used two tokenization strategies to design models based on the Transformer architecture, testing them across varying data scales. Figure 3 illustrates the scaling behavior of both the order and order batch models, highlighting insights from historical trading data. This integration enhances the model’s ability to generate order flows with a deep understanding of market intricacies, enabling more accurate time-series modeling.
Figure 3: Scaling curves of order and batch order models under different parameter sizesSpotlight: Blog post
MedFuzz: Exploring the robustness of LLMs on medical challenge problemsMedfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.
Read more Opens in a new tab MarS based on LMM A customizable generative model for financial scenariosGenerative models, once trained, can be easily adapted for a range of downstream tasks, often outperforming traditional models tailored for specific scenarios. Building on the development of LMM, researchers further analyzed the needs of various financial scenarios and designed MarS as a versatile financial market simulation engine. MarS not only serves as a general-purpose simulation tool but also introduces a novel framework for applying generative models across diverse financial tasks, from market prediction and risk assessment to trading strategy optimization.
Figure 4: Framework of MarS Constructing a unified paradigm for prediction and detection tasksTraditional financial prediction solutions often require the development of specialized algorithms, which must be frequently adjusted, consuming time and resources. LMM’s capacity to model financial markets in depth allows for periodic updates based on the latest data. MarS creates a virtual exchange to match order flows generated by LMM, simulating trades and deriving simulated market trajectories (see the top right of Figure 4). This approach can effectively address common prediction and detection tasks in financial scenarios, introducing innovative solutions within the generative model framework.
Applications in prediction tasksPrediction tasks, vital in finance, involve estimating future market metrics. Traditional models require modifications with any change in prediction targets. MarS addresses this by continuously generating future order flows from recent data, which are matched in a virtual exchange, allowing for the simulation of potential future market trajectories. This provides a significant advancement in forecasting capabilities.
Figure 5 demonstrates the use of MarS in forecasting stock-price movements, where its performance significantly outperforms traditional benchmark algorithms. Taking the Order Model (1.02B) for instance, its performance exceeds that of DeepLOB by approximately (0.662/0.583−1=13.5%) at a 1-minute horizon and increases to (0.579/0.473−1=22.4%) at a 5-minute horizon This widening performance gap suggests that the Order Model maintains its predictive accuracy more effectively over longer horizons, highlighting its superior generalization capability compared to baseline, especially as the prediction task becomes more challenging over extended timeframes. This provides an attractive solution for prediction tasks in financial markets, while also highlighting LMM’s capability in modeling stock market dynamics.
Figure 5: Predicting stock price trends with MarS Applications in detection tasksFor regulators, detecting systemic risks or market abuse is critical for market stability. LMM models typical market patterns, enabling the identification of anomalies by comparing real market trajectories with those generated by MarS. Figure 6 shows the differences in the spread distribution (i.e., the difference between the best buy and sell prices, which reflects asset liquidity) between simulated and real market trajectories during a confirmed malicious market manipulation incident. This comparison can uncover subtle deviations indicative of unusual activities, offering regulators effective tools for monitoring market integrity.
Figure 6: Spread correlation between simulated and real market during market manipulation Defining new FinTech scenariosGenerative models can create tailored content from simple descriptions. In MarS, a mechanism generates specific order flows from natural language descriptions of market conditions. To address extreme conditions, researchers developed a control signal system using a hierarchical diffusion model to generate high-fidelity signals during rare events, such as stock market crashes and circuit breakers. This capability transforms broad descriptions into precise order flow controls.
By integrating controlled order generation with real-time feedback, MarS creates a unified framework for prediction and detection tasks, redefining financial research, applications, and market understanding. Key applications include “What If” analyses and training environments for reinforcement learning algorithms in realistic market conditions.
“What If” analysis for financial researchThe question “What would happen if different sizes of trading orders were executed under different market conditions?” is crucial for understanding market behavior. Traditional methods, relying on real orders, experience, and assumptions, are costly and slow. Generative models provide a breakthrough solution.
Figure 7 illustrates how MarS can simulate market impact: the top left shows how buy orders affect asset price trajectories, while the top right presents market impact curves of different strategies, matching traditional patterns. Researchers also used MarS to generate large-scale simulated data, constructing a market impact model using ordinary differential equations (ODE). The bottom left of Figure 7) shows the derived impact formula, and the bottom right demonstrates its interpretability. These advancements highlight MarS’s potential to enhance “What If” research through deep market modeling.
Figure 7: Sample research results for market impact of orders using MarS Training environments for reinforcement learning in financial marketsReinforcement learning (RL) algorithms require controlled environments for testing and optimization. Financial market behaviors often manifest through order flow changes, impacting the market. If the simulation cannot reflect these impacts accurately, an RL algorithm may fail in real-world scenarios.
MarS provides high-fidelity generation and real-time feedback, creating a comprehensive environment for RL in finance. Figure 8 shows the training process of trading agents, highlighting significant improvements in performance over time and demonstrating MarS’s effectiveness as an RL training ground.
Figure 8: Performance of reinforcement learning trading agents trained in MarS. During training, the agent’s performance improved significantly, showcasing MarS’s ability to aid in developing robust reinforcement learning algorithms for real market conditions.Disclaimer: The research mentioned in this article, conducted by Microsoft Research, focuses on scientific exploration, aiming to advance knowledge and provide theoretical and technological support for research and applications in the financial field. All studies adhere to Microsoft’s responsible AI guidelines, ensuring principles such as fairness, inclusiveness, reliability and safety, transparency, privacy, and accountability are maintained. The technologies and methods discussed are still under research and development, not yet forming any commercial products or services, nor constituting any financial solutions. Readers are advised to consult certified financial professionals before making any financial decisions.
Opens in a new tabThe post MarS: A unified financial market simulation engine in the era of generative foundation models appeared first on Microsoft Research.
Advances in run-time strategies for next-generation foundation models
Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed Medprompt last year—a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods.
Figure 1. Comparative analyses of performance of multiple models on MedQA.Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)—without using sophisticated prompt guidance and control. This advancement is driven by the model’s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2.
Figure 2. Comparisons on a wide range of medical challenge benchmarks.These results are notable, prompting us to publish our recent study, findings, and analyses, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab). But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.
Is o1-preview “just” fancy prompting?The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to “think” before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.
Figure 3. Pareto frontier showing accuracy versus total API cost (log scale) on the MedQA benchmark (1273 questions total). o1-preview (Sep 2024) is compared with GPT-4o (Aug 2024) and GPT-4 Turbo (Nov 2023). Can we prompt engineer o1-preview?The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here’s a summary of our findings:
- Tailored Prompt: While minimal prompting—like a brief, one-sentence description followed by a question—offered a strong baseline performance, detailed task descriptions were best for eliciting accurate responses.
- Ensembling: Generating multiple answers per question and using majority voting across different reasoning paths boosted reliability, while shuffling answers in runs produced richer reasoning chains and improved outcomes. Ensembling continues to yield consistent performance improvements.
- Few-Shot Prompting: Guiding the model with a few examples produced inconsistent results and, on average, decreased performance compared with GPT models.
Microsoft research podcast
NeurIPS 2024: The co-evolution of AI and systems with Lidong ZhouJust after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opportunities to use AI to enhance systems engineering itself.
Listen now Opens in a new tab Do results stand in another language? Figure 5. JMLE-2024: National medical licensing exam held in Japan (Feb 2024).We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam. The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model’s knowledge cutoff. Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam’s minimum passing score of approximately 80%.
Do reasoning tokens improve performance?For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.
Figure 6. The effect of two prompting strategies that elicit variable length reasoning chains across benchmark datasets. What’s the takeaway?Bottom line: There’s a little something for everyone when it comes to run-time strategies. We’re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template (opens in new tab).
Figure 7. Heatmap showing absolute accuracy and relative performance over baseline zero-shot prompt (in parenthesis) across all benchmark datasets. Anything more to consider?We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:
- Research on run-time strategies. The research community has largely relied on boosting model capabilities with data, compute, and model size, predictably achieving gains by way of scaling laws. A promising new direction is inference-time scaling—the value of investing in additional computation and machinery for guiding inference at run time. We highlight in the paper opportunities to guide run-time allocations to boost efficiency, accuracy, and intellectual capabilities, including meta reasoning and reflection in real time and learning during the “idle” time (opens in new tab) between problem solving. We see a great deal of opportunity for new research and development on real-time and “offline” reasoning, learning, and reflection.
- Benchmark saturation. With the rapid advancement of state-of-the-art models, many existing medical benchmarks are reaching “saturation,” where models perform extremely well on standing medical competency challenges, considered extremely difficult just a few years ago. Current benchmarks, such as USMLE and JMLE, were designed to assess the performance of medical students and clinicians and are increasingly inadequate for evaluating cutting-edge AI models. To drive understandings of models and guide research, we need to design more challenging medical benchmarks.
- From benchmarks to clinical applications. We note that, while benchmarks offer valuable insights into performance and accuracy, they often fail to capture the complexities and nuances of real-world clinical decision making and healthcare delivery, more broadly. Conducting clinical trials to rigorously evaluate the impact of AI applications on patient care poses far greater difficulties than benchmarking models against challenge problems drawn from medical competency exams. Yet, studies of AI deployments in realistic clinical settings are essential for understanding model capabilities and for guiding the effective integration of AI into healthcare.
- From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab)
- Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
- The Power of Prompting
- Steering at the Frontier: Extending the Power of Prompting
- Capabilities of GPT-4 on Medical Challenge Problems (opens in new tab)
The post Advances in run-time strategies for next-generation foundation models appeared first on Microsoft Research.
Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation
The Global Health Drug Discovery Institute (opens in new tab) (GHDDI) and Microsoft Research have reached a milestone in tuberculosis (TB) drug research with TamGen (opens in new tab), an open-source (opens in new tab), transformer-based chemical language model for developing target-specific drug compounds. Working in collaboration, the joint team successfully identified several promising inhibitors for a TB protease, with the most effective compound showing significant bioactivity. Research shows that TamGen can also optimize existing molecules by designing target-aware molecule fragments, potentially enabling the discovery of novel compounds that build on a known molecular core structure.
Generative AI helps overcome limitations in drug discoveryGenerative AI is opening new avenues for scientific exploration by allowing computers to autonomously learn and produce original content. TamGen offers a new approach to drug discovery by applying the principles of generative AI to molecular design. Unlike traditional methods, which depend on systematically screening known compounds—a process that is long, complex, and costly due to its reliance on empirical knowledge and the time-consuming task of exploring a vast chemical library—generative AI provides opportunities for designing entirely new chemical structures.
TamGen goes beyond analyzing existing data by generating chemically diverse compounds that conventional approaches might miss. Figure 1 shows that generative AI expands chemical exploration, allowing for a deeper and more comprehensive search for therapeutic solutions compared to traditional methods.
Figure 1. Compared with the traditional screening-based approach to drug discovery, a generative AI-based approach enables the discovery of novel compounds. TamGen workflowTamGen’s workflow uses generative AI to design target-specific chemical compounds. Building on the success of large language models (LLMs), we adapted a similar approach for molecular generation, using a training method like that of GPT models, which involves next-token prediction. Molecules were first converted into a simplified molecular input line entry system (SMILES)—a notation representing molecular structures as symbol sequences, similar to text. We then developed a protein encoder to process information about proteins, including their 3D structure.
A contextual encoder combines insights from medical professionals with data on the protein target and existing compounds that have proven to be effective or promising. Using expert knowledge and computational analysis, this encoder guides the compound generator to produce new molecules that are more likely to bind to a given protein. This workflow is illustrated in Figure 2.
Figure 2. TamGen’s workflow Evaluating TamGen computationally- Download TamGen
To evaluate TamGen’s performance, we compared it to five other common methods used to create 3D shapes of molecules intended to bind to certain proteins. We evaluated these methods using the CrossDocked benchmark, a dataset used in AI research to assess the quality of molecule generation conditioned on a target protein.
Evaluation metrics:
- Docking score: Measures how well a molecule binds to a target protein.
- Quantitative estimate of drug-likeness (QED): Assesses how good a candidate a molecule is for a drug.
- Synthesis accessibility score (SAS): Measures how easy or difficult it is to synthesize a particular chemical compound in a lab.
- Ro5 (Lipinski’s rule of five): Determines how likely a compound can be developed into an oral drug.
- LogP: Tests a compound’s ability to move between water and fats.
- Diversity: Measures the range of different molecular structures and properties in a collection of compounds.
The findings, illustrated in Figure 3, show TamGen’s overall performance. While other methods may produce compounds that bind more strongly, they often include multiple interconnected ring structures. Research indicates that more of these structures can lower synthesis accessibility (SAS) and increase cellular toxicity, making these compounds harder to develop. We believe that molecular pretraining of the model contributed to the overall effectiveness of the compounds TamGen generated.
Figure 3. Results from TamGen’s computational performance verification Experimental lab verificationTo ensure real-world applicability, we also validated our findings in a hands-on lab environment. Here, we focused on the ClpP protease in Mycobacterium tuberculosis as the target because it plays a significant role in the bacterium’s survival under stress conditions. We proposed the Design-Refine-Test pipeline to effectively identify molecular compounds for TB drug discovery.
Design stage: We began by using TamGen to analyze the binding pocket of the protease, where molecules can attach and influence its function. TamGen generated about 2,600 potential compounds that could fit into this pocket. We assessed these compounds based on how well they could attach to the protease and their predicted biological effects, narrowing it down to four promising candidates.
Refine stage: Next, we entered the four compounds into TamGen, along with three molecular fragments that had been validated in previous lab experiments. This generated a total of 8,600 new compounds, which we screened again using the same criteria, eventually narrowing the selection to 296 compounds.
Test stage: Because synthesizing all 296 compounds wasn’t feasible, we identified similar compounds available in commercial libraries and tested their initial activity against TB. Five compounds showed promising results. We then synthesized one of the originals and two variants of another. Additionally, we categorized the generated compounds into clusters, selected the top 10% from each cluster based on docking scores, and after manual review, synthesized eight more compounds.
The team from Microsoft Research generated the compounds by TamGen, and the GHDDI team conducted binding analysis, structure–activity relationship studies, and lab experiments to verify the compounds’ inhibitory effect on the ClpP protease, measuring their capacity to interfere with or reduce its activity. Lower IC50 values signify greater potency. Out of the 16 compounds tested, 14 showed strong inhibitory activity measuring under 40 µM, indicating high potential. The most effective compound had a measured IC50 value of 1.88 µM.
Figure 4. The hands-on lab verification process From molecule to fragment generationIn addition to generating new molecules, TamGen can optimize existing ones by designing smaller molecular fragments. In this fragment generation process, TamGen builds on a given protein target and a molecular core structure to design new compounds around that core. By incorporating information about the target protein, it generates fragments that are highly specific to the target. This approach moves beyond traditional methods that rely on pre-existing databases, which often limit both novelty and effectiveness of molecular fragments.
For fragment generation, we adjusted the input to TamGen’s compound generator. We modified the SMILES string to ensure it ended at the desired growth site. This was done by specifying the fragment we wanted to retain and its connection point for further growth. The tailored SMILES string was then fed into the compound generator to extend the molecule.
We evaluated this method by targeting the ClpP protease for TB, achieving a more than tenfold improvement in the binding affinity of the generated compound compared to the original. Some compounds also demonstrated slow binding, indicating potential for prolonged action and improved selectivity for the target protein.
AI’s potential in drug discoveryTamGen showcases the transformative potential of generative AI in drug design, combining advanced molecular modeling with researcher-AI collaboration. Tasks that once took years can now be accomplished in a fraction of the time. This research underscores AI’s expanding role in drug discovery and its promise for developing effective treatments against persistent infectious diseases like TB.
Looking ahead, we plan to integrate advanced techniques into TamGen, including diffusion models for generating 3D structures, reinforcement learning to apply physical constraints, and molecular dynamics simulations to capture proteins’ shifting shapes. These enhancements aim to improve how well generated compounds bind to target proteins, increase their ability to be synthesized, and strengthen other critical drug properties.
Opens in a new tabThe post Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation appeared first on Microsoft Research.
LazyGraphRAG: Setting a new standard for quality and cost
The GraphRAG project (opens in new tab) aims to expand the class of questions that AI systems can answer over private datasets by leveraging the implicit relationships within unstructured text.
A key advantage of GraphRAG over conventional vector RAG (or “semantic search”) is its ability to answer global queries that address the entire dataset, such as “what are the main themes in the data?”, or “what are the most important implications for X?”. Conversely, vector RAG excels for local queries where the answer resembles the query and can be found within specific text regions, as is typically the case for “who”, “what”, “when”, and “where” questions.
In recent blog posts, we have shared two new query mechanisms that exploit the rich, summary-based data index created by GraphRAG to improve local search performance and global search costs, respectively.
In this blog post, we introduce a radically different approach to graph-enabled RAG that requires no prior summarization of the source data, avoiding the up-front indexing costs that may be prohibitive for some users and use cases. We call this approach “LazyGraphRAG”.
A key advantage of LazyGraphRAG is its inherent scalability in terms of both cost and quality. Across a range of competing methods (standard vector RAG, RAPTOR (opens in new tab), and GraphRAG local (opens in new tab), global (opens in new tab), and DRIFT (opens in new tab) search mechanisms), LazyGraphRAG shows strong performance across the cost-quality spectrum as follows:
- LazyGraphRAG data indexing costs are identical to vector RAG and 0.1% of the costs of full GraphRAG.
- For comparable query costs to vector RAG, LazyGraphRAG outperforms all competing methods on local queries, including long-context vector RAG and GraphRAG DRIFT search (our recently introduced RAG approach shown to outperform vector RAG) as well as GraphRAG local search.
- The same LazyGraphRAG configuration also shows comparable answer quality to GraphRAG Global Search for global queries, but more than 700 times lower query cost.
- For 4% of the query cost of GraphRAG global search, LazyGraphRAG significantly outperforms all competing methods on both local and global query types, including GraphRAG global search at the C2 level (the third level of the community hierarchy recommended for most applications).
LazyGraphRAG is coming soon to our open-source GraphRAG library (opens in new tab), providing a unified query interface for both local and global queries over a lightweight data index that is comparable in cost to standard vector RAG.
Blending vector RAG and Graph RAG with deferred LLM useLazyGraphRAG aims to blend the advantages of vector RAG and Graph RAG while overcoming their respective limitations:
- Vector RAG is a form of best-first search that uses the similarity with the query to select the best-matching source text chunks. However, it has no sense of the breadth of the dataset to consider for global queries.
- GraphRAG global search is a form of breadth-first search that uses the community structure of source text entities to ensure that queries are answered considering the full breadth of the dataset. However, it has no sense of the best communities to consider for local queries.
LazyGraphRAG combines best-first and breadth-first search dynamics in an iterative deepening manner (Table 1). Compared to the global search mechanism of full GraphRAG, this approach is “lazy” in ways that defer LLM use and dramatically increase the efficiency of answer generation. Overall performance can be scaled via a single main parameter – the relevance test budget – that controls the cost-quality trade-off in a consistent manner.
GraphRAGLazyGraphRAGBuild indexUses an LLM to extract and describe entities and their relationships, b) uses an LLM to summarize all observations of each entity and relationship, c) uses graph statistics to optimize the entity graph and extract hierarchical community structureUses NLP noun phrase extraction to extract concepts and their co-occurrences, b) uses graph statistics to optimize the concept graph and extract hierarchical community structureSummarize indexUses an LLM to summarize entities and relationships in each communityNone – the “lazy” approach defers all LLM use until query timeRefine queryNone – the original query is used throughoutUses an LLM to a) identify relevant subqueries and recombine them into a single expanded query, b) refine subqueries with matching concepts from the concept graphMatch queryNone – all queries are answered using all community summaries (breadth first) For each of q subqueries [3-5]:– Uses text chunk embeddings and chunk-community relationships to first rank text chunks by similarity to the query, then rank communities by the rank of their top-k text chunks (best first)
– Uses an LLM-based sentence-level relevance assessor to rate the relevance of the top-k untested text chunks from communities in rank order (breadth first)
– Recurses into relevant sub-communities after z successive communities yield zero relevant text chunks (iterative deepening)
– Terminates when no relevant communities remain or relevance test budget / q is reachedMap answersUses an LLM to answer the original query over random batches of community summaries in parallelFor each of q subqueries [3-5]:
– Builds a subgraph of concepts from the relevant text chunks
– Uses the community assignments of concepts to group related chunks together
– Uses an LLM to extract subquery-relevant claims from groups of related chunks as a way of focusing on relevant content only
– Ranks and filters extracted claims to fit a pre-defined context window size Reduce answersUses an LLM to answer the original query using the mapped answersUses an LLM to answer the expanded query using the extracted map claims LazyGraphRAG answer quality is state-of-the-art
We compared LazyGraphRAG at varying levels of relevance test budget to a range of competing methods, as follows:
- Dataset: 5,590 AP news articles (used with license)
- Queries: 100 synthetic queries (50 local and 50 global), generated using a new method to be described in a future blog post
- Metrics: Comprehensiveness, Diversity, Empowerment (as described here, with an LLM used to compare pairs of answers head-to-head on each metric)
- Conditions: Includes LazyGraphRAG with three relevance test budget settings, in addition to eight competing conditions from GraphRAG and literature (Table 2).
Figure 1 shows LazyGraphRAG win rates against each of the eight competing conditions. At the lowest budget level of 100 relevance tests using a low-cost LLM model (same cost as SS_8K), LazyGraphRAG significantly outperforms all conditions on local and global queries, except for GraphRAG global search conditions for global queries. At an increased budget of 500 relevance tests using a more advanced LLM model (4% of the query cost of C2), LazyGraphRAG significantly outperforms all conditions on both local and global queries. LazyGraphRAG’s win rates continue to increase as the relevance test budget increases to 1,500, demonstrating its scalability in terms of cost vs. quality.
Figure 1. Clustered bar charts showing win rates of LazyGraphRAG with relevance test budgets of (100, 500, 1,500) over competing conditions for each combination of (local, global) queries and (comprehensiveness, diversity, empowerment) metricsAn example comparison is shown in Table 3.
Query: Describe what factors individuals in the United States should consider when selecting a health insurance plan during the open enrollment period for 2024 health insurance plans.
Vector RAG (64K context)LazyGraphRAG# Factors to Consider When Selecting a Health Insurance Plan During Open Enrollment for 2024Selecting a health insurance plan during the open enrollment period can be a complex process, but understanding the key factors to consider can help individuals make informed decisions. Here are some important aspects to keep in mind:
## Coverage and Networks
One of the primary considerations is the coverage network of the plan. Insurers form coverage networks of hospitals and doctors, and it is crucial to ensure that your preferred healthcare providers are included in the network of the plan you are considering. Many marketplace plans, such as Health Maintenance Organizations (HMOs) and Exclusive Provider Organizations (EPOs), do not cover claims for most non-emergency care outside their networks [Data: Sources (47419, 47421, 47420, 47410, 47408)].
## Costs: Premiums, Deductibles, and Out-of-Pocket Maximums
Understanding the costs associated with a health insurance plan is essential. This includes the monthly premiums, deductibles, copays, and coinsurance. Plans are typically categorized into different levels such as platinum, gold, silver, and bronze, with bronze plans generally having lower premiums but higher out-of-pocket costs [Data: Sources (47411, 47412, 47416, 47414, 47418)]. Additionally, it is important to consider the out-of-pocket maximum, which is the most you will have to pay in a year for covered services before the insurance starts covering all costs [Data: Sources (47416, 47415)].
## Tax Credits and Cost-Sharing Reductions
Individuals with lower incomes may qualify for income-based tax credits to help reduce the cost of premiums. It is important to accurately estimate your income for the coming year when applying for these credits to avoid having to repay any excess credits received [Data: Sources (47412, 47413)]. Additionally, those with incomes below 250% of the federal poverty level may qualify for cost-sharing reductions if they select a silver plan, which can lower deductibles and out-of-pocket maximums [Data: Sources (47417, 47418)].
## Special Considerations for Medicaid and Medicare
For those transitioning from Medicaid, it is important to understand that marketplace plans may come with higher premiums and copays compared to Medicaid. Additionally, many people who have been removed from Medicaid may be eligible to enroll in marketplace plans through an extended enrollment period [Data: Sources (6284, 6285, 6286, 6287)]. For seniors, the Medicare open enrollment period allows for switching between traditional Medicare and Medicare Advantage plans, with considerations for prescription drug coverage and additional benefits like dental and vision [Data: Sources (72060, 72061, 72062, 72063, 72064)].
## Assistance and Resources
Navigating the complexities of health insurance can be challenging, and seeking assistance from insurance agents or government-funded navigators can be beneficial. These professionals can help individuals understand their options and make informed decisions without recommending specific plans [Data: Sources (47424, 47425, 47426)].
## Conclusion
Selecting the right health insurance plan requires careful consideration of various factors including coverage networks, costs, eligibility for tax credits and cost-sharing reductions, and available assistance. By thoroughly evaluating these aspects, individuals can choose a plan that best meets their healthcare needs and financial situation for the upcoming year.### Types of Health Insurance Plans
When selecting a health insurance plan during the open enrollment period for 2024, individuals in the United States have several options to consider:
1. **Marketplace Plans**: These are available under the Affordable Care Act (ACA) and are categorized into platinum, gold, silver, and bronze levels. Bronze plans generally have lower premiums but higher out-of-pocket costs, while platinum plans have higher premiums but lower out-of-pocket costs [Data: Sources (47411, 47420, 47421)].
2. **Low-Cost Marketplace Plans**: These often come with high deductibles, requiring patients to pay thousands of dollars before most coverage kicks in. They also have annual out-of-pocket maximums that can exceed $9,000 for individuals and $18,000 for families [Data: Sources (47415, 47416, 47414)].
3. **Exclusive Provider Organizations (EPOs) and Health Maintenance Organizations (HMOs)**: EPOs and HMOs generally restrict patients to a network of doctors and require a primary care doctor to direct care. They tend to be cheaper but lack out-of-network flexibility [Data: Sources (47420, 47421, 43218, 43217)].
4. **Preferred Provider Organizations (PPOs)**: These plans allow for out-of-network services but at a higher cost. They offer more flexibility compared to HMOs and EPOs [Data: Sources (43217)].
5. **High-Deductible Health Plans (HDHPs)**: Defined as plans with a deductible of at least $1,600 for individual coverage or $3,200 for family coverage, with out-of-pocket maximums of no more than $8,050 or $16,100, respectively. HDHPs usually have lower premiums, and sometimes companies contribute to a health savings account (HSA) to help cover the deductible [Data: Sources (43227, 43226)].
6. **Medicare Advantage**: These are privately run versions of the federal government’s Medicare program, mostly for people aged 65 and over. They often include prescription drug coverage and may offer additional benefits like dental or vision coverage not provided by traditional Medicare [Data: Sources (72063, 72061, 72060, 72062)].
7. **Short-Term Health Insurance Plans**: These plans are limited to three months and can only be renewed for a maximum of four months under new rules. They are intended for temporary coverage but often lack comprehensive benefits [Data: Sources (97999, 97995, 97996, 97997)].
### Cost Factors: Premiums, Deductibles, Co-pays, and Out-of-Pocket Maximums
The overall cost of health insurance plans in 2024 is influenced by several factors:
– **Premiums**: This is the set monthly cost you pay for your health insurance plan. Premiums have been rising, with a notable increase of 7% for both family and single plans in 2023, partly due to inflation [Data: Sources (83383, 83382, 83384, 83385, 83381, +more)].
– **Deductibles**: The amount you pay out-of-pocket for health care services before your insurance starts to pay. For HDHPs, the deductible is at least $1,600 for individual coverage or $3,200 for family coverage [Data: Sources (43226, 43225)].
– **Co-pays and Co-insurance**: These are the costs you pay each time you receive a medical service. Co-pays are fixed amounts, while co-insurance is a percentage of the service cost.
– **Out-of-Pocket Maximums**: This is the maximum amount you will pay for covered services in a year. For example, HDHPs have out-of-pocket maximums of no more than $8,050 for individual coverage or $16,100 for family coverage [Data: Sources (43227, 43226)].
### Provider Networks: In-Network vs. Out-of-Network
The network of healthcare providers is a crucial factor in selecting a health insurance plan:
– **In-Network Providers**: These are doctors and hospitals that have agreements with your insurance plan to provide services at lower rates. Ensuring your preferred doctors and specialists are in-network can save you significant costs [Data: Sources (43216, 47419)].
– **Out-of-Network Providers**: Services from these providers are usually more expensive and may not be covered at all, except in emergencies. PPO plans offer some out-of-network coverage but at a higher cost, while HMOs and EPOs generally do not cover non-emergency out-of-network care [Data: Sources (43217, 47421)].
### Specific Medical Needs and Services
When selecting a health insurance plan, individuals should consider their specific medical needs:
– **Prescription Drugs**: Ensure that your medications are covered by the plan’s formulary, as drug coverage can change annually [Data: Sources (43220, 43218, 43219)].
– **Mental Health Services**: Coverage for mental health treatments is essential, especially with new rules pushing insurers to increase their coverage of these services [Data: Sources (97031, 97028, 97027, 97030, 97033, +more)].
– **Chronic Conditions**: Plans should cover ongoing treatments and medications for chronic conditions. Medicare Supplement Insurance (Medigap) can help cover gaps in Medicare for chronic disease management [Data: Sources (93367, 93368)].
– **Preventive Care**: Coverage for preventive services like cancer screenings and HIV prevention is mandated under the ACA, though its future is uncertain due to ongoing legal battles [Data: Sources (71106, 71109, 71098, 71099, 71100, +more)].
### Key Dates and Steps for Open Enrollment
The open enrollment period for 2024 health insurance plans involves several key dates and steps:
– **Marketplace Plans**: Open enrollment starts on November 1, 2023, and runs through mid-December in most states, ending on January 16, 2024 [Data: Sources (47419, 47411, 47416, 47421, 47409, +more)].
– **Medicare**: Open enrollment for Medicare runs from October 15, 2023, to December 7, 2023. During this period, individuals can choose between traditional Medicare, Medicare Advantage plans, and prescription drug plans [Data: Sources (72061, 72063, 72060, 72062)].
– **Special Enrollment Periods**: Individuals who lose coverage due to life events like job loss or moving may qualify for special enrollment periods. For example, those removed from Medicaid may enroll in marketplace plans through July 2024 [Data: Sources (6288, 6289)].
By considering these factors, individuals can make informed decisions about their health insurance coverage for 2024, ensuring they select plans that best meet their medical needs and financial situations. Looking forward
LazyGraphRAG shows that it is possible for a single, flexible query mechanism to substantially outperform a diverse range of specialized query mechanisms across the local-global query spectrum, and to do so without the up-front costs of LLM data summarization. Its very fast and almost-free indexing make LazyGraphRAG ideal for one-off queries, exploratory analysis, and streaming data use cases, while its ability to smoothly increase answer quality with increasing relevance test budget makes it a valuable tool for benchmarking RAG approaches in general (e.g., “RAG approach X beats LazyGraphRAG with budget Y for task Z”).
Does this mean that all graph-enabled RAG should be lazy? We believe the answer is no, for three reasons:
- A GraphRAG data index of entity, relationship, and community summaries has use value beyond question answering (e.g., reading and sharing as reports).
- A GraphRAG data index of entity, relationship, and community summaries, combined with a LazyGraphRAG-like search mechanism, is likely to achieve better results than LazyGraphRAG alone.
- A new kind of GraphRAG data index designed to support a LazyGraphRAG-like search mechanism (e.g., through pre-emptive claim and topic extraction) is likely to achieve the best possible results.
We will be exploring these directions in the coming period, with all advances (including LazyGraphRAG itself) released via the GraphRAG GitHub repository (opens in new tab). Stay tuned!
Opens in a new tabThe post LazyGraphRAG: Setting a new standard for quality and cost appeared first on Microsoft Research.
Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo
Earlier this year, Microsoft Research announced (opens in new tab) its newest lab in Tokyo, Japan. Today, we are celebrating its grand opening, reinforcing Microsoft Research’s commitment to AI research across the Asia-Pacific region. This new lab will focus on embodied AI, well-being and neuroscience, societal AI, and industry innovation—all areas that align with Japan’s socioeconomic priorities. This initiative will enhance collaboration with local academic and industrial partners, contributing to global innovation and talent development.
We recently spoke with Yasuyuki Matsushita, head of the newly established Tokyo lab. Matsushita, who worked at Microsoft Research Asia from 2003 to 2015, served as a professor at Osaka University for the past decade, before returning in October. He reflects on his journey, the evolution of technology, and the opportunities ahead for Microsoft Research Asia – Tokyo.
Yasuyuki Matsushita, Microsoft Research Asia – Tokyo Why return to Microsoft Research Asia?Question: We are excited to have you leading the new lab in Tokyo. You worked at Microsoft Research Asia in Beijing from 2003 to 2015 before transitioning to academia. What motivated you to return after nearly a decade?
Yasuyuki Matsushita: Microsoft Research Asia has always been an exceptional place for conducting cutting-edge research, especially in the AI era. Earlier this year, I learned about Microsoft Research Asia’s expansion, including the establishment of a new lab in Tokyo. This presented an exciting opportunity to make a meaningful impact both locally and globally, sparking my motivation to return. Additionally, Microsoft is at the forefront of AI advancements, making this an ideal moment to re-engage. I’m confident that my work can contribute meaningfully to this dynamic field. The pace of AI development today is unmatched, making this an exhilarating time to be involved.
What has changed over the decade?Question: Now that you’ve been back for a few weeks, from your perspective, what has changed at Microsoft Research Asia, and what has remained the same since you were last here?
Yasuyuki Matsushita: The most immediate change I’ve noticed is the array of employee tools and resources, which have evolved significantly over the past decade. I’m still familiarizing myself with these new systems, designed to optimize efficiency and collaboration. Over the past ten years, Microsoft has played a key role in driving digital transformation for other companies, and it has also transformed internally.
Beyond these changes, much of what made Microsoft Research Asia unique remains the same. The culture and people continue to foster an environment of innovation and collaboration. The organization still attracts exceptional talent, and the spirit of research is as vibrant as ever. One of its greatest strengths is its open, collaborative approach. It has maintained long-standing partnerships with universities and research institutions, which encourage cross-regional, cross-cultural, and interdisciplinary exchanges. This synergy stimulates innovation and supports industry development. The commitment to excellence remains at the heart of Microsoft Research Asia’s identity.
Plans for the Microsoft Research Asia – Tokyo labQuestion: With Microsoft Research Asia expanding regionally to places like Tokyo, Vancouver, Singapore, and Hong Kong, what are your plans as the head of the Tokyo lab, and how do you see it contributing to the region’s innovation ecosystem?
Yasuyuki Matsushita: My primary goal is to align the Tokyo lab’s growth with Microsoft Research Asia’s mission to advance science and technology for the benefit of humanity. The research efforts we’re focusing on in this lab aim to address pressing societal issues while advancing AI technologies to benefit society as a whole.
For instance, Japan’s aging population presents unique challenges that require efficient societal solutions—an issue faced by many nations today. Through our research, we aim to generate insights that can be applied globally to proactively address and mitigate such challenges.
Japan also has a strong legacy of scientific research in fields like electronics, materials science, and robotics. Its advanced industrial base, featuring renowned companies across the automotive, electronics, and machinery sectors, provides rich application scenarios for our research outcomes. Additionally, Japan’s robust education system supplies an intellectual foundation crucial for our in-depth research.
We’re dedicated to maintaining open research practices. By publishing our findings and open-sourcing our tools, we ensure our work benefits the broader industry and enriches the global knowledge pool. Our goal is to share insights that drive societal progress and innovation worldwide.
Cultivating the next generationQuestion: Talent is at the heart of Microsoft Research’s mission and culture. What kind of talent is Microsoft Research Asia – Tokyo looking for? In what ways can the Tokyo lab enhance its efforts to cultivate the next generation of tech innovators for the region?
Yasuyuki Matsushita: One of the key advantages of being part of Microsoft is the close connection we have to real-world applications. This bridge between research and practice allows our work to have a direct societal impact, ensuring that innovative technology results in meaningful and beneficial outcomes.
When recruiting new talent, we seek bright, self-driven individuals with an innate curiosity and a passion for solving societal challenges. The most vital trait we look for is a deep desire to understand the “why” behind complex problems. While technical expertise is essential, a commitment to addressing social issues fuels creativity and drives meaningful progress. This blend of curiosity and purpose sparks innovation and propels us forward at Microsoft Research Asia.
At the Tokyo lab, a core part of our vision is cultivating the next wave of tech innovators. We plan to build on the legacy of successful talent programs that Microsoft Research Asia has championed throughout the region, like joint research initiatives, visiting scholar programs, and internship opportunities. These provide early career professionals and students with invaluable hands-on experiences, equipping them with essential research skills and deepening their understanding of complex technological challenges.
We’re committed to creating a nurturing environment where talent can thrive, collaborate, and contribute to the global tech landscape. By combining innovation with real-world impact, we aim to inspire the next generation to push boundaries and advance society.
Rapid evolution in computer visionQuestion: In today’s world, everything is moving toward digitization and intelligence. Ten years ago, your research focused on photometry and video analysis. Can you share some key outcomes from that period and explain how you see emerging technologies like AI influencing the field of computer vision?
Yasuyuki Matsushita: Back then, my research centered on computer vision, specifically on photometry for 3D reconstruction and video analysis aimed at enhancing video quality. One of the standout projects during that period was the development of a gigapixel camera capable of capturing high-resolution 3D information. This camera played a crucial role in the Dunhuang Mogao Grottoes project, which sought to digitally preserve the cultural heritage of Dunhuang’s murals and Buddha shrines with unprecedented accuracy.
Another notable project was the development of video stabilization technology, which was integrated into Windows 7 as part of Media Foundation. This technology improved video quality by compensating for unwanted camera movements, delivering smoother and more professional-looking output. The creation of real-time algorithms capable of processing and stabilizing video was groundbreaking at that time.
Since then, the introduction of deep learning, large datasets, and sophisticated neural network architectures has propelled computer vision to new heights. Tasks that were once considered difficult, such as object detection, recognition, and segmentation, are now standard with modern AI techniques. Current research continues to push the boundaries by exploring innovative network architectures, new learning strategies, and enhanced datasets. A particularly exciting trend is the use of AI in real-world interactive scenarios, leading to the emergence of embodied AI, which is a major focus of my current work.
Understanding embodied AI beyond roboticsQuestion: Your current research interests include embodied AI, which is also one of the key areas at Microsoft Research Asia – Tokyo. What exactly is embodied AI, and how does it differ from robotics?
Yasuyuki Matsushita: Embodied AI goes beyond traditional robotics. While robots are typically machines equipped with actuators designed to execute specific tasks, embodied AI focuses on developing intelligent systems that can perform complex tasks while understanding and interacting within physical and virtual environments. Robotics and AI have developed independently, but embodied AI is the convergence of these two fields, integrating AI with physical agents that can perceive, act, and learn in dynamic real-world environments.
This field is inherently interdisciplinary, involving aspects such as robotic control, reinforcement learning, spatial awareness, human-robot interaction, reasoning, and more. For instance, embodied AI includes the ability to infer cause and effect, such as understanding that an unsupported laptop will fall due to gravity. These types of interactions and interpretations stem from engaging with and understanding the physical world, making embodied AI an exciting and multifaceted area of research.
Given the complexity of embodied AI, no single organization can cover all aspects of its development alone. We look forward to collaborating with local industry and academic institutions in Japan, leveraging their expertise alongside our strengths in AI to advance the field.
Advice for aspiring researchers in computer vision and AIQuestion: You’ve had an extensive career spanning academia and industry. From your experience as both an educator and a researcher, what advice would you give to young people interested in pursuing research in computer vision and AI?
Yasuyuki Matsushita: For students interested in computer vision and AI, a strong foundation in mathematics and computer science is essential, even as specific research topics and technologies evolve. A deep understanding of fundamental mathematical concepts, such as gradients, Jacobians, and vector spaces, is indispensable. Mastery of these principles will be beneficial regardless of changes in programming languages or development platforms.
Maintaining a mindset of continuous learning is equally important, as the field is constantly evolving. For example, deep learning was not as prominent a decade ago but is now central to the field. At Microsoft, we emphasize the importance of a growth mindset—being adaptable, open to new technologies, and willing to pivot with industry advancements. Early career professionals should cultivate the ability to quickly acquire new skills while building on their foundational knowledge. This adaptability is key to long-term success in research and development.
Opens in a new tabThe post Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo appeared first on Microsoft Research.
BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis
In cancer diagnosis or advanced treatments like immunotherapy, every detail in a medical image counts. Radiologists and pathologists rely on these images to track tumors, understand their boundaries, and analyze how they interact with surrounding cells. This work demands pinpoint accuracy across several tasks—identifying whether a tumor is present, locating it precisely, and mapping its contours on complex CT scans or pathology slides.
Yet, these crucial steps—object recognition, detection, and segmentation—are often tackled separately, which can limit the depth of analysis. Current tools like MedSAM (opens in new tab) and SAM (opens in new tab) focus on segmentation only, thus missing out on the opportunity to blend these insights holistically and relegating object as an afterthought.
In this blog, we introduce BiomedParse (opens in new tab), a new approach for holistic image analysis by treating object as the first-class citizen. By unifying object recognition, detection, and segmentation into a single framework, BiomedParse allows users to specify what they’re looking for through a simple, natural-language prompt. The result is a more cohesive, intelligent way of analyzing medical images that supports faster, more integrated clinical insights.
While biomedical segmentation datasets abound, there are relatively few prior works on object detection and recognition in biomedicine, let alone datasets covering all three tasks. To pretrain BiomedParse, we created the first such dataset by harnessing OpenAI’s GPT-4 for data synthesis from standard segmentation datasets (opens in new tab).
BiomedParse is a single foundation model that can accurately segment biomedical objects across nine modalities, as seen in Figure 1, outperforming prior best methods while requiring orders of magnitude fewer user operations, as it doesn’t require an object-specific bounding box. By learning semantic representation for individual object types, BiomedParse’s superiority is particularly pronounced in the most challenging cases with irregularly shaped objects. Through joint pretraining of object recognition, detection, and segmentation, BiomedParse opens new possibilities for holistic image analysis and image-based discovery in biomedicine.
Figure 1. Overview of BiomedParse and BiomedParseData. Image parsing: a unifying framework for holistic image analysisBack in 2005, researchers first introduced the concept of “image parsing”—a unified approach to image analysis that jointly conducts object recognition, detection, and segmentation. Built on Bayesian networks, this early model offered a glimpse into a future of joint learning and reasoning in image analysis, though it was limited in scope and application. Fast forward to today, cutting-edge advances in generative AI have breathed new life into this vision. With our model, BiomedParse, we have created a foundation for biomedical image parsing that leverages interdependencies across the three subtasks, thus addressing key limitations in traditional methods. BiomedParse enables users to simply input a natural-language description of an object, which the model uses to predict both the object label and its segmentation mask, thus eliminating the need for a bounding box (Figure 1c). In other words, this joint learning approach lets users segment objects based on text alone.
Spotlight: On-demand video
AI Explainer: Foundation models and the next era of AIExplore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.
Watch video Opens in a new tab Harnessing GPT-4 for large-scale data synthesis from existing datasetsWe created the first dataset for biomedical imaging parsing by harnessing GPT-4 for large-scale data synthesis from 45 existing biomedical segmentation datasets (Figure 1a and 1b). The key insight is to leverage readily available natural-language descriptions already in these datasets and use GPT-4 to organize this often messy, unstructured text with established biomedical object taxonomies.
Specifically, we use GPT-4 to help create a unifying biomedical object taxonomy for image analysis and harmonize natural language descriptions from existing datasets with this taxonomy. We further leverage GPT-4 to synthesize additional variations of object descriptions to facilitate more robust text prompting.
This enables us to construct BiomedParseData, a biomedical image analysis dataset comprising over 6 million sets of images, segmentation masks, and text descriptions drawn from more than 1 million images. This dataset includes 64 major biomedical object types, 82 fine-grained subtypes, and spans nine imaging modalities.
Figure 2: Comparison on large-scale biomedical image segmentation datasets. State-of-the-art performance across 64 major object types in 9 modalitiesWe evaluated BiomedParse on a large held-out test set with 102,855 image-mask-label sets across 64 major object types in nine modalities. BiomedParse outperformed prior best methods such as MedSAM and SAM, even when oracle per-object bounding boxes were provided. In the more realistic setting when MedSAM and SAM used a state-of-the-art object detector (Grounding DINO) to propose bounding boxes, BiomedParse outperformed them by a wide margin, between 75 and 85 absolute points in dice score (Figure 2a). BiomedParse also outperforms a variety of other prominent methods such as SegVol, Swin UNETR, nnU-Net, DeepLab V3+, and UniverSeg.
Figure 3. Evaluation on detecting irregular-shaped objects. Recognizing and segmenting irregular and complex objectsBiomedical objects often have complex and irregular shapes, which present significant challenges for segmentation, even with oracle bounding box. By joint learning with object recognition and detection, BiomedParse learns to model object-specific shapes, and its superiority is particularly pronounced for the most challenging cases (Figure 3). Encompassing a large collection of diverse object types in nine modalities, BiomedParseData also provides a much more realistic representation of object complexity in biomedicine.
Figure 4. Evaluation on object recognition. Promising step toward scaling holistic biomedical image analysisBy operating through a simple text prompt, BiomedParse requires substantially less user effort than prior best methods that typically require object-specific bounding boxes, especially when an image contains a large number of objects (Figure 2c). By modeling object recognition threshold, BiomedParse can detect invalid prompt and reject segmentation requests when an object is absent from the image. BiomedParse can be used to recognize and segment all known objects in an image in one fell swoop (Figure 4). By scaling holistic image analysis, BiomedParse can potentially be applied to key precision health applications such as early detection, prognosis, treatment decision support, and progression monitoring.
Going forward, there are numerous growth opportunities. BiomedParse can be extended to handle more modalities and object types. It can be integrated into advanced multimodal frameworks such as LLaVA-Med (opens in new tab) to facilitate conversational image analysis by “talking to the data.” To facilitate research in biomedical image analysis, we have made BiomedParse open-source (opens in new tab) with Apache 2.0 license. We’ve also made it available on Azure AI (opens in new tab) for direct deployment and real-time inference. For more information, check out our demo. (opens in new tab)
BiomedParse is a joint work with Providence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering, and brings collaboration from multiple teams within Microsoft*. It reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health, with other exciting progress such as GigaPath (opens in new tab), BiomedCLIP (opens in new tab), LLaVA-Rad (opens in new tab), BiomedJourney (opens in new tab), MAIRA (opens in new tab), Rad-DINO (opens in new tab), Virchow (opens in new tab).
(Acknowledgment footnote) *: Within Microsoft, it is a wonderful collaboration among Health Futures, MSR Deep Learning, and Nuance.
Paper co-authors: Theodore Zhao, Yu Gu, Jianwei Yang (opens in new tab), Naoto Usuyama (opens in new tab), Ho Hin Lee, Sid Kiblawi, Tristan Naumann (opens in new tab), Jianfeng Gao (opens in new tab), Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon (opens in new tab), Sheng Wang (opens in new tab)
Opens in a new tabThe post BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis appeared first on Microsoft Research.
GraphRAG: Improving global search via dynamic community selection
Retrieval-augmented generation (RAG) allows AI systems to provide additional information and context to a large language model (LLM) when generating a response to a user query. However, traditional RAG-based methods can struggle to retrieve information that requires high-level knowledge of the entire dataset, especially with abstract and global questions such as the keywordless query: “Catch me up on the last two weeks of updates.” These types of queries are known as “global” queries, as they require holistic understanding of the dataset to answer the question. GraphRAG aims to tackle these questions in two main steps: indexing and query. The indexing engine first breaks down a collection of text documents into segments which are then clustered into hierarchical communities with entities and relationships connecting each segment up through higher levels of abstraction. We then use an LLM to generate a summary of each community, known as a community report. The indexing engine thus creates a hierarchical knowledge graph of the dataset, with each level in the hierarchy representing a different level of abstraction and summarization of the original material. In the query step, GraphRAG uses this structured knowledge to provide additional context to the LLM to help answer the question. In this blog post, we show a new method for conducting “global” queries that efficiently utilizes the knowledge graph representation and optimizes the performance of global search in GraphRAG.
Static vs. dynamic global searchThe global search (opens in new tab) algorithm in GraphRAG aims to answer abstract questions that require knowledge of the entire dataset. It generates answers by searching over communities at a predetermined level in the knowledge graph. Then the LLM combines and summarizes all the community reports at this level of abstraction. Finally, the summary is used as additional context for the LLM to generate the response to the user question. This map-reduce process allows the LLM to select relevant text from all the community reports to generate its final answer. This static approach is expensive and inefficient because it includes many lower-level reports that are not informative to the user query. Since it is unlikely that all community reports, especially at a high level, are relevant in answering the query, an approach that first considers the relevancy of the report prior to the resource-intensive map-reduce operation is highly desirable.
Here, we introduce dynamic community selection to the global search algorithm, which leverages the knowledge graph structure of the indexed dataset. Starting from the root of the knowledge graph, we use an LLM to rate how relevant a community report is in answering the user question. If the report is deemed irrelevant, we simply remove it and its nodes (or sub-communities) from the search process. On the other hand, if the report is deemed relevant, we then traverse down its child nodes and repeat the operation. Finally, only relevant reports are passed to the map-reduce operation to generate the response to the user. Figure 1 illustrates the dynamic community selection process in action.
Figure 1: Dynamic community selection workflowThe dynamic global search approach has two main benefits. First, it prunes irrelevant reports early on, reducing the total number of community reports to be considered in the map-reduce operation. Second, it enables users to search the entire knowledge graph, instead of predefining a static community level, and can lead to more detailed answers. This allows it to collect information at various levels of abstraction. Moreover, the rating operation is a classification problem, which is considerably easier to perform than summarization and text generation, therefore, a less complex model can be used. In our experiments leveraging OpenAI’s models, a GPT-4o-mini rater achieved a very similar retrieval rate as a GPT-4o rater, while operating at a fraction of both cost and time. Overall, we use the smaller and more cost-effective model, GPT-4o-mini, in the rate operation to prune any irrelevant community reports, then we use GPT-4o to perform the map-reduce operation to generate the final response.
Dynamic community selection on the AP News datasetTo demonstrate the cost saving that dynamic global search brings while maintaining a similar response quality, we evaluated the two methods side by side on a dataset from AP News. We tested static and dynamic search on 50 global questions and assessed the final response quality using an LLM evaluator. Moreover, we compared the total token cost of the two methods. To compare the two methods directly, we constrained the maximum search depth on dynamic global search so that both methods used the same underlying information.
We use an LLM evaluator to select the best response (i.e. win rate) on 3 key metrics:
- Comprehensiveness: How much detail does the answer provide to cover all the aspects and details of the question?
- Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
- Empowerment: How well does the answer help the reader understand and make informed judgements about the topic?
Digital pathology helps decode tumor microenvironments for precision immunotherapy. In joint work with Providence and UW, we’re sharing Prov-GigaPath, the first whole-slide pathology foundation model, for advancing clinical research.
Read more Opens in a new tab Significant cost reduction while maintaining output qualityThe quality of responses generated from dynamic community selection are comparable to its static counterpart while reducing the total token cost. Our LLM evaluation shows that the output quality of the two methods is similar in the three key metrics across the 50 global questions on the AP News dataset, with no statistical significance between them. More importantly, we observed a significant reduction of total token costs when using the new method, with an average cost reduction of 77% over the existing static global search at community level 1. This is due to the large number of community reports being eliminated via the rating process, thus requiring fewer prompt and output tokens needed in the map-reduce operation. For instance, the existing static global search method processes about 1,500 level 1 community reports in the map-reduce operation, while only 470 community reports on average are selected in dynamic search to generate the final answer.
Moreover, if we allow dynamic search to continue the rating process further to deeper level community reports, we observe an improvement in its final responses. Here, we conducted the same experiment but allowed dynamic search to continue until community level 3. Out of the 50 global questions, 29 included more community reports than our static search baseline, suggesting that some community reports at deeper levels are relevant to the user question. Indeed, we observed a moderate and statistically significant improvement in both comprehensiveness and empowerment. Using an LLM evaluator to score pairs of responses, we observe that dynamic global search scores a win rate of 58% and 60%, respectively, against static search at level 1. Nevertheless, while the rating operation is performed by a smaller model and hence induces negligible cost, it can still lead to a higher overall cost due to the increased number of community reports that the map-reduce operation processes. In this experiment, the total cost with dynamic search at level 3 is 34% higher on average. Table 1 summarizes the results of static search at level 1 against dynamic search at level 1 and 3.
Dynamic search Num. queries Win rate against static search (level 1) Relative cost Comprehensiveness Diversity Empowerment Level 1 50 49.5% 46.0% 48.0% -77% Level 3 29* 58.8% 48.0% 60.0% +34%Table 1. LLM evaluation on dynamic search vs static search at different community levels on global questions. We show the win rates of dynamic search at level 1 and level 3 against static search at level 1, that is the number of times the LLM evaluator prefers the generated responses from dynamic search over static search. Green background indicates there is statistical significance in the win rates, and grey cells if it is not significant. The right-most column shows the average relative token cost against static search at level 1. * Note that we only evaluate answers from dynamic search at community level 3, which contains more community reports than static search at level 1.
Overall, dynamic community selection achieved a cost reduction of 77% while achieving output quality similar to the baseline static approach. These appear to be promising numbers, but what of the result quality? Let’s examine some direct side-by-side comparisons with an example:
Comparable response quality and detail Query: Across the dataset, describe the common trends in vaccination rates for major diseases. Static global search at level 1 Global search with dynamic community selection at level 3### Common Trends in Vaccination Rates for Major Diseases
#### Decline in Vaccination Rates
A significant trend observed across various regions is the decline in vaccination rates for major diseases. For instance, Minnesota has experienced a decrease in immunization rates, leading to recent measles outbreaks [Data: Reports (335)]. Similarly, the COVID-19 pandemic has exacerbated this issue, causing a drop in childhood immunization rates in Michigan and other areas [Data: Reports (1231, 915)]. This decline is often attributed to misinformation about vaccines, particularly the false claim linking the MMR vaccine to autism [Data: Reports (912, 1335)].
#### Impact of COVID-19
The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to increase COVID-19 vaccination rates have been a major focus of public health policies, but the pandemic has also led to a decline in routine vaccinations among children [Data: Reports (920, 194)]. The pandemic has influenced public attitudes towards health measures and vaccinations, contributing to broader public health challenges [Data: Reports (920)].#### Regional Variations
Vaccination rates vary significantly by region. West Virginia, for example, has one of the strictest school vaccination policies in the U.S., resulting in high child immunization rates [Data: Reports (1199)]. In contrast, Idaho has the highest overall childhood vaccination exemption rate in the U.S., posing significant public health risks [Data: Reports (998)]. Illinois faces challenges with low vaccination rates in schools, leading to measles outbreaks [Data: Reports (108, 82)].
#### Public Health Initiatives
Various public health initiatives have been implemented to address vaccination rates. The CDC plays a crucial role in monitoring and reporting on vaccination rates for diseases such as measles, mumps, and polio, and provides vital information on preventable diseases [Data: Reports (914)]. Programs like the ‘Do It For Babydog’ sweepstakes in West Virginia encourage vaccination among residents [Data: Reports (1199)]. Additionally, the Bill & Melinda Gates Foundation is heavily involved in funding and supporting vaccination programs for major diseases such as polio, malaria, HIV, and cholera [Data: Reports (1091)].
### Conclusion
The dataset reveals a complex landscape of vaccination rates influenced by regional policies, public health initiatives, and the impact of the COVID-19 pandemic. While some regions have managed to maintain high vaccination rates through strict policies and public health campaigns, others face significant challenges due to misinformation and exemption rates. Global efforts and collaborations remain crucial in addressing these challenges and improving vaccination coverage worldwide.
### Common Trends in Vaccination Rates for Major Diseases
#### Decline in Vaccination Rates
A significant trend observed across the dataset is the decline in vaccination rates for various diseases, including measles, mumps, rubella (MMR), and polio. This decline is particularly evident among U.S. kindergartners, where vaccination coverage has dropped from 95.2% during the 2019-2020 school year to 93.1% in the 2022-2023 school year. This reduction falls below the 95% threshold needed to achieve herd immunity, putting approximately 250,000 kindergartners at risk each year [Data: Reports (5668, 15469, 15703, 15481, 15471, +more)].
#### Impact of the COVID-19 Pandemic
The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to promote COVID-19 vaccinations have been a major focus, with initiatives in regions like Michigan and Illinois offering free meals to students and promoting updated COVID-19 shots [Data: Reports (19125, 15466)]. However, the pandemic has also disrupted routine vaccination programs, leading to lower coverage for diseases such as measles and contributing to recent outbreaks [Data: Reports (15464, 15674, 15482, 15481, +more)].
#### Regional Variations and Legislative Measures
There are notable regional variations in vaccination rates and exemption rates across the United States. For example, West Virginia has implemented strict immunization requirements for children entering school, resulting in high child immunization rates [Data: Reports (5674, 18874, 18899)]. In contrast, states like Minnesota and Illinois have seen increases in non-medical exemptions, contributing to lower vaccination rates and recent outbreaks [Data: Reports (15483, 15481, 108, 2705, +more)].
#### Efforts to Improve Vaccination Rates
Various initiatives and legislative measures have been introduced to address declining vaccination rates. For instance, the Government of Sindh introduced a polio vaccination bill that includes provisions for imprisonment for parents who do not vaccinate their children [Data: Reports (15398)]. In the U.S., the CDC has recommended new COVID-19 shots for everyone aged 6 months and older and has launched initiatives to ensure equitable access to vaccines, especially in developing countries [Data: Reports (15847, 15571, 15691, 15694, +more)].
### Conclusion
The dataset reveals a complex landscape of vaccination rates influenced by the COVID-19 pandemic, vaccine hesitancy, misinformation, and regional variations. While efforts to improve vaccination rates are ongoing, the decline in immunization coverage poses significant public health risks, highlighting the need for continued vigilance and proactive measures to ensure high vaccination rates and prevent outbreaks of vaccine-preventable diseases.
Table 2. Generated response from static search (level 1) and dynamic search (level 3) to the same global question on the AP News dataset.
Table 2 shows an example output from static search at level 1 and dynamic search at level 3 to the same question. While the two outputs contain similar high-level topics, the response from dynamic search provided specific data such as the reduction of vaccination rates in certain demographics. We also notice that the response from dynamic search made significantly more references to the source material, indicated by “[Data Reports]” in the text. By selectively providing information that is relevant to the question, this alleviates the map-reduce operation from having to filter and process all the community reports all at once, and therefore it can generate a response that is more comprehensive and specific to the user question.
Overall, dynamic community selection proposes an alternative method to perform global search in GraphRAG by leveraging the indexed knowledge graph and the usage of cheaper LLM models in the rate relevancy operation. These changes led to lower total token cost and potential improvements to response detail and quality.
AvailabilityYou can experiment with dynamic global search on the GraphRAG GitHub repository (opens in new tab).
Dynamic global search is the second of several major optimizations to GraphRAG that are being explored. If you are interested in optimizations for local questions, please check out our recent blog post on DRIFT search. Stay tuned for our upcoming work, where we explore a radically different approach to graph-enabled RAG that is significantly more cost-efficient while improving answer quality for both local and global questions.
Opens in a new tabThe post GraphRAG: Improving global search via dynamic community selection appeared first on Microsoft Research.
Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators
Our work on Orca and Orca 2 demonstrated the power of using synthetic data for the post-training of small language models and getting them to levels of performance previously found only in much larger language models. Orca-AgentInstruct is another step in this direction, where we explore using agentic flows to generate diverse and high-quality data at scale. Orca-AgentInstruct is an agentic solution for synthetic-data generation. By leveraging an agentic framework, AgentInstruct can generate tailored datasets, comprising both prompts and responses, from raw data sources, paving the way to building a synthetic data factory for model fine-tuning.
The efficacy of this approach is exemplified by the substantial improvement observed by fine-tuning a base Mistral 7-billion-parameter model and using AgentInstruct to generate a 25-million-pair dataset. The fine-tuned model (which we refer to as Orca-3-Mistral) showcases a notable performance gain across multiple benchmarks. For example, it shows 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH, 45% improvement on AlpacaEval, and a 31.34% reduction of inaccurate or unreliable results across multiple summarization benchmarks.
We are making a 1-million-pair subset (orca-agentinstruct-1M) of this dataset publicly available, along with a report describing the data generation procedure, to encourage research on synthetic data generation and finetuning of language models.
Figure 1: Effect of using AgentInstruct for post-training Mistral-7B. Figure 2. This figure provides a thematic overview of the roles played by different groups of agents. Content Transformation Flow converts the seed into an intermediate representation that makes it easier to create high-quality and diverse data. Seed Instruction Generation Flow creates instances of the target tasks following a taxonomy. Instruction Refinement Flow explores the space further by starting from these initial data points and exploring the neighborhood. The expectation is that by picking a random seed we will be able to cover the entire region of data points.Synthetic Data Accelerated LLM Development: Over the past year, using synthetic data has greatly advanced the training of large language models (LLMs). It sped up model training at all stages, from pre-training (e.g., Phi-3) to instruction-tuning (e.g., Orca and WizardLM) and reinforcement learning from human feedback (e.g., Direct Nash Optimization).
- Demo video AgentInstruct Methodology
Generating high-quality synthetic data is hard: On the other hand, research indicates that pre-training models on synthetic data produced by other models can result in model collapse, causing models to progressively degrade. Similar concerns have been raised regarding the use of synthetic data for post-training, suggesting that it might lead to an imitation process where the trained model learns only stylistic features rather than actual capabilities.
This discrepancy may be attributed to the challenge of generating high-quality and diverse synthetic data. Successful use of synthetic data involves significant human effort in curating and filtering the data to ensure high quality.
Synthetic data meets agents: Another major development we witnessed during the past year is the rise of agentic (especially multi-agent) workflows, such as with AutoGen. Agentic workflows can generate high-quality data, which surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration that enable agents to look back at solutions, generate critiques, and improve solutions. They can also use tools like search APIs, calculators, and code interpreters to address LLM limitations.
Multi-agent workflows bring in additional benefits as well, such as simulating scenarios where we can generate both new prompts and the corresponding responses. They also enable automation of data-generation workflows, reducing or eliminating the need for unnecessary human intervention on some tasks.
AgentInstruct: Generating synthetic data for post-training or finetuning often relies on an existing prompt set that is either used as is or as seeds for generating more instructions. In this work, we generalize the problem settings to a broader objective of generating an abundant amount of diverse, challenging, and high-quality data to teach a particular skill to an AI model. We refer to this setting as generative teaching.
AgentInstruct is an agentic solution for generative teaching. AgentInstruct uses raw documents as input to create demonstration and feedback data. When generic data is used as seeds, AgentInstruct can be used to teach an LLM a general capability, such as writing, reasoning, or retrieval-augmented generation (RAG). Domain specific data, like retail or finance, can also be used as seeds to improve the model in a certain specialization. AgentInstruct can create:
- High-quality data: AgentInstruct uses GPT-4, coupled with tools like search and code interpreters, to create high-quality data.
- Diverse data: AgentInstruct creates prompts and responses using a set of specialized agents (with powerful LLMs, tools, and reflection flows) and a taxonomy (of more than 100 subcategories), , ensuring diversity and quality.
- Large quantities of data: AgentInstruct can run autonomously. and applyiflows for verification and data filtering. It does not require seed prompts and uses raw documents for seeding.
Using raw data as seeds offers two advantages: it is plentiful, allowing AgentInstruct to generate large-scale and diverse datasets, and it encourages learning general skills instead of benchmark-specific ones by avoiding using existing prompts.
Spotlight: Blog post
MedFuzz: Exploring the robustness of LLMs on medical challenge problemsMedfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.
Read more Opens in a new tabWe anticipate agentic flows becoming increasingly important throughout the model-training lifecycle, including pre-training, post-training, and specialization, and ultimately enabling the creation of a synthetic data factory for model customization and continuous improvement. This has the potential to drive AI advances across multiple industries by making high-quality model training more efficient and accessible.
Contributors:Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgou, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah
Opens in a new tabThe post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research.
Toward modular models: Collaborative AI development enables model accountability and continuous learning
Today, development of generalizable AI models requires access to sufficient data and compute resources, which may create challenges for some researchers. Democratizing access to technology across the research community can advance the development of generalizable AI models. By applying the core software development concept of modularity to AI, we can build models that are powerful, efficient, adaptable, and transparent.
Until recently, AI models were primarily built using monolithic architecture. Though powerful, these models can be challenging to customize and edit compared to modular models with easily interpretable functional components. Today, developers employ modularity to make services more reliable, faster to refine, and easier for multiple users to contribute to simultaneously. One promising research direction that supports this involves shifting AI development towards a modular approach (opens in new tab), which could enhance flexibility and improve scalability.
One such approach is to use numerous fine-tuned models designed for specific tasks, known as expert models, and coordinate them to solve broader tasks (see Towards Modular LLMs by Building and Reusing a Library of LoRAs – Microsoft Research (opens in new tab), Learning to Route Among Specialized Experts for Zero-Shot Generalization (opens in new tab)). These expert models can be developed in a decentralized way. Similar to the benefits of using a microservice architecture, this modular AI approach can be more flexible, cheaper to develop, and more compliant with relevant privacy and legal policies. However, while substantial research has been done on training optimization, coordination methods remain largely unexplored.
Our team is exploring the potential of modular models by focusing on two themes: i) optimizing the training of expert models and ii) refining how expert models coordinate to form a collaborative model. One method for coordinating expert models is to adaptively select the most relevant independently developed expert models for specific tasks or queries. This approach, called MoErging, is similar to Mixture-of-Experts (MoE) approaches but differs in that the routing mechanism is learned after the individual experts are trained. As an initial step, we contributed to creating a taxonomy for organizing recent MoErging methods with the goal of helping establish a shared language for the research community and facilitating easier and fairer comparisons between different methods.
Assessing existing MoErging methodsMost MoErging methods were developed within the past year, so they don’t reference each and are difficult to compare. To enable comparison of MoErging methods, we recently collaborated on a survey that establishes a taxonomy for comparing methods and organizes MoErging design choices into three steps:
- Expert design: Identifies and uses expert models trained asynchronously by distributed contributors.
- Routing design: Routes tasks to the appropriate expert models.
- Application design: Applies the merged models to specific tasks or domains.
Each step is broken down into more detailed choices. For example, in expert design, expert training can be custom or standard, and training data can be private or shared. Custom training requires MoErging to have specific training procedures, while the standard training does not. Similarly, shared data means that the training data must be accessible for routing. Otherwise, the training data is considered private.
The benefits of modular models discussed below assume that training data doesn’t need to be shared. However, a review of current MoErging methods finds that some approaches do require sharing training data, making certain benefits no longer applicable.
Spotlight: Blog post
MedFuzz: Exploring the robustness of LLMs on medical challenge problemsMedfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.
Read more Opens in a new tabThe survey evaluates 29 different MoErging methods using its taxonomy, which categorizes the design choices into two expert design choices, five routing design choices, and two application design options, shown in Figure 1.
Figure 1: Taxonomy of model MoErging design choices. References in the leaf nodes link to sections of specific papers that implement each choice. We omit references to methods where a particular choice is not applicable.One takeaway from the survey is that most MoErging methods can be grouped into four categories based on their routing design choices:
- Classifier-based routing: Methods that train the router as a classifier using expert datasets or unseen data.
- Embedding-based routing: Methods that compute embeddings of expert training sets and compare them to a query embedding for routing.
- Nonrouter methods: Methods that do not explicitly train a router but instead initialize the router in an unsupervised manner.
- Task-specific routing: Methods that learn a task-specific routing distribution over the target dataset to improve performance on a specific task.
While the differences within each category are minor, the differences across categories are significant because they determine the level of data access required for implementation. As a result, data access is a primary factor in determining which methods are applicable and feasible in various settings.
Our taxonomy also covers recent approaches to building agentic systems, which could be viewed as specific types of MoErging methods where experts are full language models and routing decisions are made on a step-by-step or example-by-example basis. The optimal level for MoErging may vary depending on the task and the computational resources available to each stakeholder.
Potential benefits and use cases of modular modelsModular models can unlock new benefits and use cases for AI, offering a promising approach to addressing challenges in current AI development. Moving forward, further substantial research is needed to validate this potential and assess feasibility.
Modular AI may:
- Allow privacy-conscious contributions. Teams with sensitive or proprietary data, such as personally identifiable information (PII) and copyrighted content, can contribute expert models and benefit from larger projects without sharing their data. This capacity can make it easier to comply with data privacy and legal standards, which could be valuable for healthcare teams that would benefit from general model capabilities without combining their sensitive data with other training data.
- Drive model transparency and accountability. Modular models allow specific expert models to be identified and, if necessary, removed or retrained. For example, if a module trained on PII, copyrighted, or biased data is identified, it can be removed more easily, eliminating the need for retraining and helping ensure compliance with privacy and ethical standards.
- Facilitate model extensibility and continual improvement. Modularity supports continual improvements, allowing new capabilities from expert models to be integrated as they are available. This approach is akin to making localized edits, allowing for continuous, cost-effective improvement.
- Lower the barrier to AI development for those with limited compute and data resources. Modular AI can reduce the need for extensive data and compute by creating a system where pretrained experts can be reused, benefiting academics, startups, and teams focused on niche use cases. For example, an AI agent tasked with booking flights on a specific website with limited training data could leverage general navigation and booking skills from other trained AI experts, enabling generalizable and broadly applicable skills without requiring domain-specific training data. We explore this process of transferring skills across tasks in our paper “Multi-Head Routing For Cross-Task Generalization.”
- Support personalization. Modular models make it possible to equip AI agents with experts tailored to individual users or systems. For instance, AI designed to emulate five-time World Chess Champion Magnus Carlsen could enhance a player’s preparation to play a match against him. Experiments suggest that storing knowledge or user profiles in on-demand modules can match or surpass the performance of retrieval-augmented generation (RAG), potentially reducing latency and improving the user’s experience in custom AI applications.
In this blog, we focused on a type of modular approach that involves training foundation models, which requires substantial compute power and large amounts of data. Despite the advantages of modularity, such as increased flexibility, efficiency, and adaptability, the development of foundation models remains resource-intensive, necessitating high-performance computing and robust datasets to support fine-tuning.
Recent work has begun to address these challenges by distributing the pretraining process of foundation models (opens in new tab). Looking ahead, a promising research direction focuses on exploring how to create a minimal dataset for training “empty foundation models” while shifting most of their capabilities to external pluggable modules.
Modular methods are evolving rapidly, and we’re excited by their potential. Modularity has the capacity to democratize AI development, improve model accountability, and support efficient continuous learning. With the MoErging taxonomy, we aim to establish a shared language that fosters engagement within the research community. This research is in the early stages, and we welcome community collaboration. If you’re interested in working with us, please reach out to ModularModels@microsoft.com.
AcknowledgementsWe would like to thank paper collaborators: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Nabil Omi, Siddhartha Sen, Anurag Sarkar, Jordan T. Ash, Oleksiy Ostapenko, and Laurent Charlin.
Opens in a new tabThe post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.
Research Focus: Week of November 11, 2024
Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.
NEW RESEARCH Look Ma, no markers: holistic performance capture without the hassleMotion-capture technologies used in film and game production typically focus solely on face, body, or hand capture, requiring complex and expensive hardware and lots of manual intervention from skilled operators. While machine-learning-based approaches can overcome these challenges, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts.
In a recent paper: Look Ma, no markers: holistic performance capture without the hassle, researchers from Microsoft introduce a technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. This approach produces stable world-space results from arbitrary camera rigs while also supporting varied capture environments and clothing. The researchers achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. They evaluate their method on a number of body, face, and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.
Read the paper NEW RESEARCH Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesUsing AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge, is a high-impact application. Interest is growing in AI for IT Operations (AIOps), which aims to automate complex operational tasks like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents.
In a recent paper: Building AI Agents for Autonomous Clouds: Challenges and Design Principles, researchers from Microsoft lay the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. The researchers also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. The paper sets the stage for building a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.
Read the paperMicrosoft Research Blog
Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and moreIn the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.
Read more Opens in a new tab NEW RESEARCH Towards Neural Synthesis for SMT-Assisted Proof-Oriented ProgrammingAI-assisted programming offers great promise, but also raises concerns around the trustworthiness of AI-generated code. Proof-oriented languages like F* (opens in new tab) enable authoring programs backed by machine-checked proofs of correctness. Using AI to generate code and proofs in proof-oriented languages helps mitigate these concerns, while also making proof-oriented programming more accessible to people.
In a recent preprint: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, researchers from Microsoft and external colleagues explore using AI to automate the construction of proof-oriented programs. The researchers curate a dataset of 940,000 lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. The dataset includes around 54,000 top-level F* definitions, each representing a type-directed program and proof synthesis problem. A program fragment checker queries F* to check the correctness of candidate solutions. With this dataset, the researchers explore using AI to synthesize programs and their proofs in F*, finding the performance of fine-tuned smaller language models to compare favorably with LLMs, at much lower computational cost.
Read the paper NEW RESEARCH One-to-many testing for code generation from (just) natural languageThe mostly basic Python programs (MBPP) dataset is commonly used for evaluating natural language models on the task of code generation. Despite its popularity, the original MBPP has two major problems: it relies on providing test cases to generate the right signature and there is poor alignment between “what is asked” and “what is evaluated” using the test cases.
To address these challenges, in their recent “One-to-many testing for code generation from (just) natural language” paper, researchers from Microsoft introduce the “mostly basic underspecified Python programs” or MBUPP dataset. This dataset adapts MBPP to emphasize the natural language aspect by allowing for some syntactic ambiguity (like not specifying the return type of a function) and evaluating generated code on multiple sets of assertions (like each set covering a different return type). Besides iteratively inspecting LLM results to extend the assertions sets, the researchers carefully remove poor alignment from the instructions (like a specific algorithm to use) and perform a majority vote over slightly paraphrased instructions to improve the quality of the dataset. The researchers compare popular open and closed weight models on the original MBPP and adapted MBUPP datasets to highlight the effect of paraphrasing and new test cases on code generation evaluation. The MBUPP dataset is publicly available to encourage its use in evaluation code generation models.
Read the paper Opens in a new tabThe post Research Focus: Week of November 11, 2024 appeared first on Microsoft Research.
Preventing side-channels in the cloud
Cloud computing delivers scalable and cost-effective compute resources to a wide range of customers. The ability for cloud providers to share components of the hardware stack across customers, or tenants, is essential for running efficient cloud systems. For example, modern central processing units (CPUs) pack hundreds of physical hardware threads sharing terabytes of dynamic random-access memory (DRAM), which can be flexibly assigned to many independent virtual machines (VMs).
Preventing tenants from snooping on others who share the same hardware requires security mechanisms. Microsoft Azure (opens in new tab) provides strong protection via comprehensive architectural isolation through access control mechanisms implemented across the cloud platform, including the hardware and the hypervisor. Confidential computing (opens in new tab) powered by trusted execution environments further hardens architectural isolation via hardware memory encryption to protect tenants even against privileged attackers.
A changing threat landscapeEven with perfect architectural isolation, sharing microarchitectural resources, such as CPU caches and DRAM row buffers, can leak small amounts of information, because interference (due to sharing) leads to variations in the latency of memory accesses. This gives rise to so-called microarchitectural side-channel attacks where a malicious tenant can learn information about another tenant, in the worst case: their cryptographic keys.
Microsoft Azure protects tenants and critical infrastructure against currently practical side-channel attacks. For example, side-channels in on-core resources (e.g., buffers, predictors, private caches) are comprehensively (opens in new tab) mitigated by Hyper-V HyperClear (opens in new tab) via core scheduling, microarchitectural flushing and scrubbing, and virtual-processor address space isolation; and our cryptographic libraries are carefully hardened to prevent any secrets from being leaked via microarchitectural side-channels.
However, the threat landscape is changing. First, side-channel attacks are becoming increasingly sophisticated: For example, recent academic research (opens in new tab) has shown that even cache-coherence directories can be exploited to leak information across cores. Second, future CPUs are likely to employ increasingly sophisticated microarchitectural optimizations, which are prone to new kinds of attacks: For example, the recently introduced data-dependent prefetchers have already been found to leak information (opens in new tab).
- Project Project Venice
In Azure Research’s Project Venice, we are investigating principled defenses, to be prepared in case such emerging attacks start posing a risk to Azure customers.
Preventing microarchitectural side-channels with resource-exclusive domainsIn a research paper (opens in new tab), which has received a distinguished paper award at the ACM Conference on Computer and Communications Security (ACM CCS’24 (opens in new tab)), we present a system design that can prevent cross-VM microarchitectural side-channels in the cloud. Our design provides what we call resource-exclusive domains, which extend the architectural abstraction of private physical threads and private memory to the microarchitectural level. That is, resource-exclusive domains guarantee isolation even against powerful attackers that try to mount side-channel attacks on shared microarchitectural resources.
Our approach builds on isolation schemes, a novel abstraction of the way a CPU shares microarchitectural structures between its physical threads. Isolation schemes can be used by the hypervisor and host operating system to assign physical threads and physical memory pages, eliminating the risk of information leakage across resource-exclusive domains. Technically, for a given assignment of physical threads to resource-exclusive domains, the isolation scheme partitions each microarchitectural resource that is shared between domains (as this would leak information), but without partitioning resources that are private to a domain (as this would affect performance). We achieve this using hardware mechanisms, if available, and multi-resource memory coloring, if not.
In a complementary research paper (opens in new tab) (appearing at ACM CCS’24 (opens in new tab)), we provide the theoretical foundations and practical algorithms for computing such multi-resource memory coloring schemes for existing microarchitectures, as well as design patterns for future microarchitectures to support a large number of resource-exclusive domains.
We have implemented our approach in a research prototype based on Microsoft Hyper-V for a modern cloud chiplet-based CPU, AMD EPYC 7543P, that supports VM-level trusted execution environments. Using a collection of microbenchmarks and cloud benchmarks, we demonstrate that our approach eliminates all identified side-channels and incurs only small performance overheads. For example, when allocating resources at chiplet and channel granularity (i.e., coupling a chiplet with one of the local DRAM channels) we observe an overhead of less than 2%; and only up to 4% when allocating resources at chiplet granularity and coloring with 2MB pages.
Co-designing cloud platforms for future microarchitectural isolationTo validate the effectiveness and practicality of our approach, we inferred isolation schemes for a single CPU by reverse-engineering its microarchitecture. This approach is incomplete and does not scale to the diverse hardware fleet available in the cloud. We are working with CPU vendors to develop isolation schemes for future CPUs, which will then be exposed via the hardware interface for consumption by the hypervisor’s hardware abstraction layer. In this way, we will be able to reap the benefits of microarchitectural performance optimizations while continuing to provide strong security guarantees to cloud tenants.
Additional Contributors
Cédric Fournet, Senior Principal Researcher
Jana Hofmann, Researcher
Oleksii Oleksenko, Senior Researcher
The post Preventing side-channels in the cloud appeared first on Microsoft Research.
From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy
The essence of the biological world lies in the ever-changing nature of its molecules and their interactions. Understanding the dynamics and interactions of biomolecules is crucial for deciphering the mechanisms behind biological processes and for developing biomaterials and drugs. As Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and wigglings of atoms.” Yet capturing these real-life movements is nearly impossible through experiments.
In recent years, with the development of deep learning methods represented by AlphaFold and RoseTTAFold, predicting the static crystal protein structures has been achieved with experimental accuracy (as recognized by the 2024 Nobel Prize in Chemistry). However, accurately characterizing dynamics at an atomic resolution remains much more challenging, especially when the proteins play their roles and interact with other biomolecules or drug molecules.
As one approach, Molecular Dynamics (MD) simulation combines the laws of physics with numerical simulations to tackle the challenge of understanding biomolecular dynamics. This method has been widely used for decades to explore the relationship between the movements of molecules and their biological functions. In fact, the significance of MD simulations was underscored when the classic version of this technique was recognized with a Nobel Prize in 2013 (opens in new tab) (opens in new tab), highlighting its crucial role in advancing our understanding of complex biological systems. Similarly, the quantum mechanical approach—known as Density Functional Theory (DFT)—received its own Nobel Prize in 1998 (opens in new tab) (opens in new tab), marking a pivotal moment in computational chemistry.
In MD simulations, molecules are modeled at the atomic level by numerically solving equations of motions that account for the system’s time evolution, through which kinetic and thermodynamic properties can be computed. MD simulations are used to model the time-dependent motions of biomolecules. If you think of proteins like intricate gears in a clock, AI2BMD doesn’t just capture them in place—it watches them spin, revealing how their movements drive the complex processes that keep life running.
MD simulations can be roughly divided into two classes: classical MD and quantum mechanics. Classical MD employs simplified representations of the molecular systems, achieving fast simulation speed for long-time conformational changes but less accurate. In contrast, quantum mechanics models, such as Density Functional Theory, provide ground-up calculations, but are computationally prohibitive for large biomolecules.
Ab initio biomolecular dynamics simulation by AIMicrosoft Research has been working on the development of efficient methods aiming for ab initio accuracy simulations of biomolecules. This method, AI2BMD (AI-based ab initio biomolecular dynamics system), has published in the journal Nature (opens in new tab), representing the culmination of a four-year research endeavor.
AI2BMD efficiently simulates a wide range of proteins in all-atom resolution with more than 10,000 atoms at an approximate ab initio—or first-principles—accuracy. It thus strikes a previously inaccessible tradeoff for biomolecular simulations than standard simulation techniques – achieving higher accuracies than classical simulation, at a computational cost that is higher than classical simulation but orders of magnitude faster than what DFT could achieve. This development could unlock new capabilities in biomolecular modeling, especially for processes where high accuracy is needed, such as protein-drug interactions.
Figure 1. The flowchart of AI2BMDAI2BMD employs a novel-designed generalizable protein fragmentation approach that splits proteins into overlapping units, creating a dataset of 20 million snapshots—the largest ever at the DFT level. Based on our previously designed ViSNet (opens in new tab), a universal molecular geometry modeling foundation model published in Nature Communications (opens in new tab) and incorporated into PyTorch Geometry library (opens in new tab), we trained AI2BMD’s potential energy function using machine learning. Simulations are then performed by the highly efficient AI2BMD simulation system, where at each step, the AI2BMD potential based on ViSNet calculates the energy and atomic forces for the protein with ab initio accuracy. By comprehensive analysis from both kinetics and thermodynamics, AI2BMD exhibits much better alignments with wet-lab data, such as the folding free energy of proteins and different phenomenon than classic MD.
Spotlight: On-demand video
AI Explainer: Foundation models and the next era of AIExplore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.
Watch video Opens in a new tab Advancing biomolecular MD simulationAI2BMD represents a significant advancement in the field of MD simulations from the following aspects:
(1) Ab initio accuracy: introduces a generalizable “machine learning force field,” a machine learned model of the interactions between atoms and molecules, for full-atom protein dynamics simulations with ab initio accuracy.
Figure 2. Evaluation on the energy calculation error between AI2BMD and Molecular Mechanics (MM) for different proteins.(2) Addressing generalization: It is the first to address the generalization challenge of a machine learned force field for simulating protein dynamics, demonstrating robust ab initio MD simulations for a variety of proteins.
(3) General compatibility: AI2BMD expands the Quantum Mechanics (QM) modeling from small, localized regions to entire proteins without requiring any prior knowledge on the protein. This eliminates the potential incompatibility between QM and MM calculations for proteins and accelerates QM region calculation by several orders of magnitude, bringing near ab initio calculation for full-atom proteins to reality. Consequently, AI2BMD paves the road for numerous downstream applications and allows for a fresh perspective on characterizing complex biomolecular dynamics.
(4) Speed advantage: AI2BMD is several orders of magnitude faster than DFT and other quantum mechanics. It supports ab initio calculations for proteins with more than 10 thousand atoms, making it one of the fastest AI-driven MD simulation programs among multidisciplinary fields.
Figure 3. Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software.(5) Diverse conformational space exploration: For the protein folding and unfolding simulated by AI2BMD and MM, AI2BMD explores more possible conformational space that MM cannot detect. Therefore, AI2BMD opens more opportunities to study flexible protein motions during the drug-target binding process, enzyme catalysis, allosteric regulations, intrinsic disorder proteins and so on, better aligning with the wet-lab experiments and providing more comprehensive explanations and guidance to biomechanism detection and drug discovery.
Figure 4. AI2BMD folds protein of Chignolin starting from an unfolded structure, achieves smaller energy error than MM and explores more conformational regions that MM cannot detect.(6) Experimental agreement: AI2BMD outperforms the QM/MM hybrid approach and demonstrates high consistency with wet-lab experiments on different biological application scenarios, including J-coupling, enthalpy, heat capacity, folding free energy, melting temperature, and pKa calculations.
Looking aheadAchieving ab initio accuracy in biomolecular simulations is challenging but holds great potential for understanding the mystery of biological systems and designing new biomaterials and drugs. This breakthrough is a testament to the vision of AI for Science—an initiative to channel the capabilities of artificial intelligence to revolutionize scientific inquiry. The proposed framework aims to address limitations regarding accuracy, robustness, and generalization in the application of machine learning force fields. AI2BMD provides generalizability, adaptability, and versatility in simulating various protein systems by considering the fundamental structure of proteins, namely stretches of amino acids. This approach enhances energy and force calculations as well as the estimation of kinetic and thermodynamic properties.
One key application of AI2BMD is its ability to perform highly accurate virtual screening for drug discovery. In 2023, at the inaugural Global AI Drug Development competition (opens in new tab), AI2BMD made a breakthrough by predicting a chemical compound that binds to the main protease of SARS-CoV-2. Its precise predictions surpassed those of all other competitors, securing first place and showcasing its immense potential to accelerate real-world drug discovery efforts.
Since 2022, Microsoft Research also partnered with the Global Health Drug Discovery Institute (GHDDI), a nonprofit research institute founded and supported by the Gates Foundation, to apply AI technology to design drugs that treat diseases that unproportionally affect low- and middle- income countries (LMIC), such as tuberculosis and malaria. Now, we have been closely collaborating with GHDDI to leverage AI2BMD and other AI capabilities to accelerate the drug discovery process.
AI2BMD can help advance solutions to scientific problems and enable new biomedical research in drug discovery, protein design, and enzyme engineering.
Opens in a new tabThe post From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy appeared first on Microsoft Research.