Microsoft

Using generative AI to imitate human behavior

Microsoft Research - Thu, 05/04/2023 - 19:00

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

Figure 1: Overview of our method.

Diffusion models have emerged as a powerful class of generative AI models. They have been used to generate photorealistic images and short videos, compose music, and synthesize speech. And their uses don’t stop there. In our new paper, Imitating Human Behaviour with Diffusion Models, we explore how they can be used to imitate human behavior in interactive environments.

This capability is valuable in many applications. For instance, it could help automate repetitive manipulation tasks in robotics, or it could be used to create humanlike AI in video games, which could lead to exciting new game experiences—a goal particularly dear to our team.

We follow a machine learning paradigm known as imitation learning (more specifically behavior cloning). In this paradigm, we are provided with a dataset containing observations a person saw, and the actions they took, when acting in an environment, which we would like an AI agent to mimic. In interactive environments, at each time step, an observation \( o_t \) is received (e.g. a screenshot of a video game), and an action \( a_t \) is then selected (e.g. the mouse movement). With this dataset of many \( o \)’s and \( a \)’s performed by some demonstrator, a model \( \pi \) could try to learn this mapping of observation-to-action, \( \pi(o) \to a \).

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Learn more

When the actions are continuous, training a model to learn this mapping introduces some interesting challenges. In particular, what loss function should be used? A simple choice is mean squared error, as often used in supervised regression tasks. In an interactive environment, this objective encourages an agent to learn the average of all the behaviors in the dataset.

If the goal of the application is to generate diverse human behaviors, the average might not be very useful. After all, humans are stochastic (they act on whims) and multimodal creatures (different humans might make different decisions). Figure 2 depicts the failure of mean squared error to mimic the true action distribution (marked in yellow) when it is multimodal. It also includes several other popular choices for the loss function when doing behavior cloning.

Figure 2: This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. Here the demonstration distribution is marked in yellow—it is both multimodal and has correlations between action dimensions. Diffusion models offer a good imitation of the full diversity in the dataset.

Ideally, we’d like our models to learn the full variety of human behaviors. And this is where generative models help. Diffusion models are a specific class of generative model that are both stable to train and easy to sample from. They have been very successful in the text-to-image domain, which shares this one-to-many challenge—a single text caption might be matched by multiple different images.

Our work adapts ideas that have been developed for text-to-image diffusion models, to this new paradigm of observation-to-action diffusion. Figure 1 highlights some differences. One obvious point is that the object we are generating is now a low-dimensional action vector (rather than an image). This calls for a new design for the denoising network architecture. In image generation, heavy convolutional U-Nets are in vogue, but these are less applicable for low-dimensional vectors. Instead, we innovated and tested three different architectures shown in Figure 1.

In observation-to-action models, sampling a single bad action during an episode can throw an agent off course, and hence we were motivated to develop sampling schemes that would more reliably return good action samples (also shown in Figure 1). This problem is less severe in text-to-image models, since users often have the luxury of selecting a single image from among several generated samples and ignoring any bad images. Figure 3 shows an example of this, where a user might cherry-pick their favorite, while ignoring the one with nonsensical text.

Figure 3: Four samples from a text-to-image diffusion model from Bing (note this is not our own work), using the prompt “A cartoon style picture of people playing with arcade claw machine”.

We tested our diffusion agents in two different environments. The first, a simulated kitchen environment, is a challenging high-dimensional continuous control problem where a robotic arm must manipulate various objects. The demonstration dataset is collected from a variety of humans performing various tasks in differing orders. Hence there is rich multimodality in the dataset.

We found that diffusion agents outperformed baselines in two aspects. 1) The diversity of behaviors they learned were broader, and closer to the human demonstrations. 2) The rate of task completion (a proxy for reward) was better.

The videos below highlight the ability of diffusion to capture multimodal behavior–starting from the same initial conditions, we roll out the diffusion agent eight times. Each time it selects a different sequence of tasks to complete.

The second environment tested was a modern 3D video game, Counter-strike. We refer interested readers to the paper for results.

In summary, our work has demonstrated how exciting recent advances in generative modeling can be leveraged to build agents that can behave in humanlike ways in interactive environments. We’re excited to continue exploring this direction – watch this space for future work.

For more detail on our work, please see our paper and code repo.

The post Using generative AI to imitate human behavior appeared first on Microsoft Research.

Categories: Microsoft

Inferring rewards through interaction

Microsoft Research - Thu, 05/04/2023 - 18:00

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

Reinforcement learning (RL) hinges on the power of rewards, driving agents—or the models doing the learning—to explore and learn valuable actions. The feedback received through rewards shapes their behavior, culminating in effective policies. Yet, crafting reward functions is a complex, laborious task, even for experts. A more appealing option, particularly for the people ultimately using systems that learn from feedback over time, is an agent that can automatically infer a reward function. The interaction-grounded learning (IGL) paradigm from Microsoft Research enables agents to infer rewards through the very process of interaction, utilizing diverse feedback signals rather than explicit numeric rewards. Despite the absence of a clear reward signal, the feedback relies on a binary latent reward through which the agent masters a policy that maximizes this unseen latent reward using environmental feedback.

In our paper “Personalized Reward Learning with Interaction-Grounded Learning,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR), we propose a novel approach to solve for the IGL paradigm: IGL-P. IGL-P is the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. This approach provides a scalable alternative to current personalized agent learning methods, which can require expensive high-dimensional parameter tuning, handcrafted rewards, and/or extensive and costly user studies.

IGL-P in the recommender system setting

IGL-P is particularly useful for interactive learning applications such as recommender systems. Recommender systems help people navigate increasing volumes of content offerings by providing personalized content suggestions. However, without explicit feedback, recommender systems can’t detect for certain whether a person enjoyed the displayed content. To accommodate, modern recommender systems equate implicit feedback signals with user satisfaction. Despite the popularity of this approach, implicit feedback is not the true reward. Even the click-through rate (CTR) metric, the gold standard for recommender systems, is an imperfect reward, and its optimization naturally promotes clickbait.

Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.

This problem has led to the handcrafting of reward functions with various implicit feedback signals in modern recommender systems. Recommendation algorithms will use hand-defined weights for different user interactions, such as replying to or liking content, when deciding how to recommend content to different people. This fixed weighting of implicit feedback signals might not generalize across a wide variety of people, and thus a personalized learning method can improve user experience by recommending content based on user preferences.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now

The choice of reward function is further complicated by differences in how people interact with recommender systems. A growing body of work shows that recommender systems don’t provide consistently good recommendations across demographic groups. Previous research suggests that this inconsistency has its roots in user engagement styles. In other words, a reward function that might work well for one type of user might (and often does) perform poorly for another type of user who interacts with the platform differently. For example, older adults have been found to click on clickbait more often. If the CTR is used as an objective, this group of users will receive significantly more clickbait recommendations than the general public, resulting in higher rates of negative user experiences and leading to user distrust in the recommender system.

IGL-P provides a novel approach to optimize content for latent user satisfaction—that is, rewards that a model doesn’t have direct access to—by learning personalized reward functions for different people rather than requiring a fixed, human-designed reward function. IGL-P learns representations of diverse user communication modalities and how these modalities depend on the underlying user satisfaction. It assumes that people may communicate their feedback in different ways but a given person expresses (dis)satisfaction or indifference to all content in the same way. This enables the use of inverse kinematics toward a solution for recovering the latent reward. With additional assumptions that rewards are rare when the agent acts randomly and some negatively labeled interactions are directly accessible to the agent, IGL-P recovers the latent reward function and leverages that to learn a personalized policy.

IGL-P successes

The success of IGL-P is demonstrated with experiments using simulations, as well as with real-world production traces. IGL-P is evaluated in three different settings:

  • A simulation using a supervised classification dataset shows that IGL-P can learn to successfully distinguish between different communication modalities.
  • A simulation for online news recommendation based on publicly available data from Facebook users shows that IGL-P leverages insights about different communication modalities to learn better policies and achieve consistent performance among diverse user groups (the dataset, created in 2016, consists of public posts from the official Facebook pages of news companies from 2012 to 2016 and aggregated user reactions; because of this aggregation, identifying information can’t be extracted).
  • A real-world experiment deployed in the Microsoft image recommendation product Windows Spotlight showcases that the proposed method outperforms the hand-engineered reward baseline and succeeds in a practical application serving millions of people.

The post Inferring rewards through interaction appeared first on Microsoft Research.

Categories: Microsoft

Inferring rewards through interaction

Microsoft Research - Thu, 05/04/2023 - 18:00

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

Reinforcement learning (RL) hinges on the power of rewards, driving agents—or the models doing the learning—to explore and learn valuable actions. The feedback received through rewards shapes their behavior, culminating in effective policies. Yet, crafting reward functions is a complex, laborious task, even for experts. A more appealing option, particularly for the people ultimately using systems that learn from feedback over time, is an agent that can automatically infer a reward function. The interaction-grounded learning (IGL) paradigm from Microsoft Research enables agents to infer rewards through the very process of interaction, utilizing diverse feedback signals rather than explicit numeric rewards. Despite the absence of a clear reward signal, the feedback relies on a binary latent reward through which the agent masters a policy that maximizes this unseen latent reward using environmental feedback.

In our paper “Personalized Reward Learning with Interaction-Grounded Learning,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR), we propose a novel approach to solve for the IGL paradigm: IGL-P. IGL-P is the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. This approach provides a scalable alternative to current personalized agent learning methods, which can require expensive high-dimensional parameter tuning, handcrafted rewards, and/or extensive and costly user studies.

IGL-P in the recommender system setting

IGL-P is particularly useful for interactive learning applications such as recommender systems. Recommender systems help people navigate increasing volumes of content offerings by providing personalized content suggestions. However, without explicit feedback, recommender systems can’t detect for certain whether a person enjoyed the displayed content. To accommodate, modern recommender systems equate implicit feedback signals with user satisfaction. Despite the popularity of this approach, implicit feedback is not the true reward. Even the click-through rate (CTR) metric, the gold standard for recommender systems, is an imperfect reward, and its optimization naturally promotes clickbait.

Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r itself, to learn how to better recommend personalized content to the individual.

This problem has led to the handcrafting of reward functions with various implicit feedback signals in modern recommender systems. Recommendation algorithms will use hand-defined weights for different user interactions, such as replying to or liking content, when deciding how to recommend content to different people. This fixed weighting of implicit feedback signals might not generalize across a wide variety of people, and thus a personalized learning method can improve user experience by recommending content based on user preferences.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Explore sessions

The choice of reward function is further complicated by differences in how people interact with recommender systems. A growing body of work shows that recommender systems don’t provide consistently good recommendations across demographic groups. Previous research suggests that this inconsistency has its roots in user engagement styles. In other words, a reward function that might work well for one type of user might (and often does) perform poorly for another type of user who interacts with the platform differently. For example, older adults have been found to click on clickbait more often. If the CTR is used as an objective, this group of users will receive significantly more clickbait recommendations than the general public, resulting in higher rates of negative user experiences and leading to user distrust in the recommender system.

IGL-P provides a novel approach to optimize content for latent user satisfaction—that is, rewards that a model doesn’t have direct access to—by learning personalized reward functions for different people rather than requiring a fixed, human-designed reward function. IGL-P learns representations of diverse user communication modalities and how these modalities depend on the underlying user satisfaction. It assumes that people may communicate their feedback in different ways but a given person expresses (dis)satisfaction or indifference to all content in the same way. This enables the use of inverse kinematics toward a solution for recovering the latent reward. With additional assumptions that rewards are rare when the agent acts randomly and some negatively labeled interactions are directly accessible to the agent, IGL-P recovers the latent reward function and leverages that to learn a personalized policy.

IGL-P successes

The success of IGL-P is demonstrated with experiments using simulations, as well as with real-world production traces. IGL-P is evaluated in three different settings:

  • A simulation using a supervised classification dataset shows that IGL-P can learn to successfully distinguish between different communication modalities.
  • A simulation for online news recommendation based on publicly available data from Facebook users shows that IGL-P leverages insights about different communication modalities to learn better policies and achieve consistent performance among diverse user groups (the dataset, created in 2016, consists of public posts from the official Facebook pages of news companies from 2012 to 2016 and aggregated user reactions; because of this aggregation, identifying information can’t be extracted).
  • A real-world experiment deployed in the Microsoft image recommendation product Windows Spotlight showcases that the proposed method outperforms the hand-engineered reward baseline and succeeds in a practical application serving millions of people.

The post Inferring rewards through interaction appeared first on Microsoft Research.

Categories: Microsoft

AI self-play for algorithm design

Microsoft Research - Tue, 05/02/2023 - 18:00

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles, which were also handwritten.

Efficient algorithms are crucial for many purposes, including reducing energy consumption in digital devices. While humans outperform AI systems at designing such algorithms, we show how to improve AI programming abilities using self-play, a technique that has helped AI systems dominate in games such as chess and Go.

Designing fast and accurate algorithms requires high-level abstract reasoning, which remains difficult for AI systems. Our approach involves having the AI design and solve its own programming challenges, enabling practice on millions of artificial challenges and exploration of problem types not found in public repositories. We detail our work in a new paper, “Language Models Can Teach Themselves to Program Better,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR).

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Explore sessions The key challenge and our solution

How can an AI system generate novel algorithmic programming problems without knowing the solution?

Our approach uses programming puzzles introduced by Microsoft Research in 2021. These puzzles—known in complexity theory as the class of “NP” decision problems—are easy to check for correctness (no hidden answer key) but often difficult to solve. In this way, they’re like a Rubik’s cube, where it’s trivial to recognize a solution but hard to find one. Three examples are illustrated below: a novel string challenge and the classic Towers of Hanoi and factoring problems. Programming puzzles can range from trivial to major open problems in algorithms and mathematics, and solving them requires all the major algorithmic techniques, such as dynamic programming and greedy algorithms. However, each puzzle just checks a single input as opposed to standard problems in algorithms, which require a solution that scales efficiently for all inputs, which is much harder to test.

Programming puzzle examples Can computers generate valuable, novel challenges?

Surprisingly, language models such as Codex and GPT-Neo can indeed create novel puzzles when prompted to generate “more like these” on a set of example puzzles without solutions. You may wonder what makes a challenge good. Instead of focusing on interesting, we prioritize useful challenges. Our evaluation has the language model generate, solve, and train on its own puzzles; then we assess whether the training improved its performance on a hidden test set of puzzles. (By now, solutions to our puzzles may have leaked into AI training sets, but with the help of champion competitive programmers, we have created a secret test set that remains unpublished, which can be used for uncontaminated evaluation.) In our experiments with small- to medium-sized language models—with a few billion parameters, much fewer than the latest GPT models—self-training more than doubled success rates.

Risks and limitations

This research was conducted prior to GPT-4’s release. While we believe similar techniques may help GPT-4 self-improve in programming, this is an active area of research as we better understand the capabilities and limitations of these models, as well as their appropriate use and the potential consequences of increased programming capabilities. One key limitation of puzzles is that solutions might only work for the specific instance provided. However, this limitation also serves as an advantage in terms of human-AI alignment. Unlike other AI challenges with inherent ambiguities that could lead to unintended consequences if objectives are imprecisely defined (for example, an AI-designed math-tutor app that may become addicting unintendedly), our programming puzzles encompass exactly those standalone problems that can be perfectly verified for meeting a precise objective. As there remains a risk that any work that substantially advances AI programming capabilities can be used in other systems and with unintended consequences, we continue to encourage taking great care before deploying systems with artificially generated code.  

Examples of programming puzzles for AI self-play

Each puzzle is specified by a short Python program that checks a possible answer. Each solution is a Python program that outputs an answer in a limited amount of time.

Example 1: Towers of Hanoi

The goal of the well-known Towers of Hanoi puzzle is to move all the disks from the first tower to the last tower, one by one, without ever putting a bigger disk on top of a smaller disk. It’s easy to check that a solution is correct but hard to find a correct solution. Even though the number of steps required to solve it is exponential in the number of disks, there’s a solution in the form of a short program that is often used to teach recursion. The clever solution program that outputs the moves is easier to find than the sequence of moves itself. Here are the programming puzzle and solution:

Example 2: String challenge

This concise puzzle perplexes AI systems, although humans find it simple. The puzzle requires a string with 1,000 “A” characters but no two consecutive A’s. Most programmers devise solutions like “ABABAB …” (1,000 times), generated by the compact Python solution above. In contrast, AI systems usually need multiple attempts. Fortunately, AI systems can easily verify their attempts by running the checking program. This puzzle exemplifies a straightforward, unique problem specifically created for our dataset.

Example 3: Integer factorization

Another classic example is integer factorization. The puzzle above requires a factor of a relatively small number so it can be solved quickly by a simple loop. However, our dataset also contains factoring challenges like the 309-digit RSA Factoring Challenge number, which was published in 1991 along with a $100,000 prize. The 309-digit number was never factored, and the challenge has since ended.

The post AI self-play for algorithm design appeared first on Microsoft Research.

Categories: Microsoft

AI self-play for algorithm design

Microsoft Research - Tue, 05/02/2023 - 18:00

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles, which were also handwritten.

Efficient algorithms are crucial for many purposes, including reducing energy consumption in digital devices. While humans outperform AI systems at designing such algorithms, we show how to improve AI programming abilities using self-play, a technique that has helped AI systems dominate in games such as chess and Go.

Designing fast and accurate algorithms requires high-level abstract reasoning, which remains difficult for AI systems. Our approach involves having the AI design and solve its own programming challenges, enabling practice on millions of artificial challenges and exploration of problem types not found in public repositories. We detail our work in a new paper, “Language Models Can Teach Themselves to Program Better,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR).

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Learn more The key challenge and our solution

How can an AI system generate novel algorithmic programming problems without knowing the solution?

Our approach uses programming puzzles introduced by Microsoft Research in 2021. These puzzles—known in complexity theory as the class of “NP” decision problems—are easy to check for correctness (no hidden answer key) but often difficult to solve. In this way, they’re like a Rubik’s cube, where it’s trivial to recognize a solution but hard to find one. Three examples are illustrated below: a novel string challenge and the classic Towers of Hanoi and factoring problems. Programming puzzles can range from trivial to major open problems in algorithms and mathematics, and solving them requires all the major algorithmic techniques, such as dynamic programming and greedy algorithms. However, each puzzle just checks a single input as opposed to standard problems in algorithms, which require a solution that scales efficiently for all inputs, which is much harder to test.

Programming puzzle examples Can computers generate valuable, novel challenges?

Surprisingly, language models such as Codex and GPT-Neo can indeed create novel puzzles when prompted to generate “more like these” on a set of example puzzles without solutions. You may wonder what makes a challenge good. Instead of focusing on interesting, we prioritize useful challenges. Our evaluation has the language model generate, solve, and train on its own puzzles; then we assess whether the training improved its performance on a hidden test set of puzzles. (By now, solutions to our puzzles may have leaked into AI training sets, but with the help of champion competitive programmers, we have created a secret test set that remains unpublished, which can be used for uncontaminated evaluation.) In our experiments with small- to medium-sized language models—with a few billion parameters, much fewer than the latest GPT models—self-training more than doubled success rates.

Risks and limitations

This research was conducted prior to GPT-4’s release. While we believe similar techniques may help GPT-4 self-improve in programming, this is an active area of research as we better understand the capabilities and limitations of these models, as well as their appropriate use and the potential consequences of increased programming capabilities. One key limitation of puzzles is that solutions might only work for the specific instance provided. However, this limitation also serves as an advantage in terms of human-AI alignment. Unlike other AI challenges with inherent ambiguities that could lead to unintended consequences if objectives are imprecisely defined (for example, an AI-designed math-tutor app that may become addicting unintendedly), our programming puzzles encompass exactly those standalone problems that can be perfectly verified for meeting a precise objective. As there remains a risk that any work that substantially advances AI programming capabilities can be used in other systems and with unintended consequences, we continue to encourage taking great care before deploying systems with artificially generated code.  

Examples of programming puzzles for AI self-play

Each puzzle is specified by a short Python program that checks a possible answer. Each solution is a Python program that outputs an answer in a limited amount of time.

Example 1: Towers of Hanoi

The goal of the well-known Towers of Hanoi puzzle is to move all the disks from the first tower to the last tower, one by one, without ever putting a bigger disk on top of a smaller disk. It’s easy to check that a solution is correct but hard to find a correct solution. Even though the number of steps required to solve it is exponential in the number of disks, there’s a solution in the form of a short program that is often used to teach recursion. The clever solution program that outputs the moves is easier to find than the sequence of moves itself. Here are the programming puzzle and solution:

Example 2: String challenge

This concise puzzle perplexes AI systems, although humans find it simple. The puzzle requires a string with 1,000 “A” characters but no two consecutive A’s. Most programmers devise solutions like “ABABAB …” (1,000 times), generated by the compact Python solution above. In contrast, AI systems usually need multiple attempts. Fortunately, AI systems can easily verify their attempts by running the checking program. This puzzle exemplifies a straightforward, unique problem specifically created for our dataset.

Example 3: Integer factorization

Another classic example is integer factorization. The puzzle above requires a factor of a relatively small number so it can be solved quickly by a simple loop. However, our dataset also contains factoring challenges like the 309-digit RSA Factoring Challenge number, which was published in 1991 along with a $100,000 prize. The 309-digit number was never factored, and the challenge has since ended.

The post AI self-play for algorithm design appeared first on Microsoft Research.

Categories: Microsoft

AI self-play for algorithm design

Microsoft Research - Tue, 05/02/2023 - 18:00

This research was accepted by the 2023 International Conference on Learning Representations (ICLR), which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.

A self-play pipeline for a language model (LM) to improve itself in a fully automatic manner. First, the LM generates novel puzzles based on a training set of handwritten puzzles. Then, the LM attempts to solve each of these puzzles 100 times. In Step 3, the computer (specifically a Python interpreter) filters the candidate solutions for correctness. Finally, the LM is improved by further training on these verified correct solutions to synthetic puzzles, and the process repeats. This process leads to significant improvements as measured on held-out test puzzles, which were also handwritten.

Efficient algorithms are crucial for many purposes, including reducing energy consumption in digital devices. While humans outperform AI systems at designing such algorithms, we show how to improve AI programming abilities using self-play, a technique that has helped AI systems dominate in games such as chess and Go.

Designing fast and accurate algorithms requires high-level abstract reasoning, which remains difficult for AI systems. Our approach involves having the AI design and solve its own programming challenges, enabling practice on millions of artificial challenges and exploration of problem types not found in public repositories. We detail our work in a new paper, “Language Models Can Teach Themselves to Program Better,” which we’re presenting at the 2023 International Conference on Learning Representations (ICLR).

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now The key challenge and our solution

How can an AI system generate novel algorithmic programming problems without knowing the solution?

Our approach uses programming puzzles introduced by Microsoft Research in 2021. These puzzles—known in complexity theory as the class of “NP” decision problems—are easy to check for correctness (no hidden answer key) but often difficult to solve. In this way, they’re like a Rubik’s cube, where it’s trivial to recognize a solution but hard to find one. Three examples are illustrated below: a novel string challenge and the classic Towers of Hanoi and factoring problems. Programming puzzles can range from trivial to major open problems in algorithms and mathematics, and solving them requires all the major algorithmic techniques, such as dynamic programming and greedy algorithms. However, each puzzle just checks a single input as opposed to standard problems in algorithms, which require a solution that scales efficiently for all inputs, which is much harder to test.

Programming puzzle examples Can computers generate valuable, novel challenges?

Surprisingly, language models such as Codex and GPT-Neo can indeed create novel puzzles when prompted to generate “more like these” on a set of example puzzles without solutions. You may wonder what makes a challenge good. Instead of focusing on interesting, we prioritize useful challenges. Our evaluation has the language model generate, solve, and train on its own puzzles; then we assess whether the training improved its performance on a hidden test set of puzzles. (By now, solutions to our puzzles may have leaked into AI training sets, but with the help of champion competitive programmers, we have created a secret test set that remains unpublished, which can be used for uncontaminated evaluation.) In our experiments with small- to medium-sized language models—with a few billion parameters, much fewer than the latest GPT models—self-training more than doubled success rates.

Risks and limitations

This research was conducted prior to GPT-4’s release. While we believe similar techniques may help GPT-4 self-improve in programming, this is an active area of research as we better understand the capabilities and limitations of these models, as well as their appropriate use and the potential consequences of increased programming capabilities. One key limitation of puzzles is that solutions might only work for the specific instance provided. However, this limitation also serves as an advantage in terms of human-AI alignment. Unlike other AI challenges with inherent ambiguities that could lead to unintended consequences if objectives are imprecisely defined (for example, an AI-designed math-tutor app that may become addicting unintendedly), our programming puzzles encompass exactly those standalone problems that can be perfectly verified for meeting a precise objective. As there remains a risk that any work that substantially advances AI programming capabilities can be used in other systems and with unintended consequences, we continue to encourage taking great care before deploying systems with artificially generated code.  

Examples of programming puzzles for AI self-play

Each puzzle is specified by a short Python program that checks a possible answer. Each solution is a Python program that outputs an answer in a limited amount of time.

Example 1: Towers of Hanoi

The goal of the well-known Towers of Hanoi puzzle is to move all the disks from the first tower to the last tower, one by one, without ever putting a bigger disk on top of a smaller disk. It’s easy to check that a solution is correct but hard to find a correct solution. Even though the number of steps required to solve it is exponential in the number of disks, there’s a solution in the form of a short program that is often used to teach recursion. The clever solution program that outputs the moves is easier to find than the sequence of moves itself. Here are the programming puzzle and solution:

Example 2: String challenge

This concise puzzle perplexes AI systems, although humans find it simple. The puzzle requires a string with 1,000 “A” characters but no two consecutive A’s. Most programmers devise solutions like “ABABAB …” (1,000 times), generated by the compact Python solution above. In contrast, AI systems usually need multiple attempts. Fortunately, AI systems can easily verify their attempts by running the checking program. This puzzle exemplifies a straightforward, unique problem specifically created for our dataset.

Example 3: Integer factorization

Another classic example is integer factorization. The puzzle above requires a factor of a relatively small number so it can be solved quickly by a simple loop. However, our dataset also contains factoring challenges like the 309-digit RSA Factoring Challenge number, which was published in 1991 along with a $100,000 prize. The 309-digit number was never factored, and the challenge has since ended.

The post AI self-play for algorithm design appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of April 24, 2023

Microsoft Research - Wed, 04/26/2023 - 18:00

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

In this article
  1. Microsoft researcher Kalai awarded 2022 ACM Prize in Computing
  2. Empowering Azure Storage with RDMA
  3. LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
  4. Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales
  5. Gov4git: Decentralized community governance to fuel open-source projects
  6. Learn more:
  7. AWARD Microsoft researcher Kalai awarded 2022 ACM Prize in Computing

    Yael Tauman Kalai, a senior principal researcher at Microsoft Research, has been awarded the 2022 ACM Prize in Computing. Kalai was recognized for breakthroughs in verifiable delegation of computation and fundamental contributions to cryptography. According to the award announcement, “Kalai’s contributions have helped shape modern cryptographic practices and provided a strong foundation for further advancements.”

    The ACM Prize in Computing recognizes early-to-mid-career computer scientists whose research contributions have fundamental impact and broad implications.

    Spotlight: On-Demand EVENT

    Microsoft Research Summit 2022

    On-Demand
    Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

    Explore sessions

    Among the multiple accomplishments cited for the award, Kalai has developed methods for producing succinct proofs that certify the correctness of any computation. This method enables a weak device to offload any computation to a stronger device in a way that enables the results to be efficiently checked for correctness. Such succinct proofs have been used by blockchain companies to certify transaction validity, thereby overcoming key obstacles in blockchain scalability and enabling faster and more reliable transactions.

    Kalai was also cited for her breakthrough work on the security of the “Fiat-Shamir paradigm,” a general technique for eliminating interaction from interactive protocols. This paradigm is extensively utilized in real-world applications, including the most prevalent digital signature scheme (ECDSA), which is used by all iOS and Android mobile devices.

    Award announcement NEW RESEARCH Empowering Azure Storage with RDMA

    High performance and highly reliable storage are fundamental requirements of public clouds. Given the wide adoption of disaggregated storage in the cloud, networking is essential for enabling high performance and high reliability. Microsoft’s Azure cloud service uses remote direct memory access (RDMA) as its transport and aims to enable it for both storage frontend traffic (between compute virtual machines and storage clusters) and backend traffic (within a storage cluster) to fully realize its benefits. As compute and storage clusters may be located in different datacenters within an Azure region, RDMA needs to be supported at regional scale.

    In a new paper: Empowering Azure Storage with RDMA, Microsoft Azure and Microsoft Research report on their intra-region RDMA deployment to support storage workloads in Azure. The high complexity and heterogeneity of Azure infrastructure creates challenges, such as the problem of interoperability between different types of RDMA network interface cards. Several changes were made to the network infrastructure to address these challenges. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. This helps achieve significant disk I/O performance improvements and CPU core savings.

    Read the paper NEW RESEARCH LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

    Systems that support users in the automatic creation of visualizations must address several subtasks—understand the semantics of data; enumerate relevant visualization goals; and generate visualization specifications. In a new paper: LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models, researchers from Microsoft pose visualization generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable to addressing these tasks.

    LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It is comprised of four modules—a summarizer that converts data into a rich but compact natural language summary; a goal explorer that enumerates visualization goals given the data; a visgenerator that generates, evaluates, refines, executes, and filters visualization code; and an infographer module that yields data-faithful stylized graphics using IGMs. LIDA provides a python API and a hybrid user interface (direct manipulation and multilingual natural language) for interactive chart, infographics and data story generation.

    Read the paper Explore the beta NEW RELEASE Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales

    Microsoft’s AI at Scale initiative has released DeepSpeed-Chat, an easy, fast, and low-cost open-source solution for reinforcement learning from human feedback (RLHF) training that can create high-quality ChatGPT-like models ranging in size from a few to hundreds of billions of parameters. DeepSpeed-Chat provides complete RLHF training experience with a single click. It combines the prowess of DeepSpeed-Inference and DeepSpeed-Training to offer 15x faster throughput than the previous state of the art, while also supporting model sizes that are up to 8x larger on the same hardware. With DeepSpeed-Chat, practitioners can train an OPT-13B ChatGPT-like model in under 1.5 hours or a massive 175B model in a day on a modest GPU cluster. For those who don’t have a GPU cluster handy, DeepSpeed-Chat enables practitioners to train up to a 13B model on a single GPU, or at $300 to train on Azure Cloud. 

    Download the code NEWS Gov4git: Decentralized community governance to fuel open-source projects

    Communal open-source projects have helped build countless applications for sourcing and sharing information like bug details and scientific data, as well as decentralized planning, design and policymaking. 

    But the lack of a standardized and secure governance solution prevents many open-source projects from getting started—and holds them back when they get too big to be managed through ad-hoc methods. These small communities often resort to external mechanisms to manage their projects and protect them from malicious actors.

    Microsoft Research and Protocol Labs, an open-source R&D company, are collaborating to develop Gov4git, a decentralized, git-native protocol with configurable governance rules to help launch more open-source projects and communities and support their growth.

    Gov4git comes with many of the transparency, decentralization, and security benefits of blockchains while also harnessing the power of formal governance to avoid costly approaches to validation and dispute resolution. 

    Git is the worldwide standard for version control and management of collaborative software development projects. Gov4git is designed as a secure and cost-effective framework solution which can be tailored to the specific needs of any one community and deployed by non-technical users anywhere where access to git is present. Gov4git can strengthen the security of such communities against the risks of malicious actors posing as collaborators with the intent to negatively impact community maintenance.

    Learn more:

    The post Research Focus: Week of April 24, 2023 appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of April 24, 2023

Microsoft Research - Wed, 04/26/2023 - 18:00

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

In this article
  1. Microsoft researcher Kalai awarded 2022 ACM Prize in Computing
  2. Empowering Azure Storage with RDMA
  3. LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
  4. Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales
  5. Gov4git: Decentralized community governance to fuel open-source projects
  6. Learn more:
  7. AWARD Microsoft researcher Kalai awarded 2022 ACM Prize in Computing

    Yael Tauman Kalai, a senior principal researcher at Microsoft Research, has been awarded the 2022 ACM Prize in Computing. Kalai was recognized for breakthroughs in verifiable delegation of computation and fundamental contributions to cryptography. According to the award announcement, “Kalai’s contributions have helped shape modern cryptographic practices and provided a strong foundation for further advancements.”

    The ACM Prize in Computing recognizes early-to-mid-career computer scientists whose research contributions have fundamental impact and broad implications.

    SPOTLIGHT: AI focus area

    AI and Microsoft Research

    Learn more about the breadth of AI research at Microsoft

    Learn more

    Among the multiple accomplishments cited for the award, Kalai has developed methods for producing succinct proofs that certify the correctness of any computation. This method enables a weak device to offload any computation to a stronger device in a way that enables the results to be efficiently checked for correctness. Such succinct proofs have been used by blockchain companies to certify transaction validity, thereby overcoming key obstacles in blockchain scalability and enabling faster and more reliable transactions.

    Kalai was also cited for her breakthrough work on the security of the “Fiat-Shamir paradigm,” a general technique for eliminating interaction from interactive protocols. This paradigm is extensively utilized in real-world applications, including the most prevalent digital signature scheme (ECDSA), which is used by all iOS and Android mobile devices.

    Award announcement NEW RESEARCH Empowering Azure Storage with RDMA

    High performance and highly reliable storage are fundamental requirements of public clouds. Given the wide adoption of disaggregated storage in the cloud, networking is essential for enabling high performance and high reliability. Microsoft’s Azure cloud service uses remote direct memory access (RDMA) as its transport and aims to enable it for both storage frontend traffic (between compute virtual machines and storage clusters) and backend traffic (within a storage cluster) to fully realize its benefits. As compute and storage clusters may be located in different datacenters within an Azure region, RDMA needs to be supported at regional scale.

    In a new paper: Empowering Azure Storage with RDMA, Microsoft Azure and Microsoft Research report on their intra-region RDMA deployment to support storage workloads in Azure. The high complexity and heterogeneity of Azure infrastructure creates challenges, such as the problem of interoperability between different types of RDMA network interface cards. Several changes were made to the network infrastructure to address these challenges. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. This helps achieve significant disk I/O performance improvements and CPU core savings.

    Read the paper NEW RESEARCH LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

    Systems that support users in the automatic creation of visualizations must address several subtasks—understand the semantics of data; enumerate relevant visualization goals; and generate visualization specifications. In a new paper: LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models, researchers from Microsoft pose visualization generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable to addressing these tasks.

    LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It is comprised of four modules—a summarizer that converts data into a rich but compact natural language summary; a goal explorer that enumerates visualization goals given the data; a visgenerator that generates, evaluates, refines, executes, and filters visualization code; and an infographer module that yields data-faithful stylized graphics using IGMs. LIDA provides a python API and a hybrid user interface (direct manipulation and multilingual natural language) for interactive chart, infographics and data story generation.

    Read the paper Explore the beta NEW RELEASE Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales

    Microsoft’s AI at Scale initiative has released DeepSpeed-Chat, an easy, fast, and low-cost open-source solution for reinforcement learning from human feedback (RLHF) training that can create high-quality ChatGPT-like models ranging in size from a few to hundreds of billions of parameters. DeepSpeed-Chat provides complete RLHF training experience with a single click. It combines the prowess of DeepSpeed-Inference and DeepSpeed-Training to offer 15x faster throughput than the previous state of the art, while also supporting model sizes that are up to 8x larger on the same hardware. With DeepSpeed-Chat, practitioners can train an OPT-13B ChatGPT-like model in under 1.5 hours or a massive 175B model in a day on a modest GPU cluster. For those who don’t have a GPU cluster handy, DeepSpeed-Chat enables practitioners to train up to a 13B model on a single GPU, or at $300 to train on Azure Cloud. 

    Download the code NEWS Gov4git: Decentralized community governance to fuel open-source projects

    Communal open-source projects have helped build countless applications for sourcing and sharing information like bug details and scientific data, as well as decentralized planning, design and policymaking. 

    But the lack of a standardized and secure governance solution prevents many open-source projects from getting started—and holds them back when they get too big to be managed through ad-hoc methods. These small communities often resort to external mechanisms to manage their projects and protect them from malicious actors.

    Microsoft Research and Protocol Labs, an open-source R&D company, are collaborating to develop Gov4git, a decentralized, git-native protocol with configurable governance rules to help launch more open-source projects and communities and support their growth.

    Gov4git comes with many of the transparency, decentralization, and security benefits of blockchains while also harnessing the power of formal governance to avoid costly approaches to validation and dispute resolution. 

    Git is the worldwide standard for version control and management of collaborative software development projects. Gov4git is designed as a secure and cost-effective framework solution which can be tailored to the specific needs of any one community and deployed by non-technical users anywhere where access to git is present. Gov4git can strengthen the security of such communities against the risks of malicious actors posing as collaborators with the intent to negatively impact community maintenance.

    Learn more:

    The post Research Focus: Week of April 24, 2023 appeared first on Microsoft Research.

Categories: Microsoft

Research Focus: Week of April 24, 2023

Microsoft Research - Wed, 04/26/2023 - 18:00

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

In this article
  1. Microsoft researcher Kalai awarded 2022 ACM Prize in Computing
  2. Empowering Azure Storage with RDMA
  3. LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
  4. Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales
  5. Gov4git: Decentralized community governance to fuel open-source projects
  6. Learn more:
  7. AWARD Microsoft researcher Kalai awarded 2022 ACM Prize in Computing

    Yael Tauman Kalai, a senior principal researcher at Microsoft Research, has been awarded the 2022 ACM Prize in Computing. Kalai was recognized for breakthroughs in verifiable delegation of computation and fundamental contributions to cryptography. According to the award announcement, “Kalai’s contributions have helped shape modern cryptographic practices and provided a strong foundation for further advancements.”

    The ACM Prize in Computing recognizes early-to-mid-career computer scientists whose research contributions have fundamental impact and broad implications.

    Spotlight: Microsoft Research Podcast

    AI Frontiers: The Physics of AI with Sébastien Bubeck

    What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

    Listen now

    Among the multiple accomplishments cited for the award, Kalai has developed methods for producing succinct proofs that certify the correctness of any computation. This method enables a weak device to offload any computation to a stronger device in a way that enables the results to be efficiently checked for correctness. Such succinct proofs have been used by blockchain companies to certify transaction validity, thereby overcoming key obstacles in blockchain scalability and enabling faster and more reliable transactions.

    Kalai was also cited for her breakthrough work on the security of the “Fiat-Shamir paradigm,” a general technique for eliminating interaction from interactive protocols. This paradigm is extensively utilized in real-world applications, including the most prevalent digital signature scheme (ECDSA), which is used by all iOS and Android mobile devices.

    Award announcement NEW RESEARCH Empowering Azure Storage with RDMA

    High performance and highly reliable storage are fundamental requirements of public clouds. Given the wide adoption of disaggregated storage in the cloud, networking is essential for enabling high performance and high reliability. Microsoft’s Azure cloud service uses remote direct memory access (RDMA) as its transport and aims to enable it for both storage frontend traffic (between compute virtual machines and storage clusters) and backend traffic (within a storage cluster) to fully realize its benefits. As compute and storage clusters may be located in different datacenters within an Azure region, RDMA needs to be supported at regional scale.

    In a new paper: Empowering Azure Storage with RDMA, Microsoft Azure and Microsoft Research report on their intra-region RDMA deployment to support storage workloads in Azure. The high complexity and heterogeneity of Azure infrastructure creates challenges, such as the problem of interoperability between different types of RDMA network interface cards. Several changes were made to the network infrastructure to address these challenges. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. This helps achieve significant disk I/O performance improvements and CPU core savings.

    Read the paper NEW RESEARCH LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models

    Systems that support users in the automatic creation of visualizations must address several subtasks—understand the semantics of data; enumerate relevant visualization goals; and generate visualization specifications. In a new paper: LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models, researchers from Microsoft pose visualization generation as a multi-stage generation problem and argue that well-orchestrated pipelines based on large language models (LLMs) and image generation models (IGMs) are suitable to addressing these tasks.

    LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It is comprised of four modules—a summarizer that converts data into a rich but compact natural language summary; a goal explorer that enumerates visualization goals given the data; a visgenerator that generates, evaluates, refines, executes, and filters visualization code; and an infographer module that yields data-faithful stylized graphics using IGMs. LIDA provides a python API and a hybrid user interface (direct manipulation and multilingual natural language) for interactive chart, infographics and data story generation.

    Read the paper Explore the beta NEW RELEASE Announcing DeepSpeed-Chat: Easy, fast, affordable RLHF Training of ChatGPT-like models at all scales

    Microsoft’s AI at Scale initiative has released DeepSpeed-Chat, an easy, fast, and low-cost open-source solution for reinforcement learning from human feedback (RLHF) training that can create high-quality ChatGPT-like models ranging in size from a few to hundreds of billions of parameters. DeepSpeed-Chat provides complete RLHF training experience with a single click. It combines the prowess of DeepSpeed-Inference and DeepSpeed-Training to offer 15x faster throughput than the previous state of the art, while also supporting model sizes that are up to 8x larger on the same hardware. With DeepSpeed-Chat, practitioners can train an OPT-13B ChatGPT-like model in under 1.5 hours or a massive 175B model in a day on a modest GPU cluster. For those who don’t have a GPU cluster handy, DeepSpeed-Chat enables practitioners to train up to a 13B model on a single GPU, or at $300 to train on Azure Cloud. 

    Download the code NEWS Gov4git: Decentralized community governance to fuel open-source projects

    Communal open-source projects have helped build countless applications for sourcing and sharing information like bug details and scientific data, as well as decentralized planning, design and policymaking. 

    But the lack of a standardized and secure governance solution prevents many open-source projects from getting started—and holds them back when they get too big to be managed through ad-hoc methods. These small communities often resort to external mechanisms to manage their projects and protect them from malicious actors.

    Microsoft Research and Protocol Labs, an open-source R&D company, are collaborating to develop Gov4git, a decentralized, git-native protocol with configurable governance rules to help launch more open-source projects and communities and support their growth.

    Gov4git comes with many of the transparency, decentralization, and security benefits of blockchains while also harnessing the power of formal governance to avoid costly approaches to validation and dispute resolution. 

    Git is the worldwide standard for version control and management of collaborative software development projects. Gov4git is designed as a secure and cost-effective framework solution which can be tailored to the specific needs of any one community and deployed by non-technical users anywhere where access to git is present. Gov4git can strengthen the security of such communities against the risks of malicious actors posing as collaborators with the intent to negatively impact community maintenance.

    Learn more:

    The post Research Focus: Week of April 24, 2023 appeared first on Microsoft Research.

Categories: Microsoft

TLA+ Foundation aims to bring math-based software modeling to the mainstream

Microsoft Research - Mon, 04/24/2023 - 18:00

TLA+ is a high level, open-source, math-based language for modeling computer programs and systems–especially concurrent and distributed ones. It comes with tools to help eliminate fundamental design errors, which are hard to find and expensive to fix once they have been embedded in code or hardware. 

The TLA language was first published in 1993 by the pioneering computer scientist Leslie Lamport, now a distinguished scientist with Microsoft Research. After years of Lamport’s stewardship and Microsoft’s support, TLA+ has found a new home. The TLA+ Foundation is launching this month as part of the Linux Foundation, with Microsoft, Amazon Web Services (AWS), and Oracle serving as founding members to help further refine the tools and spur commercial usage and additional research. 

“The foundation will help spread that work among more hands,” said Lamport. 

TLA+ is just one piece of Lamport’s impressive portfolio. He invented the document preparation system LaTeX and won the 2013 Turing Award for his work to clarify distributed systems, in which several autonomous computers communicate with each other by passing messages. 

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now

Along the way he developed an idea to help programmers build systems more effectively by using algorithmic models to specify how the code should work. It’s the same idea as creating blueprints to guide the construction of a bridge. TLA+ (for Temporal Logic of Actions) comes with a model checker that will check whether satisfying a program’s specification implies that the code will do what it should.

“When programmers write systems, they should start by defining what they are supposed to do and check that their work will do it. That’s a better way than just sitting down to write the code, based on some vague outline,” Lamport said. 

For simple tasks, a trial-and-error approach may be fine. But for more complicated projects, or those where mistakes are unacceptable, a systematic approach makes more sense.

The challenge with writing large programs isn’t necessarily their size, it’s their complexity. They are often distributed across multiple systems and involve multiple processes that need to interact. The number of possible executions becomes astronomical. To reason about and check such a system, it helps to have a mathematical way to think about it ahead of time. Yet engineers often balk at the idea. 

“The difficulty that engineers have is more a fear of math than the math itself. The math, as math goes, is very basic,” Lamport said, though it’s worth noting he holds a PhD in mathematics. “I find that engineers, after using TLA+, understand the benefit.”

In fact, TLA+ has been adopted for industrial use at semiconductor makers, companies that build distributed and database systems, other tech companies, and in more mainstream applications like payment systems in retail stores. It’s likely that some applications aren’t made public—most companies don’t publicly discuss their engineering process or proprietary technology.

That’s where the foundation comes in. A formal system for contributing to the tools and defining their future direction may spawn additional collaboration among engineers and facilitate commercial adoption. The foundation will create a steering committee, similar to other panels that look after public domain programming languages like C or Java

“I would hope that the new stewards make more subtractions than additions to the language, to remove some things that aren’t needed,” Lamport said. 

Now 82 years old and nearing retirement, Lamport also hopes the foundation gets TLA+ closer to the mainstream of industrial and academic discussion.

“TLA+ is never going to be as popular as Java. And I’d be happy if someone else made it better at helping engineers think more mathematically,” Lamport says. “The ultimate goal is to get engineers to think rigorously at a higher level about what they are doing.”

The post TLA+ Foundation aims to bring math-based software modeling to the mainstream appeared first on Microsoft Research.

Categories: Microsoft

TLA+ Foundation aims to bring math-based software modeling to the mainstream

Microsoft Research - Mon, 04/24/2023 - 18:00

TLA+ is a high level, open-source, math-based language for modeling computer programs and systems–especially concurrent and distributed ones. It comes with tools to help eliminate fundamental design errors, which are hard to find and expensive to fix once they have been embedded in code or hardware. 

The TLA language was first published in 1993 by the pioneering computer scientist Leslie Lamport, now a distinguished scientist with Microsoft Research. After years of Lamport’s stewardship and Microsoft’s support, TLA+ has found a new home. The TLA+ Foundation is launching this month as part of the Linux Foundation, with Microsoft, Amazon Web Services (AWS), and Oracle serving as founding members to help further refine the tools and spur commercial usage and additional research. 

“The foundation will help spread that work among more hands,” said Lamport. 

TLA+ is just one piece of Lamport’s impressive portfolio. He invented the document preparation system LaTeX and won the 2013 Turing Award for his work to clarify distributed systems, in which several autonomous computers communicate with each other by passing messages. 

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Learn more

Along the way he developed an idea to help programmers build systems more effectively by using algorithmic models to specify how the code should work. It’s the same idea as creating blueprints to guide the construction of a bridge. TLA+ (for Temporal Logic of Actions) comes with a model checker that will check whether satisfying a program’s specification implies that the code will do what it should.

“When programmers write systems, they should start by defining what they are supposed to do and check that their work will do it. That’s a better way than just sitting down to write the code, based on some vague outline,” Lamport said. 

For simple tasks, a trial-and-error approach may be fine. But for more complicated projects, or those where mistakes are unacceptable, a systematic approach makes more sense.

The challenge with writing large programs isn’t necessarily their size, it’s their complexity. They are often distributed across multiple systems and involve multiple processes that need to interact. The number of possible executions becomes astronomical. To reason about and check such a system, it helps to have a mathematical way to think about it ahead of time. Yet engineers often balk at the idea. 

“The difficulty that engineers have is more a fear of math than the math itself. The math, as math goes, is very basic,” Lamport said, though it’s worth noting he holds a PhD in mathematics. “I find that engineers, after using TLA+, understand the benefit.”

In fact, TLA+ has been adopted for industrial use at semiconductor makers, companies that build distributed and database systems, other tech companies, and in more mainstream applications like payment systems in retail stores. It’s likely that some applications aren’t made public—most companies don’t publicly discuss their engineering process or proprietary technology.

That’s where the foundation comes in. A formal system for contributing to the tools and defining their future direction may spawn additional collaboration among engineers and facilitate commercial adoption. The foundation will create a steering committee, similar to other panels that look after public domain programming languages like C or Java

“I would hope that the new stewards make more subtractions than additions to the language, to remove some things that aren’t needed,” Lamport said. 

Now 82 years old and nearing retirement, Lamport also hopes the foundation gets TLA+ closer to the mainstream of industrial and academic discussion.

“TLA+ is never going to be as popular as Java. And I’d be happy if someone else made it better at helping engineers think more mathematically,” Lamport says. “The ultimate goal is to get engineers to think rigorously at a higher level about what they are doing.”

The post TLA+ Foundation aims to bring math-based software modeling to the mainstream appeared first on Microsoft Research.

Categories: Microsoft

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Microsoft Research - Wed, 04/19/2023 - 18:08

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Learn more

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, \(\texttt{rank-game}\), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With \(\texttt{rank-game}\), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use \(\pi\) and \(\pi^E\) to denote the policy of the agent and the expert, respectively, and \(R_{gt}\) to be the reward function of the expert, which is unknown to the agent/robot. \(\rho^\pi\) denotes the state-action/state visitation distribution of policy \(\pi\) in the environment—the probabilistic collection of states the policy visits in the environment. We use \(J(R;\pi)\) to denote the \(\textit{cumulative reward}\), or the performance of policy \(\pi\) under a reward function \(R\). We assume policy \(\pi\) belongs to function class \(\Pi\) and reward function R belongs to function class \(\mathcal{R}\).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function \(R_{gt}\). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by \(\texttt{imit-game}\) and write it below formally:

\(\texttt{imit-game}(\pi,\pi^E): \text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

Simply stated, the \(\texttt{imit-game}\) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

Figure 1: The proposed \(\texttt{rank-game}\) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors (\(\rho\)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent \(\pi\) maximizes the expected return \(J(R;\pi)\) of the policy under reward function \(R\) when deployed in the environment. The reward player takes the dataset of pairwise rankings \(D^p\) (rankings are denoted as \(\rho^i\preceq\rho^j\)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by \(L(D^p;R)\)).

\(\underbrace{\text{argmax}_{\pi\in\Pi}J(R;\pi)}_{\text{Policy Agent}}~~~~~~~~~~~~~~~\underbrace{\text{argmin}_{R\in\mathcal{R}}L(D^p;R)}_{\text{Reward Agent}}\)

The ranking loss induces a reward function \(R\) that attempts to satisfy each pairwise preference in the dataset as follows:

\(\mathbb{E}_{\rho^i}[R(s,a)]\le\mathbb{E}_{\rho^j}[R(s,a)]~~,~~\forall \rho^i\preceq\rho^j \in D^p\)

Generalizing prior imitation learning approaches with \(\texttt{rank-game}\)

The \(\texttt{rank-game}\) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/\(\texttt{imit-game}\) optimization can be written as:

\(\text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, \(\texttt{imit-game}\) can be seen to be a special case of \(\texttt{rank-game}\) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as \(\textit{supremum loss}\)). A number of prior methods in the imitation learning domain can be understood as special cases of \(\texttt{rank-game}\) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of \(\texttt{rank-game}\) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone. Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?
Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

\(L_k(\mathcal{D}^p;R) = \mathbb{E}_{(\rho^{\pi^i},\rho^{\pi^j})\sim \mathcal{D}^p} \Big[\mathbb{E}_{s,a\sim\rho^{\pi^i}}{[(R(s,a)-0)^2]} + \mathbb{E}_{s,a\sim\rho^{\pi^j}}{[(R(s,a)-k)^2]}\Big]\)

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter \(k\). This loss function allows us to learn a reward function with user-defined scale \(k\), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.
Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

\(\text{max}_x (f(x,y_x)),~~~~\text{s.t}~~y_x\in \text{min}_x(g(x,y))\)

In this optimization, we have two players—Leader \(x\) and Follower \(y\)—that are trying to maximize and minimize their own payoff \(f\) and \(g\), respectively. We cast \(\texttt{rank-game}\) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): \(\text{max}_\pi J(R,\pi)~~~~~\text{s.t}~~ R=\text{argmin}_R~L(D^p;R)\)
  • Reward as Leader (RAL): \(\text{min}_R L(D^p;R)~~~\text{s.t}~~\pi = \text{argmax}_\pi~J(R;\pi)\)

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does \(\texttt{rank-game}\) perform in practice?

In testing the capabilities of \(\texttt{rank-game}\), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of \(\texttt{rank-game}\) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. \(\texttt{Rank-game}\) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the \(\texttt{rank-game}\) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The \(\texttt{rank-game}\) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of \(\texttt{rank-game}\) is that it does not suggest a way to handle noisy preferences. Second, \(\texttt{rank-game}\) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that \(\texttt{rank-game}\) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Read the paper Download code Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Categories: Microsoft

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Microsoft Research - Wed, 04/19/2023 - 18:08

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Learn more

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, \(\texttt{rank-game}\), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With \(\texttt{rank-game}\), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use \(\pi\) and \(\pi^E\) to denote the policy of the agent and the expert, respectively, and \(R_{gt}\) to be the reward function of the expert, which is unknown to the agent/robot. \(\rho^\pi\) denotes the state-action/state visitation distribution of policy \(\pi\) in the environment—the probabilistic collection of states the policy visits in the environment. We use \(J(R;\pi)\) to denote the \(\textit{cumulative reward}\), or the performance of policy \(\pi\) under a reward function \(R\). We assume policy \(\pi\) belongs to function class \(\Pi\) and reward function R belongs to function class \(\mathcal{R}\).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function \(R_{gt}\). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by \(\texttt{imit-game}\) and write it below formally:

\(\texttt{imit-game}(\pi,\pi^E): \text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

Simply stated, the \(\texttt{imit-game}\) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

Figure 1: The proposed \(\texttt{rank-game}\) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors (\(\rho\)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent \(\pi\) maximizes the expected return \(J(R;\pi)\) of the policy under reward function \(R\) when deployed in the environment. The reward player takes the dataset of pairwise rankings \(D^p\) (rankings are denoted as \(\rho^i\preceq\rho^j\)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by \(L(D^p;R)\)).

\(\underbrace{\text{argmax}_{\pi\in\Pi}J(R;\pi)}_{\text{Policy Agent}}~~~~~~~~~~~~~~~\underbrace{\text{argmin}_{R\in\mathcal{R}}L(D^p;R)}_{\text{Reward Agent}}\)

The ranking loss induces a reward function \(R\) that attempts to satisfy each pairwise preference in the dataset as follows:

\(\mathbb{E}_{\rho^i}[R(s,a)]\le\mathbb{E}_{\rho^j}[R(s,a)]~~,~~\forall \rho^i\preceq\rho^j \in D^p\)

Generalizing prior imitation learning approaches with \(\texttt{rank-game}\)

The \(\texttt{rank-game}\) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/\(\texttt{imit-game}\) optimization can be written as:

\(\text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, \(\texttt{imit-game}\) can be seen to be a special case of \(\texttt{rank-game}\) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as \(\textit{supremum loss}\)). A number of prior methods in the imitation learning domain can be understood as special cases of \(\texttt{rank-game}\) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of \(\texttt{rank-game}\) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone. Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?
Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

\(L_k(\mathcal{D}^p;R) = \mathbb{E}_{(\rho^{\pi^i},\rho^{\pi^j})\sim \mathcal{D}^p} \Big[\mathbb{E}_{s,a\sim\rho^{\pi^i}}{[(R(s,a)-0)^2]} + \mathbb{E}_{s,a\sim\rho^{\pi^j}}{[(R(s,a)-k)^2]}\Big]\)

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter \(k\). This loss function allows us to learn a reward function with user-defined scale \(k\), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.
Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

\(\text{max}_x (f(x,y_x)),~~~~\text{s.t}~~y_x\in \text{min}_x(g(x,y))\)

In this optimization, we have two players—Leader \(x\) and Follower \(y\)—that are trying to maximize and minimize their own payoff \(f\) and \(g\), respectively. We cast \(\texttt{rank-game}\) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): \(\text{max}_\pi J(R,\pi)~~~~~\text{s.t}~~ R=\text{argmin}_R~L(D^p;R)\)
  • Reward as Leader (RAL): \(\text{min}_R L(D^p;R)~~~\text{s.t}~~\pi = \text{argmax}_\pi~J(R;\pi)\)

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does \(\texttt{rank-game}\) perform in practice?

In testing the capabilities of \(\texttt{rank-game}\), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of \(\texttt{rank-game}\) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. \(\texttt{Rank-game}\) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the \(\texttt{rank-game}\) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The \(\texttt{rank-game}\) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of \(\texttt{rank-game}\) is that it does not suggest a way to handle noisy preferences. Second, \(\texttt{rank-game}\) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that \(\texttt{rank-game}\) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Read the paper Download code Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Categories: Microsoft

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Microsoft Research - Wed, 04/19/2023 - 18:08

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, \(\texttt{rank-game}\), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With \(\texttt{rank-game}\), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use \(\pi\) and \(\pi^E\) to denote the policy of the agent and the expert, respectively, and \(R_{gt}\) to be the reward function of the expert, which is unknown to the agent/robot. \(\rho^\pi\) denotes the state-action/state visitation distribution of policy \(\pi\) in the environment—the probabilistic collection of states the policy visits in the environment. We use \(J(R;\pi)\) to denote the \(\textit{cumulative reward}\), or the performance of policy \(\pi\) under a reward function \(R\). We assume policy \(\pi\) belongs to function class \(\Pi\) and reward function R belongs to function class \(\mathcal{R}\).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function \(R_{gt}\). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by \(\texttt{imit-game}\) and write it below formally:

\(\texttt{imit-game}(\pi,\pi^E): \text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

Simply stated, the \(\texttt{imit-game}\) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

Figure 1: The proposed \(\texttt{rank-game}\) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors (\(\rho\)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent \(\pi\) maximizes the expected return \(J(R;\pi)\) of the policy under reward function \(R\) when deployed in the environment. The reward player takes the dataset of pairwise rankings \(D^p\) (rankings are denoted as \(\rho^i\preceq\rho^j\)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by \(L(D^p;R)\)).

\(\underbrace{\text{argmax}_{\pi\in\Pi}J(R;\pi)}_{\text{Policy Agent}}~~~~~~~~~~~~~~~\underbrace{\text{argmin}_{R\in\mathcal{R}}L(D^p;R)}_{\text{Reward Agent}}\)

The ranking loss induces a reward function \(R\) that attempts to satisfy each pairwise preference in the dataset as follows:

\(\mathbb{E}_{\rho^i}[R(s,a)]\le\mathbb{E}_{\rho^j}[R(s,a)]~~,~~\forall \rho^i\preceq\rho^j \in D^p\)

Generalizing prior imitation learning approaches with \(\texttt{rank-game}\)

The \(\texttt{rank-game}\) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/\(\texttt{imit-game}\) optimization can be written as:

\(\text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, \(\texttt{imit-game}\) can be seen to be a special case of \(\texttt{rank-game}\) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as \(\textit{supremum loss}\)). A number of prior methods in the imitation learning domain can be understood as special cases of \(\texttt{rank-game}\) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of \(\texttt{rank-game}\) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone. Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?
Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

\(L_k(\mathcal{D}^p;R) = \mathbb{E}_{(\rho^{\pi^i},\rho^{\pi^j})\sim \mathcal{D}^p} \Big[\mathbb{E}_{s,a\sim\rho^{\pi^i}}{[(R(s,a)-0)^2]} + \mathbb{E}_{s,a\sim\rho^{\pi^j}}{[(R(s,a)-k)^2]}\Big]\)

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter \(k\). This loss function allows us to learn a reward function with user-defined scale \(k\), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.
Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

\(\text{max}_x (f(x,y_x)),~~~~\text{s.t}~~y_x\in \text{min}_x(g(x,y))\)

In this optimization, we have two players—Leader \(x\) and Follower \(y\)—that are trying to maximize and minimize their own payoff \(f\) and \(g\), respectively. We cast \(\texttt{rank-game}\) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): \(\text{max}_\pi J(R,\pi)~~~~~\text{s.t}~~ R=\text{argmin}_R~L(D^p;R)\)
  • Reward as Leader (RAL): \(\text{min}_R L(D^p;R)~~~\text{s.t}~~\pi = \text{argmax}_\pi~J(R;\pi)\)

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does \(\texttt{rank-game}\) perform in practice?

In testing the capabilities of \(\texttt{rank-game}\), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of \(\texttt{rank-game}\) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. \(\texttt{Rank-game}\) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the \(\texttt{rank-game}\) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The \(\texttt{rank-game}\) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of \(\texttt{rank-game}\) is that it does not suggest a way to handle noisy preferences. Second, \(\texttt{rank-game}\) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that \(\texttt{rank-game}\) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Read the paper Download code Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Categories: Microsoft

Unifying learning from preferences and demonstration via a ranking game for imitation learning

Microsoft Research - Wed, 04/19/2023 - 18:08

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren’t always so easy.

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert’s reward function, or the expert’s intent, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can’t fully reason about the intent of the expert—that is, compare whether one of its own learned behaviors is closer to the expert’s than another behavior—as it only knows the best behavior and has no notion of ordering over other behaviors.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Explore sessions

In our paper “A Ranking Game for Imitation Learning,” being presented at Transactions on Machine Learning Research 2023 (TMLR), we propose a simple and intuitive framework, \(\texttt{rank-game}\), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.

Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.

With \(\texttt{rank-game}\), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.

A ranking game for imitation learning

Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the intent of the expert, and a policy, which specifies what actions the agent—or, in our case, the robot—should take in a given state to successfully mimic the expert.

Notation: We use \(\pi\) and \(\pi^E\) to denote the policy of the agent and the expert, respectively, and \(R_{gt}\) to be the reward function of the expert, which is unknown to the agent/robot. \(\rho^\pi\) denotes the state-action/state visitation distribution of policy \(\pi\) in the environment—the probabilistic collection of states the policy visits in the environment. We use \(J(R;\pi)\) to denote the \(\textit{cumulative reward}\), or the performance of policy \(\pi\) under a reward function \(R\). We assume policy \(\pi\) belongs to function class \(\Pi\) and reward function R belongs to function class \(\mathcal{R}\).

The goal of imitation learning is to make the agent have the same performance as the expert under the expert’s unknown reward function \(R_{gt}\). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by \(\texttt{imit-game}\) and write it below formally:

\(\texttt{imit-game}(\pi,\pi^E): \text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

Simply stated, the \(\texttt{imit-game}\) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.

Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).

Figure 1: The proposed \(\texttt{rank-game}\) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.

To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors (\(\rho\)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent \(\pi\) maximizes the expected return \(J(R;\pi)\) of the policy under reward function \(R\) when deployed in the environment. The reward player takes the dataset of pairwise rankings \(D^p\) (rankings are denoted as \(\rho^i\preceq\rho^j\)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by \(L(D^p;R)\)).

\(\underbrace{\text{argmax}_{\pi\in\Pi}J(R;\pi)}_{\text{Policy Agent}}~~~~~~~~~~~~~~~\underbrace{\text{argmin}_{R\in\mathcal{R}}L(D^p;R)}_{\text{Reward Agent}}\)

The ranking loss induces a reward function \(R\) that attempts to satisfy each pairwise preference in the dataset as follows:

\(\mathbb{E}_{\rho^i}[R(s,a)]\le\mathbb{E}_{\rho^j}[R(s,a)]~~,~~\forall \rho^i\preceq\rho^j \in D^p\)

Generalizing prior imitation learning approaches with \(\texttt{rank-game}\)

The \(\texttt{rank-game}\) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let’s see how classical IRL is a part of this framework. Recall that the classical IRL/\(\texttt{imit-game}\) optimization can be written as:

\(\text{argmin}_{\pi\in\Pi}\text{max}_{R\in\mathcal{R}} [\mathbb{E}_{\rho^E(s,a)}[R(s,a)]-\mathbb{E}_{\rho^\pi(s,a)}[R(s,a)]]\)

The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy’s behavior and the expert behavior. Thus, \(\texttt{imit-game}\) can be seen to be a special case of \(\texttt{rank-game}\) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as \(\textit{supremum loss}\)). A number of prior methods in the imitation learning domain can be understood as special cases of \(\texttt{rank-game}\) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).

Figure 2: Previous methods that learn from expert demonstrations or preferences form a special case of \(\texttt{rank-game}\) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)—that is, learning from both expert states and actions—or learning from observations (LfO), where an agent learns from expert states alone. Setting up the ranking game

To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:

  1. What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?
  2. Where do we get the dataset of pairwise preferences?
  3. How can we effectively optimize this two-player game?
Step 1: A new ranking loss function for reward learning

Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as Luce Shepard, Lovász-Bregman divergences, and the earlier discussed supremum loss, we introduce a new loss function:

\(L_k(\mathcal{D}^p;R) = \mathbb{E}_{(\rho^{\pi^i},\rho^{\pi^j})\sim \mathcal{D}^p} \Big[\mathbb{E}_{s,a\sim\rho^{\pi^i}}{[(R(s,a)-0)^2]} + \mathbb{E}_{s,a\sim\rho^{\pi^j}}{[(R(s,a)-k)^2]}\Big]\)

The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter \(k\). This loss function allows us to learn a reward function with user-defined scale \(k\), which plays an important role in enabling better policy optimization; it’s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.

Step 2: Getting the ranking dataset

Besides giving more information about the expert’s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or shaped, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert’s behavior to be better or equal to current policy’s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:

  • Expert-annotated rankings: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.
  • Automatically generated rankings: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to Mixup regularization in trajectory space.
Step 3: Improving optimization stability with Stackelberg game

Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A Stackelberg game is a bi-level optimization problem:

\(\text{max}_x (f(x,y_x)),~~~~\text{s.t}~~y_x\in \text{min}_x(g(x,y))\)

In this optimization, we have two players—Leader \(x\) and Follower \(y\)—that are trying to maximize and minimize their own payoff \(f\) and \(g\), respectively. We cast \(\texttt{rank-game}\) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:

  • Policy as Leader (PAL): \(\text{max}_\pi J(R,\pi)~~~~~\text{s.t}~~ R=\text{argmin}_R~L(D^p;R)\)
  • Reward as Leader (RAL): \(\text{min}_R L(D^p;R)~~~\text{s.t}~~\pi = \text{argmax}_\pi~J(R;\pi)\)

Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.

How well does \(\texttt{rank-game}\) perform in practice?

In testing the capabilities of \(\texttt{rank-game}\), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of \(\texttt{rank-game}\) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. \(\texttt{Rank-game}\) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.

Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.

In this setting, we show that using only a handful of expert-annotated preferences in the \(\texttt{rank-game}\) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data—adding preferences is key.

Next steps

Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The \(\texttt{rank-game}\) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of \(\texttt{rank-game}\) is that it does not suggest a way to handle noisy preferences. Second, \(\texttt{rank-game}\) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that \(\texttt{rank-game}\) did not learn reusable robust reward functions.

For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the paper, project page, code, and video presentation.

Read the paper Download code Acknowledgments

This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.

The post Unifying learning from preferences and demonstration via a ranking game for imitation learning appeared first on Microsoft Research.

Categories: Microsoft

Automatic post-deployment management of cloud applications

Microsoft Research - Tue, 04/18/2023 - 18:00
Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Read part 1 Read part 2 Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Explore sessions

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.
SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).
Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Categories: Microsoft

Automatic post-deployment management of cloud applications

Microsoft Research - Tue, 04/18/2023 - 18:00
Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Read part 1 Read part 2 Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Learn more

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.
SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).
Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Categories: Microsoft

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft Research - Mon, 04/17/2023 - 18:00

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem and Technion.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Watch video

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Categories: Microsoft

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft Research - Mon, 04/17/2023 - 18:00

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem and Technion.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Categories: Microsoft

Hunting speculative information leaks with Revizor

Microsoft Research - Thu, 04/13/2023 - 18:00

Spectre and Meltdown are two security vulnerabilities that affect the vast majority of CPUs in use today. CPUs, or central processing units, act as the brains of a computer, directing the functions of its other components. By targeting a feature of the CPU implementation that optimizes performance, attackers could access sensitive data previously considered inaccessible. 

For example, Spectre exploits speculative execution—an aggressive strategy for increasing processing speed by postponing certain security checks. But it turns out that before the CPU performs the security check, attackers might have already extracted secrets via so-called side-channels. This vulnerability went undetected for years before it was discovered and mitigated in 2018. Security researchers warned that thieves could use it to target countless computers, phones and mobile devices. Researchers began hunting for more vulnerabilities, and they continue to find them. But this process is manual and progress came slowly. With no tools available to help them search, researchers had to analyze documentation, read through patents, and experiment with different CPU generations. 

A group of researchers from Microsoft and academic partners began exploring a method for systematically finding and analyzing CPU vulnerabilities. This effort would produce a tool called Revizor (REV-izz-or), which automatically detects microarchitectural leakage in CPUs—with no prior knowledge about the internal CPU components. Revizor achieves this by differentiating between expected and unexpected information leaks on the CPU. 

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Watch video

The Revizor process begins by describing what is expected from the CPU in a so-called “leakage contract.” Revizor then searches the CPU to find any violations of this contract. It creates random programs, runs them on the CPU, records the information they expose, and compares the information with the contract. When it finds a mismatch that violates the contract, it reports it as a potential vulnerability. 

Details were published in 2022 in the paper: Revizor: Testing Black-box CPUs against Speculation Contracts

To demonstrate Revizor’s effectiveness, the researchers tested a handful of commercial CPUs and found several known vulnerabilities, including Spectre, MDS, and LVI, as well as several previously unknown variants. 

However, the search was still slow, which hindered the discovery of entirely new classes of leaks. The team identified the root causes of the performance limitations, and proposed techniques to overcome them, improving the testing speed by up to two orders of magnitude. The improvements are described in a newly published paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

These improvements supported a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, the researchers found two types of previously unknown speculative leaks (affecting string comparison and division) that had escaped previous analyses—both manual and automated. These results show that work which previously required persistent hacking and painstaking manual labor can now be automated and rapidly accelerated. 

The team began working with the Microsoft Security Response Center and hardware vendors, and together they continue to find vulnerabilities so they can be closed before they are discovered by hackers—thereby protecting customers from risk. 

Revizor is part of Project Venice, which investigates novel mechanisms for the secure sharing and partitioning of computing resources, together with techniques for specifying and rigorously validating their resilience to side-channel attacks. 

Read the paper Download code

The post Hunting speculative information leaks with Revizor appeared first on Microsoft Research.

Categories: Microsoft
Syndicate content

eXTReMe Tracker