In this project, we will investigate different applications of meta-RL. Previous investigations have been centered on using meta-RL to expedite single agents learning new unseen tasks drawn from the same task distribution. In this project, we will explore how meta-RL trained agents perform in a competitive multi-agent RL setting when having to play against new unseen agents. We will specifically implement the PEARL (off-policy meta-RL) algorithm developed by Kate Rakelly, et al. Instead of relying on LSTM recurrence in the policy network to memorize skills, PEARL decouples this from the policy by using a network to encode previous experiences into a latent context variable. This latent context variable is used to condition both actor (policy) and critic (value) architectures, We will use Derkgym as the multi-agent gym environment to test the PEARL algorithm. We will first train the agent against another agent that will only perform random actions. After collecting this experience, we will then expose the agent to new opponents with different skillsets. Using meta-RL, we anticipate that the agent will be able to adapt to new opponents quicker. The code and video presentation can be accessed using this link: https://drive.google.com/drive/folders/1JTQ2ycmXNRA2OSZ7f-fipg94HBnQCR0V?usp=sharing

Introduction

Conventional reinforcement learning algorithms aim to learn a policy that maximizes the trajectory rewards for a Markov decision process (eq.1). This learned policy generally does not translate over to new unseen tasks; a distribution shift in the observation space may cause the policy to perform suboptimal actions, and a change in the environment rewards and state transitions means the same actions may yield different trajectories. Therefore, conventional reinforcement learning algorithms need to learn a separate policy for each new task, makig it extremely sample inefficient.

In contrast, meta reinforcement learning algorithms are able to learn new and unseen tasks more efficiently, much like how people can learn new skills quickly using previously learned skills. In meta reinforcement learning, the algorithm models a function that takes in a Markov decision process as input $f_{\theta}$ (eq.3). It then learns to optimize this function so that the policy maximizes the trajectory reward over any Markov decision process that is drawn from the task distribution. This can be accomplished by learning how to encode context from experience and using this encoding to condition a policy so that it universally applies to every task in the distribution. PEARL - an efficient off-policy meta reinforcement learning algorithm - operates on this basis.

Reinforcement Learning Objective

\[(eq.1) ~~ \theta^* = argmax_{\theta} E_{\pi_{\theta}(\tau)}\[R(\tau)\] \\\]

Meta Reinforcement Learning Objective

\[(eq.2) ~~ \theta^* = argmax_{\theta} \sum_{i=1}^{n} E_{\pi_{\phi_{i}}(\tau)}\[R(\tau)\] \\\] \[(eq.3) ~~ \phi_{i} = f_{\theta}(M_{i}) \\\]

This project applied the PEARL algorithm to Derk’s gym, a multi-agent competitive environment. In meta reinforcement learning, the train and test tasks are generally different from each other but are drawn from the same distribution. We create this problem by assigning train and test tasks that consist of agents with a random combination of abilities. During testing, the policy controls a combination of agents it has never played with and competes with a combination of agents it has never played against. This effectively changes the nature of every task while ensuring each task still belongs in the same distribution, recreating the problem that meta reinforcement learning aims to solve.

We hypothesize that conventional reinforcement learning algorithms will struggle to learn a policy that is able to control and play against agents with abilities that it has not seen before. With meta reinforcement learning, we hope that by exposing the algorithm to a variety of agents with different randomized abilities, it is able to learn to make inferences from context and use this to optimize a policy to solve the unseen task. In other words, the PEARL algorithm should be able to use its experience playing against an assortment of agents as context to create an action hypothesis when it plays against a combination of agents it has not seen before. The PEARL algorithm will then evaluate this hypothesis and learn to make better context inferences and policy outputs.

Algorithm

NOTE: The code for the algorithms can be found under CS269-projects-2022fall/assets/images/team21/CS269_Project.zip/

Random Policy

The opposing agents in Derk’s gym will be controlled using a random policy that simply samples the action space. This random policy will be deployed on a local Derk’s gym server where we will train our baseline and PEARL algorithms.

Baseline Policy

The baseline algorithm (used to represent conventional reinforcement learning), models the policy as a neural network. We apply evolutionary strategy to train the neural network; at every episode, we perturb the neural network weights with a Gaussian noise and pick the weights that lead to the highest trajectory reward.

PEARL Policy

PEARL Training Loop

Fig 1. PEARL Algorithm: The inference network qφ uses context data to infer the posterior over the latent context variable Z, which conditions the actor and critic, and is optimized with gradients from the critic as well as from an information bottleneck on Z

**Inference network (also refered to as context encoder)** This is the main novelty of the PEARL algorithm. The PEARL algorithm retrieves experiences in the replay buffer for context, which it uses to hypothesize what actions to perform for the current task. This is done by passing the context (C) into an inference network $q(z\|c)$ to generate a probabilistic latent variable Z, which is effectively a posterior over the context. This probabilistic latent variable is then sampled and used to condition the policy given by $\pi_{\theta}(a|s, z)$. Essentially, the context is used to adapt the agents' policy behaviors to the new unseen task. The weights of the inference network are optimized jointly with the weights of the actor-critic networks using gradients from the critic update - which will be discussed further below. **Soft actor critic reinforcement learning** The PEARL algorithm is built on top of the soft actor-critic algorithm (SAC), which is an off-policy method that augments the discounted returns with the policy entropy. The critic consists of two q-value networks $Q_{\theta}(s,a,z)$ and one value network $(V_{\phi}(s,z))$, and the actor consists of one policy function $\pi_{\theta}(a \| s,z)$. As mentioned earlier, the novelty of PEARL is that these actor-critic networks also take in the latent variable returned by the inference network as input, so that they are conditioned by the context posterior. We optimize the parameters of the actor-critic networks using the ADAM optimizer and the loss functions below in **Update Rules**. > Update Rules **Loss Functions** $$ L_{V} = E_{s,a,r,s'\sim B, z\sim q_{\phi}(z|c)} \[V_{\phi}(s,z) - min_{i=1,2}Q_{\theta, i}(s',a',z) - log \pi_{\theta}(a'\|s')\] \\ $$ $$ L_{Q} = E_{s,a,r,s'\sim B, z\sim q_{\phi}(z|c)} \[Q_{\theta}(s,a,z) - (r+V_{\phi}(s',z))\]^2 \\ $$ $$ L_{\pi} = E_{s \sim B, a \sim \pi_{\theta}, z \sim q_{\phi}(z|c)} \left[D_{KL}\left(\pi_{\theta}(a|s,z) \|\| \frac{\exp{Q_{\theta}(s,a,z)}}{Z_\theta (s)}\right)\right] \\ $$ **Stochastic Gradient Descent** $$ \phi_{V} \leftarrow \phi - \alpha \nabla_{\phi} \sum_{i} L_{V}^{i} \\ $$ $$ \theta_{Q} \leftarrow \theta_{Q} - \alpha \nabla_{\theta_{Q}} \sum_{i} L_{Q}^{i} \\ $$ $$ \theta_{\pi} \leftarrow \theta_{\pi} - \alpha \nabla_{\theta_{\pi}} \sum_{i} L_{\pi}^{i} \\ $$ $$ \phi_{q(z \| c)} \leftarrow \phi - \alpha \nabla_{\phi} \sum_{i} (L_{Q}^{i} + L_{KL}^{i}) \\ $$ ## Contribution In this project, we used the PEARL algorithm written by Kate Rakelly, et al. as the basis of our code and changed the environment from MuJoCo -- a single agent environment -- to Derk's Gym -- a multi-agent evnrionment. In Derk's gym, there are 3 agents in the home team and 3 agents in the opposing team. For each agent, the observation space is 64; some of these include the abilities of the controlled agent, the abilities of the opposing agent, and agent positions on the map. For each agent, the action space is 5; some of these actions include rotating, moving, and using any of the three abilities. To facilitate multi-agent learning, we implement one shared actor (policy) to control the actions of all 3 agents, and one shared critic to evaluate the policy. These actor-critic networks will take in the states, actions, and latent variables of all 3 agents concatenated together.

Derk's Gym Environment

Fig 2. Derk's Gym Environment

In MuJoCo, the agent trained using the PEARL algorithm was able to successfully learn new tasks quicker than other reinforcement learning methods. For example, a half cheetah trained to move in one velocity was able to efficiently learn to run in another target velocity during testing. We wish to explore whether this success translates over to a multi-agent environment. We aim to answer the following questions: 1. Is the experience the agent collected playing with and against characters of different abilities transferrable to future tasks? 2. Will the inference network be able to encode strategies (latent variables) from its experiences (context) in order to hypothesize how to successfuly play with and against characters it has not seen before? ## Results > Qualitative Observations The agents trained using the baseline algorithm learned to control a couple of characters very well, particularly those that have previously returned positive rewards through its actions. However when a new character is introduced that the agent has not been trained on before, the agent ends up circling around the map and falling off the map. This behavior is reflected in the baseline rewards across a couple game episodes; it alternates between 100 and -100. In contrast, the agents trained with the PEARL algorithm learned to always focus on the enemy tower and the enemies, and on a teammate when healing is needed. However, the agent would occasionally hesitate to use its ability, eventually leading to zero rewards. Overall, the agent trained using the PEARL algorithm performed more consistently than that with the baseline algorithm, likely because the actor-critic are conditioned with the latent posterior over the context of experiences. > Average Path Reward The table below summarizes the average of trajectory rewards across all agents for a single game episode. This was retrieved during testing, when the trained policy was made to control 3 agents with an unseen combination of abilities, and play against another 3 agents also with an unseen combination of abilities controlled by the random policy. | Game Episode | Baseline Reward | PEARL Reward | | :-: | :-: | :-: | | 1 | 100 | 0 | | 2 | -100 | 30 | | 3 | -100 | 106 | | 4 | -100 | 90 | | 5 | 100 | 20 | | 6 | 100 | 120 | | 7 | -100 | 18 | | 8 | -100 | 0 | | 9 | 100 | 0 | | 10| -100 | 45 | | Mean | -10 | 43 | > Future Improvements Several improvements could be made to the environment. First, the environment returns a positive reward when an opposing agent dies or an opposing tower is destroyed, regardless of who is responsible for the action (even if the opposing agent is responsible for the action). This causes the algorithm to learn actions that do not necessarily lead to the rewards. Second, the environment does sufficient penalize agent actions that kill a teammate or destroy its own tower. The rewards could be made more negative so the policy is discouraged from performing such actions. Finally, the maximum step length of the algorithm could be increased since several episodes terminate before the agent has successfully killed an opposing agent or destroyed an opposing tower. This leads to some episodes with zero rewards even if the agent has started taking favorable actions, so the policy fails to learn these optimal behavior. ## References [1] Meta reinforcement learning: https://metalearning-cvpr2019.github.io/assets/CVPR_2019_Metalearning_Tutorial_Chelsea_Finn.pdf [2] Derkgym documentation: http://docs.gym.derkgame.com/#examples [3] Random agent in Derkgym: https://github.com/MountRouke/Randos [4] PEARL paper: https://arxiv.org/pdf/1903.08254.pdf [5] PEARL code: https://github.com/katerakelly/oyster