Introduction

Reinforcement learning (RL) is a powerful learning paradigm for sequential decision making under uncertainty. However, RL agents sometimes do not perform well in reality since there are some different between reality and the simulator. There are many difference between deployment environment and training environment. such as friction, obstacles, human behavior, etc. Moreover, there are natural or deliberate disturbances in the environment sometimes, which make agents more difficult to perform well. Thus, some researchers conclude the variability into two types of uncertainties[@garcia2015comprehensive] : a) the inherent uncertainty related to the stochastic nature of the system, and b) the parameter uncertainty, related to some of the parameters of the MDP are not known exactly.

Furthermore, if we consider these factors in MDP, the robustness can be categorized into following four types[@zhang2017robustness] : a) robustness to reward b) robustness to state perturbation c) robustness to action perturbation d) robustness to different transition

Most of researches try to improve the robustness by modifying update equation, which lead the policy converge to a optimum of some certain criterion. While few researches study the idea of using adversarial samples to improve the robustness of RL models, which is an useful trick in machine learning, especially computer vision.

In my project, we studied whether keeping update equation unchanged but changing sampling policy can improve the robustness of RL. We modeled the uncertainty as robust MDP and mainly study the worst case performance. Based on that, we proposed a sampling policy, adversarial sampling policy, for on-policy algorithm. We proved that SARSA with adversarial sample policy learn a agent that converges to the optimum of some criterion. Some experiments were conducted in the project, which support that the idea of using adversarial samples is practical in some conditions.

The main goal of robust reinforcement learning is improving the performance of agent in some environment with some unpredictable uncertainty. Although robust reinforcement leaning is not a hot topic, people has proposed many different ideas to achieve this goal. As far as I know, there are three trend of robust reinforcement learning.

The first one is based on the modification of the optimality criterion to introduce the concept of risk.[@garcia2015comprehensive] \(\hat{Q}\)-learning and \(\beta\)-pessimistic Q-learning are two representative method of "Worse-Case" criterion.[@heger1994risk; @gaskett2003reinforcement] These methods manage to find a optimal agent that maximizes the reward in the worst scenario. Another criterion, "Risk-Sensitive" criterion, is also commonly studied, which take the variance of rewards into consideration.[@prashanth2013actor] The main idea is that the agent has to strike a balance between getting large rewards and avoiding catastrophic situations even if they occur with very small probability.[@garcia2015comprehensive]

The second one is using stochastic policy instead of deterministic policy. Soft Q-learning is a method to learning maximum entropy policies for continuous states and actions. The method express the optimal policy via a Boltzmann distribution. The benefits include improved exploration and robustness in the face of uncertainty dynamics.[@haarnoja2017reinforcement]

The third one is consider the uncertainty as a two-agent game.[@gleave2019adversarial] Robust Adversarial Reinforcement Learning (RARL) is an method that training an agent to operate in the presence of a destabilizing adversary. The method formulates the policy learning as a zero-sum and jointly trains adversary using RL methods.[@pinto2017robust]

Settings

In this section, I introduce the setting of our research problem and clarify some notation.

Markov Decision Process (MDP) is the basic setting of Reinforcement Learning problems. In this project we mainly consider MDP with finite states and finite actions.

Definition We represent an MDP by the tuple \(MDP(\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma)\), where \(\mathcal{S}\) is a set of states, \(\mathcal{A}\) is a set of actions, \(\mathcal{R} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) is the reward function, \(\mathcal{P} : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0, 1]\) is the transition probability, and \(\gamma\) is the discount factor.

A conventional way to model uncertainties is introducing a set of MDPs instead of just one. We deem that real environment is a set of MDP where the transition probability is similar but different from the one in laboratory environment.

Definition A robust MDP \((\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma)\) with power \(\xi\) is a set of MDPs which transit according to transition probability \(\tilde{\mathcal{P}}\), where \(\tilde{\mathcal{P}} = (1 - \xi)\mathcal{P} + \xi \overline{\mathcal{P}}\) and \(\overline{\mathcal{P}}\) is some transition probability.

Then, we define two special MDP in the robust MDP set. The first special case is the average case. If we only consider the uncertainty as the form of action perturbation, we take the average of the transition probability and obtain "random MDP" .

Definition random MDP \((\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma)\) with power \(\xi\) is a MDP where \(\overline{\mathcal{P}}(s, a, s') = E[\mathcal{P}(s, \cdot, s')]\).

Another special case is the worse case. We still consider the uncertainty as the form of action perturbation, and the MDP with transition probability which leads to the worst expected reward called "adversarial MDP" .

Definition adversarial MDP \((\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma)\) with power \(\xi\) is a MDP where \(\overline{\mathcal{P}}\) is the solution of the minimax problem \[ \min_{\pi} \max_{\overline{\mathcal{P}}} E[V(\pi) \mid \overline{\mathcal{P}}] \]

Rethinking of Epsilon Greedy

To address the inspiration of the adversarial sampling policy, we begin from Epsilon Greedy. We firstly recall the difference between SARSA and Q-learning. SARSA is an on-policy method, while SARSA is an on-policy method. During the training process, SARSA choose a "safer" policy than Q-learning. The explanation is that SARSA gives an lower estimation for states which is near dangerous position.

Moreover, if we use a constant parameter \(\epsilon_0\) in epsilon greedy for easy implementation. The resulted policies from SARSA and Q-learning are slightly different. Q-learning still learns a optimal policy of the MDP. However, SARSA does not learn the optimal policy of the original MDP. Instead, it converges to the optimum of the random MDP with order \(\epsilon_0\). (This can be easily proved so we skip it.)

This give us another perspective of epsilon greedy. Epsilon greedy not only can balance the trade-off between exploration and exploitation but also can shift the convergence policy from one MDP to another MDP. We can regard epsilon greedy with constant epsilon as adding a few random samples during learning process, which successfully improve the robustness of on-policy methods. This fact support our guess that adding adversarial samples can improve the robustness of models.

Adversarial Sampling policy

Method Description

Epsilon greedy only add random samples in the sampling, which improve the average performance in the uncertainty. However, we are more concern with forbidding the agent going into catastrophe. Thus, we come up with a idea that adding "worst case" in the sampling. In another word, we choose the worst action depend on current Q-value with some probability. The sampling policy is called "adversarial sampling Policy".

The policy needs two parameters, exploration rate \(\epsilon_e\) and adversary rate \(\epsilon_a\). For each state \(s_t\), the policy choose one action according to \[ \left\{ \begin{array}[]{ll} \text{random one from }\mathcal{A}, & \text{w.p. } \epsilon_e \\ \text{argmin}_{a} Q(s_t, a), & \text{w.p. } (1 - \epsilon_e) \epsilon_a \\ \text{argmax}_{a} Q(s_t, a), & \text{otherwise} \end{array} \right. \] Usually, \(\epsilon_e\) decays during the training process and converges to zero at last.

An Simple Example

We did an simple experiment to show the effect of adversarial sampling policy. The experiment environment is FrozenLake-v0. The game ask the player go from S to G without touching H. Otherwise the player would die. In order to show the difference more clearly, we use a custom map and values in the experiment.

We test Q-learning, SARSA with constant epsilon greedy, and SARSA with adversarial sampling policy. Table 1 shows the result of the experiment. Q-learning obtain the optimum solution of original MDP. Epsilon greedy with constant epsilon obtain the optimum of random MDP, and consequently, it become the best when \(\xi\) is small. While adversarial sampling policy is the most robust in disturbance, and it become the best when \(\xi\) is large.

	\(\xi=0.001\)	\(\xi=0.05\)	\(\xi=0.1\)
Q-learning	−12.7	−40.7	−46.4
SARSA with constant \(\epsilon\)	−14.1	−25.3	−36.6
SARSA with adversarial sampling policy	−20.1	−28.5	−35.2

Table 1: Performance in adversarial MDPs with different \(\xi\)

Convergence Analysis

Now, we are going to show adversarial sampling policy can lead on-policy method to the optimum in some criterion.

Definition Consider a finite state-action adversarial MDP with parameter \(\xi\). Value \(V(s)\) is the expected reward from state \(s\) to the terminal states following transition probability \(\hat{P}\). Q-value \(Q(s, a)\) is the expected reward assuming the agent is in state \(s\) and performs action \(a\) which transit following transition probability \(P\) for the first action.

Definition \(\pi^*\) is the optimal policy in the adversarial MDP. And \(V^*\) and \(Q^*\) are the corresponding optimal value and Q-value.

We have \[ \begin{aligned} V^*(s) &= \xi \max_a Q^*(s, a) + (1 - \xi) \sum_{a} \bar{P}(s, a) Q^*(s, a) \\ &= \min_{p} \left(\xi \max_a Q^*(s, a) + (1 - \xi) \sum_{a} p(a) Q^*(s, a)\right) \\ &= \xi \max_a Q^*(s, a) + (1 - \xi) \min_a Q^*(s, a)\end{aligned} \]

Subsequently, we have the bellman equation, \[ \begin{aligned} Q^*(s, a) &= r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s, a, s') V^*(s') \\ &= r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s, a, s') \left((1 - \xi)\max _{a' \in \mathcal{A}} Q^*(s', a') + \xi \min _{a' \in \mathcal{A}} Q^*(s', a')\right) \qquad (1) \end{aligned} \]

Theorem Consider using SARSA with adversarial sampling policy to solve a finite state-action adversarial MDP with parameter \(\xi\). Let \(\pi_b\) denote the adversarial sampling policy in which \(\epsilon_a = \xi\) and \(\epsilon_e\) converges to zero. Let \(Q_t(s, a)\) denote the Q value at \(t\)-th iteration and \(\pi_t\) denote the policy induced by \(Q_t\).

Let \(\alpha_t(s, a)\) denote the learning rate of state \((s, a)\) at \(t\)-th iteration. (\(\alpha_t(s, a) = 0\) if \((s, a)\) is not updated at \(t\)-th iteration.) The learning rate satisfy \(0 \leq \alpha_t(s, a) \leq 1\), \(\sum_{t} \alpha(s, a) = \infty\) and \(\sum_{t} \alpha_t^2(s, a) < \infty\).

Then, \(Q_t\) converges to \(Q^*\) w.p. \(1\), and thus, \(\pi_t\) converges to \(\pi^*\) w.p. \(1\).

The proof is based on a classical lemma.

Lemma Consider a stochastic process \((\alpha_t, \Delta_t, F_t)\), \(t \geq 0\), where \(\alpha_t, \Delta_t, F_t : X \to \mathbb{R}\) satisfy the equations \[ \Delta_{t+1}(x) = (1 - \alpha_t(x))\Delta_t(x) + \alpha_t(x) F_t(x), \quad x \in X, \quad t = 0, 1, 2, \ldots . \] Let \(P_t\) be a sequence of increasing \(\sigma\)-fields such that \(\alpha_0\) and \(\Delta_0\) are \(P_0\)-measurable and \(\alpha_t\), \(\Delta_t\) and \(F_{t-1}\) are \(P_t\)-measurable, \(t = 1, 2, \ldots\). Assume that the following hold:

the set X is finite.

\(0 \leq \alpha_t(x) \leq 1, \sum_{t} \alpha_t(x) = \infty, \sum_{t} \alpha_t^2(x) < \infty \text{ w.p. } 1.\)

\(\|E[F_t(x)|P_t]\|_W \leq \gamma \|\Delta_t\|_W + c_t\), where \(\gamma \in [0, 1)\) and \(c_t\) converges to zero w.p. \(1\).

\(Var[F_t(x)|P_t] \leq C(1 + \|\Delta_t\|_W )^2\), where \(C\) is some constant.

Then, \(\Delta_t\) converges to zero with probability one w.p. \(1\).

The lemma is originally proved in [@singh2000convergence]. So we do not show it here.

Recall the update equation in SARSA. \[ Q_{t+1}(s_t, a_t) = Q_{t}(s_t, a_t) + \alpha_t (r(s_t, a_t) + \gamma Q_{t}(s_{t+1}, a_{t+1}) - Q_{t}(s_t, a_t)). \]

Let \[ \Delta_t(s, a) = Q_t(s, a) - Q^*(s, a) ~~\text{and}~~ \alpha_t(s, a) = \left\{\begin{array}{ll}\alpha_t, &s = s_t, a = a_t \\ 0, &\text{otherwise}\end{array}\right. \]

We subsititue these into \(\Delta_{t+1}(s, a) = (1 - \alpha_t(s, a))\Delta_t(s, a) + \alpha_t(s, a) F_t(s, a)\) and get \[ \begin{aligned} F_t(s_t, a_t) &= r(s_t, a_t) + \gamma Q_t(s_{t+1}, a_{t+1}) - Q^*(s_t, a_t) \\ &= r(s_t, a_t) + \gamma \left((1 - \xi) \max_a Q_t(s_{t+1}, a) + \xi \min_a Q_t(s_{t+1}, a) \right) - Q^*(s_t, a_t) \nonumber\\ &+\gamma \left(Q_t(s_{t+1}, a_{t+1}) - \left((1 - \xi) \max_a Q_t(s_{t+1}, a) + \xi \min_a Q_t(s_{t+1}, a) \right)\right) \\ &= G_t(s_t, a_t) + C_t(s_t, a_t) ,\end{aligned} \] where \[ \begin{aligned} G_t(s_t, a_t) &= r(s_t, a_t) + \gamma \left((1 - \xi) \max_a Q_t(s_{t+1}, a) + \xi \min_a Q_t(s_{t+1}, a) \right) - Q^*(s_t, a_t) \\ C_t(s_t, a_t) &= Q_t(s_{t+1}, a_{t+1}) - \left((1 - \xi) \max_a Q_t(s_{t+1}, a) + \xi \min_a Q_t(s_{t+1}, a) \right)\end{aligned} \]

We define a construction operator \(\mathbf{H}\) for a Q-value function \(Q : \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) as \[ (\mathbf{H} Q)(s, a) = r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s, a, s') \left(c\right) \]

Proposition The optimal Q-value \(Q^*\) is the stationary point of operator \(\mathbf{H}\).

This is another statement of the Bellman equation, equation (1).

Proposition \(E[G_t(s_t, a_t) \mid P_t] \leq \gamma\|\Delta_t\|_{\infty}\)

We prove this by showing a stronger property. Let \(Q_1\) and \(Q_2\) be any Q-value function. We have \[ \begin{aligned} &\|HQ_1 - HQ_2\|_{\infty} \\ ={}& \max_{s, a} |\gamma \sum_{s'} P(s, a, s') \\ &\cdot \left((1 - \xi)(\max_{a'} Q_1(s', a') - \max_{a'} Q_2(s', a')) + \xi(\min_{a'} Q_1(s', a') - \min_{a'} Q_2(s', a'))\right)| \\ \leq & \gamma \left((1 - \xi) |\max_{a'} Q_1(s', a') - \max_{a'} Q_2(s', a')| + \xi |\min_{a'} Q_1(s', a') - \min_{a'} Q_2(s', a')| \right) \\ \leq & \gamma \max_{s, a} |Q_1(s, a) - Q_2(s, a)| = \|Q_1 - Q_2\|_{\infty}\end{aligned} \]

We also have \(E[G_t(s_t, a_t) \mid P_t] = (HQ_t)(s_t, a_t) - Q^*(s_t, a_t)\). Since \(Q^*\) is the stationary point of \(H\), \((HQ^*)(s_t, a_t) = Q^*(s_t, a_t)\).

Thus, \(E[G_t(s_t, a_t) \mid P_t] = (HQ_t)(s_t, a_t) - (HQ^*)(s_t, a_t) \leq \gamma\|Q_t - Q^*\|_{\infty} \leq \gamma\|\Delta_t\|_{\infty}\)

Proposition \(E[C_t(s_t, a_t) \mid P_t]\) converges to zero w.p. \(1\)

This part correspond to the adversarial sampling policy. We let the random rate \(\epsilon_e\) converge to zero. And we have \[ \begin{aligned} E[C_t(s_t, a_t) \mid P_t] =\epsilon_e E[Q_t(s_t, a) \mid s_t] - \epsilon_e ((1 - \xi)\max _{a'} Q_t(s', a') + \xi \min _{a'} Q_t(s', a'))\end{aligned} \] Thus, it converges to zero w.p. \(1\).

Now, the proof of convergence theorem is almost done. We additionally check \(Var[F_t(s_t, a_t) \mid P_t] \leq C(1 + \|\Delta_t\|_\infty )^2\) and some simple conditions in Lemma. Finally, Theorem is proved.

Experiment

We further do experiments to show this idea is practical in various situation. We train standard SARSA and SARSA with adversarial sampling policy on standard CartPole-v0 environment.

The training curve is shown in Figure 1. The orange and red line imply that the convergence speed of adversarial sampling policy is almost same as epsilon greedy. This shows that SARSA with adversarial sampling policy remain the efficiency of the conventional one.

Then, we test the two agent in environments with different mass of cart. (See Figure 2.) Both agent perform well for the mass is near the one in training environment. However, adversarial sampling policy have a better performance for the mass is very different from the one in training environment. This means that the adversarial sampling policy improve the robustness of the model.

Conclusion

Although we mainly discuss adversarial sampling policy for on-policy methods, this idea also can apply for off-policy as well by some modifications. However, this method doesn't work for methods which do not calculate \(Q(s, a)\), such as Actor-Critic, since there is no standard which can help us to choose the worst action. Moreover, this method is not compatible with memory technique in DQN. Thus, the range of using this idea is limited. So study how to use the idea of adding adversarial samples in Q-learning, DQN, and policy gradient approaches can be a problem for future works.

Even though there are some pities in this project, there are also some highlights. We point out a new idea that adding adversarial samples to improve the robustness, and explore this idea. The theoretical analysis also gives some insights to study deeper.

All in all, in this project, we proposed a simple sampling policy that can improve the robustness of on-policy models, which is also very easy to implement. We proved that SARSA with adversarial sample policy learn a agent that converges to the optimum of the adversarial MDP. We did some experiments of the adversarial sampling policy, and see it can perform well on some problems. Furthermore, some limitations and ideas of future works is pointed out.

Reference

[1] Haochen Zhang and Yao Shunyu. Robustness in rl. Course Project, 2017.

[2] Javier Garcıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

[3] Matthias Heger. Risk and reinforcement learning: Concepts and dynamic programming. ZKW, 1994.

[4] Chris Gaskett. Reinforcement learning under circumstances beyond its control. 2003.

[5] LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risksensitive mdps. In Advances in neural information processing systems, pages 252–260, 2013.

[6] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.

[7] Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615, 2019.

[8] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.

[9] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesv´ari. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.