Model-Based Offline Reinfoecement Learning
- 6 minsMOReL: Model-Based Offline Reinforcement Learning (2020)
Goal
- MOReL learns a pessimistic MDP (P-MDP) from the dataset and uses it for policy sea
- P-MDP partitions the state-action space into known (green) and unknown (orange) regions, and also forces a transition to a low reward absorbing state (HALT) from unknown regions. Blue dots denote the support in the dataset.
Algorithm
1.Learning the dynamics model
The first step involves using the offline dataset to learn an approximate dynamics model $\hat{P}(\cdot \mid s, a)$. Since the offline dataset may not span the entire state space, the learned model may not be globally accurate. Naïve MBRL approach that directly plans with the learned model may overestimate rewards in unfamiliar parts of the state space, resulting in a highly sub-optimal policy. We overcome this with the next step.
2.Unknown state-action detector (USAD)
Like hypothesis testing, partition known and unknown regions based on the accuracy of learned model as follows.
-
$D_{T V}(\hat{P}(\cdot \mid s, a), P(\cdot \mid s, a))$ denotes the total variation distance between $\hat{P}(\cdot \mid s, a)$ and $P(\cdot \mid s, a)$
-
Two factors contribute to USAD’s effectiveness
* data availability: having sufficient data points “close” to the query * quality of representations: certain representations, like those based on physics, can lead to better
generalization guarantees.
3.Construct Pessimistic MDP construction
$\delta\left(s^{\prime}=\mathrm{HALT}\right)$ is the Dirac delta function, which forces the MDP to transition to the absorbing state HALT. For unknown state-action pairs, use a reward of $-\kappa$, while all known state-actions receive the same reward as in the environment. The P-MDP heavily punishes policies that visit unknown states, thereby providing a safeguard against distribution shift and model exploitation.
4.Planning
Perform planning in the P-MDP defined above. For simplicity, we assume a planning oracle that returns an $\epsilon_{\pi}$ -sub-optimal policy in the P-MDP. A number of algorithms based on MPC, search-based planning, dynamic programming, or policy optimization can be used to approximately realize this.
Benchmark results(Toy Gym)
Conclusion
- Importance of pessimistic MDP
- Compare MOReL with a naive MBRL approach that first learns a dynamics model using the offline data without any safeguards against model inaccuracy.
-
The naive MBRL approach already works well, achieving results comparable to prior algorithms like BCQ and BEAR. However, MOReL clearly exhibits more stable and monotonic learning progress.
-
Furthermore, in the case of naive MBRL, we observe that performance can quickly degrade after a few hundred steps of policy improvement
- Transfer from pessimistic MDP to environment
-
MOReL suggest that the value of a policy in the P-MDP cannot substantially exceed the value in the environment.
-
This makes in the P-MDP an approximate lower bound on the true performance, and a good surrogate of True MDP for optimization.
-
Author observe that the value in the true environment closely correlates with the value in P-MDP. In par- ticular, the P-MDP value never substantially exceeds the true performance, suggesting that the pessimism helps to avoid model exploitation.
-
But MOReL constructs terminating states based on a hard threshold on uncertainty.
MOPO: Model-based Offline Policy Optimization (2020)
Goal
-
Design an offline model-based reinforcement learning algorithm that can take actions that are not strictly within the support of the behavioral distribution.
-
Balance the return and risk since models will become increasingly inaccurate further from the behavioral distri- bution (vanilla model-based policy optimization)
- the potential gain in performance by escaping the behavioral distribution and finding a better policy
- the risk of overfitting to the errors of the dynamics at regions far away from the behavioral distribution.
- To achieve the optimal balance, bound the return from below by the return of a constructed model MDP penal- ized by the uncertainty of the dynamics and maximize conservative estimation of the return by an off-the-shelf reinforcement learning algorithm (MOPO)
Preliminaries
- $T\left(s^{\prime} \mid s, a\right)=$ the transition dynamics (True dynamics)
- $\eta_{M}(\pi):=\underset{\pi, T, \mu_{0}}{\mathbb{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right)\right] .=$ expected discounted return (goal is maximizing this)
- $\eta_{\widehat{M}}(\pi)=$ A natural estimator for the true return $\eta_{M}(\pi)$
- Behavioral distribution $=$ sampled from distribution $\mathcal{D}_{\text {env }}$
- $\widehat{T}$ defines a model $\operatorname{MDP} \widehat{M}=\left(\mathcal{S}, \mathcal{A}, \widehat{T}, r, \mu_{0}, \gamma\right)$
- $\mathbb{P}_{\widehat{T}, t}^{\pi}(s)=$ the probability of being in state $s$ at time step $t$ if actions are sampled according to $\pi$ and transitions according to $\widehat{T}$
- $\rho_{\widehat{T}}^{\pi}(s, a)=$ the discounted occupancy measure of policy $\pi$ under dynamics $\widehat{T}: \rho_{\widehat{T}}^{\pi}(s, a):=\pi(a \mid s) \sum_{t=0}^{\infty} \gamma^{t} \mathbb{P}_{\widehat{T}, t}^{\pi}(s)$
Algorithm
- Quantifying the uncertainty: from the dynamics to the total return
Let $G_{\widehat{M}}^{\pi}(s, a):=\underset{s^{\prime} \sim \widehat{T}(s, a)}{\mathbb{E}}\left[V_{M}^{\pi}\left(s^{\prime}\right)\right]-\underset{s^{\prime} \sim T(s, a)}{\mathbb{E}}\left[V_{M}^{\pi}\left(s^{\prime}\right)\right]$, we can get
By definition, $G \frac{\pi}{\widehat{M}}(s, a)$ measures the difference between $M$ and $\widehat{M}$ under the test function $V^{\pi}$. By equation(1), it governs the differences between the performances of $\pi$ in the two MDPs. If we could estimate $G_{\widehat{M}}^{\pi}(s, a)$ or bound it from above, then we could use the RHS of (1) as an upper bound for the estimation error of $\eta_{M}(\pi)$.
Moreover, equation (2) suggests that a policy that obtains high reward in the estimated MDP while also minimizing $G_{\widehat{M}}^{\pi}(s, a)$ will obtain high reward in the real MDP.
However, computing $G_{\widehat{M}}^{\pi}(s, a)$ remains elusive because it depends on the unknown function $V_{M}^{\pi} .$ Leveraging properties of $V_{M}^{\pi}$, we will replace $G_{\widehat{M}}^{\pi}(s, a)$ by an upper bound that depends solely on the error of the dynamics $\widehat{T}$
Given an admissible error estimator, we define the uncertainty-penalized reward $\tilde{r}(s, a):=r(s, a)-\lambda u(s, a)$ where $\lambda:=\gamma c$, and the uncertainty-penalized MDP $\widetilde{M}=\left(\mathcal{S}, \mathcal{A}, \widehat{T}, \tilde{r}, \mu_{0}, \gamma\right)$. We observe that $\widetilde{M}$ is conservative in that the return under it bounds from below the true return:
- Policy optimization on uncertainty-penalized MDPs
Optimize the policy on the uncertainty-penalized MDP $\widetilde{M}$ in Algorithm 1 .
Benchmark results (Toy Gym)
-
Unlike MOReL, which constructs terminating states based on a hard threshold on uncertainty, MOPO uses a soft reward penalty to incorporate uncertainty. ( potential benefit of a soft penalty is that the policy is allowed to take a few risky actions and then return to the confident area near the behavioral distribution without being terminated. )
-
Using uncertainty penalized MDP, which is constructed where the reward is given by $\widetilde{r}(\mathbf{s}, \mathbf{a})=\hat{r}(\mathbf{s}, \mathbf{a})-$ $\lambda u(\mathbf{s}, \mathbf{a})$ and the learned dynamics model, MOPO learns a policy in this “uncertainty-penalized” MDP $\widetilde{\mathcal{M}}=$ $\left(\mathcal{S}, \mathcal{A}, \widehat{T}, \widetilde{r}, \mu_{0}, \gamma\right)$ which has the property that $J(\widetilde{\mathcal{M}}, \pi) \leq J(\mathcal{M}, \pi) \forall \pi .$ By constructing and optimizing such a lower bound, offline model-based RL algorithms (MOPO and MOReL) avoid the aforementioned pitfalls like model-bias and distribution shift.
-
However, strong reliance on uncertainty quantification is challenging for complex datasets or deep neural network models (MOReL and MOPO)
Refernece
[1] R. Kidambi. MOReL: Model-based Offline Reinforcement Learning. 2020.
[2] T. Yu. MOPO: Model-based Offline Policy Optimization. 2020.