Model-Based Offline Reinfoecement Learning

- 6 mins

MOReL: Model-Based Offline Reinforcement Learning (2020)

Goal

asdfasdf

  1. MOReL learns a pessimistic MDP (P-MDP) from the dataset and uses it for policy sea
  2. P-MDP partitions the state-action space into known (green) and unknown (orange) regions, and also forces a transition to a low reward absorbing state (HALT) from unknown regions. Blue dots denote the support in the dataset.

Algorithm

qwer

1.Learning the dynamics model

The first step involves using the offline dataset to learn an approximate dynamics model $\hat{P}(\cdot \mid s, a)$. Since the offline dataset may not span the entire state space, the learned model may not be globally accurate. Naïve MBRL approach that directly plans with the learned model may overestimate rewards in unfamiliar parts of the state space, resulting in a highly sub-optimal policy. We overcome this with the next step.

2.Unknown state-action detector (USAD)

Like hypothesis testing, partition known and unknown regions based on the accuracy of learned model as follows.

zvb

dsfg

$\delta\left(s^{\prime}=\mathrm{HALT}\right)$ is the Dirac delta function, which forces the MDP to transition to the absorbing state HALT. For unknown state-action pairs, use a reward of $-\kappa$, while all known state-actions receive the same reward as in the environment. The P-MDP heavily punishes policies that visit unknown states, thereby providing a safeguard against distribution shift and model exploitation.

4.Planning

Perform planning in the P-MDP defined above. For simplicity, we assume a planning oracle that returns an $\epsilon_{\pi}$ -sub-optimal policy in the P-MDP. A number of algorithms based on MPC, search-based planning, dynamic programming, or policy optimization can be used to approximately realize this.

Benchmark results(Toy Gym)

sfhj

Conclusion

  1. Importance of pessimistic MDP

sgnf

  1. Transfer from pessimistic MDP to environment

MOPO: Model-based Offline Policy Optimization (2020)

Goal

  1. Design an offline model-based reinforcement learning algorithm that can take actions that are not strictly within the support of the behavioral distribution.

  2. Balance the return and risk since models will become increasingly inaccurate further from the behavioral distri- bution (vanilla model-based policy optimization)

  1. To achieve the optimal balance, bound the return from below by the return of a constructed model MDP penal- ized by the uncertainty of the dynamics and maximize conservative estimation of the return by an off-the-shelf reinforcement learning algorithm (MOPO)

Preliminaries

Algorithm

  1. Quantifying the uncertainty: from the dynamics to the total return

Let $G_{\widehat{M}}^{\pi}(s, a):=\underset{s^{\prime} \sim \widehat{T}(s, a)}{\mathbb{E}}\left[V_{M}^{\pi}\left(s^{\prime}\right)\right]-\underset{s^{\prime} \sim T(s, a)}{\mathbb{E}}\left[V_{M}^{\pi}\left(s^{\prime}\right)\right]$, we can get

shfg

By definition, $G \frac{\pi}{\widehat{M}}(s, a)$ measures the difference between $M$ and $\widehat{M}$ under the test function $V^{\pi}$. By equation(1), it governs the differences between the performances of $\pi$ in the two MDPs. If we could estimate $G_{\widehat{M}}^{\pi}(s, a)$ or bound it from above, then we could use the RHS of (1) as an upper bound for the estimation error of $\eta_{M}(\pi)$.

Moreover, equation (2) suggests that a policy that obtains high reward in the estimated MDP while also minimizing $G_{\widehat{M}}^{\pi}(s, a)$ will obtain high reward in the real MDP.

However, computing $G_{\widehat{M}}^{\pi}(s, a)$ remains elusive because it depends on the unknown function $V_{M}^{\pi} .$ Leveraging properties of $V_{M}^{\pi}$, we will replace $G_{\widehat{M}}^{\pi}(s, a)$ by an upper bound that depends solely on the error of the dynamics $\widehat{T}$

Given an admissible error estimator, we define the uncertainty-penalized reward $\tilde{r}(s, a):=r(s, a)-\lambda u(s, a)$ where $\lambda:=\gamma c$, and the uncertainty-penalized MDP $\widetilde{M}=\left(\mathcal{S}, \mathcal{A}, \widehat{T}, \tilde{r}, \mu_{0}, \gamma\right)$. We observe that $\widetilde{M}$ is conservative in that the return under it bounds from below the true return:

dhgj

  1. Policy optimization on uncertainty-penalized MDPs

Optimize the policy on the uncertainty-penalized MDP $\widetilde{M}$ in Algorithm 1 .

agdh

Benchmark results (Toy Gym)

rsyj

Refernece

[1] R. Kidambi. MOReL: Model-based Offline Reinforcement Learning. 2020.

[2] T. Yu. MOPO: Model-based Offline Policy Optimization. 2020.

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora