Efficient Online Reinforcement Learning with Offline Data
Three keypoints they used:
- Symmetric sampleing. Sampling 50% of the data from replay buffer(online), and the remaining 50% from the offline data buffer(offline).
- Layer normalization to mitigate catastrophic overestimation of value function.
- Improving sampling efficiency by increasing the number of updates (update-to-data ratio). To avoid introduced over-fitting issue, prior works like L2 normalization, Dropout and random ensemble distillation can be used.
They also use clipped double Q-learning to migarate the over-estimation problem of value function. And adding maximum entropy reward term to maximize exploration.