Three keypoints they used:
-
Symmetric sampleing. Sampling 50% of the data from replay buffer(online), and the remaining 50% from the offline data buffer(offline).
-
Layer normalization to mitigate catastrophic overestimation of value function.
-
Improving sampling efficiency by increasing the number of updates (update-to-data ratio). To avoid introduced over-fitting issue, prior works like L2 normalization, Dropout and random ensemble distillation can be used.
They also use clipped double Q-learning to migarate the over-estimation problem of value function. And adding maximum entropy reward term to maximize exploration.