The core idea of this paper is simple. They introduce action chunking to temporal difference (TD)-based RL to mitigate the exploration challenge by:

  1. introducing Q-Chunking policy that generates h step actions
  2. introducing Q-Chunking critic that takes h step actions as input

Serveral interesting ideas from the paper:

  1. The TD value estimator is referred to as the uncorrected n-step return estimator because it is biased when the data collection policy is different from current policy (updated one). While Q-chunking backup propagates the value back to a h-step Q-function that takes in the exact same actions that are taken to obtain the n-step rewards, eliminating the biased value estimation.

  2. Offline dataset often exhibit non-Markovian structure (e.g., from scripted policies, human tele-operators …). Markovian structure means the action is only generated by using current observations. Scripted policies and human tele-operators generate actions by fixed mechanisms, such as time step counter for same state with different actions.

  3. By using RL to learn non-Markovian structure data might hard, and not properly. So they also introduce behavior constrain (KL divergence between the policy and dataset distribution) as second component as imitation learning loss.

  4. For the problem: what’s the difference between the RL and IL? Or when should we use RL and IL? One of the answer is that using IL on offline dataset (with scripted policies collected) and RL on online learning for better improvement.