The Human-in-the-Loop contains “human” collecting/relabel data and “machine” training policy by using data, where humans and machines are mutually augmenting each other.
- Human label datas only when machine has low-confidence. Thus make labeling data more efficient.
- Machine will get more accurate policy by using correct human-labeled data.
This is really similiar to Environment-Human relationship in reinforcement learning, where human generate data by interacting with environment, and environment give feedback(reward) to human.
- Human generate datas by interacting with environment, and environment give reward to human. Thus human can generate more efficient data.
- For some experience replay buffer, selecting high-prior data from erb to make human policy more perfect.
Besides, we can also add human demostration data in erb first. Maybe at first we need more expert data. But with training time goes on, the policy will become much better and we don’t need expert data at all.