The most time-consuming process is coping data from GPU memory to Computer. So the trick is: organize computation to maximize utilization of GPUs by minimizing data movement.

Data movement between GPUs is even slower, but same ‘minimize data movement’ principle holds.

  • Use collective operations (e.g., gather, reduce, all-reduce)
  • Shard (parameters, activations, gradients, optimizer states) across GPUs
  • How to split computation: {data, tensor, pipeline, sequence} parallelism

Scaling law:

  • With more FLOPs, you can train more big model within same tokens (data).
  • With more FLOPs, you can train more tokens (data) on same model.
  • TL;DR, .