ZERO

Eliminates memory redundancy (model state is the source of largest memory usage).

  • Data Parallelism: poor memory efficiency
  • Memory Parallelism: poor compute efficiency and communication overhead

Zero DP: data parallel efficiency + efficient communication

  • Optimizer state partitioning: 4x memory reduction
  • Gradient partitioning: 8x memory reduction
  • Parameter partitioning: memory reduction proportional to the number of partitions (DP degree) but 50% more communication volume

What is left is residual buffers / activations / gradients

Zero-R optimizes activations memory

  • Identifies activations replications
  • Defines optimized state for temporary storage / buffers
  • Manages memory based on lifetimes to prevent memory fragmentation

Is ZERO + MP useful? It can reduce memory footprint of very large models.