ZERO

Eliminates memory redundancy (model state is the source of largest memory usage).

Zero DP: data parallel efficiency + efficient communication

Optimizer state partitioning: 4x memory reduction
Gradient partitioning: 8x memory reduction
Parameter partitioning: memory reduction proportional to the number of partitions (DP degree) but 50% more communication volume

What is left is residual buffers / activations / gradients

Zero-R optimizes activations memory

Is ZERO + MP useful? It can reduce memory footprint of very large models.

Foundation Model 4 Climate Notes