Micro-batch size of 96

#84
by jtourille - opened

Hello,

Thanks for the nice work, it is a really impressive model πŸš€.

I've implemented a training script using accelerate and H100 cards (94GB version). Everything is working well, even the batch-size warmup.
However, I am saturating my GPUs with micro-batch sizes of 48 sequences. In the ModernBERT paper, I see that that you are setting the micro-batch size at 96 πŸ‘€. What am I missing ? I am using FA2 like you. Even if I am not saturating my unpadded sequence as you are doing in the original paper, there is not way I can fit 96*1024 tokens in a micro-batch...

Julien

Update !

I was able to increase my micro-batch size to 88 by using gradient checkpointing.
I'll get back to you once I am able to squeeze the last 8 in the GPU.

Julien

Using ZeRO-2, I am able to fit 96 sequences in one GPU. I could combine it with gradient checkpointing to fit even more sequences.
Closing the comment.

jtourille changed discussion status to closed

Sign up or log in to comment