Micro-batch size of 96
Hello,
Thanks for the nice work, it is a really impressive model π.
I've implemented a training script using accelerate and H100 cards (94GB version). Everything is working well, even the batch-size warmup.
However, I am saturating my GPUs with micro-batch sizes of 48 sequences. In the ModernBERT paper, I see that that you are setting the micro-batch size at 96 π. What am I missing ? I am using FA2 like you. Even if I am not saturating my unpadded sequence as you are doing in the original paper, there is not way I can fit 96*1024 tokens in a micro-batch...
Julien
Update !
I was able to increase my micro-batch size to 88 by using gradient checkpointing.
I'll get back to you once I am able to squeeze the last 8 in the GPU.
Julien
Using ZeRO-2, I am able to fit 96 sequences in one GPU. I could combine it with gradient checkpointing to fit even more sequences.
Closing the comment.