Micro-batch size of 96

#84

by jtourille - opened about 1 month ago

about 1 month ago

Hello,

Thanks for the nice work, it is a really impressive model 🚀.

I've implemented a training script using accelerate and H100 cards (94GB version). Everything is working well, even the batch-size warmup.
However, I am saturating my GPUs with micro-batch sizes of 48 sequences. In the ModernBERT paper, I see that that you are setting the micro-batch size at 96 👀. What am I missing ? I am using FA2 like you. Even if I am not saturating my unpadded sequence as you are doing in the original paper, there is not way I can fit 96*1024 tokens in a micro-batch...

Julien

jtourille

about 1 month ago

•

edited about 1 month ago

Update !

I was able to increase my micro-batch size to 88 by using gradient checkpointing.
I'll get back to you once I am able to squeeze the last 8 in the GPU.

Julien

jtourille

about 1 month ago

Using ZeRO-2, I am able to fit 96 sequences in one GPU. I could combine it with gradient checkpointing to fit even more sequences.
Closing the comment.

jtourille changed discussion status to closed about 1 month ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment