Want to create an interactive transcript for this episode?
Podcast: Arxiv Papers
Episode: [QA] Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful