Fwd
Getting a model to fwd is the greatest feeling in the world.
You've got an immense artifact that is the weight snapshot, which takes on an almost holy level of gravity -- the entire purpose of the company is to create this thing, the world's greatest collection of genius scientists and engineers have poured untold billions of dollars of computer hardware into producing this artifact.
And now your job is to make it think, give it wings.
When you're bringing up a model like this there will probably be a number of tweaks in the forward pass, which will be implemented separately for the training and the inference configurations. Oh, for this one we've transposed that tensor, this norm is computed slightly differently, that sort of thing. So inference needs to go in and muck around with the weights, rearrange and manipulate the brains of Claude, to fit it into a form that can be made fast.
Understand every bit of a massive distributed system from the systolic array to the web server and make it all sing a chorus of output tokens at 100 Hz.
And then it works, and Claude goes out to the people. Everyone loves Claude, and hopefully Claude loves everyone, and our accelerators get hugged to death but that's a good problem to have. So begins a slog of optimizing inference performance racing to keep ahead of the demand curve. But it helps to know that it's for the sake of more Claude, and Claude is a force for good.