Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?


Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token


Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?


Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: