Batching in vLLM doesn't combine prompts into the same context - it processes se... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		ethan_smith 10 months ago \| parent \| context \| favorite \| on: Life of an inference request (vLLM V1): How LLMs a... Batching in vLLM doesn't combine prompts into the same context - it processes separate requests in parallel while sharing compute resources, so there's no perplexity tradeoff, just efficiency gains.

zettabomb 10 months ago [–]

It's worth noting that reason this works is because basically every LLM architecture currently in use is severely limited by memory bandwidth, not by compute. So it's trivial to run several requests at a time, while waiting for the next weights to arrive from VRAM.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact