Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

llama.cpp enabled continuous batching by default half a year ago: https://github.com/ggerganov/llama.cpp/pull/6231

There is no need for a new API endpoint. Just send multiple requests at once.



The point of the endpoint is to be able to standardize my codebase and have an agnostic LLM provider that works the same.

Continuous batching is helpful for this type of thing, but it really isn't everything you need. You'd ideally maintain a low priority queue for the batch endpoint and a high priority queue for your real-time chat/completions endpoint.

Would allow utilizing your hardware much better.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: