Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm a total newb about the implementation details, but I'm curious if a hybrid is possible (GPU+CPU) to enable inference with even larger models than what fits in consumer GPU VRAM.


llama.cpp does it already. You tell it how many layers to offload to GPU, and it runs remaining ones on CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: