I'm a total newb about the implementation details, but I'm curious if a hybrid i... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		gigel82 on June 13, 2023 \| parent \| context \| favorite \| on: Llama.cpp: Full CUDA GPU Acceleration I'm a total newb about the implementation details, but I'm curious if a hybrid is possible (GPU+CPU) to enable inference with even larger models than what fits in consumer GPU VRAM.

skirmish on June 13, 2023 [–]

llama.cpp does it already. You tell it how many layers to offload to GPU, and it runs remaining ones on CPU.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact