Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’ll bite: can anyone please eli5 to the non PhDs among us?


restricts freedom in one of the parameters (A) to make training substantially more efficient (easier for a GPU to churn through).

the actual flops involved are similar to the original SSM-based version, but that's harder to formulate as strictly matrix multiplications


TL;DR: The authors show that if you simplify Mamba so its state-space layer uses a diagonal matrix A that is a scalar times the identity matrix, the state-space transformation can be expressed as a form of causal linear attention.[a] That's the duality the authors refer to in the title. The key practical benefit is that it enables more efficient (faster) training on GPUs.

---

[a] https://arxiv.org/abs/2006.16236


tldr: mamba is not as good as transformer.


Can you elaborate more?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: