I’ll bite: can anyone please eli5 to the non PhDs among us?

brrrrrm · on June 3, 2024

restricts freedom in one of the parameters (A) to make training substantially more efficient (easier for a GPU to churn through).

the actual flops involved are similar to the original SSM-based version, but that's harder to formulate as strictly matrix multiplications

cs702 · on June 3, 2024

TL;DR: The authors show that if you simplify Mamba so its state-space layer uses a diagonal matrix A that is a scalar times the identity matrix, the state-space transformation can be expressed as a form of causal linear attention.[a] That's the duality the authors refer to in the title. The key practical benefit is that it enables more efficient (faster) training on GPUs.

---

[a] https://arxiv.org/abs/2006.16236

xcodevn · on June 3, 2024

tldr: mamba is not as good as transformer.

metalloid · on June 6, 2024

Can you elaborate more?