More

gajjanag · 2026-05-16T16:42:53 1778949773

> Communication matters most when you're dealing with cross-org concerns and those that master it are usually the more friendly and pleasant ones.

I don't agree with the second one, but agree with the first.

Throughout my corporate career so far, I have found plenty of hot air/pretty picture slide decks that exist solely for ladder climbers to climb. Said ladder climbers are usually all smiles in public and "friendly", but you have to watch out for knives behind your back.

reactordev · 2026-05-16T20:28:18 1778963298

At that level it's all Hunger Games isn't it?

gajjanag · 2026-04-27T11:17:29 1777288649

There are also more papers on similar themes.

For example, TurboQuant makes use of QJL (quantized Johnson Lindenstrauss transformations). One of the first papers to characterize the QJL and in fact the rate distortion tradeoff for quantized matrix multiplication in general is "Optimal Quantization for Matrix Multiplication" (https://arxiv.org/abs/2410.13780) by Ordentlich and Polyanskiy.

There is also a more accessible survey paper around quantized matrix multiplication called "High-Rate Quantized Matrix Multiplication: Theory and Practice" (https://arxiv.org/abs/2601.17187), by the same authors.

TurboQuant cites none of them.

kumarhn · 2026-04-27T11:40:13 1777290013

TurboQuant is starting to look like a case study in how to turn a fragile paper into a breakthrough story.

The attribution is thin, the “6x compression” headline is not clearly separated from prior KV-cache quantization baselines like KIVI, and the RaBitQ comparison is hard to take seriously: single-core CPU for the baseline, A100 GPU for TurboQuant. It is comparing apples-to-datacenter. Worse, there are also public OpenReview comments saying that even the reported accuracy results are not reproducible.

Hard to believe this is the standard for something being promoted as a breakthrough. If this came from a random startup blog, people would be much harsher about it.

oofbey · 2026-04-27T15:21:08 1777303268

But how can these poor googlers be expected to sift through the thousands of research papers published on these topics to find relevant citations? They don’t have time for such trivialities. They have far more important work to be doing not being evil. /s

fnordpiglet · 2026-04-28T05:32:19 1777354339

Gemini helped them build it but didn’t / couldn’t attribute it from its corpus. I think we will see a surge of “rediscovery” that’s unattributed training surfacing of prior work that wasn’t widely recognized at the time.

oofbey · 2026-04-28T13:40:37 1777383637

Gemini is perfectly capable of searching the web. Pretty good at it really. As are most agents. If such a surge happens, it’s purely because of laziness.

fnordpiglet · 2026-04-28T21:38:46 1777412326

Laziness, aka, the human condition?

amitport · 2026-04-27T11:35:59 1777289759

I believe our claim at this point is more fundamental than just lack of citation.

The quantizer in TurboQuant is EDEN quantization (2021) applied to the KV-cache. It is neither a novel quantizer nor an improvement in quantization techniques.

In DRIVE/EDEN, we already introduced the version used in "TurboQuant"'s paper and suggested an optimal scale configurations which are better in both mse-minimizing and unbiased scenarios.

gajjanag · 2026-04-27T13:59:27 1777298367

Wow, yes - you are completely correct (read through the note in detail now).

Though, as your paper also notes, the quantizer values themselves aren't fundamentally novel to either paper. Lloyd Max scalar quantizers have been studied for a very, very long time. And the specific Lloyd Max values for the Gaussian input distribution have been obtained in many papers across signal processing and information theory.

amitport · 2026-04-27T14:07:14 1777298834

Thanks for that!

It is worth noting that taking advantage of the post-rotation distribution was not actually done until DRIVE (2021), which was made possible via our proper scaling. Furthermore, applying a Lloyd-Max codebook post-rotation was introduced EDEN.

We consider these to be the foundational works in this regard.

gajjanag · 2026-04-27T14:19:02 1777299542

> Thanks for that! It is worth noting that taking advantage of the post-rotation distribution

I again feel this claim is too strong. Rotations have been used in information theory/wireless communications for decades at this point, with appropriate scaling done at channel inputs/outputs to hit channel capacity. The signals then pass through the appropriate codebooks that take advantage of the post-rotated+whitened signal.

Our cellphones today are powered by such technology.

I agree with your claim when restricted to deep learning. But I do not agree with the broad characterization that taking advantage of post-rotation distributions was only first done in your work.

amitport · 2026-04-27T15:13:56 1777302836

Thanks for the pushback, and I appreciate the reference to classical information theory.

While I probably overstated things by using the very general phrase "taking advantage," I want to be very precise about the claim, as I believe these works are foundational to quantization, beyond the scope of deep learning. The mechanism of applying a deterministic biased quantizer, such as Lloyd-Max, to the induced post-rotation distribution, alongside mathematically correcting its inherent bias, is a distinct contribution (which asymptotically improves the worst-case error).

If there is a classical paper that utilizes such a combination, I would genuinely be very eager to review it. But to my knowledge, this was not introduced prior to DRIVE and EDEN.

gajjanag · 2026-04-19T12:55:21 1776603321

TurboQuant is known across the industry to not be state of the art. There are superior schemes for KV quant at every bitrate. Eg, SpectralQuant: https://github.com/Dynamis-Labs/spectralquant among many, many papers.

> Given that TurboQuant results in a 6x reduction in memory usage for KV caches

All depends on baseline. The "6x" is by stylistic comparison to a BF16 KV cache; not a state of the art 8 or 4 bit KV cache scheme.

gajjanag · 2026-04-18T11:41:39 1776512499

The bigger challenge is GPU/NPU. Branches for fast vs accurate path get costlier, among other things. On CPU this is less of a cost.

Most published libm on GPU/NPU side have a few ULP of error for the perf vs accuracy tradeoff. Eg, documented explicitly in the CUDA programming guide: https://docs.nvidia.com/cuda/cuda-programming-guide/05-appen... .

Prof. Zimmermann and collaborators have a great table at https://members.loria.fr/PZimmermann/papers/accuracy.pdf (Feb 2026) comparing various libm wrt accuracy.

gajjanag · 2026-03-04T01:23:08 1772587388

> That's why you need to put your scope

The problem is, "scope" is often equated to "how many people worked in my empire" rather than "how much business value did my work X generate".

The two things are vastly different, and I have seen the distinction/oversimplification play out over and over in my own career as well as many others around me.

As an extreme on the "individual technical expert side", there are things out there that can pretty much only be accomplished with a few people around the world who possess the dedicated expertise. These results can't be replicated by a cobbled together team of 10 or 100 people even though the latter sounds more impressive for "scope".

Some organizations do a decent job of recognizing these different "archetypes", many don't.

Swizec · 2026-03-04T01:43:56 1772588636

I agree. What counts as a positive signal for "scope" really very much depends what you're hiring for.

When looking for a manager type, people under management are a decent proxy. When looking for the world's greatest postgres optimization expert, some version of queries-per-second is prob the metric you want.

Or realistically if I needed the world's greatest Postgres expert (and could afford them), I would go talk to experts in the field and ask "Who's the best postgres person you know?" and work from there. At that point your resume is but a formality.

gajjanag · 2025-11-06T14:38:57 1762439937

>80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized.

Yep. I was pleasantly surprised by the autovectorization quality with recent clang at work a few days ago. If you write code that the compiler can infer to be multiples of 4, 8, etc the compiler goes off and emits pretty decent NEON/AVX code. The rest as you say is handled quite well by intrinsics these days.

Autovectorization was definitely poorer 5-10 years ago on older compiler toolchains.

gajjanag · 2025-10-25T22:30:05 1761431405

Welcome to the brave new world these days:

1 - Very few people conduct "proper scholarship", and fail to trace ideas back to their original inception and cite them correctly. This happens time and again in deep learning, where 30+ year old ideas are claimed as "novel" over and over. Many times out of malice by the authors, sometimes out of ignorance.

2 - Peer review in many parts of the industry+research is a joke. Mostly shouldered by early graduate students who don't really know the field well and an incredibly noisy process.

3 - It is common practice now to dump out one's "kitchen sink" of ideas rather than properly refined stuff. Hence the increase in LinkedIn spam, blog spam, arXiv spam style of papers.

gajjanag · 2025-09-17T15:38:48 1758123528

> I don't think there are many (or any) upsides to the well documented downsides.

C++ template metaprogramming still remains extremely powerful. Projects like CUTLASS, etc could not be written to give best performance in as ergonomic a way in Rust.

There is a reason why the ML infra community mostly goes with Python-like DSL's, or template metaprogramming frameworks.

Last I checked there are no alternatives at scale for this.

gajjanag · 2025-09-11T06:01:51 1757570511

As others have pointed out, these phenomena are well known to many folks across companies in the AI infra space. It doesn't really break new ground. This article is a good exposition of the basic strategies though.

What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.

gajjanag · 2025-07-27T15:45:18 1753631118

+1 - there are just so many Asian recipes that can not be done anywhere near as easily on induction stovetops (high heat from direct flame for flatbreads, etc).

Plus a whole bunch of cookware doesn't work with induction (clay pots, non ferromagnetic bases, etc). I do wonder if any of these "environmental" estimates factor in the environmental cost of replacing a bunch of cookware just to satisfy induction requirements.

g8oz · 2025-07-27T16:08:22 1753632502

Simply not true. There are induction woks available for East Asian recipies.

South Asian flatbreads like naans, rotis, dosas and parathas can definitely be made well with induction. Plus the precision control of heating opens up new possibilities with all cuisine types.

As for embodied replacement costs - that talking point has been used or rather misused to dismiss everything from solar panels to EVs to wind turbines. Just because there is a payback period doesn't mean that it's insurmountable. What's the payback period on fast fashion and other consumerist nonsense? Infinity right?

gajjanag · 2025-07-27T17:28:17 1753637297

I guess you have never worked with a slow induction cooktop. Literally we had to spend 15 minutes more for cooking things on induction compared with our previous apartment's gas connection.

Maybe they are better now but it is certainly not the case that all induction cooktops have these magical properties; many are cheap and skimp on something. While in the 5+ apartments I have been in gas has always delivered the same heating experience that I can rely on.

And to your point about rotis, no - it can not be done unless you get a different, heavier bottomed pan suitable for induction. Exactly what I was saying regarding the replacement costs.

SilverElfin · 2025-07-27T16:12:21 1753632741

Yep - a gas ban basically bans major parts of various cultures. But also even for typical recipes, you can’t do things like tilt a pan to use the flame to heat different parts differently.

As for environmental costs - the thing that surprises me is that induction easily warps even higher end pans. But yes you’re right, you can’t use many different materials.

ponector · 2025-07-27T17:23:56 1753637036

Gas stove is a modern invention. Culture will be fine with other ways to heat the pan.

SilverElfin · 2025-07-27T20:47:51 1753649271

It’s a modern invention that replaces cooking over a flame. Like from wood. Having a flame to cook over is core to many cultures.

ViewTrick1002 · 2025-07-28T17:14:27 1753722867

Buy a propane torch? Or a tiny single pan portable gas stove? Or just use your gas barbecue? Or use your charcoal fired barbecue?