Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to make LLMs go fast (vgel.me)
215 points by tosh on Dec 22, 2023 | hide | past | favorite | 54 comments


I wonder if the transformer will remain the de facto arch for LLMs in a couple years. We already have Mamba, RWKV, etc. which are huge improvements over transformer in terms of memory requirement and speed. I wonder why there's still so much work (and hacks and workarounds) on transformer to make it go faster and consume less VRAM whereas the same effort could be spent on other architectures that solve some of those problems fundamentally.


Theoretically transformers are fundamentally more powerful than stuff like Mamba and RWKV because they don't "forget"; the newest token can attend perfectly to the oldest token if it wants. Mamba and RWKV on the other hand compress the old state, based on information available at the time of compression. If it turns out the new token wants some information from the old token that during the compression was excluded due to the compression thinking it wouldn't be needed, the new token has no way to access it.


This is fair -- the newest token can attend perfectly to the oldest token, within the context window.

but also, on a broader scale, if a transformer model is presented with a long input that does not fit in its context (e.g.: you are building a chatbot, and you have a very long chat history), it must "compress" or "forget" some of that information (e.g.: repeatedly summarizing historical messages, dropping them and appending the summary at the beginning of the input).

Mamba/RWKV/other "recurrent" architectures, can theoretically operate on unbounded input lengths; they "forget" information from earlier tokens over time, but is that not comparable to what a transformer must do with input lengths greater than their context window?


I didn’t know those alternatives had those weaknesses. I did have a whole list of alternatives to check out. Having already looked at some, do you know any transformer alternatives that don’t forget things but were still sub-quadratic? What rabbit holes are worth digging deeper into?


Doesn't sub-quadratic attention fundamentally imply forgetting things?


The jury very much still seems out on this. Computationally speaking, I believe Mamba is Turing complete, while transformers aren't (they can't do loops), so technically Mamba is more expressive. But of course, the question is always whether it ends up with lower loss.


(well actually its not turing complete, but it definitely seems closer)


Yet this all-to-all approach of transformers is bottlenecked on compute, while Mamba can take in a context length of a million tokens, that might have a positive effect on long range tasks.


What use is to have infinite context if new tokens can only access a lossy compression of past tokens?

There is definitely a use for this kind of model, but this also shows why Transformers are still the main architecture we use today.


I know almost nothing about this space but “lossy compression of millions of past tokens” feels a lot more like how actual human memory works than “perfect access to a small number of recent tokens”.


> feels a lot more like how actual human memory works

Beware of feel-good traps like this.

If you were to map the human connectome to a computational neural network down to the ion channel, it'd be at least 500 quadrillion parameters*. That's at least 5-6 orders of magnitude beyond what is currently possible with SOTA ML which means that even if the human brain was 99% devoted to compressing those tokens, that 1% that could actually do work with them is still a thousand times bigger than GPT4. There be emergent dragons.

* This is a fascile argument to begin with since biological neuron signals aren't quantized and ion channels are far too complex to map to a single static parameter


> biological neuron signals aren't quantized

Perhaps, but they inherently have imprecision due the nature of being biological. Expose the same neuron to the same input N times, and you'll get a range of output. The effect of noisy analog data is, broadly speaking, similar to the effect of low-resolution digital data.


Expose the same neuron to the same input N times and you won't get a randomly distributed range of outputs, you'll get a signal that is either attenuated or potentiated. That's not noise, that's the simplest form of neuroplasticity at work (which is why static parameters can never represent ion channels).


Maybe against HN rules, but: thank you. Very helpful comment!


RNNs and LSTMs from the past did this as well (but cannot be trained in parallel as each token has to be compressed sequentially). Transformers ate their cake.

Newer methods are going back to similar concepts but trying to get past previous bottlenecks given what we've learned since then about transformers.


There’s hacks and workarounds because billions of dollars have been poured into transformers by the biggest tech companies in the world.

Your assumption that the same effort could be given to MAMBA or RWKV isn’t necessarily true. I’m sure there’s research arms looking to see if they scale to the level OpenAI or Google have their transformers at now, but they’re still very much in their infancy.

This all ignores the risk of them not panning out and the lost time not focusing some energy on transformers.


The current best-in-class performance is from transformer models so of course research is focusing on this. You're creating a false dichotomy between "work on transformers" and "work on something new and fundamentally better." All of the current work on optimizing transformer architectures is exploring and better understanding the space, and so it IS leading us to better architectures. This is the most fertile ground for new insights that will outperform the current approaches.

New technologies aren't generally just created out of whole cloth by some genius, they're build up of layers of incremental improvements, each of which is modest in its own right but which when taken together are groundbreaking.


The speed of transformers is memory bandwidth limited. I think that it's possible to speed them way up with different chip architectures, and I know there are lots of people working on this e.g. https://etched.ai making asics and https://untether.ai making chips will co-located memory and compute cells. So unless some architecture really starts beating transformers badly on language tasks, I think the speed problem is going to disappear as silicon architectures adapt.


Yup, the company I work for has the world's fastest LLM appliance, based on a hardware design co-locating memory and compute. It requires compilation innovation as well in a sort of hardware-software codesign. We don't see transformer models as being an impediment in terms of memory requirements or speed in the near future.

[I would be happy to say more but I just got a top-level comment flag killed, presumably because they though I was advertising, so I won't mention the company name.]


If I understand correctly, Groq chips have 220MB SRAM and the next best level is DDR4? How many chips are needed to run Llama2-70B at those speeds?


Cool that you know the tech specs of the GroqChip! Yes, that's right, 220 MB of SRAM per chip. I think the demo where we first broke 200 tokens / sec was running on 1 GroqRack, so 64 chips. The live public demo that's currently running at 275 tokens / sec I think might be running on two GroqRacks, so 128 chips. I'm not certain of either of these figures so please don't quote me! But those are the right ball-park.


This article from less than a month ago says that it is on 576 chips https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...


Thanks, looks like you're right and this demo is running on 9 GroqRacks (576 chips). I think we may also have an 8 rack version in progress. We've tried a variety of different configurations to improve performance, which is possible because of the high level of flexibility and configurability of our architecture and compiler tool chain.


Link to HN discussion about the appliance: https://news.ycombinator.com/item?id=38739199


Are you guys hiring? I'm actually looking for a job in this industry.


Yes we are! (I put links in my profile because I'm worried if link here or mention the name I'll be flag killed again.)


Awesome, thanks! Is there someone there you could connect me with for an introductory chat? I put my contact info in my profile, I completely understand if it's not possible. Thanks!


Yes, sure! I've emailed a recruiter to introduce you. Thanks for reaching out and showing interest.


There is still so much performance to gain by applying software-based optimizations for superior performance of transformer LLMs on GPUs. From ~ 30% GPU utilization to over 90% is possible. I wrote a white paper on this, page 5 and 6: http://tinyurl.com/uujr3z4f


> why there's still so much work (and hacks and workarounds) on transformer to make it go faster and consume less VRAM whereas the same effort could be spent on other architectures

Probably because the pace of development is now no longer determined by ML researchers but by CS engineers. ML is "easy to pick up" until you have to understand the reasons certain architectures work or don't for different applications.


The best technology doesn't necessarily win. First to market matters, especially when models on the scale of gpt-4 cost 100mln to train.


"best" is subjective, but I do think the industry will eventually converge onto an architecture that is significantly more cost effective than current state of the art. Regardless of who is first to market, everyone is incentivized to continue down this path with their research on improving LLM performance.


While the first mover advantage exists, I don't think it's a sufficient moat in the long term.

OpenAI's moat is the high cost of training as you mentioned but this might be obsolete in a few papers down the road.

They themselves realized this by trying to turn it into a platform but I don't think it's enough.


technical debt


LLM's generate 'the next token' require a serialization of data, and prevents 'too much' parallelization.

I wonder if some sort of diffusion hybrid is being worked on.

Something that approximates a complete text answer, but 'increases the resolution' of that text over time.

One key benefit to a 'shot-gun/top-down/all-at-once' approach as opposed to a 'guess the next token => add and repeat' is the ability for the tokens at the end of the text (a twist in a surprise mystery story) have a direct effect on the beginning of that story. which (as far as I know) is not possible in current LLM's due to their architecture.

how self attention and positional encoding would work in a diffusion model... that's the question I have if that could even work...

to be clear, I don't mean stable diffusion rendering text as an image. I mean diffusion run on raw (random) text: turning that into a legible response.


People have been trying to get non-autoregressive generation working for years. There are some methods which work okay but they're always behind AR generation. And yeah they do look very diffusion-esque.

One of the issues is that you'd need to be really sold in this being a better paradigm before spending the money to pretrain a huge non-AR model.


I figured the raw approach of diffusion-esque was tried (without techniques lifted from newer LLMs {chatGPT is only a year old}). I am thinking that those attempts were made largely before LLM's were popular. Self attention and positional encoding are the claimed 'secret sauce': are there any models (or attempts) anyone can point to that try to combine SA+PE into a diffusion-like (non-autoregressive) model.

What are some models that 'work okay'?


Self-attention and positional embeddings are ancient at this point (i.e., more than 5 years old). Here are a couple examples of methods: https://arxiv.org/abs/2112.06749 https://arxiv.org/abs/2205.12558


Thanks for the blog post, I also enjoyed their previous post on making a transformer by hand: https://vgel.me/posts/handmade-transformer/


> ... we could be running models than handily beat GPT-4 on consumer hardware,

The implication there is that consumer hardware will have more and faster RAM than the current defaults. AI features are going to probably require more than the 8gb minimum that macbook airs have. Sure mistral 7b can fit, but I think generally useful models start at 32B, and mixtral 8x7B still is short of GPT-4. Mixtral requires about 28GB 4bit quantised.

It will be interesting to see if llama etc drive a step up in consumer hardware defaults, but always connected to cloud and paying a subscription for cloud services is the trend with commercial offerings.


Fast is great but uncensored is better


They had a whole section on guided decoding which can be used to among other thing break censorship/alignment efforts.


Just run/train it yourself.


For anyone else who wishes to do so, Fireship has a low-barrier video[1] on training the uncensored model in the sibling comment (dolphin mistral) for local use.

[1] https://www.youtube.com/watch?v=GyllRd2E6fg



What do you want to use it for? Out of all the things I use LLM's for, there isn't one situation where I care about political correctness or whatever.


Copilot tells me all the time that it doesn't want to answer my questions on multiple topics, never political and not usually talking about illegal things (talking about something illegal in itself isn't illegal anyways). So if you don't care about political correctness, you should care about censorship.

And 2 days ago, I was able to use MS' copilot (not Github's) to generate Python scripts, and now it tells me that it is incapable. Just tried it right now and the functionality came back though..

They just keep removing functionalities. To me that is pretty much censoring or whatever you want to call it. The quality degrades even faster then Google search quality.

here is a very simple example: https://i.imgur.com/vydRUwn.png (I would never ask that question usually, but I can't remember which questions I previously asked because they deleted my history)


I told Bing to stop hallucinating. It got upset, told me to come back when I am ready to talk, and hung up. True story!


Court documents. One mention of rape or murder and there's a 50:50 chance GPT/Claude/Bard will just shut down, when all I want is to extract the entities mentioned in transcripts or opinions.


You probably want to talk to Lexis/Nexis instead if you’re dealing with legal materials. I believe they are working with some RELX affiliates in this space.


If AI has as big of an impact as it seems like it will, we will ultimately have to design consumer-grade chips around neural networks to run it at the edge. (kind of like what https://www.etched.ai is doing)

Either that or some massive efficiency breakthrough, probably (hopefully!) both. Otherwise AI will be limited to enormous computing banks at datacenters, and it will be severely limited in terms of privacy (because no one wants to upload their personal data to datacenters, especially not to power some black-box "AI")


Isn’t that what the Neural Engine in Apple Silicon is? I also believe some Pixel phones have TPUs in them as well.


Intel Meteor Lake and AMD Phoenix also introduced dedicated AI acceleration hardware.


We use some of these techniques at Groq to make the world's fastest LLM appliance! You can see for yourself how fast it is at https://groq.com/ (click "Launch demo" in the top right). The model is Llama 2 70B.

Our hardware is an "LPU" (language processing unit). The silicon has a very regular/uniform/homogeneous design with compute located close to memory, which allows us to get a much lower latency for language models than a GPU can. A lot of the smarts is in the compiler toolchain, which can fit high-level model descriptions (say Torch, ONNX) to the hardware without the need for handwritten kernels. I work on the assembler part of the toolchain, which is written in Haskell.

EDIT: To the downvoters, I guarantee that once you've tried the demo you'll be so surprised at how fast it is that you'll change your vote.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: