Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Linus Torvalds:

  And I claim that that is the real problem with AVX-512 (and pretty much any vectorization). I personally cannot find a single benchmark that does anything I would ever do - not even remotely close. So if you aren't into some chess engine, if you aren't into parsing (but not using) JSON, if you aren't into software raytracing (as opposed to raytracing in games, which is clearly starting to take off thanks to GPU support), what else is there?
Answer? Neural net inference, e.g., https://NN-512.com

If you need a little bit of inference (say, 20 ReNet50s per second per CPU core) as part of a larger system, there's nothing cheaper. If you're doing a small amount of inference, perhaps limited by other parts of the system, you can't keep a GPU fed and the GPU is a huge waste of money.

AVX-512, with its masked operations and dual-input permutations, is an expressive and powerful SIMD instruction set. It's a pleasure to write code for, but we need good hardware support (which is literally years overdue).



I'd say AES Encryption/Decryption (aka: every HTTPS connection out there), and SHA256 Hashing is big. As is CRC32 (the VPMULDQ instruction), and others.

There's.... a ton of applications of AVX512. I know that Linus loves his hot-takes, but he's pretty ignorant on this particular subject.

I'd say that most modern computers are probably reading from TLS1.2 (aka: AES decryption), processing some JSON, and then writing back out to TLS1.2 (aka: AES Encryption), with probably some CRC32 checks in between.

--------

Aside from that, CPU signal filtering (aka: GIMP image processing, Photoshop, JPEGs, encoding/decoding, audio / musical stuff). There's also raytracing with more than the 8GB to 16GB found in typical GPUs (IE: Modern CPUs support 128GB easily, and 2TB if you go server-class), and Moana back in 2016 was using up 100+ GB per scene. So even if GPUs are faster, they still can't hold modern movie raytraced scenes in memory, so you're kinda forced to use CPUs right now.


> AES Encryption/Decryption (aka: every HTTPS connection out there),

that already have dedicated hardware on most of the x86 CPUs for good few years now. Fuck, I have some tiny ARM core with like 32kB of RAM somewhere that rocks AES acceleration...

> So even if GPUs are faster, they still can't hold modern movie raytraced scenes in memory, so you're kinda forced to use CPUs right now.

Can't GPUs just use system memory at performance penalty ?


> that already have dedicated hardware on most of the x86 CPUs for good few years now

Yeah, and that "dedicated hardware" is called AES-NI, which is implemented as AVX instructions.

In AVX512, they now apply to 4-blocks at a time (512-bit wide is 128-bit x 4 parallel instances). AES-NI upgrading with AVX512 is... well... a big important update to AES-NI.

AES-NI's next-generation implementation _IS_ AVX512. And it works because AES-GCM is embarrassingly parallel (apologies to all who are stuck on the sequential-only AES-CBC)


> Can't GPUs just use system memory at performance penalty ?

CPUs can access DDR4/DDR5 RAM at 50-nanoseconds. GPUs will access DDR4/DDR5 RAM at 5000-nanoseconds, 100x slower than the CPU. There's no hope for the GPU to keep up, especially since raytracing is _very_ heavy on RAM-latency. Each ray "bounce" is basically a bunch of memory-RAM checks (traversing a BVH tree).

Its just better to use a CPU if you end up using DDR4/DDR5 RAM to hold the data. There are algorithms that break up a scene into oct-trees that only hold say 8GBs worth of data, then the GPU can calculate all the light bounces within a box (and then write out the "bounces" that leave the box), etc. etc. But this is very advanced and under heavy research.

For now, its easier to just use a CPU that can access all 100GB+ and just render the scene without splitting it up. Maybe eventually these GPU oct-tree split / process within a GPU / etc. etc. subproblem / splitting will become better researched and better implemented, and GPUs will traverse System ram a bit better.

GPUs will be better eventually. But CPUs are still better at the task today.


I am confused, CPUs have dedicated instructions for AES encryption and CRC32. Are they slower than AVX512?


> I am confused, CPUs have dedicated instructions for AES encryption and CRC32. Are they slower than AVX512?

Those instructions are literally AVX instructions, and have been _upgraded_ in AVX512 to be 512-bit wide now.

If you use the older 128-bit wide AES-NI instruction, rather than the AVX512-AES-NI instructions, you're 4x slower than me. AVX512 upgrades _ALL_ AVX instructions to 512-bits (and mind you, AES-NI was stuck on 128-bits, so the upgrade to 512-bit is a huge upgrade in practice).

-----

EDIT: CRC32 is implemented with the PCLMULQDQ instruction, which has also been upgraded to AVX512.


True, but the problem is that that is today better done on vector hardware like a GPU or other ML hardware. The world has sort of diverged int to two camps: vectorizable problems that can be massivly paralleleized (graphics, simulation, ML) and for that we use GPUs, and then everything else is CPU. What i think linus is saying is that there are few reasons to use AVX-512 on a CPU, when the is a GPU much better siuted for those kinds of problems.

You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.


GPUs are still an unworkable target for wide end user audiences because of all the fragmentation, mutually incompatible APIs on macOS/Windows/Linux, proprietary languages, poor dev experience, buggy driver stacks etc.

Not to mention a host of other smaller problems (eg no standard way to write tightly coupled CPU/GPU codes, spotty virtualization support in GPUs, lack integation in estabilished high level languages, etc chilling factors).

The ML niche that can require speficic kinds of NVidia GPUs seems to be an island of its own that works for some things, but it's not great.


While true, it is still easier to write shader code than trying to understand the low level details of SIMD and similar instruction sets, that are only exposed in a few selected languages.

Even JavaScript has easier ways to call into GPU code than exposing vector instructions.


Yes, one is easier to write and the other is easier to ship, except for WebGL.

The JS/browser angle has another GPU related parallel here. WebAssembly SIMD is is shipping since a couple of years and like WebGL make the browser platform one of the few portable ways to access this parallel-programming functionality now.

(But functionality is limited to approximately same as the 1999 vintage x86 SSE1)


> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.

People are forgetting the "Could run on a GPU but I don't know how" factor. There's tons of Situations where GPU Offloading would be fast or more energy efficient but importing all the libraries, dealing with drivers etc. really is not worth the effort, whereas doing it on a CPU is really just a simple include away.


> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.

I dunno, JSON parsing is stupid hot these days because of web stacks. Given the neat parsing tricks by simdjson mentioned upthread, it seems like AVX512 could accelerate many applications that boil down to linear searches through memory, which includes lots of parsing and network problems.


Firing up a GPU just to do a little bit of inference sounds quite expensive.

CPUs (let alone CPUs talking to a GPU) spend huge numbers of cycles shunting data around already.


> few reasons to use AVX-512 on a CPU,

Memcpy and memset are massively parallel operations used on a CPU all the time.

But lets ignore the _easy_ problems. AES-GCM mode is massively parallelized as well, each 128-bit block of AES-GCM can run in parallel, so AVX512-AES encryption can process 4 blocks in parallel per clock tick.

Linus is just somehow ignorant of this subject...


Icelake and later CPUs have a REP MOVS / REP STOS implementation that is generally optimal for memcpy and memset, so there’s no reason to use AVX512 for those except in very specific cases.


Does AMD support enhanced REP MOVS?

I know when I use GCC to compile with AVX512 flags, it seems to output memcpy as AVX registers / ZMMs and stuff...

Auto vectorization usually sucks for most code. But very simple setting of structures / memcpy / memset like code is ideal for AVX512. It's a pretty common use case (think a C++ vector<SomeClass> where the default constructor sets the 128 byte structure to some defaults)


AVX512 doesn't itself imply Icelake+; the actual feature is FSRM (fast short rep movs), which is distinct from AVX512. In particular, Skylake Xeon and Cannon Lake, Cascade Lake, and Cooper Lake all have AVX512 but not FSRM, but my expectation is that all future architectures will have support, so I would expect memcpy and memset implementations tuned for Icelake and onwards to take advantage of it.


The two ARM64 systems on the desk next to me have neural net engines built in.


Software defined radio.


"boy, if I exclude all the places where it's actually used, there's really not any places that use it, amirite guys?"

begging the fucking question to the max, post better linus


We already have GPUs for that


gpus have a high enough latency that for O(n) operations, the time it takes to move the data to the GPU will be higher than the time it takes to run the problem on a CPU. AVX-512 is great because it makes it easy to speed up code to the point that it's memory bottlenecked.


Not sure you're using O(n) right




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: