True, but the problem is that that is today better done on vector hardware like ...

fulafel · on Nov 28, 2022

GPUs are still an unworkable target for wide end user audiences because of all the fragmentation, mutually incompatible APIs on macOS/Windows/Linux, proprietary languages, poor dev experience, buggy driver stacks etc.

Not to mention a host of other smaller problems (eg no standard way to write tightly coupled CPU/GPU codes, spotty virtualization support in GPUs, lack integation in estabilished high level languages, etc chilling factors).

The ML niche that can require speficic kinds of NVidia GPUs seems to be an island of its own that works for some things, but it's not great.

pjmlp · on Nov 28, 2022

While true, it is still easier to write shader code than trying to understand the low level details of SIMD and similar instruction sets, that are only exposed in a few selected languages.

Even JavaScript has easier ways to call into GPU code than exposing vector instructions.

fulafel · on Nov 29, 2022

Yes, one is easier to write and the other is easier to ship, except for WebGL.

The JS/browser angle has another GPU related parallel here. WebAssembly SIMD is is shipping since a couple of years and like WebGL make the browser platform one of the few portable ways to access this parallel-programming functionality now.

(But functionality is limited to approximately same as the 1999 vintage x86 SSE1)

dtx1 · on Nov 28, 2022

> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.

People are forgetting the "Could run on a GPU but I don't know how" factor. There's tons of Situations where GPU Offloading would be fast or more energy efficient but importing all the libraries, dealing with drivers etc. really is not worth the effort, whereas doing it on a CPU is really just a simple include away.

titzer · on Nov 28, 2022

> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.

I dunno, JSON parsing is stupid hot these days because of web stacks. Given the neat parsing tricks by simdjson mentioned upthread, it seems like AVX512 could accelerate many applications that boil down to linear searches through memory, which includes lots of parsing and network problems.

mhh__ · on Nov 28, 2022

Firing up a GPU just to do a little bit of inference sounds quite expensive.

CPUs (let alone CPUs talking to a GPU) spend huge numbers of cycles shunting data around already.

dragontamer · on Nov 28, 2022

> few reasons to use AVX-512 on a CPU,

Memcpy and memset are massively parallel operations used on a CPU all the time.

But lets ignore the _easy_ problems. AES-GCM mode is massively parallelized as well, each 128-bit block of AES-GCM can run in parallel, so AVX512-AES encryption can process 4 blocks in parallel per clock tick.

Linus is just somehow ignorant of this subject...

stephencanon · on Nov 28, 2022

Icelake and later CPUs have a REP MOVS / REP STOS implementation that is generally optimal for memcpy and memset, so there’s no reason to use AVX512 for those except in very specific cases.

dragontamer · on Nov 28, 2022

Does AMD support enhanced REP MOVS?

I know when I use GCC to compile with AVX512 flags, it seems to output memcpy as AVX registers / ZMMs and stuff...

Auto vectorization usually sucks for most code. But very simple setting of structures / memcpy / memset like code is ideal for AVX512. It's a pretty common use case (think a C++ vector<SomeClass> where the default constructor sets the 128 byte structure to some defaults)

stephencanon · on Nov 29, 2022

AVX512 doesn't itself imply Icelake+; the actual feature is FSRM (fast short rep movs), which is distinct from AVX512. In particular, Skylake Xeon and Cannon Lake, Cascade Lake, and Cooper Lake all have AVX512 but not FSRM, but my expectation is that all future architectures will have support, so I would expect memcpy and memset implementations tuned for Icelake and onwards to take advantage of it.