General questions for gamedevs here. How useful is SIMD given that now we have c...

h0l0cube · on Aug 22, 2024

Specifically physics benefits from CPU processing. Efficient rendering pipelines are typically one-way (CPU -> GPU), whereas the results of physics calculations are depended on both by the game logic and rendering, and it's much simpler (and probably more efficient) to keep that computation on the CPU. The exception to this is could be on UMA architectures like the Apple M-series and the PS4, where memory transport isn't a limiting factor – though memory/cache invalidation might be an issue?

eigenspace · on Aug 22, 2024

Even with UMA architectures where you eliminate the memory transport costs, it still costs a ton of time to actually launch a GPU kernel from the CPU.

h0l0cube · on Aug 22, 2024

Yeah, that's why I qualified with 'could'. Really depends on what facilities the hardware and driver provide. If the GPU is on the same die, perhaps the latency isn't great, but I really don't have the data on that. But I'd really like to see something like voxel deformable/destructible environments leveraging UMA on the Apple M. Seems like that something that would be groundbreaking, if only Apple really cared about gaming at all.

dxuh · on Aug 22, 2024

With graphics you mostly prepare everything you want to render and then transfer all of it to the GPU. Physics still lends itself fairly well to GPU acceleration as well (compared to other things), but simply preparing something, transferring it to the GPU and being done is not enough. You need to at least get it back, even just to render it, but likely also to have gameplay depend on it. And with graphics programming the expensive part is often the communication between the CPU and the GPU and trying to avoid synchronization (especially with the old graphics APIs), so transferring there and back is expensive. Also physics code is full of branches, while graphics usually is not. GPUs (or rather really wide vectorization generally) don't like branches much and if you do only certain parts of the physics simulation on the GPU, then you need to transfer there and back (and synchronize) even more. I'm just a hobby gamedev and I know that people have done physics on the GPU (PhysX), but to me the things I mentioned sound like big hurdles.

EDIT: one more big thing is also that at least for AAA games you want to keep the GPU doing graphics so it looks good. You usually never have GPU cycles to spare.

eigenspace · on Aug 22, 2024

I'm not a gamedev, but I do a lot of numerical work. GPUs are great, but they're no replacement for SIMD.

For example, I just made a little example on my desktop where I summed up 256 random Float32 numbers, and doing it in serial takes around 152 nanoseconds, whereas doing it with SIMD took just 10 nanoseconds. Doing the exact same thing with my GPU took 20 microseconds, so 2000x slower:

    julia> using CUDA, SIMD, BenchmarkTools

    julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T}
               s = Vec{N, T}(0)
               lane = VecRange{N}(0)
               for i ∈ 1:N:length(v)
                   s += v[lane + i]
               end
               sum(s)
           end;

    julia> let L = 256
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    152.239 ns (0 allocations: 0 bytes)
    SIMD benchmark:      10.359 ns (0 allocations: 0 bytes)
    GPU benchmark:       19.917 μs (56 allocations: 1.47 KiB)

The reason for that is simply that it just takes that long to send data back and forth to the GPU and launch a kernel. Almost none of that time was actually spent doing the computation. E.g. here's what that benchmark looks like if instead I have 256^2 numbers:

    julia> let L = 256^2
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    42.370 μs (0 allocations: 0 bytes)
    SIMD benchmark:      2.669 μs (0 allocations: 0 bytes)
    GPU benchmark:       27.592 μs (112 allocations: 2.97 KiB)

so we're now at the point where the GPU is faster than serial, but still slower than SIMD. If we go up to 256^3 numbers, now we're able to see a convincing advantage for the GPU:

    julia> let L = 256^3
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    11.024 ms (0 allocations: 0 bytes)
    SIMD benchmark:      2.061 ms (0 allocations: 0 bytes)
    GPU benchmark:       353.119 μs (113 allocations: 2.98 KiB)

So the lesson here is that GPUs are only worth it if you actually have enough data to saturate the GPU, but otherwise you're way better off using SIMD.

GPUs are also just generally a lot more limiting than SIMD in many other ways.

xoranth · on Aug 22, 2024

Thank you for your reply!

> GPUs are also just generally a lot more limiting than SIMD in many other ways.

What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)

eigenspace · on Aug 22, 2024

Here's a few random limitations I can think of other than those already mentioned:

* Float64 math is typically around 30x slower than Float32 math on "consumer-grade" GPUs due to an arbitrary limitation to stop people from using consumer grade chips for "workstation" purposes. This turns out to not be a big deal for things like machine learning, but lots of computational processes actually are rather sensitive to rounding errors and benefit a lot from using 64 bit numbers, which is very slow on GPUs.

* Writing GPU specific functions can be quite labour intensive compared to writing CPU code. Julia's CUDA.jl and KernelAbstractions.jl packages does make a lot of things quite a bit nicer than in most languages, but it's still a lot of work to write good GPU code.

* Profiling and understanding the performance of GPU programs is typically a lot more complicated than CPU programs (even if there are some great tools for it!) because the performance model is just fundamentally more complex with more stuff going on and more random pitfalls and gotchas.