Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been nerdsniped as well. I can't say I'm going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.

The best method I personally think would work, is the "compaction algorithm" documented here: http://www.davidespataro.it/cuda-stream-compaction-efficient...

True, that's a CUDA implementation, but AVX512 is closely related to GPU programmers. Effectively, you calculate the prefix sum of the "matches".

The paper the above code is based on is very clear on how this works: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

Pay close attention to "figure 1" on page 2. That's the crux of the algorithm. Assuming 8-bit characters, you can generate a prefix-sum in just 6-steps (Each step is a constant, pre-defined byte-shift + Add). A prefix sum is best described by the following picture: https://en.wikipedia.org/wiki/Prefix_sum#/media/File:Hillis-...

Full Wikipedia page on Prefix Sum: https://en.wikipedia.org/wiki/Prefix_sum

Prefix Sum is just 6-steps for a AVX512 register on 8-bit ints. That generates the full AVX512-space permute (ie: if the prefix sum is 5 for an element, that means that element belongs in index #5)., but AVX512 has "in lane" permutes only. I dunno how many steps you'd need to get a "in lane" permute into a "cross lane" permute... but it doesn't seem too difficult of a problem (and IIRC, I think i read a blogpost about how to convert the in-lane AVX512 permutes into a cross-lane one).

I bet that the above sketch of the AVX512 algorithm can be implemented in less than 30 assembly instructions for the full AVX512 / 64-byte space, maybe less than 20. That should definitely run faster than the scalar version.

-------

EDIT: Herp-derp. It doesn't seem like VPERMB is affected by AVX Lanes (!!). https://www.felixcloutier.com/x86/vpermb

So I guess you can just run VPERMB at the end on the calculated prefix-sum. The end.

-------

The Stream Compaction algorithm is a very important 1-dimentional work-balancing paradigm in the GPU programming world. It is used to select which rays are still active in a Raytracing scenario (so that all SIMD registers have something to do).



Prefix sum was my first thought as well.

One approach I haven't benchmarked is to vpmulllq (64-bit in-lane multiply) by 0x0101010101010101. That produces an 8-byte prefix sum in each lane, so then you need to prefix-sum the high bytes (either by mul or 3 rounds of shift/add) and broadcast them back to their respective lanes to sum the whole sequence.

I can't figure out the latencies on uops.info for vpmullq, but it's probably 3-5 cycles followed by a shuffle, ~6 cycles for the high-byte prefix sum, and then a shuffle and add. ~15 cycles including the final vpermb (also forgot timings for that).


Interesting, I started out thinking along these lines, but once I figured out I could use PEXT, I just went with that.

I think this approach needs some tweaks, though. Mainly that the vpermb at the end is the inverse of what we want--the bytes at dense indices get spread out to the sparse indices (it works analogously to gather, but we want scatter). I can't think of a way around this right now...

That said, it's an interesting approach. I think the PEXTs would be the bottleneck in my code (looks like there's only one execution unit for them, whereas there's two for the VPADDs), and finding a way to parallelize all the VPADDs could lead to a nice speedup.


You're right.

I did a brief look through AVX512 instructions to look for a solution, and unfortunatley, it seems we both may have been overthinking this.

vpcompressb more or less does the job in one instruction. Agner Fog doesn't have a latency listed however.

---------

My search methodology was basically this: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

Search for __m512i (integer-based ZMM registers), with the category "swizzle" (which includes permute, insert, and other such instructions). I figure any potential AVX512 instruction would be a "Swizzle" style instruction.

-------

Note: I originally responded to the wrong location in this thread. I copy/pasted my text to here, which is where I originally intended to respond.


Do you recall which machines VPCOMPRESSB works on? I think it's next generation Icelake? Or is it there already on Cannonlake? And along the same lines, is there a good general way of looking this up?

Coincidentally, searching for this, I found Geoff Langdale's blog post where in addition to describing VPCOMPRESSB as 'dynamiting the trout stream', he also describes something very close to zwegner's PEXT approach: https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-an...


It's not in Cannonlake (nor the W variants). The D and Q versions are in SKX though and they are 4L2T IIRC.

You need a CPU with VBMI2 for the B variant, can't remember off the top of my head if Icelake has that.


Oh sweet! That's an awesome instruction. I'd imagine that would be useful for lots of things. I believe I've seen vcompressd before, but totally forgot about it.

Unfortunately it looks like the byte-wise version is part of AVX512-VBMI2, which won't be out until Ice Lake...


You might have seen vcompessd in context of sorting; I used it for partition part in qsort.


It actually would've been during my time at Intel working on the graphics stack for Larrabee, in the 2010-2011 timeframe--vcompressd was part of LRBNI. I was mainly doing infrastructure/compiler/optimization type work, and not much graphics stuff, so I can't recall using the instruction personally, but pretty sure it was used in various places around the stack.


> I've been nerdsniped as well. I can't say I'm going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.

Let me explain it. I do know the presented approach is extremely naive, but... My initial question was: "how slow this might be?", and it turned out that's not that bad as I supposed, so shared this finding with others. :)

Thank you for pointing this article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: