I've been nerdsniped as well. I can't say I'm going to go ahead and try and solv...

kwillets · on Jan 7, 2019

Prefix sum was my first thought as well.

One approach I haven't benchmarked is to vpmulllq (64-bit in-lane multiply) by 0x0101010101010101. That produces an 8-byte prefix sum in each lane, so then you need to prefix-sum the high bytes (either by mul or 3 rounds of shift/add) and broadcast them back to their respective lanes to sum the whole sequence.

I can't figure out the latencies on uops.info for vpmullq, but it's probably 3-5 cycles followed by a shuffle, ~6 cycles for the high-byte prefix sum, and then a shuffle and add. ~15 cycles including the final vpermb (also forgot timings for that).

zwegner · on Jan 6, 2019

Interesting, I started out thinking along these lines, but once I figured out I could use PEXT, I just went with that.

I think this approach needs some tweaks, though. Mainly that the vpermb at the end is the inverse of what we want--the bytes at dense indices get spread out to the sparse indices (it works analogously to gather, but we want scatter). I can't think of a way around this right now...

That said, it's an interesting approach. I think the PEXTs would be the bottleneck in my code (looks like there's only one execution unit for them, whereas there's two for the VPADDs), and finding a way to parallelize all the VPADDs could lead to a nice speedup.

dragontamer · on Jan 6, 2019

You're right.

I did a brief look through AVX512 instructions to look for a solution, and unfortunatley, it seems we both may have been overthinking this.

vpcompressb more or less does the job in one instruction. Agner Fog doesn't have a latency listed however.

---------

My search methodology was basically this: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

Search for __m512i (integer-based ZMM registers), with the category "swizzle" (which includes permute, insert, and other such instructions). I figure any potential AVX512 instruction would be a "Swizzle" style instruction.

-------

Note: I originally responded to the wrong location in this thread. I copy/pasted my text to here, which is where I originally intended to respond.

nkurz · on Jan 6, 2019

Do you recall which machines VPCOMPRESSB works on? I think it's next generation Icelake? Or is it there already on Cannonlake? And along the same lines, is there a good general way of looking this up?

Coincidentally, searching for this, I found Geoff Langdale's blog post where in addition to describing VPCOMPRESSB as 'dynamiting the trout stream', he also describes something very close to zwegner's PEXT approach: https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-an...

BeeOnRope · on Jan 7, 2019

It's not in Cannonlake (nor the W variants). The D and Q versions are in SKX though and they are 4L2T IIRC.

You need a CPU with VBMI2 for the B variant, can't remember off the top of my head if Icelake has that.

zwegner · on Jan 6, 2019

Oh sweet! That's an awesome instruction. I'd imagine that would be useful for lots of things. I believe I've seen vcompressd before, but totally forgot about it.

Unfortunately it looks like the byte-wise version is part of AVX512-VBMI2, which won't be out until Ice Lake...

wmu · on Jan 8, 2019

You might have seen vcompessd in context of sorting; I used it for partition part in qsort.

zwegner · on Jan 8, 2019

It actually would've been during my time at Intel working on the graphics stack for Larrabee, in the 2010-2011 timeframe--vcompressd was part of LRBNI. I was mainly doing infrastructure/compiler/optimization type work, and not much graphics stuff, so I can't recall using the instruction personally, but pretty sure it was used in various places around the stack.

wmu · on Jan 6, 2019

> I've been nerdsniped as well. I can't say I'm going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.

Let me explain it. I do know the presented approach is extremely naive, but... My initial question was: "how slow this might be?", and it turned out that's not that bad as I supposed, so shared this finding with others. :)

Thank you for pointing this article.