having implemented it, with unoptimized verilog (ok, i wrote a verilog generator to generate it), it requires about 30% fewer LUTs on a FPGA relative to berkeley hardfloat implementation.
Is the variable length regime handling that much easier to deal with (space-wise) than the NaN and subnormal handling needed in IEEE floats? I'd think that the regime scheme would effectively be equivalent to creating a multitude of different-width subnormal routes. Is it really the NaN handling that kills IEEE float performance?
It's basically a barrel shifter; for addition you're going to need it anyways. Multiplication is a bit nastier, but most of multiplier gates are the adder gates anyways. I made a useful insight that negative numbers are basically the same as positives, with a "minus two" invisible bit.
Here is a sample 8-bit multiplier. All code was generated using a verilog DSL I wrote in Julia for the specific purpose. All verilog is tested by transpiling to c using verilator and mounting the shared object into a Julia runtime with a Julia implementation.