In my experience, for the best balance between compression speed and compression...

terrelln · on April 17, 2019

A comparable zstd call that uses a 64 MB window size and all cores is:

    zstd --long=26 -T0

From there you can tune the compression level, or increase the window size up to 2 GB (--long=31). zstd won't beat the compression of xz, but it can compress much faster if you trade off some space.

cmurf · on April 17, 2019

Perhaps it's the use of a dictionary? As far as I'm aware, tar, zstd, xz do not use one by default, it's an extra set of hoops to create a training set, create the dictionary, use it for compression, and pack it away somewhere so that it's available for decompression, and then actually use it for decompression. If that's all being done by 7zip just by passing -md=64m that's pretty cool.

Edit: Ahh, I was confused. Neither require a separate training step. Zstd offers an option to do a training step. Both always use dictionaries with a default size that can optionally be changed.

vthriller · on April 17, 2019

Well, it didn't work that well for me. Adding to https://news.ycombinator.com/item?id=19682019

    $ time 7zr a -mmt=$(nproc) -ms=off -m0=lzma2 -md=64m -ma=0 -mmf=hc4 -mfb=64 -mf=off linux-5.0.8.tar{.7z,}
    real 60.49 user 158.94 sys 3.06 maxrss 8995040

    $ stat -c '%s %n' linux-5.0.8.tar.7z 
    127700475 linux-5.0.8.tar.7z

    $ time 7zr e -so linux-5.0.8.tar.7z >/dev/null
    real 14.09 user 13.96 sys 0.12 maxrss 282208

Basically:

- it took twice the time to compress data even compared to xz -2 (which also uses lzma2 under the hood),

- it is comparable to zstd/bzip2 ratio-wise,

- it used almost 6 times (!) more RAM than even zstd -12 --long,

- it only used about 2.5 CPU cores out of 4 while compressing (which aligns pretty well with your reasoning for using -ms=off).

----

But hey, source code is not that regular. Since you mentioned JSON and fixed-width-field binary data, I decided to re-run benchmarks on 10M lines of nginx access logs: they're way more regular in their structure (repetitive URLs, timestamps, Mozilla/5.0, stuff like that) that might benefit from larger window sizes.

    $ time lbzip2 -k access-log-10m.log 
    real 90.59 user 313.04 sys 18.46 maxrss 117904

    $ time ~/zstd-1.4.0/zstd -T0 -k -12 access-log-10m.log -o access-log-10m.log.zst-12
    real 77.34 user 277.21 sys 1.55 maxrss 886416

    $ time ~/zstd-1.4.0/zstd -T0 -k -12 --long access-log-10m.log -o access-log-10m.log.zst-12-long
    real 69.24 user 242.18 sys 1.85 maxrss 1411872

    $ time 7zr a -mmt=$(nproc) -ms=off -m0=lzma2 -md=64m -ma=0 -mmf=hc4 -mfb=64 -mf=off access-log-10m.log{.7z,}
    real 109.10 user 356.42 sys 4.69 maxrss 9777664

    $ stat -c '%s %n' access-log-10m.log* | sort -n
    208537395 access-log-10m.log.bz2
    231953002 access-log-10m.log.zst-12-long
    237566691 access-log-10m.log.zst-12
    249412192 access-log-10m.log.7z
    3386733539 access-log-10m.log

Now tweaked 7z did better CPU- and time-wise, but it's still behind zst and bz2 on every metric, especially RAM which it requires so much of (literally gigabytes) it becomes impractical in a number of situations. And we needed a pretty regular input (not just some pretty compressible text like source code or Wikipedia dump) to close that gap. So I can't really recommend your suggestion, unless you have some niche input that benefits from that particular set of options (but then, who has time to learn lzma internals and how every option plays with different kinds of input?).

----

It's also worth pointing out that zstd also has plenty of options to fiddle with: https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md...