More

NovaX · 2026-02-26T08:06:56 1772093216

Go tried that [1], a failed experiment that was a complex NIH version of the generational hypothesis. They currently use a CMS-stye collector.

[1] https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ft...

NovaX · 2025-12-31T21:29:59 1767216599

That is just a normal JVM with optional Graal components if enabled, but not being used. The default memory allocation is based on a percentage of available memory and uncommitted (meaning its available for other programs). When people mention Graal they mean an AOT compiled executable that can be run without a JVM installed. Sometimes they may refer to Graal JIT as a replacement for C1/C2 available also in VM mode. You are using a plain HotSpot VM in server mode, as the optimized client mode was removed when desktop use-cases were deprioritized (e.g. JWS discontinued).

cb321 · 2026-01-01T14:46:28 1767278788

You are correct and I apologize for the misimpression.

`native-image -jar helloworld.jar helloworld` did take a whopping 17 seconds to compile, what might be the smallest possible project. That does makes me worry for iterations trying to get better perf in a context where startup overhead matters, BUT the executable it produced did run much faster - only about 1.8x slower than `tcc -run`:

    97.0 +- 1.7 μs  (AlreadySubtracted)Overhead
    98.2 +- 2.8 μs  awk '{}'</n
    376.4 +- 4.4 μs tcc -run /tmp/true.c
    527.9 +- 4.0 μs perl</n
    686.5 +- 3.9 μs ./helloworld>/n

Perl has 2 more shared libraries for ld.so to link, but is somehow faster. So, there may still be some room for improvement, but anyway, thank you for the correction.

(Also, I included 4 of the faster comparative programs to show additionally that the error bars are vaguely credible. In truth, on time shared OSes, the distributions have heavier tails than Gaussian and so a single +- is inadequate.)

--

EDIT: So, the ld.so/dynamic linking overhead was bothering me. I had to get a musl + zlib build environment going, but I did after a few minutes and then found this result with a fully statically linked binary executable:

    398.2 +- 4.2 μs ./helloworld-sta>/n

(I should have noted earlier that /n -> /dev/null is just a convenience symlink I put on all my systems. Also, this is all on Linux 6.18.2 and the same CPU as before.)

Also, only around 4.2 MiB of RSS compared to ~1.8 MiB for dash & awk. So, 2.4x the space and only ~4X the time of static awk & dash. That might sound like criticism, but those are usually the efficiency champs. The binary size is kind of hefty (~2x like the RAM use):

    $ size *sta
       text    data     bss     dec     hex filename
    2856644 3486552    3184 6346380  60d68c helloworld-sta

So, I guess, in 2026, Java start-up overhead is pretty acceptable. I hope that these "real numbers" can maybe add some precision to the discussion. Someone saying "mere milliseconds" just does not mean as much to me as 400 +- 4 microseconds, and perhaps there are others like me.

NovaX · 2026-01-01T16:58:18 1767286698

Thanks for the corrected evaluation. Just for your awareness, the HotSpot team is working on offering a spectrum of AOT/JIT options under the Leyden project [1]. Currently one has to choose between either a fully open world (JIT) or closed world (AOT) evaluation. The mixed world allows for a tunable knob, e.g. pre-warmup by AOT while retaining dynamic class loading and fast compile times for the application. This will soften the hard edges so developers can choose their constraints that best fit their application's deployment/usage model.

[1] https://openjdk.org/projects/leyden

NovaX · 2025-03-06T01:57:40 1741226260

This was a good attempt but flawed.

1. You should use JMH to handle warmup, jit, timings, averaging of runs, etc. You might enjoy the following talks on the subject, though I am sure there are other gems out there.

- Performance Anxiety: https://wiki.jvmlangsummit.com/images/1/1d/PerformanceAnxiet...

- Benchmarking for Good: https://www.youtube.com/watch?v=SKPdqgD1I2U

- (The Art of) (Java) Benchmarking: https://shipilev.net/talks/j1-Oct2011-21682-benchmarking.pdf

- A Crash Course in Modern Hardware: https://www.youtube.com/watch?v=OFgxAFdxYAQ

- How NOT to Write a Microbenchmark https://www.slideshare.net/slideshow/2002-microbenchmarks/28...

- Anatomy of a flawed microbenchmark https://web.archive.org/web/20110513090823/http://www.ibm.co...

2. An uncontended lock is basically free, as you noticed.

3. Guava does implement the Map interface via asMap(). It also have a concurrencyLevel to adjust its performance. It is based on Java 5's CHM.

4. Caffeine is a Guava Cache rewrite based on Java 8's CHM rewrite, plus many learnings. You might look at its benchmarks for ideas: https://github.com/ben-manes/caffeine/wiki/Benchmarks

5. Data structures are surprisingly tricky. For example see this analysis showing an accidental misunderstanding degrading an LRU to O(n) eviction. https://gist.github.com/ben-manes/6312727adfa2235cb7c5e25cae...

6. It is important to remember that the goal of a benchmark is never which is faster or by how much. It is to ask (a) is it fast enough? (b) might I reach a point where it will not be? and (c) does it degrade unexpectedly? == This is to say the winner is of little interest, once all the choices are good enough then it is about usability, features, documentation, friendliness, and so on.

tombert · 2025-03-06T02:10:04 1741227004

I completely agree with point 6, obviously these microbenchmarks are not representative of much of anything, I was just trying to check my assumptions on some of this stuff.

I actually knew about the Guava asMap(), but I wasn't sure if that introduced overhead and I thought that might be an unfair test.

I have heard of Caffeine, but I haven't used it yet. Maybe I'll write an updated post talking about it.

I'll look into JMH and make a part 2 of this post tomorrow.

NovaX · 2025-03-06T03:39:39 1741232379

In both the asMap() view unwraps the opinionated Cache facade to the underlying bounded map.

There are a lot of little gotchas so it is best to not believe your own results until proven otherwise. A Rust influencer wrote EVMap, eagerly giving a talk about its performance equaling concurrent maps while being much simpler. However since he generated the random numbers as part of the test loop, he did not realize that it uses a lock to compute the next seed. This throttled the benchmark to make the faster maps slower as they created more contention, fully invalidating and inverting the results. Sadly since this was presented as part of a PhD such truths are of little importance and the false claims continue to be shown. That's just to share how innocent mistakes are the norm, but incentives to not correct them are doubly so, and that you should never trust your own results until you've exhausted all attempts to disprove them as invalid.

Sounds like you have the right perspective and a great attitude. Feel welcome to ping me if you run into any troubles or questions as I collab'd on Guava's cache and wrote Caffeine.

tombert · 2025-03-06T03:47:31 1741232851

I actually just got JMH set up in Gradle, and I'm rewriting my tests to try and give more accurate results. Once I have numbers from JMH I'll write another post, and I'll put a caveat to take them with a grain of salt.

In this specific example, I was actually afraid that randomness taking indeterminate time could be an issue, which is why I pre-allocated all the UUIDs at the beginning, and used the same list as input data for each one; I figured if all the setup was allocated before I started the timer, and every test has the same input, then we can factor out any inefficiencies since that should affect all of them roughly equally.

Anyway, I have wanted to learn JMH for awhile, so this is the push I needed to actually do it.

NovaX · 2025-03-02T18:27:11 1740940031

How about the senior engineer level error of forking Doug Lea's concurrent data structures only to make them operate in worst-case time complexity? Found this doozy recently.

[1] https://gist.github.com/ben-manes/6312727adfa2235cb7c5e25cae...

NovaX · on Feb 4, 2025

I think the idea is that the cache is so large that hot data won't be forced out by one-hit wonders. In this 2017 talk [1], the speaker says that Twitter's SLA depends on having a 99.9% hit rate. It is very common to have extremely over provisioned remote caching tier for popular sites. That makes eviction not as important and reducing their operational costs comes by purging expired data more proactively. Hence, memcached switched from away from relying on its LRU to discard expired entries to using a sweeper. Caffeine's approach, a timing wheel, was considered but dormando felt it was too much of an internal change for memcached and the sweeper could serve multiple purposes.

[1] https://www.youtube.com/watch?v=kxMKnx__uso

NovaX · on Feb 3, 2025

You might be interested in this thread [1] where I described an idea for how to incorporate the latency penalty into the eviction decision. A developer even hacked a prototype that showed promise. The problem is that there is not enough variety in the available trace data to be confident that a design isn't overly fit to a particular workload and doesn't generalize. As more data sets become available it will become possible to experiment with ideas and fix unexpected issues until a correct, simple, elegant design emerges.

[1] https://github.com/ben-manes/caffeine/discussions/1744

hinkley · on Feb 3, 2025

I'll check it out.

The thing is that the math changes when the system is saturated. The closer you get to tipping over, the more each new request costs. I feel like I can clearly recall times when I had to make a Sophie's Choice between p50 and p95 times because of issues of this sort.

NovaX · on Feb 2, 2025

As the article mentions, Caffeine's approach is to monitor the workload and adapt to these phase changes. This stress test [1] demonstrates shifting back and forth between LRU and MRU request patterns, and the cache reconfiguring itself to maximize the hit rate. Unfortunately most policies are not adaptive or do it poorly.

Thankfully most workloads are a relatively consistent pattern, so it is an atypical worry. The algorithm designers usually have a target scenario, like cdn or database, so they generally skip reporting the low performing workloads. That may work for a research paper, but when providing a library we cannot know what our users workloads are nor should we expect engineers to invest in selecting the optimal algorithm. Caffeine's adaptivity removes this burden and broaden its applicability, and other language ecosystems have been slowly adopting similar ideas in their caching libraries.

[1] https://github.com/ben-manes/caffeine/wiki/Efficiency#adapti...

NovaX · on Feb 2, 2025

Yep, no bots. A real bug not only means that I wasted someone else’s time, but reporting is a gift for an improvement. If a misunderstanding then it’s motivation that my project is used and deserves a generous reply. This perspective and treating as strictly a hobby, rather than as a criticism or demand for work, makes OSS feel more sustainable.

NovaX · on Feb 2, 2025

Thanks Jonathan!

NovaX · on Feb 2, 2025

I tried to reimplement Linux’s algorithm in [1], but I cannot be sure about correctness. They adjust the fixed sizes at construction based on device’s total memory, so it varies if a phone or server. This fast trace simulation in the CI [2] may be informative (see DClock). Segmentation is very common, where algorithms differ by how they promote and how/if they adapt the sizes.

[1] https://github.com/ben-manes/caffeine/blob/master/simulator/...

[2] https://github.com/ben-manes/caffeine/actions/runs/130865965...