The official challenge is in Java, but there are unofficial solutions in the "Sh...

mike_hearn · on Jan 29, 2024

You can both have five minutes of work and some extra JVM speed juice over awk. The following Kotlin Script runs in ~2m46s on my M1 Mac. This is faster than the baseline and a bit more complex than it otherwise could be to avoid using too much RAM.

I think it's also more readable than the awk for people who don't already know the language whilst being of a similar length, but that might be just me. More usefully you can get type checking, autocompletion, use any libraries you want and so on.

    class Result(var lo: Double = Double.MAX_VALUE, var hi: Double = Double.MIN_VALUE, var total: Double = 0.0, var samples: Int = 0) {
        val mean get() = total / samples
    }
    
    val results = HashMap<String, Result>()
    Path("measurements.txt").useLines { it
        .map { it.substringBefore(";") to it.substringAfter(";").toDouble() }
        .forEach { (name, measurement) ->
            with(results.getOrPut(name) { Result() }) {
                lo = min(lo, measurement)
                hi = max(hi, measurement)
                total += measurement
                samples++
            }
        }
    }
    results.toSortedMap().mapValues { (_, m) -> "%.1f/%.1f/%.1f".format(m.lo, m.mean, m.hi) }.also(::println)

You can do this sort of thing with jbang, although for some reason Gunnar's baseline is slower. Maybe I'm doing something wrong.

papercrane · on Jan 29, 2024

> although for some reason Gunnar's baseline is slower. Maybe I'm doing something wrong.

I suspect it's because the baseline code is adding each line to a TreeMap before processing, effectively sorting 1B rows and then processing them. Your code reduces to the 413 unique stations and then sorts those.

I feel like Gunnar purposefully wrote the baseline with lots of room for improvement to encourage participation.

mike_hearn · on Jan 29, 2024

Yes, I think you're right. The way he defines two classes for holding the results is another hint that it was built to be easily optimized :) And, well, you gotta hand it to him. That really went viral!

rakoo · on Jan 30, 2024

That's really interesting and I like seeing what can be done in more "evolved" languages than awk. Awk has the advantage of being everywhere and not needing installation phases, but that's a price you only pay once.

Your Kotlin code feels more imperative to me even though I'm used to that kind of things; I like that awk still feels more like describing stuff, but that's only me.

mike_hearn · on Jan 30, 2024

You can write it more functionally very easily, something like (untested):

    Path("measurements.txt")
        .readLines()
        .groupBy({ it.substringBefore(';') }, { it.substringAfter(';').toDouble() })
        .forEach { (name, temps) -> 
             println("%s=%.1f/%.1f/%.1f".format(name, temps.min(), temps.average(), temps.max())) 
        }

I didn't do it that way because with one billion rows, it runs out of memory. The imperative version that mutates in place doesn't.

You can probably write a functional version that also doesn't run out of memory using .groupingBy{}, because it all applies to lazy sequences. I just didn't bother working out how. The Java streams framework can do the same things and has the advantage that it can sometimes make it easy to parallelize.

elteto · on Jan 29, 2024

For a one off thing, use awk. For the production pipeline spend the time writing the Java one (or Rust, or C++, or… etc).

Or use the awk one in production if you want. Nothing wrong with awk ;)

quadrature · on Jan 29, 2024

> It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?".

Not quite, it runs in 3 minutes written in java in 5 minutes. The official baseline is quite simple and runs in about ~3min https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

alexsmirnov · on Feb 1, 2024

I can bet that some 2 min 50 seconds in baseline took by byte -> String conversion. Java is pretty bad at this, especially if default Charset is UTF8. Even using ASCII at Files.lines() call may improve performance dramatically. All leaders work directly with bytes. All other tricks is just for sub-second improvements. Did solve similar problem with log files.

quadrature · on Feb 1, 2024

Yeah strings are very expensive, saw huge gains in my submission after removing them. Parallelization gets the runtime down to 30s , After removing strings you get to around 15s on the eval machine, the rest of the tricks are still very relevant to get to sub 10s.

7thaccount · on Jan 29, 2024

>"It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?"

It's exactly this. Is it going to be run many times or just once or a few times? Is it going to be run in production where it impacts customers or runs up the AWS bill? If it's a throwaway script, the Linux terminal and some Awk is hard to beat. There are always tradeoffs.

pizzafeelsright · on Jan 29, 2024

which one scales is the question for those trying to optimize.