Analyzing a High Rate of Paging

nanis · on Aug 31, 2021

> The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes.

> The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance.

This is the kind of thing that is becoming more and more common as literally no one wants to think about how to process anything that does not fit in memory.

> The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file).

It is not trivial for a team which has never thought about why things that can be done in constant memory footprint ought to be done in constant memory footprint to make this change. Ideally, your team will adopt the view that while slurping files might be OK in toy examples, if you start with a constant memory footprint goal, you eliminate a whole huge range of issues at the outset.

Your biggest problem then becomes trying to get everyone to see the value of this approach because they will never have experienced crashing machines, corrupted processing pipelines, sleepless nights, missed deadlines because every thing every step of the way wants to read everything into memory.

The whole encrypt/upload thing could be done in a single pass in fixed size chunks here, reading the file linearly (I am not sure why the S3 bucket is not encrypted or if it is what the advantage of the "double" encryption is). Incidentally, this would not even require that much extra programming.

The cache being full is exactly what you want. During a linear read, most of the reads will be satisfied from the cache. And, the OS will do a much better job of deciding how much of each file should remain in the cache etc.

Say you move this thing to a server with 128 GB memory. What happens when the service actually has to handle four uploads at the same time?

jerf · on Aug 31, 2021

I have often thought perhaps about 50% of Go's success in the high-power network space has little to do with any of its headline features and more to do with the fact that it shipped with the "io.Reader" and "io.Writer" interface from the get-go in the standard library, and as a result, the entire ecosystem tends to support working in the io.Reader and io.Writer interface. One of the joys of working with networking in Go is picking up some half-obscure library like a JSON validator or some obscure encryption/decryption library and finding out that it defaults to stream processing, because it uses Reader/Writer correctly, and the byte-array or string interface is just a convenience wrapper around the stream processing. Anyone who posts a string- or byte-array-only library to some Go discovery mailing list or message board where there isn't a good reason to have it take only those things will get as their first piece of feedback that they should convert the library to be based on Reader/Writer.

There is almost no technical reason any current popular language couldn't work this way. (Though C has some serious challenges with its anemic memory management system, pretty much anything else can do this.) It is all to the culture of the language community, rather than the language itself, and a sort of inductive process of "all previous N libraries worked on strings, so the person writing the N+1'th library also wrote it work on strings". It is one of the problems I have when going back to Python for the sort of work I've done... it only takes one library in a pipeline to work solely on strings to ruin the ability to stream process for the entire pipeline.

(Note this isn't a praise of Go qua Go; again, almost any language is technically capable of pulling this off. It's the libraries that accumulate in a community based on strings, and the problem where it only takes one library in your stack to be based on strings to make stream programming impossible meaning that they tend to "pollute" the community library culture if you don't start from the beginning with stream processing in mind. Otherwise a language community ends up having to create a whole parallel library ecosystem based on stream processing, like Twisted used to be for Python, and that parallel ecosystem is never quite able to keep up with the main one. There are other language communities that also do this successfully, I think, but there are certain communities where the language is perfectly capable of streaming but libraries tend to be written against fully-manifested strings.)

anotherhue · on Aug 31, 2021

I agree, but I don't think human behavior will ever let us disregard easy, seemingly complete wins for slower actually proper ones.

Probably, the bad solution is good enough most of the time, but we only hear about the failures.

bluedino · on Aug 31, 2021

>> Say you move this thing to a server with 128 GB memory. What happens when the service actually has to handle four uploads at the same time?

You spin up 4 more 128GB servers and then wonder why your AWS bill is 8x higher than it was last month.

CraigJPerry · on Aug 31, 2021

A few years ago i glommed onto one of Brendan Gregg’s structured models called the USE method, i’ve had a lot of mileage out of it.

Utilisation - have a quick squizz at what resources are being used on the host, so in this case i’m guessing you’d see an elevated %sys time which would lead you to look at probably IO next so maybe disk would be first on your hit list…

Saturation - where are we bottlenecking.

Exceptions - anything going wrong on the host (or indeed in the process).

This model isn’t general though, e.g. it wouldn’t help you identify a deadlock for example.

One for the back pocket. It’s certainly earned a space in mine.

traceroute66 · on Aug 31, 2021

Aah Brendan Gregg. Forever famous as "that guy" who shouted at a disk array[1]. :)

[1]https://www.youtube.com/watch?v=tDacjrSCeq4

alexitosrv · on Aug 31, 2021

I remember him as my go to eBPF guy / or Netflix perf demigod, but now you brought this old vid to attention, definitively that old guy screaming to the cloud works too :D

krnaveen14 · on Aug 31, 2021

A few years back we were doing heavy file processing in Java with most of the time being spend in file i/o. Initially it was designed as receive zip file, extract zip to directory, read the extracted files one by one and processing it. If the zip file is 1GB and expands to 10GB when extracted, the amount of IO being done is significantly large. 1GB Read -> 10GB Write -> 10GB Read. Suppose our AWS Instance type is capable of 50GB/s, we were spending a minimum of 420 sec in IO operation itself.

This limited the throughout capacity of number of files which can be processed within a duration where the next set of zip files would be received in fixed interval. Since we passthrough the file only once during processing, we had to eliminitate the zip extraction and read the files in zip one by one as decompressed byte stream. This was possible with zipInputStream.getNextEntry() and reading the bytes but it posed a major refactoring and inconvenience where we now have to deal with byte[] instead of File in every place.

Then comes the most advanced nio and filesystem provider features (it was already available but we came to know about the benefits of it only then). All we had to do was simply replace File with Path and new FileInputStream() with Files.newInputStream() instead. Regarding zip file decompression, we simply replaced ZipInputStream and ZipEntry with FileSystems.newFileSystem() instead. Near instantly we were able to reduce 21GB IO into just 1GB IO reducing the total processing time exponentially.

Based on this understanding, we were able to implement the similar approach in zip file creation also where files will be written directly to zip as a compressed stream and only using Path in all places.

Developers are not just required have to a mental model about the memory constraints but also about the volume of Disk IO operations where the latency would instantly kill the application performance once the page cache cannot hold the files in memory.

bluedino · on Aug 31, 2021

Questions I have: Why was 48GB chosen? Were 100GB uploads intended or has this been running 'fine' for the last 5 years?