> The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes.
> The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance.
This is the kind of thing that is becoming more and more common as literally no one wants to think about how to process anything that does not fit in memory.
> The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file).
It is not trivial for a team which has never thought about why things that can be done in constant memory footprint ought to be done in constant memory footprint to make this change. Ideally, your team will adopt the view that while slurping files might be OK in toy examples, if you start with a constant memory footprint goal, you eliminate a whole huge range of issues at the outset.
Your biggest problem then becomes trying to get everyone to see the value of this approach because they will never have experienced crashing machines, corrupted processing pipelines, sleepless nights, missed deadlines because every thing every step of the way wants to read everything into memory.
The whole encrypt/upload thing could be done in a single pass in fixed size chunks here, reading the file linearly (I am not sure why the S3 bucket is not encrypted or if it is what the advantage of the "double" encryption is). Incidentally, this would not even require that much extra programming.
The cache being full is exactly what you want. During a linear read, most of the reads will be satisfied from the cache. And, the OS will do a much better job of deciding how much of each file should remain in the cache etc.
Say you move this thing to a server with 128 GB memory. What happens when the service actually has to handle four uploads at the same time?
I have often thought perhaps about 50% of Go's success in the high-power network space has little to do with any of its headline features and more to do with the fact that it shipped with the "io.Reader" and "io.Writer" interface from the get-go in the standard library, and as a result, the entire ecosystem tends to support working in the io.Reader and io.Writer interface. One of the joys of working with networking in Go is picking up some half-obscure library like a JSON validator or some obscure encryption/decryption library and finding out that it defaults to stream processing, because it uses Reader/Writer correctly, and the byte-array or string interface is just a convenience wrapper around the stream processing. Anyone who posts a string- or byte-array-only library to some Go discovery mailing list or message board where there isn't a good reason to have it take only those things will get as their first piece of feedback that they should convert the library to be based on Reader/Writer.
There is almost no technical reason any current popular language couldn't work this way. (Though C has some serious challenges with its anemic memory management system, pretty much anything else can do this.) It is all to the culture of the language community, rather than the language itself, and a sort of inductive process of "all previous N libraries worked on strings, so the person writing the N+1'th library also wrote it work on strings". It is one of the problems I have when going back to Python for the sort of work I've done... it only takes one library in a pipeline to work solely on strings to ruin the ability to stream process for the entire pipeline.
(Note this isn't a praise of Go qua Go; again, almost any language is technically capable of pulling this off. It's the libraries that accumulate in a community based on strings, and the problem where it only takes one library in your stack to be based on strings to make stream programming impossible meaning that they tend to "pollute" the community library culture if you don't start from the beginning with stream processing in mind. Otherwise a language community ends up having to create a whole parallel library ecosystem based on stream processing, like Twisted used to be for Python, and that parallel ecosystem is never quite able to keep up with the main one. There are other language communities that also do this successfully, I think, but there are certain communities where the language is perfectly capable of streaming but libraries tend to be written against fully-manifested strings.)
> The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance.
This is the kind of thing that is becoming more and more common as literally no one wants to think about how to process anything that does not fit in memory.
> The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file).
It is not trivial for a team which has never thought about why things that can be done in constant memory footprint ought to be done in constant memory footprint to make this change. Ideally, your team will adopt the view that while slurping files might be OK in toy examples, if you start with a constant memory footprint goal, you eliminate a whole huge range of issues at the outset.
Your biggest problem then becomes trying to get everyone to see the value of this approach because they will never have experienced crashing machines, corrupted processing pipelines, sleepless nights, missed deadlines because every thing every step of the way wants to read everything into memory.
The whole encrypt/upload thing could be done in a single pass in fixed size chunks here, reading the file linearly (I am not sure why the S3 bucket is not encrypted or if it is what the advantage of the "double" encryption is). Incidentally, this would not even require that much extra programming.
The cache being full is exactly what you want. During a linear read, most of the reads will be satisfied from the cache. And, the OS will do a much better job of deciding how much of each file should remain in the cache etc.
Say you move this thing to a server with 128 GB memory. What happens when the service actually has to handle four uploads at the same time?