Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Did you consider emulating mmap yourselves?

  "Memory mapped files work by mapping the full file into a virtual address space and then using page faults to determine which chunks to load into physical memory. In essence it allows you to access the file as if you had read the whole thing into memory, without actually doing so."
I feel like this could be done in c++ directly, by maintaining an internal cache for each file that keeps track of which parts of the file are loaded and uses read() to load chunks on demand. Error handling would be a lot simpler (no signals, just a failed read()) and there would be less OS-specific code.


This is essentially how databases like PostgreSQL work, but in essence it only avoids the sys-call overhead. The OS is already caching the file, regardless of mmap, so using pread would have likely been enough for us.

It totally would have been simpler overall, but each incremental step we made was significantly less work than the refactoring required for pread.


> The OS is already caching the file

Not necessarily. With O_DIRECT, pread() doesn't put pages into page cache: it just DMAs them directly into your process. Using O_DIRECT and the process-private caching we've been discussing, sophisticated programs (like databases) can (and do!) implement their own "page cache" systems. And because databases have access pattern information that the generic kernel VM subsystem doesn't, such a database can frequently do a better job doing this caching on its own.


I might have undersold the performance advantage of writing your own cache, but let me reiterate the point I was trying to make: The reason we didn't consider doing so was because we weren't having a performance issue. Writing our own cache would be strictly more work than just using pread and accomplished the same thing.


Yeah. For your application, you did the right thing. I was speaking more abstractly.


It totally would have been simpler overall, but each incremental step we made was significantly less work than the refactoring required for pread.

Question.

In 10 years will you be saying this about the next incremental problem that you run into? If you think this likely, then the next incremental problem is an excuse to do it right.


If it's less work to solve that problem than refactor all the relating code, and the impact on maintainability is minimal, likely yes. But considering the amount of users we have and the current lack of any crashes relating to mmap there are unlikely to be any future unforseen issues.


Mmap is right, though. Pread would also be right. There's a tradeoff and the complexity argument would only win if they knew all this when they started.


Well, then you have to implement some kind of plan for efficient caching - some kind of LRU scheme, for example, to prevent the cache from ballooning to unusable sizes - at which point you're reinventing the kernel page cache (poorly). mmap does have a big advantage here if you really need a lot of random accesses.


It’s easy enough to read a file in chunks, parsing out the information as you go. This limits memory use as long as you release the chunks when you no longer need them. The operating system can swap out memory as-needed, even if you didn’t get the memory from mmap, so it’s irrelevant where you store the parsed data.

Unless you actually need to read the file multiple times (compared to looking at the parsed in-memory data multiple times), this should be fast enough.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: