"What do I want? [...] Shared objects across [micro-]processes with copy-on-write; then you can efficiently share objects (like modules!) across concurrent processes without the danger of shared state, but without the overhead of copying everything you want to share."
Is overhead of modules really an issue on servers with GB of RAM? More importantly, almost every multi-threaded string library (like C++ STL) has moved away from sharing across threads and COW because the cost of the atomic operations is too high. See Herb Sutter's "Optimizations That Aren't (In a Multithreaded World)."
Also, to say that speed is "really uninteresting" but latency is important seems like a contradiction. Latency is absolutely gated by speed for non-parallelizable problems. And even if the problem can be parallelized, speed is a much easier and simpler way to decrease latency than parallelizing.
I guess this reads to me more like a list of theoretically interesting ideas than a set of features that will actually help anyone in the real world.
RAM seems relatively constrained to me -- at least, running a multiprocess server without a few GB or RAM is hard to do (and memory seems to be the most expensive part of servers). And when it goes wrong (you reach whatever your limit is) then things tend to fail in less than ideal ways. Tools around memory usage are also quite poor, so while performance gets optimized memory seldom does. And a culture of cavalier memory usage doesn't help either -- too many people are borrowing memory to get performance, compounded in the case of an application that is typically an aggregation of many people's work.
The amount of code involved in systems is also continuing to go up, so that the amount of memory you use before you've done anything (but when you've loaded up all the code) is getting constantly higher. In the world of static/compiled languages this might not be as notable as there's a clear sense of "code" and "data" -- Python has no such distinction, so if you don't share anything then you don't share anything.
WRT speed-vs-latency -- definitely related, but most benchmarking seems to specifically remove latency from the benchmark and instead test throughput. E.g., tests frequently throw away the slowest run, even though abnormally slow runs exist in real programs and can have an effect. (Of course they aren't predictable and might be affected by other things on the system -- which is a kind of bias to not optimize things that are hard to measure). But latency is mostly just simplicity of implementation, so no, I wouldn't expect parallelizing to help.
I guess I can't disagree with your specific points about memory, but OS pages are already copy-on-write -- if you're really concerned about that kind of overhead you could always fork() individual OS processes from one that already has everything loaded.
Is overhead of modules really an issue on servers with GB of RAM? More importantly, almost every multi-threaded string library (like C++ STL) has moved away from sharing across threads and COW because the cost of the atomic operations is too high. See Herb Sutter's "Optimizations That Aren't (In a Multithreaded World)."
http://www.gotw.ca/publications/optimizations.htm
Also, to say that speed is "really uninteresting" but latency is important seems like a contradiction. Latency is absolutely gated by speed for non-parallelizable problems. And even if the problem can be parallelized, speed is a much easier and simpler way to decrease latency than parallelizing.
I guess this reads to me more like a list of theoretically interesting ideas than a set of features that will actually help anyone in the real world.