Note that 0x3D is the "=" character in ASCII, so "=3D" in QP is "=" in ASCII. :)
This email has probably been through a few conversions to QP and back again between different email clients. Perhaps some buggy client got confused between an ASCII "=" and a QP escape sequence or something like that.
Another FYI type comment I guess. :) Some of my more gripey statements here may be outdated info, so DYOR I guess.
Dynamic reconfig in 3.5 addresses the "restarting every zookeeper instance" problem. [0] You stand up an initial quorum with seed config, then tie in new servers with "reconfig -add". Not sure how well it would tie into cloudy autoscaling stuff though. I wouldn't start there myself.
A much bigger pain IMO is the handling of DNS in the official Java ZK client earlier than 3.4.13/3.5.5 (and by association, Curator, ZkClient, etc.). [1] The former was released mid 2018 and the latter this year, so tons of stuff out there that just won't find a host if IPs change. If you "own" all the clients it's maybe not a problem, but if you've got a lot of services owned by a ton of teams it's ... challenging.
Even with the fix for ZOOKEEPER-2184 in place I'm pretty sure DNS lookups are only retried if a connect fails, so there's still the issue of IPs "swapping" unexpectedly at the wrong time in cloud environments which can lead to a ZK server in cluster A talking to a ZK server in cluster B (or worse: clients of cluster A talking to cluster B mistakenly thinking that they're talking to cluster A). I'm sure this problem's not unique to ZK though.
Authentication helps prevent the worst-case scenarios, but I'm not sure if it helps from an uptime perspective.
TL;DR: ZK in the cloud can get messy (even if you play it relatively "safe").
Exactly this. Or at least it's a way this can be achieved, assuming solid testing & some tooling in the mix.
For folks unfamiliar with it, the issue is something like:
1. You find a bug in a library A.
2. Libraries B, C and D depend on A.
3. B, C and D in turn are used by various applications.
How do you fix a bug in A? Well, "normal" workflow would be something like: fix the bug in A, submit a PR, wait for a CI build, get the PR signed off, merge, wait for another CI build, cut a release of A. Bump versions in B, C and D, submit PRs, get them signed off, CI builds, cut a release of each. Now find all users of B, C and D, submit PRs, get them signed off, CI builds, cut more releases ...
Now imagine the same problem where dependency chains are a lot more than three levels deep. Then throw in a rat's nest of interdependencies so it's not some nice clean tree but some sprawling graph. Hundreds/thousands of repos owned by dozens/hundreds of teams.
See where this is going? A small change can take hours and hours just to make a fix. Remember this pain applies to every change you might need to make in any shared dependency. Bug fixes become a headache. Large-scale refactors are right out. Every project pays for earlier bad decisions. And all this ignores version incompatibilities because folks don't stay on the latest & greatest versions of things. Productivity grinds to a halt.
It's easy to think "oh, well that's just bad engineering", but there's more to it than that I think. It seems like most companies die young/small/simple & existing dependency management tooling doesn't really lend itself well to fast-paced internal change at scale.
So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this. Folks like Netflix stuck it out with the multi-repo thing, but lean on tooling [0] to automate some of the version bumping silliness. I think most companies that hit this problem just give up on sharing any meaningful amount of code & build silos at the organizational/process level. Each approach has its own pros & cons.
Again, it's easy to underestimate the pain when the company is young & able to move quickly. Once upon a time I was on the other side of this argument, arguing against a monorepo -- but now here I am effectively arguing the opposition's point. :)
> So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this.
I think you’re retroactively claiming that Google actively anticipated this in their choice at the beginning of using Perforce as an SCM. They may believe that it’s still the best option for them, but as I understand it, to make it work they bought a license to the Perforce source code forked it and practically rewrote it to work.
My theory (I wonder if someone can confirm this), is that Google was under pressure at that point with team size and Perforce’s limitations. Git would have been an entirely different direction had they chosen to ditch p4 and instead use git. What would have happened in the Git space earlier if that had happened? Fun to think about... but maybe Go would have had a package manager earlier ;)
> I think you’re retroactively claiming that Google actively anticipated this in their choice at the beginning of using Perforce as an SCM.
Oh I didn't mean to imply exactly that, but really good point. I just meant that it seems like folks don't typically _anticipate_ these issues so much as they're forced into it by ossifying velocity in the face of sudden wild success. I know at least a few examples of this happening -- but you're right, those folks were using Git.
In Google's case, maybe it's simply that their centralized VCS led them down a certain path, their tooling grew out of that & they came to realize some of the benefits along the way. I'd be interested to know too. :)
Maybe Google’s choice for monorepo was pure chance. However, on many occasions the choice was challenged and these kinds of arguments were (successfully) made in order for it to stay.
There's a subtler, and potentially more important thing that can crop up with your scenario:
Library A realises that its interface could be improved, but it would not be backwards incompatible. In the best case scenario, with semver, there is a cost to this change. Users have to bump versions and rewrite code, maybe the maintainer of Library A has to keep 2 versions of a function to ease the pain for users. It may just be that B, C and D trust A less because the interface keeps changing. All this can mean an unconscious pressure to not change and improve interfaces, and adds pain when they do.
Doing it in a monorepo can mean that the developers of A can just go around and fix all the calls if they want to make the change, allowing for greater freedom to fix issues with interfaces between modules. And that is really important in large complex systems with interdependent pieces.
> The answer would be no in general, I think, since it is unsafe.
> ...
> for example, the statement may include an rpc, and we don't want to make that rpc under the lock
I do agree that it's not a "safe" optimization in the extreme general case (so don't go rewriting your code assuming it's equivalent!), but in the case where the loop is a candidate for unrolling it works just fine. Imagine you had a more CPU- or memory-bound workload, and these benchmarks are a whole lot more interesting.
Put another way: if there's an RPC call in the for loop, the time spent in the RPC will dwarf the work involved in executing loop itself so ... odds are good it's not going to be a candidate for unrolling anyways. :)
If any Python devs are out there reading: my understanding is that removing the GIL itself isn't the hard part so much as removing the GIL while satisfying certain constraints deemed necessary by GvR and/or the rest of the community. I know some of those constraints relate to compatibility with existing C extensions -- but there must be others too?
The reason I ask is Larry's attempt buffered ref counting surely has implications for single-threaded code that maybe relies on the existing semantics -- e.g. a program like this may no longer reliably print "Deallocated!":
Python 2.7.13 (default, Mar 5 2017, 00:33:10)
[GCC 6.3.0 20170205] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class Foo(object):
... def __del__(self):
... print 'Deallocated!'
...
>>> foo = Foo()
>>> foo = None
Deallocated!
>>>
A bad example in some ways since in this particular case we could wait for all ref counting operations to be processed before letting the interpreter exit, but hopefully my point is still clear.
Similarly, what about multi-threaded Python code that isn't written to operate in a GIL-free environment -- absent locks, atomic reads/writes, etc.? At best, you might expect some bad results. At worst, segfaults.
Are these all bridges that need to be crossed once a realistic solution to the core GIL removal issue is proposed? As glad as I am that folks are still thinking hard about this problem, I'm personally sort of pessimistic that the GIL can be killed off without a policy change wrt backward compatibility. Still, I do sort of wonder if some rules of engagement wrt departures from existing semantics might help drive a solution.
If I'm understanding you, some or all of these questions are explicitly addressed in the Q&A. My apologies if you got that far and I simply didn't understand you.
For example, your first question seems to be asking about whether there's a semantic change coming from a lack of immediacy in when __del__ will run. And the answer is explicitly "yes, and the docs already told you not to count on that".
As for multi-threaded Python code... and perhaps also multi-threaded C code in extensions... I think the clear answer is "yes, our whole goal is to remove some guarantees that were previously provided, so if you counted on those guarantees you're in trouble". Again, c.f. the Q&A in case that helps.
From the talk, it doesn't look to me like Larry Hastings has a plan for the policy change in question; so maybe "bridges that need to be crossed once [the technical issues are smaller]" is correct?
The big constraint (aside from backwards compatibility) is performance: Guido has indicated that he is unwilling to accept much (if any) slowdown of single-threaded code in order to remove the GIL. It's (relatively) easy to remove the GIL and replace it with a bunch of fine-grained locks (or atomic increments, etc), but doing so tends to slow things down. The challenge is in figuring out how to avoid synchronization overhead for common operations (mainly reference counts).
It's true that buffered refcounting probably means that `__del__` would no longer be called immediately as it is now, but I'm not sure if that's a requirement - pypy and jython don't do this either, and destructors are generally discouraged in favor of `with` blocks these days.
In his talk at last year's pycon, Hastings said the three constraints GvR laid out are:
1. Can't degrade single-threaded performance
2. Can't break existing extensions
3. Can't make the implementation of cpython much more complicated (i.e., can't raise the barrier to entry to participating in the development of python)
All of these are pretty reasonable, if tough, targets to meet, and Hastings agrees with all of them. For 1 and 2 he was generally looking at making GIL-less cpython a compiled mode so that the default was the single threaded version, thus retaining compatibility and performance, but offering a true multi-threaded binary for those who would use it.
> The goal should be, and is kind of what Larry Hastings is looking for, is that any program should run 8 times faster on a 8-core CPU compared to a 1-core.
A program that's inherently single-threaded it's unlikely to benefit from more CPUs. When you say "any program" here, you mean "any program with >=8 threads", right?
I mean the developer (using a highlevel language as Python) ideally should get more performance on an 8-core than on a 1-core CPU.
Erlang, which btw. is older than Python, will perform better the more cores you have due to its message oriented nature. Python (2.7) on the other hand performed worse with multithreading on multicore.
I was hoping that Python would take the same direction in the future, but unfortunately we are getting the async/await mess, instead of a simple async object model (sorry, my pet peeve)
> Servers are where the problem is. The GIL makes python functionally single-threaded, which is a bummer for your server at any kind of scale.
Right, agreed. I can imagine some of the frustration you might experience using CPython for high throughput systems: kind of like NodeJS without the benefits of a standard library written with async/non-blocking I/O in mind.
A bit curious about a few things you mention here, though:
> Or that a python process never really releases memory back to the system, just within itself, so the process slowly grows over the course of a few weeks.
I'm not sure this is true in general, is it? Can you elaborate? It's been a while since I've dug around in Python innards, but if Py_DECREF(x) leads to a refcount of zero IIRC free(x) is ultimately called -- albeit in an indirect manner via a layer or six of tp_dealloc calls and tp_free. :) I suppose calling free(x) may only return the memory associated with x to (g)libc's free list and not necessarily back to the OS [0]. No different to C/C++ in that regard, I guess.
> I came to the conclusion that python is unsuitable for servers, but until Go came out, there wasn't a realistic alternative, since C++ and Java are too heavyweight, and Ruby suffers from similar problems (don't know about a GIL).
"Too heavyweight" in that they're relatively difficult to write in comparison? Maybe true of Java-the-language, but the JVM itself is an absolute workhorse when it comes to high performance. Plenty of languages to choose from there, typically without a GIL. Jython, for example, has no GIL [1].
Common Lisp implementations generally do multithread really well and give you lovely syntactic abstraction capabilities while also running significantly faster than comparably high-level languages.
I can't speak to the rest of Facebook's stuff, but I think the build tool problem is a special case. Per their docs:
> Buck is designed for building multiple deliverables from a single repository (a monorepo) rather than across multiple repositories. It has been Facebook's experience that maintaining dependencies in the same repository makes it easier to ensure that all developers have the correct version of all of the code, and simplifies the process of making atomic commits.
Having been on the "other side" of the monorepo argument where we tried to make do with improving/extending existing build tools etc. in a rapidly growing engineering org, let me say that Facebook (with Buck), Twitter (Pants? I think?) and Google (with Bazel/Blaze) almost certainly built these to deal with the problem of scaling build management with an ever-growing organization.
The popular model of a dozen or so small repositories in GitHub + Jenkins with Maven/NPM/Rake+Bundler/whatever works fine for maybe a few dozen engineers or more, but one day you wake up and realize there hundreds of repositories spread across dozens of _teams_ and hundreds of developers. Obviously you've then got a big ol' dependency graph between repos to deal with, so if you need to fix something near the root suddenly you need to run off bumping version numbers and/or fixing intermediate libraries all the way down the graph. Plus version incompatibilities between the dependencies of different libraries ... it's a total mess, and it doesn't make for an org that can easily "move fast and break things", so to speak.
So then to avoid paralysis your options are basically either to silo up (this team owns their stuff, that team owns their stuff, don't bother with shared dependencies) or you go the monorepo route. If you do, then maybe you go and pull all your hundreds of smaller repos into a monorepo. Having everything in one place makes it easier to police the dependency issues within the org & makes it easier for a single engineer to deal with those sort of "cascading changes" instead of shunting that problem onto the entire organization. But in exchange for this "agility" you've then got the problem that builds take multiple hours & the associated tools are often highly language-centric (Maven+Java, NPM+Node, Ruby+Rake, etc.). They don't typically make any reproducibility guarantees either.
Anyway, to make a short story really long: at the time FB, Google and Twitter were hitting these organizational scaling walls, making these decisions and building these tools internally, there really weren't any great tools out there for the monorepo use case. I think that's why all these tools have appeared as side-by-side alternatives rather than improvements on one another or to tools like Maven et al.
Whether or not consolidation is warranted, for the folks who have the problems that Buck/Bazel/Pants solve, it's likely to save 'em a hell of a lot of time, effort and money IMO. It's a good thing that they have been published, even if the value's maybe not immediately obvious.
This. Also, I think that the build system itself is just the tip of the iceberg. At least in Google's case it has recently been very nicely documented [1] that blaze is "just" one piece of how google keeps velocity high
You could, for example, write a compiler in Python that generated native code (or LLVM bitcode to be passed to llc or whatever) via the LLVM API. Writing a compiler in Python vs. C/C++ would be a lot easier in a number of ways.
I'm not aware of any "production grade" compilers that do this, but no hard reason why not, I guess. Seems like it'd be nice for prototyping etc. if nothing else.
Note that 0x3D is the "=" character in ASCII, so "=3D" in QP is "=" in ASCII. :)
This email has probably been through a few conversions to QP and back again between different email clients. Perhaps some buggy client got confused between an ASCII "=" and a QP escape sequence or something like that.