I've maintained some pretty big libraries inside Google at one time or another i...

pclmulqdq · on May 4, 2022

I worked on one of those infrastructure services (like Bigtable). It was a huge boon to my productivity that the one version rule existed.

If we had to maintain 6 past months of releases, we would never get anything done. Since breakages in our client library were our problem, getting a months-old bug means that we can just tell you to recompile rather than figuring out how to backport a fix for you.

Hyrum's Law considerations got really weird, especially when people take them too far. I think this is based on the kernel idea of "don't break userspace," but its practical implications are nuts. Hyrum's law has killed infrastructure projects that could have made things a lot better, and has resulted in crazy test-only behavior being the norm.

One person on an adjacent team loved taking behaviors as promises (and he also had a reputation as one of the most prolific coders at G). We had to clean up his messes every time he relied on unspecified behavior. I pushed back on his nonsense a few times, particularly when he used an explicitly banned combination of configuration options that we forgot to check-fail on, but always lost. 1.5 SWEs on our team were full-time cleaning up after him.

inoffensivename · on May 4, 2022

You are a rare engineer who actually pushes back on nonsense. For that, I say "thank you". I feel that there are too few people at Google who will say "stop, that is going to create a problem in the future". Some people are very enthusiastic and well meaning and "productive" (high change count), but create a burden for their colleagues in their fervor to change things.

phaedrus · on May 4, 2022

To the point in your last sentence, a few years ago I worked at a software company where there was a prolific code writer who similarly tied up about ~2 SWEs cleaning up his messes. He was a Perl programmer who was set loose on C# code. The broken ways the company tracked productivity meant the prolific writer rated as doing great while the three SWEs who had to spend 2/3rds of their time fixing the messes he left behind were rated as unproductive.

winterplace · on May 4, 2022

How did the company track productivity? Why did the one get rated as doing great?

sp332 · on May 5, 2022

Features checked off as "working" when they pass a test/demo instead of passing code review?

panda88888 · on May 5, 2022

Probably lines of new code?

afc · on May 4, 2022

Agree. I've maintained some widely used libraries at Google for over a decade (the larger one getting invoked from over 5e9 qps at peak) and I'm very grateful for the build-everything-from-head practice (and short build horizons). Yeah, it introduces a few small problems, but I think they are ~easy to deal with —e.g., protect new functionality with flags to control canary/roll out when needed; file bugs (and, eventually, deliberately break) customers with unreasonable tests.

Just to pick on the last example, having customers with unreasonable tests _is_ a real problem. But letting clients pin down specific/older builds of their dependencies (your code) to deal with this doesn't solve the problem, just pushes it down the road, imo making things worse.

.. and these problems are absolutely worth having in return for the simplicity of ~only having to support head (and, in some cases, like client LB policies, just a relatively short build horizon).

specialist · on May 4, 2022

Any ideas on mitigating, kneecapping "prolific coders"?

My observation has been the torrent-of-poo coder also demands quick (pro forma ~~performa~~) code reviews while nitpicking concern trolling other people's PRs. Not sure what to call this? Gatekeeping, control freak, PKI hacking, passive-aggressive, or maybe just being an asshole.

pclmulqdq · on May 4, 2022

It's a culture thing IMO. I used to be an electrical engineer, which has a strong culture of "everything I don't specify is a black box that I am free to change" (since the core products of EEs literally are black boxes with some wires sticking out the bottom). I was shocked that programmers don't think the same way. Also, the super-coders having bad attitudes on PRs matched my experience, and is incredibly toxic.

I think part of it is that programmers don't like to write a lot of documentation, but infrastructure services have very long lives and naturally build up a lot of documentation. If you write down a lot of promises, you also have documentation about what you don't promise. Unfortunately, the culture of G was that docs don't get updated and you promise all observable behavior.

lumost · on May 4, 2022

This is always the result of someone looking at productivity as zero sum, some management practices encourage zero sum behavior. If management doesn't actively incentivize non-zero sum behavior - then zero sum behavior rules. Consider the following list of highly effective zero sum behaviors.

Effective Zero Sum behaviors that management should avoid:

0) I can ship faster than all of my team mates if I nitpick their CRs and get mine quickly approved by our easier reviews.

1) I can make sure my CR reviews are always faster if I am the nitpicky a-hole on every other CR

2) I can ship faster if I don't add as many tests as my peers. I'll promise to add them in a later CR and forget about it.

3) I can ship faster if I write software that is difficult for others to work on.

4) I can seem like a smarter engineer than everyone else by using jargon and writing difficult to understand code.

5) I can convince management to look at my X,Y,Z productivity metrics, while the team remains ignorant. (Code Review iterations is a dangerous example of this)

Big thing is to recognize signs of zero sum behavior, and investigate if it's being done intentionally or unintentionally. Even if its intentional, it may simply be a bad reaction to the surrounding incentive model of the organization.

specialist · on May 4, 2022

Agreed.

I think about Eli Goldratt's The Goal all the time.

Paraphrasing: A team (system) is only as fast as its slowest member (task); to speed up the team, focus on making the bottle neck faster.

Said another way, Goldratt also explains how a process that outpaces the others works to slow down the system overall. Local optimization leads to gloal inoptimization.

Pretty basic queue theory stuff. This aspect is aka Theory of Constraints.

Anyhoo. I don't buy most "10x programmer" tales. I just wonder who's on the receiving end of such awesomeness. And at what cost.

BWStearns · on May 5, 2022

Holy shit I just got flashbacks from the dude on my team who used to do that asymmetric code review warfare. Hardly saw his PRs cause he had one person on the team insta-stamp them. At the same time he’d happy-glad you to death with blocking change requests. Said requests would trickle in over the course of a day or two because he didn’t just sit down and do a review, he’d graze on it.

rmc · on May 6, 2022

> At the same time he’d happy-glad you to death

Can you explain what “happy-glad” means here please?

sagarm · on May 4, 2022

I think you meant "pro forma" not "performa".

miiiiiike · on May 4, 2022

Thank you for saying this. I bought the physical copy of the book a few years ago, before I was a heavy user of Google OSS projects. I cracked it after I was shocked by how bad many of the practices seemed to be.. And it's all working as intended? The bit on docs is confusing. Google docs of some of the most poorly organized, least easily referenced docs I've ever seen. [0]

Skip forward a few years and my projects are full of workarounds for years old Google bugs. It feels like fixing basic functionality just isn't a priority. Most of them are literally labeled "NOT A PRIORITY".

[0]: You can read Scrapy's docs, and the docs for most major Python libraries, from beginning to end and just "know" how to use it (https://docs.scrapy.org/_/downloads/en/latest/pdf/). With Google docs you have to piece together fragments of information connected by an complex web of "[barely] Related Topics".

titzer · on May 4, 2022

Google overcompensates for a number of practices that don't scale by throwing ungodly computational power at them. I don't know how much CPU forge/TAP consumes these days, but I remember when it was at least 90K cores in a single cluster. It's insane to me that hundreds of thousands of giga-brains are pinned 100% 24/7 to dynamically check literally trillions of things because the combinatorial space was too hard to factor.

This is not to disparage the people who built those systems, but there is only so much concrete you can put in a rocket ship.

yowlingcat · on May 4, 2022

I'd agree with you that this would be bad for most companies I've worked at barring arguably one (Amazon, which is also gigantic) -- at any startup/medium sized company, these practices would not be right-sized to the work being done.

Even during my tenure at AWS inside Amazon, I have difficulty believing this would be useful; for example, at AWS, we'd separate services by tiers based on how close to the center they were inside the internal service graph. Running EC2/S3 or another tier 1 service? Yes, you're probably going to index on moving a bit slower to reduce operational risk than not. Running a new AWS service that is more of a "leaf node" than a center node? You can go ahead and move fairly quickly so long as you're obeying AWS best practices, which while somewhat hefty by startup standards are quite relaxed by other corporate standards.

What I wonder is whether this kind of heterogeneity would have been a better path for Google than what you describe. Or, is it the case that the sheer monolithic scale of search/ads at Google is such that it just wouldn't make sense, and that continuing to pile incremental resources into the monstrosity (and I mean this gently/positively) of search is what the company must do and so what the engineering culture must enable.

But, as you might be alluding to, perhaps the current approach doesn't even suit the needs of the company and is purely bad even for Google's specific problems -- in that case, is it simply there due to cultural cruft/legacy? I haven't worked at Google before, so it's hard for me to say something based on my experience with it.

Rarebox · on May 4, 2022

I thought google did this versioning thing for libraries before, but it was stopped for reasonable reasons (g3 components).

Basically if you could pin lib versions everyone would be stuck on old versions for a long time, causing difficult debugging work for each user of the library. You'd then also have all sorts of diamond problems: what if you want newest absl but older bigtable client?

It's a difficult problem no matter which way you go.

alpb · on May 6, 2022

This is inherently contradictory with the trunk-based development model.

I get the feeling you need to pin to v1.4 but ideally by being on the trunk head at all times, you force everyone (especially the library owners, and yourself, by writing tests around your wrapper) to do things properly such as having enough tests in place. Otherwise, you find yourself digging a grave for yourself when the time comes to migrate from v1.4 to v1.7, and it becomes grunt work that nobody wants to take on.

_ktx2 · on May 4, 2022

Re: the one version rule

On the other hand, my users can pin versions, and we maintain a longer LTS window for those features. To this day, that LTS window has never been exercised because we end up having to build backwards compatibility into everything we do. The backwards compatibility promise also means our testing is extremely verbose.

mucle6 · on May 4, 2022

>The version control and build system we have makes it difficult to work inside a branch

nit: use fig

danpalmer · on May 4, 2022

Not sure how much we should discuss in a public forum, but while you may be technically correct, I don't think this is the right solution here.

Long-running shared Git branches are a useful tool for large scale changes and integrations. They're not ideal in small teams, but unavoidable at some level and useful if done well. Fig isn't doing that.

inoffensivename · on May 4, 2022

I use fig, it doesn't solve the problem I'm talking about.