Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Huh, fascinating. git was initially created for the Linux kernel development, and I haven't heard of any issues there. Offhand I would have said, as a codebase, the Linux kernel would be larger and more complex than facebook, but I don't have a great sense of everything involved in both cases.

So what's the story here: kernel developers put up with longer git times, the kernel is better organized, the scope of facebook is more massive even than the linux kernel, or there's some inherent design in git that works better for kernel work than web work?



The linux kernel has an order of magnitude fewer files than Facebook's test repository (25,000 in version 2.6.27, according to http://www.schoenitzer.de/lks/lks_en.html) and only 9 million lines of text.

This is on the largish side for a single project, but if Facebook likes to keep all their work in single repo then it isn't too difficult to go way beyond those stats. Think of keeping all GNU projects in a single repo.


It isn't surprising if Facebook has a large, highly coupled code base. Given their reputation for tight timelines and maverick-advocacy, I'm continually surprised the thing works at all.


I wouldn't say that a large repository implies that the code is highly-coupled. There are advantages for keeping certain code together in a single repo. Being able to easily discover users of functions of a library, being able to "transactionally" commit an update to a library (or set of libraries) and the code that uses it, being able to do code review over changes of code in various places, being able to discover if someone else has solved this problem before, and so forth. If you only have your project and its libraries checked out, you don't serendipitously discover things in other projects.

As mentioned in this talk on how Facebook worked on visualizing interdependence between modules to drive down coupling at https://www.facebook.com/note.php?note_id=10150187460703920 , there are at least 10k modules with clear dependency information in a front-end repo, and the situation probably is a lot better now that they have that information-dense representation to work from (I don't work on the PHP/www side of things, I spend most of my time in back-end and operations repos).


None of what you mention here precludes breaking up the code into many smaller repositories, and then having them all linked together in one super-repository.

Then tags at the super-repository level can record the exact state of all submodules.

It's not about not checking the other modules out; you can make this the standard behavior, sure. Instead it's about having git manage reasonable sized blocks of the code base.


Three big problems with a split up codebase:

1) Instead of doing one large release every week (which facebook does: http://www.facebook.com/video/video.php?v=10100259101684977) you now have dozens or hundreds of smaller releases, a lot more heterogeneity to test for.

2) If you have inter-dependencies on modules you have to grapple with the "diamond dependency" problem. Say module A depends on module B and C, and suppose that module B also depends on C. However, module B depends on C v2.0 but A depends on C v1.0. If they're all split across repositories it's not possible to update a core API in an atomic commit.

3) Now you rely on changes being merged "up" to the root and then you have to merge it "down" to your project. This is one of the reasons Vista was such a slow motion train wreck: http://moishelettvin.blogspot.com/2006/11/windows-shutdown-c... -- kernel changes had to be merged up to the root, then down to the UI, requiring months of very slow iterations to get it right.


Keep in mind that the talk in question is talking about the web site (and some other stuff) going into production, and as is mentioned in the talk, it is done more than once a week, and the whole shebang can be pushed in some number of minutes (I forget the amount mentioned).

Back-end services have their own release schedules and times, and obviously are made to be highly backward compatible so that they don't need to be done in lock-step with the front-end.

I think you're right about the "diamond dependency" model, but I think the merge-up and merge-down in Vista had more to do with having multiple independent branches in flight at the same time.


I disagree.

There are ALREADY interdependency issues if you're using git. Anyone could have changed anything before you did your last commit. If you pull and then push your changes without running tests, you're already risking breaking the build.

If a diamond dependency conflict came up, it shouldn't ever be committed to the TRUNK. Whoever made the change that causes a conflict should ideally discover that conflict before they submit it to TRUNK. But it's still not relevant to the decision to have one big repo or a hundred smaller ones, since the exact same problem can come up in either case if you fail to test before you submit.

With the proper workflow, having a lot of smaller trees can be functionally equivalent to having one massive tree. Checking in your branch to the TRUNK would require that you have the latest version of the TRUNK, and if you don't, then you'd need to pull the latest, just like how git works now. And updates to the TRUNK ARE atomic: Before you update TRUNK, it's pointing to the previously working version, and after you update, it's pointing to the one you just tested to make sure it still works.

And, just like now, a developer should therefore test to make sure the merge works after they've grabbed the latest.

It sounds like you're assuming that, if I update one of the submodules, it could break TRUNK? That can't happen without someone trying to commit it to TRUNK. Git remembers a specific commit for each submodule, and doesn't move forward to a new commit without being told to.


I'm not sure this would work - if an operation needs you to stat() a file for every file in the repo (for example), whether it is 10k files in 1 repo or 1k files each in 10 repos will probably just as bad?


An operation in git makes you stat() each file in the current repo -- so things like check-ins and local operations would be done 100% in the current repo.

Any time you were pulling the entire repo tree, it could be slow, yes. But assuming people are only working in one or a small number of repos at once, you can imagine a workflow that didn't involve nearly so many operations on the entire tree.


Ah, I see. Yes, that might make some of those operations better. Other operations that are common in our workflow might still need to look through the whole super-repo - for example getting a list of all changes (staged or unstaged) in the repo and generating a code review based on that.

(I almost habitually run "git status" whenever I've task switched away from code for even a few seconds to make sure I know exactly what I've done, which would have to look over the whole super-repo as well.)

Thankfully we're a while away from the times based on the synthetic test - it's not something I notice at all, but I probably write less code than most engineers here.


The FB mobile auth workflow breaks at least once a month.


From the sounds of it facebook has a really, really big ball of highly coupled code.


Not necessarily.

In the open-source world, when you want to change an API, you have to either add the change as a new API (leaving the existing API intact) or break backward compatibility and maintain parallel versions, gradually migrating users off of the old version.

Both of these options are a huge pain, and have a direct cost (larger API surface or parallel maintenance/migration efforts). When your entire repo and all callers are in the same code-base, you have a much more attractive option: change the API and all callers in a single changelist. You've now cleaned up your API without incurring any of the costs of the two open-source options.

This is why it can be nice, even if you have a bunch of nicely structured components, to have all code in a single repository.


Sounds like PHP.

Oh wait, it is.


Not really relevant.


While it may not be, I do find it an every day battle to keep my PHP wel 'styled'. Sure, PHP is the first language I learnt, and I do use a framework, but sometimes a long method is easier in the short run than writing good model functions. And I have models, but mostly for the ORM and they're all completely interconnected. PHP makes me lazy, fast


That's pretty much the same in any language - if you don't have the discipline to keep your code in a good state in one, why would you suddenly gain that discipline in another?


It's for this type of thinking that I've been looking at Rails recently.


If downvoting me, I would appreciate at least if you explain yourself :)


Funny.


The linux kernel is several orders of magnitude smaller. They are talking about 1.3 million files totalling nearly 10GB for the working tree. My kernel checkout has 39 thousand files totaling 489MB.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: