Facebook engineer here, working on this problem with Joshua. What this comes dow...

groby_b · on Feb 4, 2012

An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash

So does, presumably, the cache when you use lstat. (Let's scratch presumably. It does. Bonus points if you can't use Linux and use an OS that seems to chill its caches down as soon as possible. )

I hope I'm wrong, but the proper solution to this seems to be a custom file system - not only will it allow you to more easily obtain a "modified since" list of files, it also allows you to only get local files "on demand". (E.g. http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...)

That still doesn't solve the data structure issues in git, but at least it takes some of the insane amount of I/O off the table.

I'm looking forward to see what you guys cook up :)

caf · on Feb 4, 2012

You might be able to do the "custom file system" as a pass-through FUSE filesystem.

patangay · on Feb 4, 2012

Yea, a FUSE filesystem is also being considered as one of the possible solutions.

dochtman · on Feb 4, 2012

Heh, a familiar name working on this!

So for your first item, it seems like it should be possible to add a (mostly immutable) cache file doing the job of Mercurial's files field in changesets, right? I.e. for each commit, list the files changed. Should be more efficient than searching through trees/manifests for changed files, at least.

For large (in files) trees, it seems like there's no easy solution, except for developing some kind of subtree support. However, that's similar to just splitting up the repository (along the lines of hg subrepo support), in the sense that now you have no real verification that non-checked-out parts of the tree will work with the changes in the part you do have checked out.

Still, the inotify daemon seems like it could alleviate things a bunch; particularly if the repository is on a server anyway, i.e. it's not rebooted that often.

Terretta · on Feb 4, 2012

> waiting on disk I/O

Out of curiosity, why are these benchmarks using regular disk and flash disk? At only 15 GB, what happens using ram disk? Sure SSD is fast, but for these things it's still really slow.

joelthelion · on Feb 4, 2012

I see you've given the issue a lot of thought.

Sorry for asking the obvious, but do you really need huge amount of data to keep development productive? How often do you use history that is several years old? Could you not archive it?

Or is the sheer number of files the problem, even ignoring history?

This is not a "git is perfect, fix your workflow" post, but I'm genuinely interested in what you have to say. Also, it seems like making git faster is a increasingly difficult task, given the amount of effort that has already been put into it.

tonfa · on Feb 4, 2012

Nice to see you're still in the DVCS business ;)

willvarfar · on Feb 4, 2012

Do you think adapting git to use, say, LevelDB and letting that do its job with incremental updates and maintaining a secondary index by path could help?

And that might dovetail nicely with an inotify daemon?

DrCatbox · on Feb 4, 2012

You seem to know enough about this problem to solve it given enough time.

Why doesnt facebook solve this git on huge repos problem and put out a patch for others to see? Oh, right, you want somebody else to solve the problem for you, for free!