Facebook engineer here, working on this problem with Joshua.
What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.
A few examples:
* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.
* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)
* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).
None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash
So does, presumably, the cache when you use lstat. (Let's scratch presumably. It does. Bonus points if you can't use Linux and use an OS that seems to chill its caches down as soon as possible. )
I hope I'm wrong, but the proper solution to this seems to be a custom file system - not only will it allow you to more easily obtain a "modified since" list of files, it also allows you to only get local files "on demand". (E.g. http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...)
That still doesn't solve the data structure issues in git, but at least it takes some of the insane amount of I/O off the table.
I'm looking forward to see what you guys cook up :)
So for your first item, it seems like it should be possible to add a (mostly immutable) cache file doing the job of Mercurial's files field in changesets, right? I.e. for each commit, list the files changed. Should be more efficient than searching through trees/manifests for changed files, at least.
For large (in files) trees, it seems like there's no easy solution, except for developing some kind of subtree support. However, that's similar to just splitting up the repository (along the lines of hg subrepo support), in the sense that now you have no real verification that non-checked-out parts of the tree will work with the changes in the part you do have checked out.
Still, the inotify daemon seems like it could alleviate things a bunch; particularly if the repository is on a server anyway, i.e. it's not rebooted that often.
Out of curiosity, why are these benchmarks using regular disk and flash disk? At only 15 GB, what happens using ram disk? Sure SSD is fast, but for these things it's still really slow.
Sorry for asking the obvious, but do you really need huge amount of data to keep development productive? How often do you use history that is several years old? Could you not archive it?
Or is the sheer number of files the problem, even ignoring history?
This is not a "git is perfect, fix your workflow" post, but I'm genuinely interested in what you have to say. Also, it seems like making git faster is a increasingly difficult task, given the amount of effort that has already been put into it.
Do you think adapting git to use, say, LevelDB and letting that do its job with incremental updates and maintaining a secondary index by path could help?
And that might dovetail nicely with an inotify daemon?
You seem to know enough about this problem to solve it given enough time.
Why doesnt facebook solve this git on huge repos problem and put out a patch for others to see? Oh, right, you want somebody else to solve the problem for you, for free!
What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.
A few examples:
* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.
* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)
* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).
None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)