> It seems like the main thing that bun does to stay ahead is cache the manifest responses. PNPM, for example, resolves all package versions when installing (without a lockfile), which is slower.
This isn't the main optimization. The main optimization is the system calls used to copy/link files. To see the difference, compare `bun install --backend=copyfile` with `bun install --backend=hardlink` (hardlink should be the default). The other big optimization is the binary formats for both the lockfile and the manifest. npm clients waste a lot of time parsing JSON.
The more minor optimizations have to do with reducing memory usage. The binary lockfile format interns the strings (very repetitive strings). However, many of these strings are tiny, so it's actually more expensive to store a hash and a length separately from the string itself. Instead, Bun stores the string as 8 bytes and one bit bit says whether the entire string is contained inside those 8 bytes or if it's a memory offset into the lockfile's string buffer (since 64-bit pointers can't use the full memory address and bun currently only targets 64-bit CPUs, this works)
yarn also caches the manifest responses.
> If I install a simple nextjs app, then remove node_modules, the lockfile, and the ~/.bun/install/cache/.npm files (i.e. keep the contents, remove the manifests) and then install, bun takes around ~3-4s. PNPM is consistently faster for me at around ~2-3s.
This sounds like a concurrency bug with scheduling tasks from the main thread to the HTTP thread. I would love someone to help review the code for the thread pool & async io.
> One piece of feedback, having the lockfile be binary is a HUGE turn off for me. Impossible to diff. Is there another format?
If you do `bun install -y`, it will output as a yarn v1 lockfile.
Of course, I can't say for sure that he looked at the fastest possible way to parse json here, but my intuition would be that if he didn't, it's because he had an educated guess that it'd still be slower.
You don't need to go straight to simdjson et al, something like Rust serde which desierializes to typed structs with data bllike strings borrowed from the input can be very fast.
Nobody is arguing that JSON is equally as performant as binary formats. What the others are saying is that the amount of JSON in your average lock file should be small enough that parsing it is negligible.
If you were dealing with a multi-gigabyte lock file then it would be a different matter but frankly I agree with their point that parsing a lock file which is only a few KB shouldn’t be a differentiator (and if it is, then the JSON parser is the issue, and fixing that should be the priority rather than changing to a binary format).
Moreover the earlier comment about lock files needing to be human readable is correct. Being able to read, diff and edit them is absolutely a feature worth preserving even if it costs you a fraction of a second in execution time.
> I agree with their point that parsing a lock file which is only a few KB
You mean a few MB? NPM projects typically have thousands of dependencies. A 10MB lock file wouldn't be atypical and parse time for a 10MB JSON file can absolutely be significant. Especially if you have to do it multiple times.
> Being able to read, diff and edit them is absolutely a feature worth preserving even if it costs you a fraction of a second in execution time.
You can read and edit a SQLite file way easier than a huge JSON file.
GitHub (disclosure: where I work) does respect some directives in a repo’s .gitattributes file. For example, you can use them to override language detection or mark files as generated or vendored to change diff presentation. You can also improve the diff hunk headers we generate by default by specifying e.g. `*.rb diff=ruby` (although come to think of it I don’t know why that’s necessary since we already know the filetype — I’ll look into it)
In principal there’s no reason we couldn’t extend our existing rich diff support used for diffing things like images to enhance the presentation of lockfile diffs. There’s not a huge benefit for text-based lock files but for binary ones (if such a scheme were to take off) it would be a lot more useful.
Any way to use `.gitattributes` to specify a file is _not_ generated? I work on a repo with a build/ directory with build scripts, which is unfortunately excluded by default from GitHub's file search or quick-file selection (T).
Does this really work for jump to file? (we're not talking language statistics or supressing diffs on PRs, which is mostly what linguist readme is talking about).
> File finder results exclude some directories like build, log, tmp, and vendor. To search for files within these directories, use the filename code search qualifier.
(The inability of quick jumping to files from /build/ folder with `T` has been driving me crazy for YEARS!)
Correct me if I'm wrong, but checking those two files:
I don't see `/build` matching anything there. So to me this `/build` suppression from search results seems like controlled by some other piece of software at GitHub :/
I checked and you're right: The endpoint that returns the file list has a hardcoded set of excludes and pays no attention to `.gitattributes`.
I think it's reasonable to respect the linguist overrides here so I'll open a PR to remove entries from the exclude if the repo has a `-linguist-generated` or `-linguist-vendored` gitattribute for that directory [1]. So in your case you can add
build/** -linguist-generated
to `.gitattributes` and once my PR lands files under `build` should be findable in file-finder.
Thanks for pointing this out! Feel free to DM me on twitter (@cbrasic) if you have more questions.
On Linux, not yet. I don't have a machine that supports reflinks right now and I am hesitant to push code for this without manually testing it works. That being said, it does use copy_file_range if --backend=copyfile, which can use reflinks.
Still don't understand whhy we even need all these inodes.. The repo is centrally accessible (and should be read-only btw). Resolving that shouldn't be a problem. It's been more than a decade and npm is still a mess.
This isn't the main optimization. The main optimization is the system calls used to copy/link files. To see the difference, compare `bun install --backend=copyfile` with `bun install --backend=hardlink` (hardlink should be the default). The other big optimization is the binary formats for both the lockfile and the manifest. npm clients waste a lot of time parsing JSON.
The more minor optimizations have to do with reducing memory usage. The binary lockfile format interns the strings (very repetitive strings). However, many of these strings are tiny, so it's actually more expensive to store a hash and a length separately from the string itself. Instead, Bun stores the string as 8 bytes and one bit bit says whether the entire string is contained inside those 8 bytes or if it's a memory offset into the lockfile's string buffer (since 64-bit pointers can't use the full memory address and bun currently only targets 64-bit CPUs, this works)
yarn also caches the manifest responses.
> If I install a simple nextjs app, then remove node_modules, the lockfile, and the ~/.bun/install/cache/.npm files (i.e. keep the contents, remove the manifests) and then install, bun takes around ~3-4s. PNPM is consistently faster for me at around ~2-3s.
This sounds like a concurrency bug with scheduling tasks from the main thread to the HTTP thread. I would love someone to help review the code for the thread pool & async io.
> One piece of feedback, having the lockfile be binary is a HUGE turn off for me. Impossible to diff. Is there another format?
If you do `bun install -y`, it will output as a yarn v1 lockfile.
If you add this to your .gitattributes:
It will print the diff as a yarn lockfile.