It's a solved problem. The physics is simply such that it's really inefficient.
> ... we'd need a system 12.5 times bigger, i.e., roughly 531 square metres, or about 2.6 times the size of the relevant solar array. This is now going to be a very large satellite, dwarfing the ISS in area, all for the equivalent of three standard server racks on Earth.
The gist of it is that about 99% of cooling on earth works by cold air molecules (or water) bumping into hot ones, and transferring heat. There's no air in space, so you need a radiator 99x larger than you would down here. That adds up real fast.
That’s the secret plan - cover LEO with solar cells and radiators, limiting sunlight on the ground, rendering ground base solar ineffective, cool earth and create more demand for heating; then sell expensive space electricity at a huge premium. Genius!
I think you may be thinking of cooling to habitable temperatures (20c). You can run GPUs at 70c , so radiative cooling density goes up exponentially. You should need about 1/3 of the array in radiators.
Yes, slack is intermittently down for me and the rest of my company. It's intermittently losing messages, failing to load new ones, and showing error pages :-(
Now you understand why this app costs more than 2x the price of alternatives such as diskDedupe.
Any halfway-competent developer can write some code that does a SHA256 hash of all your files and uses the Apple filesystem API's to replace duplicates with shared-clones. I know swift, I could probably do it in an hour or two. Should you trust my bodgy quick script? Heck no.
The author - John Siracusa - has been a professional programmer for decades and is an exceedingly meticulous kind of person. I've been listening to the ATP podcast where they've talked about it, and the app has undergone an absolute ton of testing. Look at the guardrails on the FAQ page https://hypercritical.co/hyperspace/ for an example of some of the extra steps the app takes to keep things safe. Plus you can review all the proposed file changes before you touch anything.
You're not paying for the functionality, but rather the care and safety that goes around it. Personally, I would trust this app over just about any other on the mac.
Best course of action is to not trust John, and just wait for a year of the app out the wild, until everyone else trusts John . I have enough hard drive space in the meantime to not rush into trusting John.
I can't find a specific EULA or disclaimer for the Hyperspace app, but given that the EULA's for major things like Microsoft Office basically say "we offer you no warranty or recourse no matter what this software does" I would hardly expect an indie app to offer anything like that
Yep. At a previous job we had a file server that we published Windows build output to.
There were about 1000 copies of the same pre-requisite .NET and VC++ runtimes (each build had one) and we only paid for the cost of storing it once. It was great.
It is worth pointing out though, that on Windows Server this deduplication is a background process; When new duplicate files are created, they genuinely are duplicates and take up extra space, but once in a while the background process comes along and "reclaims" them, much like the Hyperspace app here does.
Because of this (the background sweep process is expensive), it doesn't run all the time and you have to tell it which directories to scan.
If you want "real" de-duplication, where a duplicate file will never get written in the first place, then you need something like ZFS
Both ZFS and WinSvr offer "real" dedupe. One is on-write, which requires a significant amount of available memory, the other is on a defined schedule, which uses considerably less memory (300MB + 10MB/TB).
ZFS is great if you believe you'll exceed some threshold of space while writing. I don't personally plan my volumes with that in mind but rather make sure I have some amount of excess free space.
WinSvr allows you to disable dedupe if you want (don't know why you would) where as ZFS is a one-way street without exporting the data.
Both have pros and cons. I can live with the WinSvr cons while ZFS cons (memory) would be outside of my budget, or would have been at the particular time with the particular system.
> This is basically only a win on macOS, and only because Apple charges through the nose for disk space
You do realize that this software is only available on macOS, and only works because of Apple's APFS filesystem? You're essentially complaining that medicine is only a win for people who are sick.
This is NOT a novel or new feature in filesystems... Basically any CoW file system will do it, and lots of other filesystems have hacks built on top to support this kinds of feature.
---
My point is that "people are only sick" because the company is pricing storage outrageously. Not that Apple is the only offender in this space - but man are they the most egregious.
He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
When I run it on my home folder (Roughly 500GB of data) I find 124 MB of duplicated files.
At this stage I'd like it to tell me what those files are - The dupes are probably dumb ones that I can simply go delete by hand, but I can understand why he'd want people to pay up first, as by simply telling me what the dupes are he's proved the app's value :-)
> He "only" saved 30%? That's amazing. I really doubt most people are going to get anywhere near that.
You misunderstood my comment. I ran it on my home folder which contains 165GB of data and it found 1.3GB is savings. That isn't significant for me to care about because I currently have 225GB free of my 512GB drive.
BTW I highly recommend the free "disk-inventory-x" utility for MacOS space management.
His comment is pretty understandable if you've done frontend work in javascript.
Node_modules is so ripe for duplicate content that some tools explicitly call out that they're disk efficient (It's literally in the tagline for PNPM "Fast, disk space efficient package manager": https://github.com/pnpm/pnpm)
So he got ok results (~13% savings) on possibly the best target content available in a user's home directory.
Then he got results so bad it's utterly not worth doing on the rest (0.10% - not 10%, literally 1/10 of a single percent).
---
Deduplication isn't super simple, isn't always obviously better, and can require other system resources in unexpected ways (ex - lots of CPU and RAM). It's a cool tech to fiddle with on a NAS, and I'm generally a fan of modern CoW filesystems (incl APFS).
But I want to be really clear - this is people picking spare change out of the couch style savings. Penny wise, pound foolish. The only people who are likely to actually save anything buying this app probably already know it, and have a large set of real options available. Everyone else is falling into the "download more ram" trap.
Another 30% more than the 1GB saved in node modules, for 1.3GB total. Not 30% of total disk space.
For reference, from the comment they’re talking about:
> I then tried again including my user home folder (731K files, 127K folders, 2755 eligible files) to hopefully catch more savings and I only ended up at 1.3GB of savings (300MB more than just what was in the NodeJS folders.)
In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".
If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.
Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.
I understand the concept. My main point is that it's probably not a huge advantage to store hashes of the first 1KB, which requires CPU to calculate, over just the raw bytes, which requires storage. There's a tradeoff either way.
I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.
If you're going to keep partial hashes in memory, may as well align it on whatever boundary is the minimal block/sector size that your drives give back to you. Hashing (say) 8kB takes less time than it takes to fetch it from SSD (much less disk), so if you only used the first 1kB, you'd (eventually) need to re-fetch the same block to calculate the hash for the rest of the bytes in that block.
... okay, so as long as you always feed chunks of data into your hash in the same deterministic order, it doesn't matter for the sake of correctness what that order is or even if you process some bytes multiple times. You could hash the first 1kB, then the second-through-last disk blocks, then the entire first disk block again (double-hashing the first 1kB) and it would still tell you whether two files are identical.
If you're reading from an SSD and seek times don't matter, it's in fact probable that on average a lot of files are going to differ near the start and end (file formats with a header and/or footer) more than in the middle, so maybe a good strategy is to use the first 32k and the last 32k, and then if they're still identical, continue with the middle blocks.
etc, and only calculate the latter partial hashes when there is a collision between earlier ones. If you have 10M files and none of them have the same length, you don't need to hash anything. If you have 10M files and 9M of them are copies of each other except for a metadata tweak that resides in the last handful of bytes, you don't need to read the entirety of all 10M files, just a few blocks from each.
A further refinement would be to have per-file-format hashing strategies... but then hashes wouldn't be comparable between different formats, so if you had 1M pngs, 1M zips, and 1M png-but-also-zip quine files, it gets weird. Probably not worth it to go down this road.
I don't know exactly what Siracusa is doing here, but I can take an educated guess:
For each candidate file, you need some "key" that you can use to check if another candidate file is the same. There can be millions of files so the key needs to be small and quick to generate, but at the same time we don't want any false positives.
The obvious answer today is a SHA256 hash of the file's contents; It's very fast, not too large (32 bytes) and the odds of a false positive/collision are low enough that the world will end before you ever encounter one. SHA256 is the de-facto standard for this kind of thing and I'd be very surprised if he'd done anything else.
You can start with the size, which is probably really unique. That would likely cut down the search space fast.
At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.
To make dedup[0] fast, I use a tree with device id, size, first byte, last byte, and finally SHA-256. Each of those is only used if there is a collision to avoid as many reads as possible. dedup doesn’t do a full file compare, because if you’ve found a file with the same size, first and last bytes, and SHA-256 you’ve also probably won the lottery several times over and can afford data recovery.
This is the default for ZFS deduplication and git does something similar with size and far weaker SHA-1. I would add a test for SHA-256 collisions, but no one has seemed to find a working example yet.
How much time is saved by not comparing full file contents? Given that this is a tool some people will only run occasionally, having it take 30 seconds instead of 15 is a small price to pay for ensuring it doesn't treat two differing files as equal.
FWIW, when I wrote a tool like this I used same size + some hash function, not MD5 but maybe SHA1, don't remember. First and last bytes is a good idea, didn't think of that.
Yeah, there is definitely some merit to more efficient hashing. Trees with a lot of duplicates require a lot of hashing, but hashing the entire file would be required regardless of whether partial hashes or done or not.
I have one data set where `dedup` was 40% faster than `dupe-krill` and another where `dupe-drill` was 45% faster than `dedup`.
`dupe-krill` uses blake3, which last I checked, was not hardware accelerated on M series processors. What's interesting is that because of hardware acceleration, `dedup` is mostly CPU-idle, waiting on the hash calculation, while `dupe-krill` is maxing out 3 cores.
Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.
>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash
Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.
>if you just compare the bytes there is no chance of hash collision
You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.
But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.
You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.
To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).
The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).
You know, I've been hearing people warn of handling potential collisions for years and knew the odds were negligible, but never really delved into it in any practical sense.
You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.
Initially all files are in the same bucket.
You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.
For example:
* Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.
* Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.
* Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.
* Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.
Not surprisingly, differentiator 2 can just be the first byte (or machine word). Differentiator 3 can be the last byte (or word). At that point, 99.99% (in practice more 9s) of files are different and you’re read at most 2 blocks per file. I haven’t figured out a good differentiator 3 prior to hashing, but it’s already so rare, that it’s not worth it, in my experience.
I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:
- compute SHA256 hashes for each file on the source side
- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)
- mirror the source directory structure to the destination
- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.
Hard links are not a suitable alternative here. When you deduplicate files, you typically want copy-on-write: if an app writes to one file, it should not change the other. Because of this, I would be extremely scared to use anything based on hard links.
In any case, a good design is to ask the kernel to do the dedupe step after user space has found duplicates. The kernel can double-check for you that they are really identical before doing the dedupe. This is available on Linux as the ioctl BTRFS_IOC_FILE_EXTENT_SAME.
It was for me. I was using rsync with "--link-dest" earlier for this purpose, but that only works if the file is present in consecutive backups. I wanted to have the option of seeing a potentially different subset of files for each backup and saving disk space at the same time.
Restic and Borg can do this at the block level, which is more effective but requires the tool to be installed when I want to check out something.
Oh the sha-256 hashes are precisely what I used for a quick script I put together to parse through various backups of my work laptop in different places (tool changes and laziness). I had 10 different backups going back 4 years, and I wanted to make sure I - 1) preserved all unique files, 2) preserve the latest folder structure they showed up in.
xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.
Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.
Blake3 is my favorite for this kind of thing. It's a cryptographic hash (maybe not the world's strongest, but considered secure), and also fast enough that in real world scenarios it performs just as well as non-crypto hashes like xx.
I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.
> ... we'd need a system 12.5 times bigger, i.e., roughly 531 square metres, or about 2.6 times the size of the relevant solar array. This is now going to be a very large satellite, dwarfing the ISS in area, all for the equivalent of three standard server racks on Earth.
https://taranis.ie/datacenters-in-space-are-a-terrible-horri...
The gist of it is that about 99% of cooling on earth works by cold air molecules (or water) bumping into hot ones, and transferring heat. There's no air in space, so you need a radiator 99x larger than you would down here. That adds up real fast.
reply