Even when I had this in the hot path 10 years ago and was owning hundreds of objects in a particle filter, handing out ownership copies and creating new ones ended up taking ~5% (ie making it a contiguous vector without any shared_ptr). It can be expensive but in those case you probably shouldn’t be using shared_ptr.
Oh, and the cost of incrementing an integer by itself (non atomically) is stupid fast. Like you can do a billion of them per second. The CPU doesn’t actually write that immediately to RAM and you’re not putting a huge amount of extra cache pressure vs all the other things your program is doing normally.
If you write to cache, then depending on architecture that change has to be made visible to every other thread as well. Reading is not subject to such a constraint.
MESI cache coherency (and its derivatives) [1] means that you can have exclusive writes to cache if and only if no other core tries to access that data. I would think most if not all microarchitectures have moved to MESI (or equivalent) cache coherency protocols as they avoid unnecessary writes to memory.
Only for atomics. Single-threaded RC counts are fine and WebKit uses hybrid RC where you use an atomic shared ptr to share across threads and then you downgrade it to a non atomic version in-thread (and can hand out another atomic copy at any time).
Atomics are rarely needed as you should really try to avoid sharing ownership across threads and instead change your design to avoid that if you can.
>> If you write to cache, then depending on architecture that change has to be made visible to every other thread as well
> Only for atomics
I don't think cache coherency is aware of threads; they live at a level above it (IANAExpert though)
> where you use an atomic shared ptr to share across threads and then you downgrade it to a non atomic version in-thread (and can hand out another atomic copy at any time).
your idea of GC is very different from others', you are happy to do a ton of manual stuff. GC is generally about not doing a ton of manual stuff.
Your single threaded RC will still have to write back to memory, no one thinks that incrementing an integer is the slow part — destroying cache is.