Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most people's mental model of garbage collection is still just stop-the-world mark-and-sweep, which is extremely old. Game developers hate GC because of impossible to predict or the lack of ability to limit latency pauses eating into the frame budget - but there are tons of new, modern GC implementations that have no pauses, do collection and scanning incrementally on a seperate thread, have extremely low overhead for write barriers and only have to scan a minimal subset of changed objects to recompute liveness, etc. that are probably a great choice for a new programming language for a game engine. Game data models are famously highly interconnected and have non-trivial ownership, and games already use things like deferred freeing of objects (via area allocators) to avoid doing memory management on the hot path that it could do automatically for everything.


After being both a Java programmer for the greater part of a decade, then spending a bunch of years doing Objective C/Swift programming, I don't really understand why the Automatic Reference Counting approach hasn't won out over all others. In my opinion it's basically the best of all possible worlds:

1. Completely deterministic, so no worrying about pauses or endlessly tweaking GC settings.

2. Manual ref counting is a pain, but again ARC basically makes all that tedious and error prone bookkeeping go away. As a developer I felt like I had to think about memory management in Obj C about as much as I did for garbage collected languages, namely very little.

3. True, you need to worry about stuff like ref cycles, but in my experience I had to worry about potential memory leaks in Obj C about the same as I did in Java. I.e. I've seen Java programs crash with OOMs because weak references weren't used when they were needed, and I've seen similar in Obj C.


On the contrary, as others have pointed out, automatic reference counting seems to be the worst of all worlds.

1. Automatic reference counting adds a barrier every time you pass around an object, even if only for reading, unlike both manual memory management and tracing GC.

2. ARC still requires thought about ownership, since you have to ensure there are no cycles in the ownership graph unlike tracing/copying/compacting GC.

3. ARC still means that you can't look at a piece of code and tell how much work it will do, since you never know if a particular piece of code is holding the last reference to a piece of data, and must free the entire object graph, unlike manual memory management. In the presence of concurrency, cleanup is actually non-deterministic, as 'last one to drop its reference' is a race.

4. ARC doesn't solve the problem of memory fragmentation, unlike arena allocators and compacting/copying GC.

5. ARC requires expensive allocations, it can't use cheap bump-pointer allocation like copying/compacting GC can. This is related to the previous point.

6. ARC cleanup still takes time proportional to the number of unreachable objects, unlike tracing GC (proportional to number of live objects) or arena allocation (constant time).

Reference counting is a valid strategy in certain pieces of manually memory managed code (particularly in single-threaded contexts), but using it universally is almost always much worse than tracing/copying/compacting GC.

Note that there are (soft) realtime tracing GCs, but this can't be achieved with ARC.


It depends on the implementation. In traditional ARC implementations these are all issues to a degree, but even then they tend toward much lower overhead with more predictable behavior. Though I agree tracing GCs can be really fast and better in many ways.

1. I'm guessing you mean lock based implementations? There's several non lock, non atomics based ARC designs that still handle threading safely. That means it's little more than a single integer operation.

2. True, but for many contexts this is easy to do and makes it easier to understand data flows. In other cases it's possible to use index based references pretty readily, like in Rust. Or add a cycle collector.

3. In theory you can't, but in practice it's often pretty easy to tell. At least in non-oop languages. I use Nim with its ARC (1) on RTOS and this is really only a concern for large lists or large dynamic structures. It can be managed using the same patterns as RAII where you call child functions who know they won't be the last ones holding the bag. Also you can use the same trick as some Rust code does where you pass the memory to another thread to dispose of (2).

4/5. It depends on the implementation, but you can use pools or arenas or other options. Nim provides an allocator algorithm (tlsf) with proven O(1) times and known fragmentation properties (3). Still tracing gcs can make better usage of short lifetime arenas true. Though with ARC you get similar benefits using stack based objects.

6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often. ARC's only need to check a root object during usage and only access the entire graph when de-allocating.

Your last point isn't accurate, as you can use an appropriately designed ARC in a hard realtime context. I've found it quite easy to do, but granted it takes a little bit of care, but any real time systems do. For items like interrupt handlers I ensure no memory is allocated or destroyed.

1) https://nim-lang.org/blog/2020/10/15/introduction-to-arc-orc... 2) https://abramov.io/rust-dropping-things-in-another-thread 3) http://www.gii.upv.es/tlsf/


I mostly agree with all of your points, and the core message - it's all tradeoffs.

Just one nitpick though:

> 6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often.

This is not accurate: tracing GCs always start from the roots and only ever visit live objects. By definition, the objects that they free are not reachable from anywhere. "Full scans" typically refer to various optimization strategies that tracing GCs implement to avoid scanning all roots (e.g. generations, per-thread scans) which do rely on still occasionally doing a full scan (scanning all live objects in all generations, or in all threads).


Yah you’re right. I didn’t quite describe that correctly, but mostly I meant scanning the “live” objects of the heap.


> non lock, non atomics based ARC designs that still handle threading safely

Don't think that's even doable at all, at least not portably. Do you have some examples?


There's a couple of methods like deferred counting, differential counting, or biased references. They're usually not completely atomic free but generally provide guaranteed constant overhead or tweak when or how the memory can be shared.

- Nim's ARC only allows one thread to manage a reference count at a time, but enables an isolated graph of memory to be moved to another thread. The concept is called `Isolate`, and is very similar to Rust's single owner of mutable references. There's still WIP to have the compiler automate the checks, but it's usable now (I used it with FreeRTOS's xQueue mechanism just fine). https://github.com/nim-lang/RFCs/issues/244

- Python's new non-GIL proposal that does this using biased references: https://hackaday.com/2021/11/03/python-ditches-the-gils-and-...

- The source of Python's biased references: https://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf

- Defered Reference counting: https://dl.acm.org/doi/pdf/10.1145/3453483.3454060

It's pretty cool stuff!


If you check ARC code in a profiler, there's a shocking amount of time spent in the retain/release calls:

https://floooh.github.io/2016/01/14/metal-arc.html

There are ways around this overhead at least in Metal, but this requires at least as much effort as not using ARC to begin with.


I think the M1 CPU lowers some of the ARC retain/release to silicon -- it doesn't remove the problem, but it does seem to reduce the relative overhead significantly.

https://news.ycombinator.com/item?id=25203924

> this requires at least as much effort as not using ARC to begin with.

Designing a brand new CPU architecture certainly counts as "at least as much effort", yes. ^_^

P.S. While I'm here, your "handles are the better pointers" blog post is one of my all-time favorites. I appreciate you sharing your experiences!


Isn't ARC absolutely worst-case for most multi-threading patterns? You'll thrash objects between cores just when you reference them. Every object becomes a mutable object!


Reference counting is a form of garbage collection. Remember that reference counting and tracing garbage collection are simply two extremes of a continuum of approaches. Must-read paper: "A Unified Theory of Garbage Collection".


Compared to other, state of the art GC strategies, it's slow and has bad cache behavior esp. for multithreading. Also, not deterministic perf wise.


Can you explain these or point me to any links, I'd like to learn more.

> has bad cache behavior for multithreading

Why is this? Doesn't seem like that would be something inherent to ref counting.

> Also, not deterministic

I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?


Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.

determinism: you reset a pointer variable. Once in a while, you're the last referent and now have to free the object. That takes more instructions and cache invalidation.


> Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.

Thank you! This was the response that made the cache issues clearest to me.


> Why is this? Doesn't seem like that would be something inherent to ref counting.

Once I know I can read the data, it usually doesn't matter that another thread is also reading it. Reference counting changes that because we both need to write to the count every time either of us takes or drops a reference to the data, and in the latter case we need to know what's happened on the other core, too. This means a lot more moving of changing data between processor cores.

> > Also, not deterministic

> I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?

That's my understanding, but is that "deterministic" as we mean it here? It's true that the same program state leads to the same behavior, but it's non-local program state, and it leads to you doing that work - potentially a lot of work, if you (eg) wind up freeing all of a huge graph structure - at relatively predictable places in your code.

There are good workarounds (free lists, etc) but "blindly throw language level reference counting at everything" isn't a silver bullet (or maybe even a good idea) for getting low-latency from your memory management.


> Doesn't seem like that would be something inherent to ref counting.

It is inherently so. A reference count is an atomic mutable value that must be updatable by any thread.

A significant selling point of ARC compared to traditional ObjC reference counting was that the compiler could elide retain/release calls better than the programmer could, thus preventing a ton of stalls.


Writing to the rc field even for read only access dirties cache lines and causes cache line ping ping pong in case of multi core access, where you also need to use slower, synchronised refcount updates so not to corrupt the count. Other GC strategies don't require dirtying cache lines when accessing the objects.

Determinism: the time taken is not deterministic because (1) malloc/free, which ARC uses but other GCs usually not, are no deterministic - both can do arbitrary amounts of work like coalescing or defragmenting allocation arenas, or performing system calls that reconfigure process virtual memory - and (2) cascading deallocations as rc 0 objects trigger rc decrements and deallocation of other objects.


> Also, not deterministic

> not deterministic perf wise.

was what parent wrote (emphasis added), I assume referring to the problem that when an object is destroyed, an arbitrarily large number of other objects -- a subset of the first object's members, recursively -- may need to be destroyed as a consequence.


Because (mainstream) refcounting GCs are just slower than modern tracing GCs. GC pauses are virtually gone these days (Java gives you <1ms maximum pause for heaps up to 16TB), and are actually more deterministic than refcounting.


RC can be deterministic in terms of when a destructor gets called. GC languages usually don't support destructors for that reason.


But RC GC is not deterministic, though. All you know is that it will be called when some reference is cleared. You don't know which and when, and you don't know how much work it will do. With modern tracing GCs, though, there are no more pauses, and mostly a constant and small CPU tax, done in the background.

The only significant cost tracing has these days is in memory footprint.


> The only significant cost tracing has these days is in memory footprint.

And that's not insignificant. The top-of-the-line Pixel 6 Pro has twice as much RAM as the top-of-the-line iPhone 13 Pro. Maybe the Android camp is just more eager to play the specs game, but I've long speculated that iOS simply needs less RAM because of its refusal to use tracing GC.


Windows Phones of the Lumia series had historically less RAM than equally priced Android models and the performance with .NET Native was much better.

I used all my models until their hardware died.


How can there be degrees of determinism?


Determinism in programming doesn't just mean "X was caused by Y" (e.g. determinism as a stand-in for the causual chain).

It mostly means "I can know (as in "determine") how much time, or instructions, or calls, or memory an operation will take".

And this knowledge can come in degrees (be more or less fuzzy).


I can tell you I'll come fix your plumbing between 10:00 and 14:00, and it will take between 30 minutes and two hours, or that I'll make a delivery at 9:45.


The destruction of graph of several billion nodes and about five times as much edges took several seconds in my C++ program. I constructed it in main() and delay between message just before "return 0;" statement and program actually exiting was quite long.

"Determministic" is a double edged sword. You get deterministic allocation and release times, but they might be bigger than what really is achievable.


Because it is worse, regardless of the sales pitch.

https://github.com/ixy-languages/ixy-languages

When doing all the required optimizations, it turns into tracing GC optimizations with another more political accepted name.


Some game engines (C++) do allocations in multiple memory areas. If some allocated memory is known to be needed only for the current frame, then it is allocated from that memory area. Then at the end of the frame the whole region is freed. This is an explicit garbage collection at end of each frame. Memory allocation can further be split according to the CPU thread, thus avoiding global locking in the allocator. With double-buffering the next updated/prepared frame can use its own memory area, while the previous one finishes rendering.

The problem with GC or with reference counting is that it needs to operate on each allocated object separately. If the task for the GC can be reduced to operate on whole memory areas only, its overhead is greatly reduced.


> there are tons of new, modern GC implementations that have no pauses

There are no GC implementations that have no pauses. You couldn't make one without having prohibitively punitive barriers on at least one of reads and writes.


There are no CPUs that have no pauses; who knows, you may have to wait a microsecond for that cache line to come in from RAM.

There are no operating systems that have no pauses; if you don't want to share the CPU, you're going to have to take responsibility for the whole thing yourself. Most people are not even using RTOS.

There are no reference counting implementations that have no pauses. Good luck crawling that object graph. Most people are not even using deferred forms of reference counting.

As far as I know, there is one malloc implementation which runs in constant time. No one uses it.

There are tracing GCs whose pause times are bounded to 1ms. That is enough for soft-real-time video and audio (which is what matters to video games). In general, you are not going to get a completely predictable environment unless you pick up an in-order CPU with no cache and write all your code in assembly.


I think the more accurate / useful statement is, there's no GC with bounded pause times that's also guaranteed to free memory fast enough you don't run out even though you shouldn't. In other words, GC can never fully replace thinking about your allocation patterns.

(I suspect this is obvious to a lot of programmers, but especially in working with Java programmers who just skim JVM release notes and then repeat "pauseless", it's also not obvious to many too.)


> GC can never fully replace thinking about your allocation patterns

Of course not! Nothing can. But—two things:

1. Per Knuth on premature optimization, 97% of your code should not care about such things. For those parts of your code that do need to effect their own allocation strategies, they can generally do so as well in a GCed language as another. Tracing GC forces minimal cognitive overhead on the remaining 97%, and allows it to interoperate smoothly with the 3%.

2. Tracing GC has better performance characteristics than other automatic memory management policies, in a few important respects. Compacting adds locality; bumping gives very fast allocation; copying gives very fast deallocation of deep graphs in the nursery.


The problem is that garbage collection does impact that 3 %... since unless you’re using naive gc garbage collection limits your ability to not manage memory automatically. In general reference counting is more tolerant of mixing reced and non ref counted code. This hurts more since gc gives performance in the part of the code where performance doesn’t matter, while hurting you where performance does! And yes this is alleviated somewhat by memory pools, but not entirely.


> Of course not!

Like I said - "of course" to me, and apparently you, and probably anyone who has worked with multiple GCs and unmanaged languages over many years - but that's not most programmers. The vast majority of Java developers I interview, up to those with 5-6y experience, have no clue how their idioms affect allocations.

> Per Knuth on premature optimization, 97% of your code should not care about such things.

No, this is the classic misstatement of Knuth. 97% of the time, I shouldn't worry about such things. But 100% of my program still might be affected by a design needed to reduce GC pressure, if that means switching core data structures to pools or SoA or something.


Right, but if you're allocating way too much, no lifetime management model will save you.

As a game dev I still will take a modern GC over refcounting or manual lifetime management any day. There are scenarios where you simply don't want to put up with the GC's interference, but those are rare - and modern stacks like .NET let you do manual memory management in cases where you are confident it's what you want to do. For the vast majority of the code I write, GC is the right approach - the best-case performance is good and the worst-case scenario is bad performance instead of crashes. The tooling is out there for me to do allocation and lifetime profiling and that's usually sufficient to find all the hotspots causing GC pressure and optimize them out without having to write any unsafe code.

The cost of having a GC walk a huge heap is a pain, though. I have had to go out of my way to ensure that large buffers don't have GC references in them so that the GC isn't stuck scanning hundreds of megabytes worth of static data. Similar to what I said above though, the same problem would occur if those datastructures had a smart pointer in them :/


This is a meaningless assertion, if you insist on going that far we have to discuss whether OS context switches cause your process to pause or whether the allocator you're using causes page faults. When people say 'gc implementations that have no pauses' they mean no stop-the-world, which IS an achievable thing in production collectors in many cases.


GC has a lot of issues, especially in engine level code, but in practice every game I've worked on or shipped has had at least one garbage collector running for UI or gameplay code. Lua, Actionscript (Scaleform), Unreal Script, Javascript, managed C#. Every game had GC performance issues as well, and we wrote code to generate minimal or no garbage


Flipping this, in the C++ world, "garbage collection", is more than just freeing unused references. "Resource release is destruction"

Try-with-resources is ok, but not great.

Ultimately developers, especially those concerned with performance, prefer full control. And that means deterministic lifetime of memory and resources.


Part of it is also due to Unity3D sticked with the ancient stop-the-world mono GC for a very long time and discuraging developers from doing allocation in official talks.


Another big cost of garbage collection is memory usage: it's not uncommon for a high-performance GC to require 2x-3x the memory for the same application compared to non-GC. For games this is not trivial given how heavy some of the assets are.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: