I'd love to hear more! What kind of profiling issues are you running into? I'm assuming the inuse memory profiles are sometimes not good enough to track down leaks since they only show the allocation stack traces? Have you tried goref [1]?. What kind of memory pressure issues are you dealing with?
I for one am still mystified how it's possible that a GC language can't expose the GC roots in a memory profile. I've lost so many hours of my life manually trying to figure out what might be keeping some objects live, information the GC figures out every single time it runs...
Do you think the GC roots alone (goroutine stacks with goroutine id, package globals) would be enough?
I think in many cases you'd want the reference chains.
The GC could certainly keep track of those, but at the expense of making things slower. My colleagues Nick and Daniel prototyped this at some point [1].
Alternatively the tracing of reference chains can be done on heap dumps, but it requires maintaining a partial replica of the GC in user space, see goref [2] for that approach.
So it's not entirely trivial, but rest assured that it's definitely being considered by the Go project. You can see some discussions related to it here [3].
Disclaimer: I contribute to the Go runtime as part of my job at Datadog. I can't speak on behalf of the Go team.
no, haven't heard of goref yet but will give it a shot!
usually I go with pprof, like basic stuff and it helps. I would NOT say memory leak is the biggest or most common issue I see, however as time goes and services become more complicated what I often see in the metrics is how RAM gets eaten and does not get freed as time goes, so the app eats more and more memory as time goes and only restart helps.
It's hard to call it memory leak in "original meaning of memory leak" but the memory does not get cleaned up because the choices I made and I want to understand how to make it better.
Hacking into the Go runtime with eBPF is definitely fun.
But for a more long term solution in terms of reliability and overhead, it might be worth raising this as a feature request for the Go runtime itself. Type information could be provided via pprof labels on the allocation profiles.
Not sure if there is already quorum on what a solution for adding labels to non-point-in-time[^1] profiles like the heap profile without leaking looks like: https://go.dev/issue/23458.
[^1]: As opposed to profile that collect data only when activated, like the CPU profile. The heap profile is active from the beginning if `MemProfileRate` is set.
+1. In particular []byte slice allocations are often a significant driver of GC pace while also being relatively easy to optimize (e.g. via sync.Pool reuse).
I'm skeptical that it's worth it myself, this was just a fun research project for me. But once hardware shadow stacks are available, I think this could be great.
To answer your first question: For most Go applications, the average stack trace depth for profiling/execution tracing is below 32 frames. But some applications use heavy middleware layers that can push the average above this limit.
That being said, I think this technique will amortize much earlier when the fixed cost per frame walk is higher, e.g. when using DWARF or gopclntab unwinding. For Go that doesn't really matter while the compiler emits frame pointers. But it's always good to have options when it comes to evolving the compiler and runtime ...
Fedora 40 and later have shadow stack support in userspace. It's currently opt-in with glibc (`export GLIBC_TUNABLES=glibc.cpu.x86_shstk=on` is one way to switch it on I believe). The plan is to make this self-tuning eventually in glibc upstream, once the quirks have been ironed out.
It will not work with Go as-is because the Go scheduler will have to be taught to switch the shadow stack along with the regular stack, panic/recover needs to walk the shadow stack. But for binaries that do not use CGO, it would be possible to enable this fairly quickly. Hardware support is already widely available. The SHSTK-specific code paths are well-isolated. You would not lose compatibility with older CPUs or kernels.
What does the API for accessing the shadow stack from user space look like? I didn't see anything for it in the kernel docs [1].
I agree about the need for switching the shadow stacks in the Go scheduler. But this would probably require an API that is a bit at odds with the security goals of the kernel feature.
I'm not sure I follow your thoughts on CGO and how this would work on older CPUs and kernels.
You can get the shadow stack pointer using the RDSSPQ instruction. The kernel documentation shows how the shadow stack size is obtained for the main thread. Threads created explicitly using clone3 have the specified shadow stack size. I think this is sufficient to determine the shadow stack boundaries.
Regarding older CPUs, what I wanted to point out is that the code to enable and maintain shadow stacks will not be smeared across the instruction stream (unlike using APX instructions, or direct use of LSE atomics on AArch64). It's possible to execute the shadow stack code only conditionally.
Note that it's actually GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK to enable it with Fedora 40 glibc (and the program needs to be built with -fcf-protection).
I know that at least two engineers from the runtime team have seen the post in the #darkarts channel of gopher slack. One of them left a fire emoji :).
I'll probably bring it up in the by-weekly Go runtime diagnostics sync [1] next Thursday, but my guess is that they'll have the same conclusion as me: Neat trick, but not a good idea for the runtime until hardware shadow stacks become widely available and accessible.
Thanks! And to answer you question: No, it won't speed up Go programs for now. This was mostly a fun research project for me.
The low hanging fruits to speed up stack unwinding in the Go runtime is to switch to frame pointer unwinding in more places. In go1.21 we contributed patches to do this for the execution tracer. For the upcoming go1.23 release, my colleague Nick contributed patches to upgrade the block and mutex profiler. Once the go1.24 tree opens, we're hoping to tackle the memory profiler as well as copystack. The latter would benefit all Go programs, even those not using profiling. But it's likely going to be relative small win (<= 1%).
Once all of this is done, shadow stacks have the potential to make things even faster. But the problem is that we'll be deeply in diminishing returns territory at that point. Speeding up stack capturing is great when it makes up 80-90% of your overhead (this was the case for the execution tracer before frame pointers). But once we're down to 1-2% (the current situation for the execution tracer), another 8x speedup is not going to buy us much, especially when it has downsides.
The only future in which shadow stacks could speed up real Go programs is one where we decide to drop frame pointer support in the compiler, which could provide 1-2% speedup for all Go programs. Once hardware shadow stacks become widely available and accessible, I think that would be worth considering. But that's likely to be a few years down the road from now.
Do you think/know of any areas in Go codebase that would enable jump in performance bigger than e.g 10%? I'm very grateful for any work done in Go codebase, for me this language is plenty fast, I'm just curious what's the state of Go internals, are there any techniques left to speed it up significantly or some parts of codebase/old architectures holding it back? And thank you for your work!
I don't think any obvious 10%+ opportunities have been overlooked. Go is optimizing for fast and simple builds, which is a bit at odds with optimal code gen. So I think the biggest opportunity is to use Go implementations that are based on aggressively optimizing compilers such as LLVM and GCC. But those implementations tend to be a few major versions behind and are likely to be less stable than the official compiler.
That being said, I'm sure there are a lot of remaining incremental optimization opportunities that could add up to 10% over time. For example a faster map implementation [1]. I'm sure there is more.
Another recent perf opportunity is using pgo [2] which can get you 10% in some cases. Shameless plug: We recently GA'ed our support for it at Datadog [3].
Go limitation is it’s a high-level language with a very simple compiler where providing true zero-cost abstractions (full monomorphization rather than GC stenciling) and advanced optimizations is a bridge it wouldn’t cross because it means much greater engineering effort spent on the compiler and increasing LOCs by a factor of 5, especially if a compiler wants to preserve its throughput.
Though I find it unfortunate that the industry considers Go as a choice for performance-sensitive scenarios when C# exists which went the above route and does not sacrifice performance and ability to offer performance-specific APIs (like crossplat SIMD) by paying the price of higher effort/complexity compiler implementation. It also does in-runtime PGO (DynamicPGO) given long-running server workloads are usually using JIT where it's available, so you don't need to carefully craft a sample workload hoping it would match production behavior - JIT does it for you and it yields anything from 10% to 35% depending on how abstraction-heavy the codebase is.
Reminder: the Go team consider optimizations in code generation only as far the the compiler is kept fast. That's why the Go compiler doesn't have as many optimizations phases as C/C++/Rust compilers.
As a developer I like that approach as it keeps a great developer experience and helps me stayed focus and gives me great productivity.
Dynamic patching of return addresses is a very cool trick. I don't think I've seen this before. Have you run into any situations where this crashes programs or otherwise interferes with their execution?
If the program's already doing weird stuff with the stack/control flow/etc., yes, but that should be relatively rare and for the majority of the programs it should work fine.
It should support C++ exceptions. The trampolines have exception landing pads included to catch and rethrow any exceptions which are thrown through them.
[1] https://github.com/cloudwego/goref
Disclaimer: I work on continuous profiling for Datadog and contribute to the profiling features in the runtime.