More

felixge · 2025-09-23T11:37:06 1758627426

I'd love to hear more! What kind of profiling issues are you running into? I'm assuming the inuse memory profiles are sometimes not good enough to track down leaks since they only show the allocation stack traces? Have you tried goref [1]?. What kind of memory pressure issues are you dealing with?

[1] https://github.com/cloudwego/goref

Disclaimer: I work on continuous profiling for Datadog and contribute to the profiling features in the runtime.

tsimionescu · 2025-09-23T12:55:06 1758632106

I for one am still mystified how it's possible that a GC language can't expose the GC roots in a memory profile. I've lost so many hours of my life manually trying to figure out what might be keeping some objects live, information the GC figures out every single time it runs...

felixge · 2025-09-23T14:22:13 1758637333

Do you think the GC roots alone (goroutine stacks with goroutine id, package globals) would be enough?

I think in many cases you'd want the reference chains.

The GC could certainly keep track of those, but at the expense of making things slower. My colleagues Nick and Daniel prototyped this at some point [1].

Alternatively the tracing of reference chains can be done on heap dumps, but it requires maintaining a partial replica of the GC in user space, see goref [2] for that approach.

So it's not entirely trivial, but rest assured that it's definitely being considered by the Go project. You can see some discussions related to it here [3].

Disclaimer: I contribute to the Go runtime as part of my job at Datadog. I can't speak on behalf of the Go team.

[1] https://go-review.googlesource.com/c/go/+/552736

[2] https://github.com/cloudwego/goref/blob/main/docs/principle....

[3] https://github.com/golang/go/issues/57175

defraudbah · 2025-09-23T12:55:53 1758632153

no, haven't heard of goref yet but will give it a shot!

usually I go with pprof, like basic stuff and it helps. I would NOT say memory leak is the biggest or most common issue I see, however as time goes and services become more complicated what I often see in the metrics is how RAM gets eaten and does not get freed as time goes, so the app eats more and more memory as time goes and only restart helps.

It's hard to call it memory leak in "original meaning of memory leak" but the memory does not get cleaned up because the choices I made and I want to understand how to make it better.

Thanks for the tool!

prerok · 2025-09-23T17:17:03 1758647823

Sorry if this is a basic question but are you setting the GOMEMLIMIT?

Also, are you running the code in a container? In K8s?

felixge · 2025-07-22T18:34:08 1753209248

Hacking into the Go runtime with eBPF is definitely fun.

But for a more long term solution in terms of reliability and overhead, it might be worth raising this as a feature request for the Go runtime itself. Type information could be provided via pprof labels on the allocation profiles.

aktau · 2025-07-22T19:45:04 1753213504

Not sure if there is already quorum on what a solution for adding labels to non-point-in-time[^1] profiles like the heap profile without leaking looks like: https://go.dev/issue/23458.

[^1]: As opposed to profile that collect data only when activated, like the CPU profile. The heap profile is active from the beginning if `MemProfileRate` is set.

felixge · 2025-04-01T14:19:23 1743517163

+1. In particular []byte slice allocations are often a significant driver of GC pace while also being relatively easy to optimize (e.g. via sync.Pool reuse).

felixge · on Aug 13, 2024

This is great, thank you.

felixge · on May 30, 2024

I'm skeptical that it's worth it myself, this was just a fun research project for me. But once hardware shadow stacks are available, I think this could be great.

To answer your first question: For most Go applications, the average stack trace depth for profiling/execution tracing is below 32 frames. But some applications use heavy middleware layers that can push the average above this limit.

That being said, I think this technique will amortize much earlier when the fixed cost per frame walk is higher, e.g. when using DWARF or gopclntab unwinding. For Go that doesn't really matter while the compiler emits frame pointers. But it's always good to have options when it comes to evolving the compiler and runtime ...

felixge · on May 30, 2024

That seems to be windows only? My main target OS is Linux.

malkia · on May 30, 2024

Ah - sorry, now I understand. Also TIL - This might not be right away accessible on Linux, unless certain config is done/set. Thanks!

felixge · on May 30, 2024

That's what hardware shadow stacks in modern intel/arm CPUs can do! It just needs to be exposed to user space and become widely available.

fweimer · on May 30, 2024

Fedora 40 and later have shadow stack support in userspace. It's currently opt-in with glibc (`export GLIBC_TUNABLES=glibc.cpu.x86_shstk=on` is one way to switch it on I believe). The plan is to make this self-tuning eventually in glibc upstream, once the quirks have been ironed out.

It will not work with Go as-is because the Go scheduler will have to be taught to switch the shadow stack along with the regular stack, panic/recover needs to walk the shadow stack. But for binaries that do not use CGO, it would be possible to enable this fairly quickly. Hardware support is already widely available. The SHSTK-specific code paths are well-isolated. You would not lose compatibility with older CPUs or kernels.

felixge · on May 30, 2024

Thanks for the reply!

What does the API for accessing the shadow stack from user space look like? I didn't see anything for it in the kernel docs [1].

I agree about the need for switching the shadow stacks in the Go scheduler. But this would probably require an API that is a bit at odds with the security goals of the kernel feature.

I'm not sure I follow your thoughts on CGO and how this would work on older CPUs and kernels.

[1] https://docs.kernel.org/next/x86/shstk.html

fweimer · on May 30, 2024

You can get the shadow stack pointer using the RDSSPQ instruction. The kernel documentation shows how the shadow stack size is obtained for the main thread. Threads created explicitly using clone3 have the specified shadow stack size. I think this is sufficient to determine the shadow stack boundaries.

Regarding older CPUs, what I wanted to point out is that the code to enable and maintain shadow stacks will not be smeared across the instruction stream (unlike using APX instructions, or direct use of LSE atomics on AArch64). It's possible to execute the shadow stack code only conditionally.

felixge · on May 30, 2024

Thank you so much, this is very helpful and interesting. I'll try to experiment with this at some point.

fweimer · on May 30, 2024

Glad to be of help.

Note that it's actually GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK to enable it with Fedora 40 glibc (and the program needs to be built with -fcf-protection).

felixge · on May 30, 2024

OP here, happy to answer any question.

nikolayasdf123 · on May 30, 2024

Did you get any response from Go core team? Might make sense to open issue in GitHub https://github.com/golang/go/issues/ and start thread in https://groups.google.com/g/golang-dev

felixge · on May 30, 2024

I know that at least two engineers from the runtime team have seen the post in the #darkarts channel of gopher slack. One of them left a fire emoji :).

I'll probably bring it up in the by-weekly Go runtime diagnostics sync [1] next Thursday, but my guess is that they'll have the same conclusion as me: Neat trick, but not a good idea for the runtime until hardware shadow stacks become widely available and accessible.

[1] https://github.com/golang/go/issues/57175

nickcw · on May 30, 2024

Very interesting article thank you.

Do you see this speeding up real world Go programs?

felixge · on May 30, 2024

Thanks! And to answer you question: No, it won't speed up Go programs for now. This was mostly a fun research project for me.

The low hanging fruits to speed up stack unwinding in the Go runtime is to switch to frame pointer unwinding in more places. In go1.21 we contributed patches to do this for the execution tracer. For the upcoming go1.23 release, my colleague Nick contributed patches to upgrade the block and mutex profiler. Once the go1.24 tree opens, we're hoping to tackle the memory profiler as well as copystack. The latter would benefit all Go programs, even those not using profiling. But it's likely going to be relative small win (<= 1%).

Once all of this is done, shadow stacks have the potential to make things even faster. But the problem is that we'll be deeply in diminishing returns territory at that point. Speeding up stack capturing is great when it makes up 80-90% of your overhead (this was the case for the execution tracer before frame pointers). But once we're down to 1-2% (the current situation for the execution tracer), another 8x speedup is not going to buy us much, especially when it has downsides.

The only future in which shadow stacks could speed up real Go programs is one where we decide to drop frame pointer support in the compiler, which could provide 1-2% speedup for all Go programs. Once hardware shadow stacks become widely available and accessible, I think that would be worth considering. But that's likely to be a few years down the road from now.

aerfio · on May 30, 2024

Do you think/know of any areas in Go codebase that would enable jump in performance bigger than e.g 10%? I'm very grateful for any work done in Go codebase, for me this language is plenty fast, I'm just curious what's the state of Go internals, are there any techniques left to speed it up significantly or some parts of codebase/old architectures holding it back? And thank you for your work!

felixge · on May 30, 2024

I don't think any obvious 10%+ opportunities have been overlooked. Go is optimizing for fast and simple builds, which is a bit at odds with optimal code gen. So I think the biggest opportunity is to use Go implementations that are based on aggressively optimizing compilers such as LLVM and GCC. But those implementations tend to be a few major versions behind and are likely to be less stable than the official compiler.

That being said, I'm sure there are a lot of remaining incremental optimization opportunities that could add up to 10% over time. For example a faster map implementation [1]. I'm sure there is more.

Another recent perf opportunity is using pgo [2] which can get you 10% in some cases. Shameless plug: We recently GA'ed our support for it at Datadog [3].

[1] https://github.com/golang/go/issues/54766 [2] https://go.dev/doc/pgo [3] https://www.datadoghq.com/blog/datadog-pgo-go/

neonsunset · on May 30, 2024

Go limitation is it’s a high-level language with a very simple compiler where providing true zero-cost abstractions (full monomorphization rather than GC stenciling) and advanced optimizations is a bridge it wouldn’t cross because it means much greater engineering effort spent on the compiler and increasing LOCs by a factor of 5, especially if a compiler wants to preserve its throughput.

Though I find it unfortunate that the industry considers Go as a choice for performance-sensitive scenarios when C# exists which went the above route and does not sacrifice performance and ability to offer performance-specific APIs (like crossplat SIMD) by paying the price of higher effort/complexity compiler implementation. It also does in-runtime PGO (DynamicPGO) given long-running server workloads are usually using JIT where it's available, so you don't need to carefully craft a sample workload hoping it would match production behavior - JIT does it for you and it yields anything from 10% to 35% depending on how abstraction-heavy the codebase is.

dolmen · on May 30, 2024

Reminder: the Go team consider optimizations in code generation only as far the the compiler is kept fast. That's why the Go compiler doesn't have as many optimizations phases as C/C++/Rust compilers.

As a developer I like that approach as it keeps a great developer experience and helps me stayed focus and gives me great productivity.

felixge · on May 23, 2024

Dynamic patching of return addresses is a very cool trick. I don't think I've seen this before. Have you run into any situations where this crashes programs or otherwise interferes with their execution?

peterfirefly · on May 23, 2024

Turbo Pascal used it for the overlay implementation (for DOS) -- overlays = virtual memory at home.

TP 5.0 from 1988 was the first version that had it.

The idea was to make sure the code the CPU returned to would actually be in memory.

I'm pretty sure Windows 1.0 did something very similar.

kouteiheika · on May 24, 2024

If the program's already doing weird stuff with the stack/control flow/etc., yes, but that should be relatively rare and for the majority of the programs it should work fine.

felixge · on May 28, 2024

Thanks for the reply. I ended up implementing this idea in Go and wrote a blog post about the results: https://blog.felixge.de/blazingly-fast-shadow-stacks-for-go/

I'm curious if you've done any benchmarking for your implementation as well?

kouteiheika · on May 28, 2024

> Thanks for the reply. I ended up implementing this idea in Go and wrote a blog post about the results: https://blog.felixge.de/blazingly-fast-shadow-stacks-for-go/

Nice!

> I'm curious if you've done any benchmarking for your implementation as well?

Not in any detail; I just checked that it's significantly faster than doing it naively and left it at that since it was fast enough for my use case.

tdullien · on May 23, 2024

It's going to play poorly when C++ exceptions are thrown/caught.

felixge · on May 23, 2024

Looking at the code [1] it seems like the library is actively trying to handle this problem.

[1] https://github.com/koute/not-perf/blob/master/nwind/src/loca...

kouteiheika · on May 24, 2024

It should support C++ exceptions. The trampolines have exception landing pads included to catch and rethrow any exceptions which are thrown through them.

felixge · on April 3, 2024

This sounds like it was copied from PostreSQL which SQLite cites as a strong source of inspiration.