Cache hit rate is probably the most immediately useful. Although given that this...

porridgeraisin · 2025-10-29T18:20:26 1761762026

It depends on what counter.

[ All from my experience on home GPUs, and in lah with 2 nodes with 2 80GB H100 each. Not extensively benchmarked ]

Events like kernel launch, which this profiler reads right now, is a very small overhead (1-2%). Kernel level metrics like DRAM utilisation, cache hit rate, SM occupancy, etc usually give you a 5-10% overhead. If you want to plot a flame graph at a instruction level (mostly useful for learning purposes) then you go off the rails - even 25% overhead I have seen. And finally full traces add tons of overhead but that's pretty much expected - they anyways produce GBs of profiling data.

sirhcm · 2025-10-29T18:39:13 1761763153

Occupancy and RAM utilization are available from static analysis. A sampling profiler would also obviously not be suitable for this always-on profiler case. But reading the counters [0] from the GSP should be cheap.

[0] https://en.wikipedia.org/wiki/Hardware_performance_counter