> Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb expl...

malkia · on Nov 20, 2023

While I was at Google, circa 2015-2016, working on some Ad project, and happened to be on-call our system started doing something wonky, so I think I either called the SRE on the sub-system we were using (spanner? something else - don't remember) to check what's up (as written by our playbook).

They asked me to enable tracing for 30s (we had Chrome extension, that sends some URL common parameter that enables in your web server full tracing (100%) for some short amount of time), and then I did some operations that our internal customers were complaining.

This produced quite a hefty tracing, but only for 30secs - enough for them to trace back where the issue might be coming from. But basically end-to-end from me doing something on the browser, down to our server/backend, downto their systems, etc.

That's how I've learned how important it is - for cases like this, but you can't have 100% ON all the time - not even 1% I think...

sofixa · on Nov 20, 2023

Oh yeah, tracing can be extremely useful, precisely because it should be end to end.

As for the numbers, that's why all tracing collectors and receivers support downsampling out of the box. Recording only 1% or 10% of all traces, or 10% of all successful ones and 100% of failures is a good way of making use of tracing without overburdening storage.

phillipcarter · on Nov 17, 2023

You can sorta measure some of this with traces. For example, sampled traces that contain the sampling rate in their metadata let you re-weight counts, thus allowing you to accurately measure "number of requests to x". Similarly, a good sampling of network latency can absolutely be measured by trace data. Metrics will always have their place, though, for reasons you mention - measuring cpu utilization, # of objects in something etc. Logs vs. traces is more nuanced I think. A trace is nothing more than a collection of structured logs. I would wager that nearly all use cases for structured logging could be wholesale replaced by tracing. Security logging and big object logging is an exception, although that's also dependent on your vendor or backend.