Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa).

No wonder, it's either strong bias from people working in a tracing vendor, or outright a sales pitch.

It's totally false though. Each pillar - metrics, logs and traces have their place and serve different purposes. You won't use traces to measure the number of requests hitting your load balancer, or the amount of objects in the async queue, or CPU utilisation, or network latency, or any number of things. Logs can be more rich than traces, and a nice pattern I've used with Grafana is linking the two, and having the option to jump to corresponding log lines from a trace which can describe the different actions performed during that span.



While I was at Google, circa 2015-2016, working on some Ad project, and happened to be on-call our system started doing something wonky, so I think I either called the SRE on the sub-system we were using (spanner? something else - don't remember) to check what's up (as written by our playbook).

They asked me to enable tracing for 30s (we had Chrome extension, that sends some URL common parameter that enables in your web server full tracing (100%) for some short amount of time), and then I did some operations that our internal customers were complaining.

This produced quite a hefty tracing, but only for 30secs - enough for them to trace back where the issue might be coming from. But basically end-to-end from me doing something on the browser, down to our server/backend, downto their systems, etc.

That's how I've learned how important it is - for cases like this, but you can't have 100% ON all the time - not even 1% I think...


Oh yeah, tracing can be extremely useful, precisely because it should be end to end.

As for the numbers, that's why all tracing collectors and receivers support downsampling out of the box. Recording only 1% or 10% of all traces, or 10% of all successful ones and 100% of failures is a good way of making use of tracing without overburdening storage.


You can sorta measure some of this with traces. For example, sampled traces that contain the sampling rate in their metadata let you re-weight counts, thus allowing you to accurately measure "number of requests to x". Similarly, a good sampling of network latency can absolutely be measured by trace data. Metrics will always have their place, though, for reasons you mention - measuring cpu utilization, # of objects in something etc. Logs vs. traces is more nuanced I think. A trace is nothing more than a collection of structured logs. I would wager that nearly all use cases for structured logging could be wholesale replaced by tracing. Security logging and big object logging is an exception, although that's also dependent on your vendor or backend.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: