95.8% of the time it's kind of obvious what happened, at least with reasonable m...

stochastimus · on June 17, 2022

I feel that vendors have really muddied the playing field by claiming ML features and then not delivering value. That’s why we tend not to talk about ML. But, I think saying that ML is unhelpful for the problem is a bit like saying programming is unhelpful for the problem - the solution space is vast, and has hardly been explored.

coward123 · on June 18, 2022

FWIW: In my career doing this kind of work, I found logs to be archeologically interesting, but seldom of value. Products that claimed to do things with logs gave me skepticism, because they were often like trying to do brain surgery through your foot - better to just go straight to the head. That is to say, if you could replicate the problem, better to watch it via full-stack monitoring tools in real-time than wading through gigabytes (terabytes?) of log data after the fact, often which had a randomness to their shape (IE: consisted of whatever garbage some developer thought was useful long ago and far away).

throwaway81523 · on June 18, 2022

If you knew exactly what to monitor and control, you would also have put in mitigations for the problem. Log analysis is for when something went wrong that you didn't anticipate, such as a security exploit. It's part of a post-incident investigation.

coward123 · on June 18, 2022

My point is log analysis is noise to the signal. A poor way to discern what went wrong or to proactively monitor to avoid an incident in the first place. There are loads of tools out there, some of which have been mentioned in this thread, that monitor from network to user to app layer and are superior for triage. If someone is down in the bowels of logs, it's gonna be a bad time. I spent a decade triaging high-profile incidents around the world and teaching organizations how to do this stuff.

throwaway81523 · on June 18, 2022

The logs are what you have. It's like the investigation after a plane crash, where you have some black boxes, some radar images, observed distribution of wreckage, whatever. You probably don't have all the data you would like to, but you use whatever you can get your hands on.

Better tools for analyzing logs are fine, but the idea of some ML tool that you throw random logs through and have it automatically identify significant events seems like a pipe dream.

stochastimus · on June 18, 2022

This is all true. So for the times you end up there, wouldn’t you prefer a tool to surface for you the things you were going to have to spend hours digging for?