One thing I don't really like about Prometheus is it seems to prefer the pull aka scraping model over the push model.
I think the push model is better in terms of security and discovery (which is how I think most of the other metric aggregators work).
I don't even like log scrapping. I just push the log data through kafka or rabbitmq and have something else pick it up.
I do like how Prometheus has a dimensional model instead of just raw timeseries.
Speaking of which I still haven't found an effective way of merging or correlating metric data with log data (particularly since it is two different systems).
I sort of made some experimental headway with Druid since it kind of has generic event and metric support. This was only possibly because the events are being pushed and not pulled (ie pushing to a bus allows for syndication).
The #1 reason Prometheus went with pull versus push is that it's fundamentally impossible to overload a pull-based system. If you have a push-based system, there's a hard problem of knowing whether you're getting all the messages or if system load or outages at scale are causing you to drop some metrics on the floor (as was the case at SoundCloud which drove them to start Prometheus).
I'd recommend listening to their recent CNCF online meetup [1] for some more of the background on push/pull and why they made the choices they did.
I will for sure check it out but I have found the opposite to be true (well not the opposite but a different problem). With pull I loose invaluable data because either the pulling system is down or it just doesn't know about all the nodes. Of course the converse could be said... what happens if nodes just choose not to push. Consequently I made the nodes smarter. IF they can't push they crash.
I guess it is architecture choice but honestly if our nodes can't push to the bus (because this bus is more than just metrics that can be dropped it is critical biz shit) it is a fatal state for us.
I think if you have a good enough bus like Kafka or RabbitMQ you can create a big enough buffer to prevent absolute chaos of overload (and now you just monitor queue size). If you see overload happening (ie massive queue) you can selectively drop messages (particularly with RabbitMQ as it has routing). But you are absolutely right that bad stuff can happen.
> * One thing I don't really like about Prometheus is it seems to prefer the pull aka scraping model over the push model.*
This is also closely related to one of my (well, our) wishlist items for any TS system. Quoting from the README.md:
> allows a scraper configured for a high-frequency 15s resolution
Above extract feels like a crystallisation of cognitive dissonance.
For us, and especially our engineers, 15 seconds for resolution is just barely acceptable. Our senior engineers would like a 2-second granularity for most metrics, and should generally accept 5 seconds as a reasonable compromise between storage requirements and data accuracy. (We have 10-15 seconds resolution now, and all too frequently find it annoyingly coarse.)
> I think the push model is better in terms of security and discovery (which is how I think most of the other metric aggregators work).
Interesting, I think the pull model is better at least in terms of discovery, and you need less configuration overall for a pull-based monitoring system, as now you need to only configure the monitoring system to know about your services, and the service instances don't need to know: a) who they are, b) where the monitoring system lives. In a push-based world, you need to configure both sides.
It's specifically a TSDB built for metrics and instrumentation. Basically intended to be Google's Borgmon for everybody else in the same way Kubernetes is essentially Borg for everybdoy else.
It is more than TSDB because it has dimensions. Most metric systems (aka Graphite) are just: <time> <name> <number> <maybe unit>.
It is sort of nice to have dimensions for rollup/filtering instead of doing weird name mangling like graphite does. For example one might add a dimension of build # or release version to each metric.
Having a dimension like build # you could even effectively say if your releases are improving performance wise and graph it. TBH haven't done it yet but I think it would be useful instead of just manually doing it with time filtering.
We do this with telegraf + influxdb. It has support for tags in measurements where we add the release version among other metadata which we can then visualize in grafana with a simple group by in the query.
You can also shove in multiple related metrics as part of the same measurement.
Kinda don't get why DigitalOcean "forked" rather than solving the long-term retention problem by working with upstream, especially given the number of comments which state "Development in this area should be done in Prometheus first then merged into Vulcan".
One of the devs here, we are actively working with the Prometheus devs. We are all at Prometheus Conf this week. The upstream Prometheus will unlikely ever have a distributed backend, however it will support projects like this
The prometheus core team has been pretty clear on their stance about some of the things that vulcan is focusing on (things like long term storage for example), as such it didn't make sense to fight with them to make them change their mind ;)
From looking at their architecture diagram it seems that this is different enough from Prometheus's single golang binary to "justify" a fork.
I haven't looked close enough, but hopefully they've done so in a way that common things (query functions, alertmanager integration, etc) can be kept in sync with upstream easily.
I think the push model is better in terms of security and discovery (which is how I think most of the other metric aggregators work).
I don't even like log scrapping. I just push the log data through kafka or rabbitmq and have something else pick it up.
I do like how Prometheus has a dimensional model instead of just raw timeseries.
Speaking of which I still haven't found an effective way of merging or correlating metric data with log data (particularly since it is two different systems).
I sort of made some experimental headway with Druid since it kind of has generic event and metric support. This was only possibly because the events are being pushed and not pulled (ie pushing to a bus allows for syndication).