Sounds like the third paragraph could at least be somewhat improved on by switching everything to structured logging and being able to point them all to a local aggregator.
Moving everything to distributed tracing would be even better, but there’s a larger investment as the tracing metadata then has to be unified across the board to properly track requests.
Would likely help more with ops than with dev but should help nonetheless, even just getting proper spanning information can provide a lot of insight into a clusterfuck.
> A breakpoint on one end is a timeout on the other.
Moving everything to distributed tracing would be even better, but there’s a larger investment as the tracing metadata then has to be unified across the board to properly track requests.
Would likely help more with ops than with dev but should help nonetheless, even just getting proper spanning information can provide a lot of insight into a clusterfuck.
> A breakpoint on one end is a timeout on the other.
No dev mode to increase or disable timeouts?