Downtime is expensive. You could just bypass your infra and manually get it working so that you can fix your infra while production is up instead of when it's down.
That's in fact how most high-impact events should be handled: mitigate the issue with a potentially short-term solution, once things are back up find the root cause, fix the root cause, and perform a thorough analysis of events to ensure it won't happen again.
Depending on the level of automation that may not be possible. That’s like saying if factory line robot fails “you just bypass the line and manually weld those car bodies”