My guess is that they do not want to serve those requests synchronously. They just put the request in a queue and have some workers fulfilling them at their own pace, without worrying about HTTP timeouts. It sucks for the user of the API because handling that involves a lot of additional complexity
Right. I think when you're dealing with massive traffic and you want to create a highly scalable API this is one technique. But then if everything is hitting a queue why have an application level rate limit? Adding items to the queue costs essentially 0, and you get to it when you get to it. If you think an app is abusing your API, then change the rate you process the queue for that app, or you know reach out to the app developer and ask them to stop.
The queue has to be stored somewhere and that takes either memory and disk or memory. That's not finite and prone to spikes (e.g. from another service suddenly waking up and calling you)
That's true. And really, of all the things the app level rate limit is perhaps the least worth mentioning. Not wanting to get into the weeds, what I didn't say is that there's no way of returning the extent of the data for a given user. So, the recommended approach by the support team is to always request all data. This easily maxes out the rate limit. Feedback that this lack of transparency is problem for both consumer and provider alike, fell on deaf ears.
Based on the authors comment that they "might call you back in an hour" suggest that keeping a connection open for that long is impractical. I don't doubt there are APIs that misuse this pattern, but in some places it is the right way to go, even though it might not be obvious to the user.
How...? Why?