The model is massive and heavy. I have a hard time seeing this used in real-time...

The model is massive and heavy. I have a hard time seeing this used in real-time. But it's so flexible and accurate it's an amazing teacher for lean CNNs; that's where the real value lies.

I don't even care about the numbers; a visual transformer encoder with output that is too heavy for many edge compute CNNs to use as input isn't gonna cut it.