They train the model on only five frames, and then detect all close frames. They say that with five frames they are able to get 500 new labelled frames.
This means 100 new frames per each original frame. Because movies are at 24 frames per second, this in turn means that each original frame gives enough information for analyzing more or less 4 seconds of video on average.
As they show in the clips in the post, the short clips do indeed portrait Batman in very similar positions, with similar shading, light, etc. The micro model is able to detect Batman because the frames are all very similar to one another. It is very likely that this micro model as is wouldn't be able to detect Batman in a completely different scene of the movie.
So the model is indded over fitted, meaning that it is able to detect Batman in a very specific set of data. Of course over fitting can be done at different levels. They do not over fit to the point that they would be able to detect only the five original frames. The over fitting here stops when there is still scope for capturing new data with the over fitted model.
The smart idea of the authors is then to use these micro models to generate a lot of labeled data and "stich" together the micro models, so that they end up with a much larger data to train on, and a much more general model.
I agree that is what the post describes and that it could be a useful process. I don't think "overfitting" describes that process though. Overfitting describes increasing your performance on the training set to the extent that your model performs worse on the data it is used on.
If overfitting is happening here then it wouldn't be beneficial. There is no reason to prefer that your model be better on the training set if you are going to use it to collect batman images across a film. It would be better if your model wasn't overfit, if it performed better on your dataset, then it would collect more images.
They train the model on only five frames, and then detect all close frames. They say that with five frames they are able to get 500 new labelled frames.
This means 100 new frames per each original frame. Because movies are at 24 frames per second, this in turn means that each original frame gives enough information for analyzing more or less 4 seconds of video on average.
As they show in the clips in the post, the short clips do indeed portrait Batman in very similar positions, with similar shading, light, etc. The micro model is able to detect Batman because the frames are all very similar to one another. It is very likely that this micro model as is wouldn't be able to detect Batman in a completely different scene of the movie.
So the model is indded over fitted, meaning that it is able to detect Batman in a very specific set of data. Of course over fitting can be done at different levels. They do not over fit to the point that they would be able to detect only the five original frames. The over fitting here stops when there is still scope for capturing new data with the over fitted model.
The smart idea of the authors is then to use these micro models to generate a lot of labeled data and "stich" together the micro models, so that they end up with a much larger data to train on, and a much more general model.