You actually don't need that. You only need a set of real videos and a generator for fake ones. Then train the discriminator to tell these two classes apart and make use of its differentiability to update the generator in tandem with the discriminator.
Not if you have a balanced dataset. If not, you're already getting into trouble with GANs, which is why they were superseded by diffusers on image generation tasks.