My understanding is no, they are collecting a cache of documents from the traini...

My understanding is no, they are collecting a cache of documents from the training set, then after pre-training prompt about those topics. A separate verifier is given both the relevant source documents and generated response, and tasked with checking for conflicts in factuality.

They describe using Qwen 32B as the verifier, and the model under training is Qwen 8B. So in fact the verifier is beefier than the trainee model, though it's unclear if that has to be the case as you scale up.