The only difference I see from XLNet is how they use it during inference.

arnaudpannatier · on June 7, 2024

Hey! I'm Arnaud, first author of the paper. XLNet also shuffles the data during training, but they use a masking mechanism instead of the causal + double positional encoding. The application differs, XLNet is not AFAIK focused on generation (even if it can be used for that) and the burst-sampling idea is new.

RivieraKid · on June 7, 2024

Are there any obvious practical application of this algorithm for existing large (10B+) text / image models?

Does the rejection sampling lead to a statistically correct sample from the joint probability distribution or is that just a (possibly rough) approximation?

arnaudpannatier · on June 8, 2024

For the application: being able to prompt anywhere in the sequence can be of interest. For what we've seen in the experiment, the rejection sampling leads to similar generation than the autoregressive one, we did not see any mode collapse or anything of that kind.

tripplyons · on June 7, 2024

Thanks for the clarification!