Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The only difference I see from XLNet is how they use it during inference.


Hey! I'm Arnaud, first author of the paper. XLNet also shuffles the data during training, but they use a masking mechanism instead of the causal + double positional encoding. The application differs, XLNet is not AFAIK focused on generation (even if it can be used for that) and the burst-sampling idea is new.


Are there any obvious practical application of this algorithm for existing large (10B+) text / image models?

Does the rejection sampling lead to a statistically correct sample from the joint probability distribution or is that just a (possibly rough) approximation?


For the application: being able to prompt anywhere in the sequence can be of interest. For what we've seen in the experiment, the rejection sampling leads to similar generation than the autoregressive one, we did not see any mode collapse or anything of that kind.


Thanks for the clarification!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: