That sounds like something I could get an LLM to do. And then of course I can do it iteratively until all the code has been laundered. Maybe that's how Microsoft can justify training on all the GitHub data.
IANAL, but my understanding of copyright law jurisprudence is that using an LLM to automate the process is going to substantially increase the likelihood that you will be found to be infringing.
It’s entirely possible that the model and all of its outputs will be determined to be derivative works of the training inputs. If that happens then, oh boy, not good things for anyone using it, I’m sure.