Would you use an AI text generator to write a thesis? No, there's a risk a whole chunk of it will be considered plagiarism because you have no idea what the source of the AI output is, but you know it was trained with unknown copyrighted material. This has nothing to do with the way humans learn, it's about correct attribution.
There is no technical reason why Microsoft can't respect licenses with Copilot. But that would mean more work and less training input, so they do code laundering and excuse it with comparisons to human learning because making AI seem more advanced than it is has always worked well in marketing.
Edit: And where do you draw the line between "learning" and copying? I can train a network to exactly reproduce licensed code (or books, or movies) just like a human can memorize it given enough time - and both of those would be considered a copyright violation if used without correct attribution. If you trained an AI model with copyrighted data you will get copyrighted results with random variation which might be enough to become unrecognizable if you're lucky.
> Would you use an AI text generator to write a thesis? No, there's a risk a whole chunk of it will be considered plagiarism because you have no idea what the source of the AI output is, but you know it was trained with unknown copyrighted material.
Of course, but that's a separate issue. We're not talking about whether the output of the AI is copyrighted. We're talking about whether it's ok for it to learn from copyrighted material.
Again you can say exactly the same about humans. I am perfectly capable of plagiarising or outputting copyrighted material. That doesn't mean it's illegal to learn from that material, just to output it verbatim.
So the fundamental issue is that it's harder to tell when an AI is plagiarising than it is when you produce something yourself. But that is a technical (and probably solvable) issue, not a legal one. And it's not the subject of this lawsuit.
There is no technical reason why Microsoft can't respect licenses with Copilot. But that would mean more work and less training input, so they do code laundering and excuse it with comparisons to human learning because making AI seem more advanced than it is has always worked well in marketing.
Edit: And where do you draw the line between "learning" and copying? I can train a network to exactly reproduce licensed code (or books, or movies) just like a human can memorize it given enough time - and both of those would be considered a copyright violation if used without correct attribution. If you trained an AI model with copyrighted data you will get copyrighted results with random variation which might be enough to become unrecognizable if you're lucky.