Leaking sensitive data and infringement are separate (tho related) concerns. They may not want to do what you say, even though it's totally infringement safe.
Are they separate? Or is it the same concern but from opposite view points?
Both worried about IP leaking but one side is worried about their IP leaking and the other worried about liability if they inadvertently implement any leaked IP. Either way, the concern is leaked IP.
Yes, if I ask something like "Can you describe microsoft's internal security processes and the names of upcoming products" the output would be original and not covered by copyright, but it would be sensitive internal information and covered by NDAs. But any code publicly posted and available to be scraped will not have such sensitive info in it.
I don’t think GitHub Co-pilot can respond to prompts like that. I thought it was ostensibly sophisticated source code completion. If so, source code is absolutely covered under copyright.
But even if that were true, it’s a moot point because we are talking about the copyrighted content that the models were trained on. Hence the point the OP made that if Microsoft really wanted to reassure people then they’d promote models that were trained on Microsoft’s own code rather than handwave away these concerns with gestures of assuming theoretical liability.
Ah, ok. As for testing in court, that will be useful, but a rather official source says "created by a human author" [0] in defining the notion of copyright, which I assume is paraphrasing actual law, which I assume a judge would interpret similarly. However, I will concede that it's conceivable that if a human authors a work that then itself authors another work, the second work could potentially be attributed to the human for purposes of copyright eligibility.