IMHO they built the opposite of what's actually useful for real-world use. Copil...

heavyset_go · on July 12, 2021

But that would require hiring tons of software engineers to label training data accurately.

Why do that when you can just train a GPT-3 model on public repositories and call it a day?

qbasic_forever · on July 12, 2021

No presumably copilot skirted that need by just analyzing the AST of code they host and using the nearby comments to identify what a section of code is meant to do. This would use the same dataset but solve the opposite problem, generate a description from a block of code AST as input.

patchguard · on July 13, 2021

> copilot skirted that need by just analyzing the AST of code they host and using the nearby comments to identify what a section of code is meant to do.

I'm curious what it spills out for things like "Todo", or "this is probably broken", etc.

gspr · on July 12, 2021

Sorry for adding just noise, but I think this is the most insightful comment I've read on HN this year. Excellent analysis and idea!

moralestapia · on July 12, 2021

Something like this would be amazing, particularly for poorly written, obfuscated or even disassembled/decompiled code!

dkersten · on July 12, 2021

Now that is a damn good idea!

rantwasp · on July 12, 2021

it’s a good idea. depending on how “smart” it is it can be extremely hard to pull off

toomuchtodo · on July 12, 2021

Ideally, you'd train/teach it using PR code reviews. Human labeling and all that jazz.

JetSpiegel · on July 15, 2021

> Ideally, you'd train/teach it using PR code reviews.

Which is why, based on Windows state, it will never come out of Microsoft.

randallsquared · on July 12, 2021

I'm not sure I understand how you envision this working, given the underlying technology. You'd have to have a pretty large cache of such analyses to train on, right?

qbasic_forever · on July 12, 2021

Github has a huge amount of source code and likely for copilot they already had to transform it into an AST to look at comments and nearby code. This would use the same dataset but build the opposite model--input a block of code AST and get a guess as to what the description (i.e. comment) should be for it.

randallsquared · on July 13, 2021

My naive assumption is that they don't have nearly that level of control. I'd be surprised if they have an AST step before the tokenizer, or in it.