Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

IMHO they built the opposite of what's actually useful for real-world use. Copilot should have been trained to describe what a selected block of code does, not write a block of code from a description. It could be extremely useful when looking at new or under-documented codebases to have an AI that gives you a rough hint as to what some code might be doing. For example if you select some heinous spaghetti code function, press a button, and get a prompt back that says "This code looks like it's parsing HTML using regex (74.2% confidence)" it could be much easier for folks to be productive on big codebases.


But that would require hiring tons of software engineers to label training data accurately.

Why do that when you can just train a GPT-3 model on public repositories and call it a day?


No presumably copilot skirted that need by just analyzing the AST of code they host and using the nearby comments to identify what a section of code is meant to do. This would use the same dataset but solve the opposite problem, generate a description from a block of code AST as input.


> copilot skirted that need by just analyzing the AST of code they host and using the nearby comments to identify what a section of code is meant to do.

I'm curious what it spills out for things like "Todo", or "this is probably broken", etc.


Sorry for adding just noise, but I think this is the most insightful comment I've read on HN this year. Excellent analysis and idea!


Something like this would be amazing, particularly for poorly written, obfuscated or even disassembled/decompiled code!


Now that is a damn good idea!


it’s a good idea. depending on how “smart” it is it can be extremely hard to pull off


Ideally, you'd train/teach it using PR code reviews. Human labeling and all that jazz.


> Ideally, you'd train/teach it using PR code reviews.

Which is why, based on Windows state, it will never come out of Microsoft.


I'm not sure I understand how you envision this working, given the underlying technology. You'd have to have a pretty large cache of such analyses to train on, right?


Github has a huge amount of source code and likely for copilot they already had to transform it into an AST to look at comments and nearby code. This would use the same dataset but build the opposite model--input a block of code AST and get a guess as to what the description (i.e. comment) should be for it.


My naive assumption is that they don't have nearly that level of control. I'd be surprised if they have an AST step before the tokenizer, or in it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: