Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wouldn't "account creation" date be shared between test and train data and so would essentially constitute a train/set set leak?

E.g. user in training set has a meta-data about account creation.

Any test set case would only need to look at the account creation date to identify the user.



Yes, this is silly. And while they address it, and admit that using that field alone basically results in perfect classification, they don't do the logical thing and give this whole exercise up as pointless. Instead they just break the Account Create Time up into individual features: "Account Creation Hour", "Account Creation Minute". Seriously?

The reality is, the inclusion of that field in the metadata means that identifying a user from metadata is trivial and no interesting case for ML. In order to publish, they "degraded" the data until it was just interesting enough to be headline worthy. Insulting.


They do address this fact in the paper, and in fact note that a simple KNN approach using only the account creation time gives 99.98% accuracy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: