Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Data Mining


+1 It may not be sexy or glamorous but it will be in high demand.


Any justifications for that? You could have said the same thing 10 or 20 years ago, but it still hasn't come into great demand.


Webapps. The rise of the webapps will herald a new era for analytics and reporting. There's gold in them thar logs!


Yes, I'm starting to work for an online company that's sitting on 20 TB of clickstream data that no one's gotten around to even looking at.


10 or 20 years ago there wasn't nearly as much readily available data to be mined. Today even moderately high traffic sites generate GBs of log files a day, not to mention the enormous quantity of high value data available through various APIs.


you don't actually need all of the traffic to make meaningful conclusions. Tracking a statistically sound random sampling of user sessions provides most of the benefit for pattern analysis uses.


you've actually got the processing power to do interesting things. I've currently got a 250M record database in my domain of interest - a few years ago, crunching on this database was prohibitively time expensive but now it flies, without even getting into what it means to be able to run stuff on EC2 with arbitrary power... that's direct experience with the same database btw, not supposition based on two different databases. Next, consider how much more data is being generated... it should not be difficult to believe that drawing interesting conclusions from data is and will continue to be interesting.


and visualization


Any thoughts on the best resources for learning data mining ie. can anyone suggest must read sites, blogs, textbooks on the subject? Thanks.


first, you need to grok the basics http://academicearth.org/courses/machine-learning but data mining can't be learned from a passive standpoint. You need to find a large dataset http://aws.amazon.com/publicdatasets/ and try to do something with it. I've been pleased with http://neuroph.sourceforge.net/ for a lot of the stuff I do.


Data mining is a lot of statistical work and machine learning, so if your stats knowledge is rusty, basic or non-existent, then I would suggest you read up on that.


SCPD at Stanford offers a certificate program in data mining

http://scpd.stanford.edu/public/category/courseCategoryCerti...

(if the above is broken, http://tinyurl.com/stanford-graduate-certs )


Ben Fry's dissertation 'Computational Information Design' [http://benfry.com/phd/] is a great start. He breaks down many of the skills needed to succeed at information design and processing.

He's recently released Mastering Data with O'Reilly, which is essentially an expanded second addition to his dissertation.


I think you mean Visualizing Data and not Mastering Data.


I'd recommend you to start with data visualization, intute how mean, median, variation don't capture the data, then move onto normal distribution approximation of dataset, then linear regression and finally bayesian classification/probability models.

Only after you get a good grasp on basics should you move to advanced topics such as neural networks, SVMs, etc.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: