Data Mining

blender · on Dec 31, 2009

+1 It may not be sexy or glamorous but it will be in high demand.

bioweek · on Dec 31, 2009

Any justifications for that? You could have said the same thing 10 or 20 years ago, but it still hasn't come into great demand.

kordless · on Dec 31, 2009

Webapps. The rise of the webapps will herald a new era for analytics and reporting. There's gold in them thar logs!

lsb · on Dec 31, 2009

Yes, I'm starting to work for an online company that's sitting on 20 TB of clickstream data that no one's gotten around to even looking at.

simonw · on Dec 31, 2009

10 or 20 years ago there wasn't nearly as much readily available data to be mined. Today even moderately high traffic sites generate GBs of log files a day, not to mention the enormous quantity of high value data available through various APIs.

aaronblohowiak · on Dec 31, 2009

you don't actually need all of the traffic to make meaningful conclusions. Tracking a statistically sound random sampling of user sessions provides most of the benefit for pattern analysis uses.

keefe · on Dec 31, 2009

you've actually got the processing power to do interesting things. I've currently got a 250M record database in my domain of interest - a few years ago, crunching on this database was prohibitively time expensive but now it flies, without even getting into what it means to be able to run stuff on EC2 with arbitrary power... that's direct experience with the same database btw, not supposition based on two different databases. Next, consider how much more data is being generated... it should not be difficult to believe that drawing interesting conclusions from data is and will continue to be interesting.

est · on Dec 31, 2009

and visualization

sga · on Dec 31, 2009

Any thoughts on the best resources for learning data mining ie. can anyone suggest must read sites, blogs, textbooks on the subject? Thanks.

keefe · on Dec 31, 2009

first, you need to grok the basics http://academicearth.org/courses/machine-learning but data mining can't be learned from a passive standpoint. You need to find a large dataset http://aws.amazon.com/publicdatasets/ and try to do something with it. I've been pleased with http://neuroph.sourceforge.net/ for a lot of the stuff I do.

elai · on Dec 31, 2009

Data mining is a lot of statistical work and machine learning, so if your stats knowledge is rusty, basic or non-existent, then I would suggest you read up on that.

aaronblohowiak · on Dec 31, 2009

SCPD at Stanford offers a certificate program in data mining

http://scpd.stanford.edu/public/category/courseCategoryCerti...

(if the above is broken, http://tinyurl.com/stanford-graduate-certs )

hack_edu · on Dec 31, 2009

Ben Fry's dissertation 'Computational Information Design' [http://benfry.com/phd/] is a great start. He breaks down many of the skills needed to succeed at information design and processing.

He's recently released Mastering Data with O'Reilly, which is essentially an expanded second addition to his dissertation.

evgen · on Dec 31, 2009

I think you mean Visualizing Data and not Mastering Data.

paraschopra · on Jan 1, 2010

I'd recommend you to start with data visualization, intute how mean, median, variation don't capture the data, then move onto normal distribution approximation of dataset, then linear regression and finally bayesian classification/probability models.

Only after you get a good grasp on basics should you move to advanced topics such as neural networks, SVMs, etc.