Machine Learning on the Cheap and Easy

January 17, 2012

For those not in the know, Weka is the excellent data-mining toolkit made available by the nice folks at the University of Waikato. Weka provides a few features that are worth evaluating for your next datamining/machine learning project:

  1. Datamining workbench with strong visualization tools and data manipulation.
  2. Machine learning experiment environment – a great way to compare different ML algorithms for performance.
  3. Knowledge Flow Environment – datamining workflow engine. I haven’t played with this yet, but the documentation implies that you have a simple “ETL” system that can help you create scripts to automate your file processing and learning tasks.
  4. GPL codebase – so use it where you want to how you need to.

Weka is a Java library, and if you have a problem where you think serious ML will help out, I doubt you have time to be a language bigot. The examples in these books are just as easy to express in JRuby, Groovy, Scala, Clojure or your JVM language of choice.


Data Mining – Practical Machine Learning Tools and Techniques

This is the book on using Weka. If you want a tutorial on the libraries you might as well buy a copy. The description “Practical” is accurate. They don’t waste a lot of space with intricately laid out LaTeX diagrams. Most of the time is spent showing practical examples of how to make use of the different parts of Weka. I would recommend this as a good book to figure out “what can I do with this stuff”. The examples are fairly easy to follow along with, and it makes it fairly easy to get started with practical data-mining without a lot of theoretical issues getting in the way. Granted, without at least some understanding of that theory, you might be predicting out of left field (and into right field), so caveat lector.

Only sad point here is that the third edition is not available as an ebook. In this day and age how can this be?


Elements of Statistical Learning

This book is highly recommended as a great survey of the huge number of methods (pun intended) at your disposal for getting started with data-mining. The best part about this book is that you can download the complete text from the authors website.

There is a great summary of statistics in the back of the book and it’s a good book to get a feel for what’s out there. The math in this book is silly at times, but don’t let that dissuade you. Read the chapters several times and you’ll get through it. Also, the appendices include a great crash course on statistics it’s short but manages to cover quite a bit of ground in a read of one or two hours.


Introduction to Applied Bayesian Statistics and Estimation for Social Scientists (Statistics for Social and Behavioral Sciences) by Scott Lynch

The title alone on this one is long enough to be a tongue twister. That being said, the step by step explanations (including working R code) that show you how to actually build a monte-carlo simulation are instructive, informative and easy to follow.

‚ĶOkay, easy to follow might be a stretch, but this book stands as one of the few books where they offer plenty of detailed explanations around why you’re doing what you’re doing and where all those magic numbers come from. For example, in section 7.3 of the book Lynch walks you through a complete Bayesian analysis that asks the question “Are people in the South nicer than other people?” The analysis includes analyzing the performance of the resulting model.

Far too often the decision of if the model works, how to fix it if it doesn’t or what the alternatives are are missing from the more academically minded texts. Oddly enough, because so much of what we’re trying to do in the user-big-data technology world surrounds user behavior, there might be more to be gleaned from reading social science texts than their more cosmologically minded counterparts.


The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer Texts in Statistics)

This is a text that I recommend with some reservations. It’s incredibly heavy on the theory of Bayesian statistics. Indeed, this book, more than many others suffers from equation overkill of the type that almost always ensures that the uninitiated reader will feel as though she has been set upon by a pack of ravenous marmosets and left for dead. Be that as it may, it is one of the few books that seems to have the occasional reasonable description of some of the esoteric bits of Bayesian modeling. If you’re looking for an introductory theory book, this may or may not be it depending on your strength in Maths.

Potentially Useful Books

The following books are on my reading list, but I haven’t had time to vet them fully:

Future Posts

In my next post on Data Mining, I’ll go into some practical JRuby examples of how to make use of the Weka toolkit.Subscribe to our blog for updates!

Discuss this post on Hakcker News.