Getting Started with Machine Learning

November 9, 2013

“Being a data scientist is when you learn more and more about more and more, until you know nothing about everything.”
— Will Cukierski via @drelu

Machine learning, “big data”, and “data science” are all the rage because information is currency in an information economy. In an era where more data is being collected and more questions are being asked, the latest arrow in the data analysis quiver goes under the deceptively simple name of “Machine Learning”.

Machine Learning (or ML) offers the promise of having a computer pour through your data and automatically extract or create the information that you’re interested in. At least in theory. On closer inspection machine learning is a loosely defined collection of poorly explained statistical and algorithmic tricks that are magical when they work and fail fantastically when they don’t. Acquiring the needed skills for working with data safely is not easy. So we’ve assembled a solid set of suggestions and resources on getting started.

Getting into data

Knowledge is a treasure, but practice is the key to it.
– Thomas Fuller

Everyone comes to machine learning with different gaps in the basic knowledge that makes machine learning problems less painful. This post provides a concise list of references and resources along with a path for getting through the list. This list aims for compactness and utility. Doubtless we overlook many notable references, tutorials and introductions. When getting oriented in an opaque domain like machine learning your first goal is to figure out the right questions, your second goal is figure out where to look for answers.

Step 0: Learn What you Don’t Know

Learning what you don’t know that you don’t know – the unknown unknowns as it were – is a challenge. The lack of suitable introductory material hardly helps. One bright spot in the otherwise pitch black landscape of machine learning books is Machine Learning by Peter Flach[1] .

Having a good framework for arranging the concepts and lingo that are used without explanation in textbooks is essential both to avoid frustration and link the conceptual with the practical. Flach’s book has the best compact introduction to of the sorts of problems that fall into the ML category[2] along with the basic terms that appear in ML papers again and again. Reading chapter 1 of the book should be your starting point before tackling any other reading in the ML space.

Step 1: Learn Basic Statistics

Machine Learning has its foundations in statistics. It is inadvisable to attempt to jump into machine learning without a basic understanding of – well – the basics. Without a general statistics knowledge it is difficult to build up an intuition covering how machine learning techniques work. At some level all machine learning systems achieve the same result with different approaches. Connecting these equivalencies often relies on basic statistics and algebra.

It is possible to learn statistics by writing programs in general purpose languages like C, Java, Python, Clojure, but it’s not recommended. Using a language that was specifically designed for statistical exploration simplifies learning the basics. General purpose computing languages are usually not up to the tasks for exploring statistical concepts[3]. A leading language in this space is “R”[4]. The essential point here: you should have a REPL and a way to draw pictures. Statistics is completely intuitive right up until the point where it isn’t. Drawing pictures helps incrementally build layers of understanding that become essential when dealing with the 99% of statistics that is in purely mathematical terms algebraically bat-shit crazy.

The best book for getting started with R and statistics is Introductory Statistics with R, 2nd edition by Peter Dalgaard. In addition of getting you familiar with the R environment it takes you through the basic blocking and tackling of statistics [5].

A detailed reading of the text isn’t required but you should have a “read it twice” understanding of the following chapters:

Getting started with R (Chapters 1 and 2)
Distributions (Chapter 3)
Descriptive Statistics (Chapter 4)
Simple Linear Regression (Chapter 6)

Chapter 5 is on the “optional” reading list. One and and two sample tests are worth reading if only to understand what a “P-value” is[6].

Chapters 11, 12, and 13 are “skim now read later”. Read the introductions and look for words that seem important.

Chapter 10 (Advanced Data Handling) is a toss up. Getting the data into a usable format it often more work than the statistical analysis : irregular text files, comma separated files… with optional commas, 100 line files with 99 lines. While being able to manipulate data inside of R is important it is not always the easiest choice. If you’re coming to R with experience in another language odds are you can more easily massage your data outside of R. Heavy data manipulation work inside of R is usually done with a package dedicated to data wrangling – e.g.: ReShape.

Step 2: Dipping a Toe Into Machine Learning

Machine Learning by Peter Flach is highly recommend book. When first approaching a new subject having a book written for the motivated beginner is a godsend. Machine Learning represents the “if you only buy one book” suggestion on getting into machine learning. It’s strength lies in helping one understand enough to immediately start applying machine learning to real problems. The Preface contains something rare in a text book: useful information on how to approach the material covered in the book. A suggested reading order is as follows:

Preface
Prologue: A Machine Learning Sampler
Chapter 1: The Ingredients of Machine Learning
Chapter 2: Binary Classification and Related Tasks
Chapter 3: Beyond Binary Classification
Epilogue: Where to go from here

After covering those chapters read the introductions to the techniques covered in other chapters. This provides a fairly complete survey of what’s available and in use today.

Step 3: Now that you Know Enough To Be Dangerous…

The time has come to actually work with some data. Just not your own data. Getting things right under ideal circumstances is hard. Getting it right when the “answers” are unknown and you’re trying to learn is damn near impossible. A difficulty arising from the following proposition:

Functional machine learning and dysfunctional machine learning look exactly the same

Which leads to the “Machine Learning Production Corollary”:

Functional machine learning and dysfunctional machine learning look exactly the same…. until you put it into production.

Or to put it another way: Getting machine learning not wrong is way harder than getting it right.

Worked examples of “real world” data analysis are available (and invaluable) for learning. A Handbook Of Statistical Analysis Using R [7] is a great place to start.

Run these commands from the R Prompt:


install.packages(“HSAUR”) # Install the book

vignette(package=“HSAUR”) # List the chapters

vignette(“Ch_cluster_analysis”) # Show a chapter listed from the previous command

I recommend going through the following chapters:

Ch_introduction_to_R
Ch_principal_components_analysis
Ch_recursive_partitioning
Ch_multiple_linear_regression

Walking in the footsteps of people who know what they are doing is essential if you ever hope to be one of them.

Data Mining With R should be your next stop. This book contains four detailed case studies. Working through them is akin to looking over the shoulder of someone tackling common tasks with applied statistics and having them explain exactly what they’re doing along with why they are doing it. These tutorials show exactly how one carries out a data analysis task.

In Data Mining With R, start with Chapter 2 “Predicting Algae Blooms” and then pick the next chapter upon your personal interest and the sorts of data tasks that you would like to tackle.

When working through these examples it is important to go back to other sources and look up additional reference material related to the technique being applied. When you’re working through the HSAUR chapter on linear regression, read through chapter 7 of Machine Learning so that you can get a broader sense of how the “simple statistical” technique directly connects to the “advanced machine learning” concept of linear classifiers and predictors. When you come across a new technique return to Flach and Wikipedia as necessary. Try to ensure that you’re getting a broad understanding of the underpinnings. It is just as important not to get stuck on one particular equation or one particular technique. Spend some time on the concept; try developing an intuition on how the concept applies, and be generous to yourself if you don’t understand exactly how the dual formulation of a quadratic optimization problem results in an elegant optimization for support vector machines… after only one read.

Step 4: Know the Tools of the Trade

Often having the right tool makes the impossible merely impractical. And when tackling a challenging subject like machine learning a little help goes a long way.

R

R is to statistics as Linux is to computing. What still costs thousands of dollars commercially[8] is free and under active development by people who use R as an essential tool. Unless you have a good reason to build R from source (and if you’re not sure if you do – you don’t) download binaries from an approved R Mirror and save your self days of agony.

R Studio

Highly recommended for working with R. It is an IDE that understands R, has a built in help window, and does a great job at keeping you from shooting your foot off. Frankly, I’ve found that good autocomplete and built in help is half the battle in learning a new language. So while you love Vim, and wouldn’t dream of typing anything without Emacs, and SublimeText is your best friend – if you’re learning R try R-Studio first.

Python

Python is the king snake in the room when it comes to working with data in a general purpose language. A quick scan through PyData and you can see why. Tools like SciPy, Pandas and SciKit Learn are all actively maintained (and more importantly used) by a wide community tackling problems at every scale imaginable.

Orange Data Mining

I have yet to get time to dive into Orange. However, others with sound technical judgement have vouched for Orange and the system is under active development, which is always a good sign.

Resources and Referenes

Online

Coursera Stats 1

I have not taken this course. But I have taken other Coursera classes and I’ve found their content well organized. The syllabus is a great list of concepts that you should be familiar with to consider yourself conversant with essential statistics.

Wikipedia Statistics Portal

When it comes to concise articles on everything statistics Wikipedia is your friend.

Wolfram Mathworld

Mathworld’s Probability and Statistics Portal is a concise glossary for terms that show up in the reference material, but aren’t exactly intuitive.

Books

R In Action

Just about the gentlest introduction to R that I can imagine. There is a Second Edition of the book in the works.

Machine Learning

Peter Flach’s book is an aproachable introduction Machine Learning. Flach manages to make the esoteric approachable and more importantly easy to follow. He uses jargon sparingly when necessary but never without explanation. The true value of the book shines in his simple explanations of the whys behind common algorithms and techniques in machine learning. He excels at providing straightforward explanations of complex topics. (It has one of the best explanations of how a Support Vector Machine works.) In short buy the book unless you’re into learning things the hard way.

Where to Next

There’s a huge world of data out there. And according to the marketing weasels, the next generation of companies will be competing on their ability to manufacture information out of the raw piles of data that are building up on servers all over the inter-webs. Machine Learning can be the key to making sense out of all this data, but only if you’re willing to take the time and do the learning before the machine does.

Peter Flach makes Chapter 1 of Machine Learning available for free online. You have to trade your email for it. But seeing as you’re going to buy the book anyway (you don’t want to dive into machine learning without this book) why not give him an email address to let you know about updates? ↩
It can be just as important to identify which problems aren’t a good fit for machine learning. ↩
The one exception I make to this blanket statement is SciPy. Frankly, the Python community is so far ahead of the most other open source scripting environments that it is difficult to imagine how any of them are going to catch up. ↩
There are many other alternatives: Fortran (don’t laugh), Incantor, Mathematica, Octave, and Julia to name a few. Understanding the tradeoffs involved with choosing another language or framework is important. When starting out, simpler is usually better. ↩
Data Mining With R also has a great high speed introduction to the R environment. It might prove a better tutorial if you’ve programmed extensively in other languages. ↩
I’ve found the traditional tests less helpful when dealing with practical ML issues. YMMV. ↩
Best part, this book is built into R as a Package ↩
I tried to determine a representative price for SAS. One look at this page and you can tell why I failed. ↩