I'm a computer programmer with some interest in machine learning, but not knowing where to start if I want to get in really fast. A "Dive Into Python" kind of way. Also, it would be nice to know what other knowledge I would require.

This question is marked "community wiki".

asked Jul 02 '10 at 19:52

partoa's gravatar image


edited Jul 02 '10 at 20:44

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

10 Answers:

Here's a crazy idea: Post a "How do I build X" question on this site.


NLP and ML are not that hard to implement, as long as you know what to implement. So come up with a conrete project that involves NLP or ML, and then ask us what technique you should use. We'll explain to you how to do it, and you can implement it.

Once you do this for a few projects, you'll start to draw patterns, and get the hang of what technique is applicable for what problem.

This answer is marked "community wiki".

answered Jul 02 '10 at 22:49

Joseph%20Turian's gravatar image

Joseph Turian ♦♦


I get really frustrated when someone asks for research-related machine learning advice (as in, how to begin to study, etc), and at the same time shies away from implementing the learning algorithms themselves. I'm sure this puts an undesirable upper bound on how well and how fast they can understant things. I guess it's something that's just not clear for most people, that you can implement almost any ML method in a few hundred lines of clear code (supposing you have access to numerical libraries). And recently this even includes notoriously troublesome things like SVMs and l1 regularization.

(Jul 03 '10 at 06:21) Alexandre Passos ♦

I like this idea. Still it's since to have some fundamentals first, I think.

(Jul 03 '10 at 06:52) partoa

Another point I would like to add to this discussion is about using Matlab as starting point. Matlab allows you to focus on the mathematical aspect of ML, it also makes things such as Matrix multiplication easier, the matrix view and command line like interface also help a lot. R is a good alternative for Matlab.

(Jul 04 '10 at 00:27) DirectedGraph

@Alexandre the problem with implementation of ML algorithms aren't with hardness of writing the code itself. Its rather the hardness of grasping the math and intuition of the idea. Because for instance SVM as you mentioned may not take too much KLOC to implement(and the complexity of algorithms aren't measured by LOC). But it is usually hard to understand the math and creating algorithm to this math algorithm for a regular and inexperienced people(like me). I know this because recently i had to implement a plenty of ML algorithms(including SVM). There is a second problem, even if you wrote the code for a specific ML algorithm. For production use it have to be efficient and reliable.(It could be hard to get it right sometimes) Therefore you may have to do tweaks and use some different data structures etc... These things will complicate the problem as well. Therefore I don't recommend a person to implement a ML algorithm if there is a good working and a stable library for a specific task. (Similar to the cryptographic applications) Of course if the existing tools doesn't meet your needs, you can implement on your own, but the risk is on behalf of you.

(Jul 05 '10 at 00:08) cglr

cglr, I agree with you on the hardness being in grasping the math and intuition, and translating that to code. But, if a person is actually trying to understand and use machine learning, I think facing this hardness is necessary. In the process of struggling against the equations, definitions, training set, optimizations, etc, a person starts to develop an intuition as to why some things are working and others not, and how to fix the ones that don't. This is specially important in ML because what we're trying to do is impossible in the limit (you can't generalize from a biased sample in the most general setting possible, per the no free lunch theorem or even just David Hume's writings), so while we're trying to develop and apply nice black boxes, they will fail and need tweaking, and a person who never struggled with the failures found in simple problems will just no be ready to understand the more complex issues that arise in production environments. I see some professors and colleagues that just "use weka" (or some matlab toolbox, or libsvm) without an intuition behind doing very silly things (like reporting 100% classification accuracy on the training set with an svm using a gaussian kernel and over 5000 features for 500 examples as a breakthrough and useful model) and being completely helpless at fixing and/or improving a learning system, while the ones that have struggled to implement at least a few basic models tend to not make these mistakes.

With me, at least, this has been proven true many times. In my first experiences with LDA, SVMs, graphical models (and probably others) I did the traditional download-run-and-test loop and got really bad results, forming weird incorrect intuitions in my head. Only after reading the original source, scratching my head and implementing some simpler variants have I started to understand how these things work, how can I apply them to this or that specific problem, why are those people writing all those very similar very obscure papers on improving some little detail of those things, etc.

Just reading an overview of a machine learning technique and using a pre-packaged software encourages you to skip the construction of very important mental models that can be the difference between a rational application of machine learning and an overfitted mess.

Of course, when designing a production system one has very different priorities than when one is trying to learn about some technique or technology. In this scenario it is very useful to find a stable, tested, fast, accurate, production-ready implementation and just use it. But this is only so because I'm assuming that the person doing the using is already familiar with the techniques, and nows why they're working and how to debug them.

(Jul 05 '10 at 13:05) Alexandre Passos ♦

@Alexandre thanks for your valuable comment. I noticed that we're thinking in the same direction.

(Jul 05 '10 at 17:50) cglr

I agree with Alexandre. Downloading and using ready made packages doesn't help in understanding the algorithms and their various nuances. It's true with any problem that you are trying to solve, not just in the ML domain. Coding always makes it clearer what the real deal is and also let's you debug stuff.

(Sep 05 '10 at 03:46) Aman
showing 5 of 7 show all

For what it is worth, as I see it, there at least three major components to knowing what you are doing in applying machine learning:

  • Understanding the mathematics behind the methods.
  • Understanding the tools for implementing them.
  • Understanding the domain in which you are applying your tools and methods.

If you are mathematically mature, I would recommend working through The Elements of Statistical Machine Learning and learning R, if you don't already know it. If you are not, you might want to start with StatSoft's Electronic Statistics Textbook and/or purchase the Handbook of Statistical Analysis and Data Mining Applications, instead of TEoSML. That will get you two-thirds of the way to a basic skill-set, fast.

There are some caveats, though. if you are interested in being some areas of commercial work, you might want to learn SAS programming, or some other language/platform, instead of R. While R seems valued by high-end, analytically-oriented companies, many companies less focused on the bleeding-edge use SAS, MS SQL SSAS, etc.

answered Jul 02 '10 at 20:33

John%20L%20Taylor's gravatar image

John L Taylor

edited Jul 03 '10 at 19:52

The best dive-into-python-ish way to get into ML is by reading and following the implementations of Programming Collective Intelligence. It's lightweight on the theory and motivations side, but it will teach you some important methods and how they work. Then I'd recommend reading Bishop's Pattern Recognition and Machine Learning if you want to really get into the field.

This answer is marked "community wiki".

answered Jul 02 '10 at 20:47

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

First, if you want to understand ML, watch the Stanford ML course lectures by Andrew Ng. If you haven't heard of him, he is a really successful ML researcher. The lectures are available freely here: http://academicearth.org/courses/machine-learning

Then you can start by implementing Hidden Markov Models. They are simple, yet they will make you implement a famous inference algorithm and also understand how to store random variables, probability tables, etc. If you are an experienced programmer, you will enjoy learning how to implement the data structures for Machine Learning algorithms. It is true that most of the ML algorithms can be implemented in about 100 lines of code. But you need to build a lot of coding infrastructure around it to make it usable.

And yes, try looking at other people's code.

This answer is marked "community wiki".

answered Sep 05 '10 at 03:54

Aman's gravatar image


edited Sep 05 '10 at 03:56

The best advice for people interested in machine learning that I have found so far is the following blog post:


I hope it is helpful.

This answer is marked "community wiki".

answered Jul 04 '10 at 14:17

Pierre%20Rosado's gravatar image

Pierre Rosado

edited Jul 04 '10 at 15:13

That's a great resource, but also extremely disheartening. There is a lifetime of material on that page.

(Oct 27 '10 at 18:49) Nate Murray

Indeed. Another resource, somewhat similar: http://www.quora.com/How-do-I-become-a-data-scientist

(Oct 28 '10 at 10:45) Lucian Sasu

I would look into an ML framework like Weka or Orange or RapidMiner. Weka, for example, is written in Java, has a GUI, but is also easy to use programmatically.

I disagree with the other posters above who suggest that the best way to start in ML is by implementing algorithms. That's a great way to understand more, but at first I think it's more helpful to just use a few of the more common algorithms like Bayes and SVMs (that other people have implemented) on pet projects. Then, later on, implement them yourself for deeper understanding.

This answer is marked "community wiki".

answered Jul 03 '10 at 13:32

Cory%20Giles's gravatar image

Cory Giles

Ahh, nice. Frameworks. These I will definitely use. Thanks.

(Jul 03 '10 at 19:27) partoa

Another approach: look at code! I learned a lot from looking at the code my advisor, Dan Klein. It taught me a lot about the right kind of abstractions for NLP and machine learning more broadly. Not that my code is a shining example, but I intend to teach my future students from my code base: http://github.com/aria42/umass-nlp.

This answer is marked "community wiki".

answered Jul 02 '10 at 22:31

aria42's gravatar image


I have collected a number of books in this post. You might be interested in checking them out

This answer is marked "community wiki".

answered Jul 03 '10 at 06:39

Mark%20Alen's gravatar image

Mark Alen

I'm somewhat surprised no one has mentioned simply reading academic journals on the subject. Many academic journal searches allow you to enter keywords to be searched for, just do searches on machine learning or neural networks or any number of other things related to those topics. As you read those papers, you'll come across things you don't know or understand and you can then study those things until you do understand sufficiently for clarity in the article, and so on.

That would be my recommendation, though perhaps it may be wise to get a very fundamental understanding under your belt first

This answer is marked "community wiki".

answered Jul 05 '10 at 03:50

Rueben's gravatar image


I like the answers from Joseph, John and aria42. For me, these aspects are best represented in David Barber's upcoming book Bayesian Reasoning and Machine Learning.

This answer is marked "community wiki".

answered Jul 04 '10 at 08:03

osdf's gravatar image


edited Jul 04 '10 at 08:18

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.