|
I'm a math/stats guy, so I understand the theory (linear algebra, probability/statistics, optimization, etc.) behind machine learning. But I'm pretty sure I'd be horrible at doing machine learning in real life. I mean, I'm a decent high-level programmer. I can write machine algorithms in Ruby or Java no problem, even throw it onto Amazon's AWS/Elastic Map Reduce. But I have no idea how to do scalable low-level/non-MapReduce machine learning. What kinds of things do I need to learn? (e.g., multithreaded programming, memory management, ...?) What are some good introductory resources? [A superficial introduction is fine, preferred even, since I'm just trying to get a broad overview of what I need to learn. In other words, the meta-problem is that I don't know what I don't know.] I'm primarily looking for non-MapReduce stuff, but MapReduce/Hadoop suggestions are welcome as well. (I can write MapReduce/Hadoop algorithms, and know at a high level how they work, but I'm not familiar with low-level details.) |
|
Here are few references: John Langford has a blog post about general guidelines for making your ML algorithms fast--both low-level and high-level suggestions. JL is the man behind of Vowpal Wabbit, which is regarded by many as the most scalable ML toolkit around. Here's another post by jl summarizing some recent excitement on the internet about "how to structure your ML experiments/code". The references within are a must follow. Lastly, another recent post about machine learning in the wild. |
|
To complement other people's replies, I would like to add data structures. Having a good command of data structures, e.g., knowing their time / space complexity, access / insertion / deletion complexity, is very useful. But again, as others pointed out, this is not really limited to machine learning. 1
I agree. While this is not limited to machine learning, in many other areas you can get away with dumber data structures far more easily than in ML, specially since datasets have been growing wildly.
(Jul 29 '11 at 16:13)
Alexandre Passos ♦
|
|
I won't talk about the machine learning or algorithm part as other said. I will complement it by describing how you can efficiently implement a model that you want. As said the first step it to profile and profile and profile. Don't be shy on profiling:) After you profiled, check the part that take too much time and try to understand why it take so much time. You will need to choose when to stop profiling and spending time, as their is always wait to have it faster. Other stuff to take care about in decreasing order: - algorith selection: use good algo (ex: nlog(n) will always bet an n*2 algo when n is big), If you want parallel sort, use merge sort, not quick sort,... - Use already optimized library as BLAS, LAPACK,... Optimization specialist spent an not contable number of hours doing them, you probably won't beat them except if you solve a more specialized version of their code or spend much time on it. - Check your bottle to see if they are memory or cpu bond. Now it happen more and more that the problem is memory related. An optimized gemm is 10 times faster then a normal one. The majority of the speed up come from memory access optimization, not specialized instruction! - After those step, you can start to look at parallelizing or using gpu. As someone else told, when you profiled and hit a wall on not to optimize, don't hesitate to ask. |
|
I won't talk about the machine learning or algorithm part as other said. I will complement it by describing how you can efficiently implement a model that you want. As said the first step it to profile and profile and profile. Don't be shy on profiling:) After you profiled, check the part that take too much time and try to understand why it take so much time. You will need to choose when to stop profiling and spending time, as their is always wait to have it faster. Other stuff to take care about in decreasing order: - algorith selection: use good algo (ex: nlog(n) will always bet an n*2 algo when n is big), If you want parallel sort, use merge sort, not quick sort,... - Use already optimized library as BLAS, LAPACK,... Optimization specialist spent an not contable number of hours doing them, you probably won't beat them except if you solve a more specialized version of their code or spend much time on it. - Check your bottle to see if they are memory or cpu bond. Now it happen more and more that the problem is memory related. An optimized gemm is 10 times faster then a normal one. The majority of the speed up come from memory access optimization, not specialized instruction! As someone else told, when you profiled and hit a wall on not to optimize, don't hesitate to ask. |
|
Learn C. No matter what you use for high-level algorithm implementation eventually, knowing C will be an asset if only so you are aware of all the details (or most of the details, C compilers still take care of a lot). In a pinch, other languages (okay, maybe not Java so easily) can interface with external libraries written in C, so you can write time-critical/bottleneck stuff in C even if most of your stuff is written in Python, or Ruby, or R. If I were just getting started I'd try and read up on GPU programming, what it's good for, how basically to use it. Note that Theano can really help with this part, but you ought to know what sorts of problems GPUs are good at before looking at specific software implementations. |
|
As Robert said, machine learning programming is no easier and no harder than actual programming. When you have a well-specified algorithm for some task, implementing it is usually fairly trivial, as is the rest of programming. Paralelizing and scaling machine learning techniques, however, are just as hard as paralelizing and scaling regular programs: sometimes it's easy, sometimes it's hard, sometimes it's a research problem. I advise you to search a bit in google scholar before attempting to deploy a specific learning algorithm on a large scale to see if any researchers have already published papers detailing how they did so: it's a good rule of thumb, at first, that something about which a paper was written is not exactly trivial. This is not to say that there aren't any extra issues: as most machine learning techniques are rather algorithmical (i.e., they solve a specific, well-defined problem as a subproblem of learning), it might be necessary for you to take some care in making sure that the transformations you apply from the specification to your program don't break any of the assumptions that guarantee the correct result. Once you're very familiar with the family of techniques you're implementing you should know when this or that assumption can be relaxed and how to "map back" from code to a model that you can make sense of; however at first it is better to always start from a clear model, as algebraic errors are sometimes annoying to debug (specially since, as most machine learning techniques are inherently fuzzy, it gets hard to test that you're actually doing the right calculations and not something meaningless that happens to have similar formulas. One last tip is to always try a dumb, non-object-oriented, non-functional, non-parallel, pseudocode-looking implementation of any ML algorithm, so you can benchmark others against it in performance and correctness. Usually most papers either give you pseudocode for their algorithms directly or reference some paper which does give such code; if there is no pseudocode maybe you're better off implementing whatever simple version there is pseudocode for. Researchers usually oversell their improvements, and even very old and unsophisticated things such as the perceptron can still produce good results, and in some situations are depressingly similar to the state of the art. For an example of this phenomenon at play see this EMNLP paper, Two Decades of Unsupervised POS induction: How far have we come? from EMNLP 2010. It takes some ingenuity and familiarity to know not to jump straight to the keyboard upon seeing a cool new algorithm. Also search for tutorials and workshop papers on the technique you're trying to implement; some of these can shed a new light on some hard decisions. And last, but not least, don't be afraid of asking specific questions in forums like this, where people with actual experience might be able to help you. |
|
I would suggest to also look around for libraries and see how they implement different basic algorithms. This should give you an idea of what you need to do. I personally would recommend using Theano, a linear algebra and more library that takes care of many of the pitfalls for you. Note that Theano is not a machine learning library per se. You can look at Deep Learning Tutorials to see Theano in action. There are also other libraries around and I will just throw in some names: PyBrain Oger ... |
|
First of all, you don't need to be 'low level' to program for machine learning, you just need to be aware of pitfalls in coding at higher levels. Usually, things like appending lists and concatenation are slow. Creating objects (such as indirect copies or creating a new object from a modifying operation) is usually a tell. Learn how to profile your code. This tells you when your code is slow, and which bit is slow. The quote 'premature optimisation is the root of all evil' is true but with machine learning, slow won't cut it. I program mainly in python, using numpy a lot, and by profiling code, most bottlenecks are in completely unexpected spots. Another thing is to make code modular. This allows you to speed up code (unit testing to ensure that it works the same) without worrying about breaking code. This also has another benefit -> it allows you to break code up by resources. Finally, use libraries when they exist. If possible, mature libraries for as much as possible. If you can't find the algorithm already implemented, make sure you use linear algebra libraries. 1
+1 especially for the bit about making it modular. Clearly your code is less useful if it's not easy to adapt to other algorithms. On the other hand, you can run into the same pitfalls as with premature optimization. The extreme of this is you spend so much time trying to make it modular you hardly get anything done.
(Jul 29 '11 at 17:52)
Brian Vandenberg
|
|
Just start implementing the algorithms. Start measuring the rate your algorithms converge. The visceral experience of seeing how your algorithms actually perform is very valuable. Machine Learning is as much a craft as it is a science. Direct experience with code will solidify your understanding further. Scalable machine learning has started to be explored by several researchers. Leon Bottou has some recent papers on Large Scale Learning. You might like the papers at this recent NIPS workshop useful. http://www.select.cs.cmu.edu/meetings/biglearn09/ |
Is is a very subjective question. Should be marked community wiki.