|
Papers, workshops, courses, software... |
|
To bring together people interested in processing Big Data, distributed computing and machine learning, I created a new website processingbigdata.com |
|
Mining of Massive Datasets is a great free e-book by two Stanford professors. It's more focused on the MapReduce-way of doing large-scale machine learning than on things like online methods. If you're interested in Natural Language Processing in particular, Jimmy Lin also has a good free e-book called Data-Intensive Text Processing with MapReduce. It's also MapReduce-focused. Vowpal Wabbit is a great piece of software for fast online learning on huge datasets. If you're more interested in the research frontiers of large-scale machine learning, Stanford hosts a Workshop on Algorithms for Modern Massive Datasets that has a bunch of great papers. There's also an upcoming book on Scaling up Machine Learning by Ron Bekkerman, Misha Bilenko, and John Langford. |
|
I'm having trouble finding the paper, but I believe it was either Alexander Krizhevsky (more likely) or Graham Taylor that co-authored a paper that parallelized training a large RBM over a network between many machines. Their basic approach was to add a very large number of hidden units, then parallelize the up & down passes during gibbs sampling in a manner somewhat similar to the way work would be split up in a mapreduce style framework.
This answer is marked "community wiki".
probably Krizhevsky's thesis Learning Multiple Layers of Features from Tiny Images
(Jun 17 '11 at 19:01)
alex
|
|
I enjoyed reading Leon Bottou's tutorial [1] on large scale learning with SVMs and CRFs trained by stochastic gradient descent. |
|
I follow the machine learning blog hunch.net of John Langford. There I found this link pointing to a list of resources for learning about large scale machine learning: http://www.quora.com/Machine-Learning/What-are-some-introductory-resources-for-learning-about-large-scale-machine-learning#ans104989 |
|
Even though it is more on the data mining side, I think this course and the accompanying book may be of interest: Data Mining: Learning from large datasets. |
|
This paper (A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification, Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin; JMLR, 2010) is a good overview read in that direction. My favourite source for reading on various ML topics is still JMLR because of the teaching nature of these papers. |
|
I think large-scale machine learning is still very much an area of research. That given, most of the recent advances are still being published, and as far as I know there is no comprehensive book or class on the topic. I would recommend, then, that you watch the talks in the "learning on cores, clusters, and clouds" on NIPS 2010 and the "large scale machine learning" on NIPS 2009. Watching the talks and reading the papers might point you towards other interesting resources in this topic. Edit: Actually a book just came out on the subject. It's called Scaling up machine learning, by Bekkerman, Bilenko, and Langford. I'd like to add Alex Smola's talk on graphical models for the internet. (link)
(Feb 26 '12 at 12:10)
jjossarin
|
|
I can recommend Apache Mahout software - hadoop based open source Java library implementing large scale machine learning and collaborative filtering algorithms. |