30
24

Recently, I've been working with a bunch of non-machine learning/NLP people. We were talking about the most influential ideas (or papers) in machine learning which everyone in computer science - and perhaps even beyond - should know about. The paper "Top 10 algorithms in data mining" lists: C4.5, k -Means, SVM, Apriori, EM, PageRank, AdaBoost, k NN, Naive Bayes, and CART. I feel like the field has introduced a number of key ideas beyond this list that people should know about. The only rule is that it has to be between 1995 and 2005, just to make sure we choose things that had some time to sink in. My question to you:

What are the most influential ideas we should tell other computer scientists about?

I'll kick start the discussion by mentionning my absolute favorite:

Message Passing for Inference Although this has been around for a while, only around 2000 the community realized all these ideas could be written in the language of message passing. This idea has both spurred a tremendous amount of theoretical research as well as having a huge practical impact.

asked Jul 08 '10 at 19:13

Jurgen's gravatar image

Jurgen
99531419

2

The most influential and upcoming ideas during the era specified (and current times) include but are not limited to the following. Increased Usage of Ensemble methods in machine learning, MapReduce, ML techniques for Privacy preservation in data mining, large scale text mining, techniques pertaining to business intelligence, text mining algorithms specially TF-IDF and Document Indexing by Latent Dirichlet Allocation, ontology building techniques RDF et al, social media analysis, Data Mining with Differential Privacy, advancement in Boosting/Bagging techniques, increased adaption of mining and learning with graphs, Birch, HITS, to name a few key ideas/algorithms.

(Jul 09 '10 at 20:04) Adnan Masood

I second the nomination of ensemble methods. Quite a few other ideas have emerged in the time period specified, but I think the general idea of model ensembles has offered the largest movement forward in model performance.

(Feb 13 '11 at 13:22) Will Dwinnell

8 Answers:

So the paper that didn't invent sequence CRFs, but certainly popularized them Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data by Lafferty, McCallum, and Periera. I think it is one of the most cited papers in CS for the 2000's.

answered Jul 09 '10 at 17:14

aria42's gravatar image

aria42
209972441

1

Which paper invented CRFs, then? I thought it was that one.

(Jul 09 '10 at 21:19) Alexandre Passos ♦
2

Well its complicated. In what sense is a CRF different from an instantiation of a MRF where the $y$s are 1st order markov? The way that you get marginals is a particularly efficient instantiation of the message passing algorithm. So there isn't a specific earlier paper that says by the way you can do sequence MRFs; its just that for the case of chains things are a lot more efficient than the general case.

Don't get me wrong this is a great paper, its just that its primary contribution is explaining why global normalization is good for sequence modeling and that empirically this thing outperforms the at-the-time common MEMM and HMMs.

(Jul 09 '10 at 21:27) aria42

So in what sense is a chain CRF different from an identical chain MRF with hidden variables? I thought the conditional training of an MRF-like model was their main methodological contribution (the conceptual one is the fact that global normalization beats per-state normalization, as you mentioned), but then again this distinction has always been slightly fuzzy to me.

A question you might be able to answer, as well: do speech recognition people use CRFs, or are complex HMMs still the state of the art?

(Jul 09 '10 at 21:36) Alexandre Passos ♦
1

Also for MRFs, there definitely is work I can't recall off the top of my head where they are estimated using conditional likelihood. Definitely was not necessarily common, but it existed.

For speech, I believe its still HMMs since they have lots of latent variables at training time.

(Jul 09 '10 at 21:43) aria42

Thanks for the answers.

(Jul 09 '10 at 21:44) Alexandre Passos ♦

I thought the 1971 Hammersley and Clifford paper was the grandaddy of CRFs, at least it covered most of the theory - and was nicely readable too.

(Jul 10 '10 at 17:03) teejy

@teej It certainly seems to have introduced MRFs but I have to read it carefully to see if the CRF ideas are already there.

(Jul 10 '10 at 17:13) Alexandre Passos ♦

@Alex - true, I was thinking along the lines of aria42, where CRFs stem from the MRF

(Jul 10 '10 at 18:06) teejy

@Alexandre: The speech community is starting to pick up on CRFs, see Geoffrey Zweig's latest work: http://research.microsoft.com/en-us/people/gzweig.

(Jul 22 '10 at 18:23) Frank

This surely is very influential. The interesting thing is that the paper does not introduce anything too radical, but a right combination of things (discriminative training, graphical models, optimization, application in information extraction).

(Aug 30 '10 at 02:02) Truyen Tran
showing 5 of 10 show all

I'd say nonparemetric bayesian models for language are a very interesting idea. Also, conditional random fields are useful for a surprising amount of things.

answered Jul 08 '10 at 20:30

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

1

Definitely +1 for CRF and although I am writing my thesis on NPBayes I feel like there is still a lot of work to do on proving their worth to put them on par with CRF's for example.

(Jul 09 '10 at 14:32) Jurgen

That's true, but I think the applied work by Griffiths, Johnson, Goldwater, and Steyvers (in nearly any possible ordering of subsets of these names) has caused quite an impact in the nlp community (although less than CRFs, and the general structured linear model explosion that followed).

(Jul 09 '10 at 14:52) Alexandre Passos ♦
2

Despite liking these approaches, I think its definitely WAY to early to say bayesian non-parametric ideas for language modeling will be important or not.

(Jul 09 '10 at 17:12) aria42

Oh well, this's the sort of thing that I'm missing.

(Jul 09 '10 at 17:34) Alexandre Passos ♦

CRFs are extremely important, but I wonder if they haven't been slightly over-used as well. In many cases it seems that non-structured models are actually just as good (and in some cases better) than the structured variants, while being much more efficient.

(Feb 14 '11 at 11:31) Oscar Täckström

Here are few that come to my mind:
1) Statistical Learning Theory by Vapnik
2) Map Reduce Paradigm for Distributed Computing
3) Availability of Cloud on rent at cheap rates (20 cent per hour per machine)

And most significantly I feel
4) [Upcoming in my opinion] recent rise of Complex Networks and Associated problems and solutions

feel free to correct me

This answer is marked "community wiki".

answered Jul 09 '10 at 19:01

DirectedGraph's gravatar image

DirectedGraph
56031424

Can you give a few pointers or a bit more description with regards to point 4. This area has passed me by, it seems.

(Jul 24 '10 at 14:17) Noel Welsh
1

You can have a look at papers about social network from WWW 2010 conference.

There are two broad problems: 1. Community detection in networks [which is similar to clustering in a sense]

  1. Link prediction [which is similar to supervised learning classification / regression]

you can have a look at paper by: Jon Kleinberg, Aron Clauset, Mark Newman, Jure Leskovec to name a few

(Jul 24 '10 at 14:40) DirectedGraph

Several I'd be tempted to mention (boosting, active learning, a recognition that input dimensionality is largely irrelevant with proper regularization, etc.) originated a bit before 1995 even if they came to fruition in that interval.

So my vote for one of the most important ideas that originated in 1995-2005 would be the understanding that more data is better than more optimization in compute-limited contexts, providing part of the explanation for why stochastic gradient can be a great learning approach even though it's a lousy optimization approach. See in particular:

Léon Bottou and Yann LeCun: Large Scale Online Learning, Advances in Neural Information Processing Systems 16, Edited by Sebastian Thrun, Lawrence Saul and Bernhard Schölkopf, MIT Press, Cambridge, MA, 2004.

Léon Bottou and Olivier Bousquet: The Tradeoffs of Large Scale Learning, Advances in Neural Information Processing Systems, 20:161–168, Edited by J.C. Platt, D. Koller, Y. Singer and S. Roweis, NIPS Foundation (http://books.nips.cc), 2008.

Prior to this work, I used to say there was no such thing as a "learning algorithm", just loss functions and optimization algorithms. Bottou changed my mind.

answered Aug 26 '11 at 15:20

Dave%20Lewis's gravatar image

Dave Lewis
890202846

ICA was quite influential, but has completely disappeared from the limelights. LDA, while not exatcly original (directed graphical models were known, and it is just a more bayesian version of pLSA), has also sparked a lot of derivative research.

Maybe deep learning (or the neural net renaissance in general) should be in the list as well, although it still seems to be proving itself useful.

L1 regularization, the lasso, and the general issue of sparsity also fits the bill, and has been influential (though not as influential as it should be outside ML itself?).

answered Jul 11 '10 at 17:31

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 11 '10 at 17:33

1

Deep learning didn't appear on the scene until late 2006 (or was it 2007?), unless you count LeCun's convolutional networks.

(Jul 13 '10 at 11:22) Joseph Turian ♦♦

Yeah, Deep Learning is great, but still in a somewhat embryonic stage. Convolutional nets definitely don't count as REAL deep learning - trust me, I've worked with them. They're effective and interesting, sure, but not on the same level of potential as RBMs and their progeny IMO.

(Jul 22 '10 at 19:44) Jacob Jensen

I'd also count Mechanical Turk among one of those.

answered Jul 28 '10 at 02:19

spinxl39's gravatar image

spinxl39
3698114869

Not sure if all these really fall in the specified era, but I think they overlap with it: ensemble methods, kernel methods, dimensionality reduction methods (LSA, PLSA, LDA, random projections [Johnson-Lindenstrauss lemma dates from 84, but we were all a little slow getting it :)], and their approximations like Nydstrom method & core sets), L1 regularization (Lasso).

answered Aug 26 '11 at 18:15

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

edited Aug 26 '11 at 18:16

FDR (False discovery rate) methods, specifically the BH method.

It is a way to correct for multiple comparisons when performing multiple hypothesis testing.

To my knowledge, it is one of the more influential and important notions in the "data intensive statistical analysis" era.

Here is one of the first articles that introduced this notion to the scientific community: Benjamini, Yoav; Hochberg, Yosef (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing". Journal of the Royal Statistical Society

answered Jul 11 '10 at 04:40

Tal%20Galili's gravatar image

Tal Galili
149359

What is the BH method?

(Jul 13 '10 at 11:21) Joseph Turian ♦♦

I think BH is Benjamini-Hochberg, authors of the paper (linked by Tal) which describes the method.

(Jul 24 '10 at 03:42) ars
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.