17
9

The List of ICML accepted Papers is out.

I looked through it yesterday and read most of the abstracts.

First I'll mention a few papers I was interested in (Alert: Amateur's opinion, and only based on first glance):

  • The Hierarchical Beta Process for Convolutional Factor Analysis and Deep Learning: Deep learning with a graphical model. I like this sort of cross-fertilization.
  • On the Integration of Topic Modeling and Dictionary Learning: The title is self-explanatory. Another interesting hybrid of ideas from topic modeling and other areas of machine learning.
  • Infinite SVM: a Dirichlet Process Mixture of Large-margin Kernel Machines: Is there anything we can't mash together nonparametric process? Apparently not. I say this in a complementary way.
  • Uncovering the Temporal Dynamics of Diffusion Networks: A tough problem that has been attacked by one of the authors before. This paper continues to elucidate the nature of the problem and simplifies it a great deal, as well as producing solid results. All you have to do is plug a simple program into existing solver software.
  • Automatic Feature Decomposition for Single View Co-training: Find an optimal split of your training set into sets of features, then use classifier based on different sets to help each other make predictions. Neat extension of ordinary co-training.
  • The Constrained Weight Space SVM: Learning with Ranked Features: Label features instead of samples and then use an SVM based on this concept for transfer learning
  • On Random Weights and Unsupervised Feature Learning: Another "mythbusters" type paper out of Stanford, confirming that random filters can find reasonable features.

Two more papers I've gotten through recently:

  • Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection: This is a great and clear paper on why greed is good for dictionary selection and similar problems; It introduces a "submodularityishness" measure that seems like it could have a lot of potential. Won an award
  • Efficient Sparse Modeling with Automatic Feature Grouping: A really interesting paper on an alternative to feature subset selection methods like LASSO: feature grouping. Apart from being neat in and of itself, it seems like a better way to regularize many problems than LASSO, by dividing features into groups where all features within a group are equally weighted, essentially saying their sum is a sort of "super-feature". It performs better than LASSO and elastic net empirically for some problems.

The word count for some selected (whim-based; maybe someone should try some topic modeling on it) terms (including repeated mentions in a single abstract breaks down as follows

Data: 163 Learning: 210 Feature: 86

Text: 27 Image: 21 Audio: 7 Video: 5 Time series: 4

Supervised: 6 Unsupervised: 9 Semi-supervised: 14 Active: 11 Multi-view: 9

cluster: 69 discriminative: 4 generative: 18 embedding: 15 classification: 52 ranking: 15 Policy: 32 Hashing: 6 recognition: 10 margin: 20 regression: 18

kernel: 72 support vector: 16 svm: 32 RKHS: 5 Hilbert: 5 block coordinate descent: 5

manifold: 16 PCA: 21 (about 17 in a single abstract) projection: 6

Spars: 37 Completion: 7 Decomposition: 12 matrix: 45 Optimiz: 54 Objective: 20 Dictionary: 7

Tree: 55 Graphical: 9 Topic: 13 MCMC: 10 MAP: 25 variational: 19 bayesian: 27

Graph: 60 Network: 44 Edge: 21 pair: 18 link: 8

Deep: 26 Boltzmann: 5 RBM: 8 Convolution: 8 neural net: 10

nearest neighbor: 6 sampling: 22 boost: 19 ensemble: 5 forest: 1

state-of-the-art: 23 improve*:40

asked May 26 '11 at 12:15

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663

edited Jun 30 '11 at 17:50

I find it a bit strange that the accepted papers are just listed in a single random list. Has anyone tried clustering this, or categorizing it in some way? I think I had some document clustering code lying around...hmm

(May 28 '11 at 15:40) karpathy

4 Answers:

Bayesian Learning via Stochastic Gradient Langevin Dynamics

Abstract: In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic gradient optimization algorithm we show that the iterates will converge to samples from the true posterior distribution as we anneal the stepsize. This seamless transition between optimization and Bayesian posterior sampling provides an in-built protection against overfitting. We also propose a practical method for Monte Carlo estimates of posterior statistics which monitors a ``sampling threshold'' and collects samples after it has been surpassed. We apply the method to three models: a mixture of Gaussians, logistic regression and ICA with natural gradients.

(I don't try to summarize the paper, the paper as a whole is well written and astonishing)

answered May 27 '11 at 08:15

osdf's gravatar image

osdf
67031119

I skimmed over this abstract, even though its actually really closely related to some stuff I've studied! Glad you pointed it out.

(May 27 '11 at 11:24) Jacob Jensen

This paper is amazing, thank you very much for bringing it to my attention. I'm looking forward to trying this technique on my next models and seeing how it compares to the usual culprits.

(Jun 02 '11 at 09:28) Alexandre Passos ♦

There is of course the usage of Martens' hessian free optimizer on recurrent neural networks to generate text.

"Generating Text with Recurrent Neural Networks" and "Learning Recurrent Neural Networks with Hessian-Free Optimization". They beat the longstanding state of the art for learning long term dependencies in time series, LSTM. While the achievement is big, I wonder how HF+LSTM would work.

Anyway, here is what that RNN of them produced, character by character.

The meaning of life is the tradition of the ancient human reproduction: it is less favorable to the good boy for when to remove her bigger. In the show’s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE. In 1996 the primary rapford undergoes an effort that the reserve conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that activates the population. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s thought is. To adapt in most parts of North America, the dynamic fairy Dan please believes, the free speech are much related to the

answered May 29 '11 at 05:00

Justin%20Bayer's gravatar image

Justin Bayer
170693045

Martens and Sutskever actually had two papers accepted to ICML: one is on using an RNN to generate text, the other is on training RNNs with a hessian-free CG optimizer. I'm in the process of constructing an RNN right now, though it's slow going (mostly due to time constraints)

(Jun 09 '11 at 11:23) Brian Vandenberg

Is there an implementation of the heassian-free optimization technique that one can download?

(Jun 13 '11 at 14:39) Frank

The code for the first HF paper in NIPS 2010 is available on J. Marten's site here: http://www.cs.toronto.edu/~jmartens/docs/HFDemo.zip

(Jun 13 '11 at 14:45) crdrn

"On RandomWeights and Unsupervised Feature Learning"

I wanted to highlight this paper again because I noticed that it makes this comment about pre-training convolutional architectures:

"However, we find that the performance improvement can be modest and sometimes smaller than the performance differences due to architectural parameters."

This result is both surprising as there is a growing body of work these days about pre-training neural networks. Isn't LeCun moving in this direction of applying pretraining to his CNNs? (Hinton mentioned it in one of his video lectures)

answered Jun 09 '11 at 10:50

crdrn's gravatar image

crdrn
402162126

edited Jun 09 '11 at 10:50

Active Learning from Crowds

Abstract: Obtaining labels is expensive or time-consuming, but unlabeled data is often abundant and easy to obtain. Many learning task can profit from intelligently choosing unlabeled instances to be labeled by an oracle also known as active learning, instead of simply labeling all the data or randomly selecting data to be labeled. Supervised learning traditionally relies on an oracle playing the role of a teacher. In the multiple annotator paradigm, an oracle, who knows the ground truth, no longer exists; instead, multiple labelers, with varying expertise, are available for querying. This paradigm posits new challenges to the active learning scenario. We can ask which data sample should be labeled next and which annotator should we query to benefit our learning model the most. In this paper, we develop a probabilistic model for learning from multiple annotators that can also learn the annotator expertise even when their expertise may not be consistently accurate (or inaccurate) across the task domain. In addition, we provide an optimization formulation that allows us to simultaneously learn the most uncertain sample and the annotator/s to query the labels from for active learning. Our active learning approach combines both intelligently selecting samples to label and learning from expertise among multiple labelers to improve learning performance.

I feel like this is a problem many of us tackle often while trying to do what we really care about. Having somebody take the time to solve this in a reasonable fashion is very nice.

answered Jun 11 '11 at 00:32

zaxtax's gravatar image

zaxtax ♦
1051122545

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.