|
When training a neural network (or autoencoder) with regularizing penalty terms added to the objective (L1 or L2 weight decay, contractive penalty or manifold tangent penalty), should the validation error used for early stopping be the unpenalized objective function? I have not seen this stated in any of the literature, but it makes sense to me... The penalty terms modify the calculation of the gradient in a way that has a regularizing effect on the training, but the actual measure of how good the model can generalize to unseen data should be based on the unpenalized objective function (which purely measures goodness of fit) right? Am I missing something? |
|
Yes, regularization should improve your model's generalize-ability, but it should not impact your assessment of a model's quality on holdout data. I think it will be clearer if you step away from the neural network context and just think of "statistical model" as a black box trained on some training data. You have a second set of data that you use to modify the black box, in your case setting regularization parameters. A model's accuracy is measured against its training data (uh-oh), or against the second set of data (still iffy), or preferably against a third holdout set of data that has not been observed even for regularization. You can take this to the extreme by having several holdout sets, each progressively observed only when parameters (i.e. L1 lambda, neuron weights) have been fixed by and on earlier holdout data. Sometimes I use a structurally different accuracy or fitness measure, instead of the proper loss function, just for discipline. For example, you might use L1 regularized squared loss as a learner's loss function, but evaluate on the holdout with absolute loss. Autoencoding does not matter here, since it is still supervised learning. So again to put it simply, measuring accuracy is fundamentally different from a learning or training objective. |