|
I have a simple neural network which classifies words into IOB labelling, and 99%+ of my labels are O (outside any sequence). I have a simple NN architecture with word embeddings as the input layer, then a hidden layer of 80 neurons, and a categorical output layer of 7 neurons (O + 3 pairs of I/B labels). I get really quick to very small loss values with adagrad and categorical cross-entropy loss, and wonder if the network is able to train on such small values of gradients at all. For example, these are my training and validation loss/accuracy values after 146 epochs:
Is there anything I can do to make training effective in such case? I have already created a balanced training and validation sets (have the same number of sentences with and without interesting sequences of I/B in them), but that doesn't solve the problem. |