|
Has anybody ever considered developing a more abstract language model, which is based solely on characters (instead of words)? I thought an interesting approach would be to train a language model by showing it various paragraphs of text - but without ever explicitly modeling words themselves. i.e. the network only learns based on the patterns of the individual characters. In this way it could discover words, spellings, syntax, punctuation, phonetics, semantics, topics, unknown features etc... all as a by-product of the character patterns it learns. I thought the same kind of techniques could be used that are used in image recognition (chars instead of pixels) The original reason for this idea was because I wanted to develop an elegant (simple) model that could be used to correct spellings and add missing punctuation to an input document. Here's a couple of research papers where people have looked into character based text generation.
Both of the authors are looking at ways of generating a probability distribution for the next character in a sequence. But I'm interested to look at it in terms of finding the probability distribution of a set of likely alternative documents, with all the characters in the document being given as inputs at the same time. Does anybody think this would be a good idea - and which kind of NN architecture would be most suitable? And how to actually do it! |
|
Working at the character level is very very very difficult. One of the reasons is purely computational: you have to process many more time steps--and have a model that learns dependencies that span many more time steps--when a single time step is only one character as opposed to an entire word token. You also need models that waste a lot of capacity just on getting words spelled mostly correctly. Working with some input representation below whole words but above characters seems best for me, especially if the model is capable of handling morphology cleanly and generalizing across words just based on spellings. That said, I have no idea what neural net architecture is best for this. I don't understand the part of your question about "the probability distribution of a set of likely alternative documents, with all the characters in the document being given as inputs at the same time" |
I don't know whether it help but check out this paper Subword language modeling with neural networks
Yes - that's an interesting paper. In the conclusion they say "This means that none of the studied models, including RNNs, is powerful enough to learn all discoverable patterns from the character-level input. This provides further motivation for research of training algorithms for recurrent neural network models." I wonder if they have considered the deep networks described by Hinton instead of the time-based RNNs
Ilya Sutskever and the other authors are well aware of deep neural networks.