One of the problems in Bioinformatics is that of gene finding, where we're given training data in the form:

X: a string of characters from a finite alphabet (like A,C,G,T)

Y: a string of labels for each individual character in X, also from some finite alphabet (i.e. in a gene, not in a gene, or in a reverse-coding gene)

At test time we're given another string X and are asked to produce Y, or, for a given nucleotide x, predict its function y.

Traditionally this problem has been solved with Hidden Markov Models or Conditional Random Fields, but I'm wondering what other models would be a good fit.

Recurrent Neural Networks trained appropriately might be able to model the long-range dependencies required to actually predict when a gene starts and stops, but reading the modern papers on this model (Graves' thesis using LSTM and Martens' HF Optimized version with structural dampening) always refer to the input vectors as real-valued - a symbol in a DNA sequence is a one-of-K vector. Additionally, Hinton refers at one point to sequences of length 100 as "long sequences" in the context of RNNs - is it safe to say that RNNs would be unable to process sequences of many thousands of characters?

HMMs in the context of gene predictors usually incorporate a rather elaborate state diagram to encode prior biological knowledge. A recent development in nonparametric Bayesian statistics is that of "infinite" HMMs, where the model is learned from the data using a hierarchical Dirichlet process. Emily Fox used a modified version of this model to do speaker separation in audio, while Eric Sudderth (I think) did change-point detection in NMR data. None of these is a supervised learning problem, and I don't think the sequences are quite long enough to be comparable. I don't know enough about this framework to determine whether it would be infeasible to use it for gene prediction, so any input is welcome in that regard.

Any other models that are good at long-range dependencies, and can handle long sequences, would be very appreciated. Thanks a lot!

asked Feb 05 '14 at 15:21

ncryer's gravatar image

ncryer
1111

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.