1
2

I have a dataset (observations of variables over a period of x years for 1000 samples) Different samples have different dates of observations. Observations are recorded using dates

Data looks like this:

SampleID Date Var
S1    Date1 Var1 
S1    Date1 Var20 
S1    Date2 Var32 
S1    Date2 Var11 
S1    Date2 Var1001
S2    Date1 Var3411
...

Given such a dataset, I am looking for an ideal learning algorithm that which can be used to predict next Var for a new sample. What techniques do you use to select a machine learning algorithm for a given dataset ? Is there any literature/protocols that I can follow ?

EDIT: I am looking for opinion on whether there is any best practice in selecting machine learning frameworks for a given data type. For example, IMHO if someone need to develop a multi-classifier SVM is a good choice. For binary classifier: both SVM and RF could be considered, similar way is there any review/best practice/literature on adapting machine learning frameworks for a given dataset ?

Thanks.

asked Sep 26 '11 at 11:02

Khader%20Shameer's gravatar image

Khader Shameer
16247

edited Sep 27 '11 at 10:34

I am confused by your dataset. Can you give a more specific example? What is a var? Is it a value? e.g. In what way are Var1 and Var20 related?

(Sep 26 '11 at 14:48) Justin Bayer

The data is derived from electronic medical records. Here, variables are ICD-9 codes (See: http://en.wikipedia.org/wiki/List_of_ICD-9_codes).

(Sep 26 '11 at 15:03) Khader Shameer

Is there any data about samples besides a set of (date, ICD9-code) pairs. What kind of dependence do you expect between the dates & the codes. Is the earliest date of a sample the date it was taken? What is the distribution of number of observation per sample, and what kind of time frames are they distributed over. Are the codes relative constant per sample? (I would not expect the ICD-9 classification of a single sample to vary wildly from day to day)

(Sep 26 '11 at 19:17) Daniel Mahler

If you only have your dates, I find little you can actually do, A simple linear regression would probably be your best choice, in this simple example it looks like the Var label does not repeat, so it is unfeasible to use any other algorithm.

If you have some more features, you can use nice techniques, like Markov Chains, which work quite nicely for template data

(Sep 27 '11 at 01:07) Leon Palafox

@Daniel: Thanks for the pointers. Samples (S1 ... Sn) also have basic demographic information like age, sex etc. The dependency should be derived for example how many times Var1 exist with Var20 and similar combination etc. (I am working on this aspect). The data is available for approximately 20 years. So there is variation in temporal observations of ICD-9 codes between Date1 to Date2.

(Sep 27 '11 at 10:36) Khader Shameer

@Daniel: S1 also have basic demographic information like age, sex etc. The dependency should be derived for example how many times Var1 exist with Var20 and similar combinations etc. (I am working on this aspect). The data is available for approximately 20 years. So there is variation in temporal observations of ICD-9 codes between Date1 to Date2.

(Sep 27 '11 at 10:37) Khader Shameer

@Leon: Var labels are repeating, but not in a uniform way. Thanks for pointing this important aspect, I have edited the data to illustrate this. I am afraid, linear regression may not be an ideal choice - I was looking at Association rule mining, I will also check how I can adapt my problem to Markov Chains.

(Sep 27 '11 at 10:41) Khader Shameer
showing 5 of 7 show all

4 Answers:

Here's a quick answer sketch to the general question asked in the title "Given a dataset, how do you choose a machine learning algorithm?"

  • Input: Are the inputs vectors or general objects? In the latter case, k-NN or kernel-based approaches can be used.

  • Output: Is the output a real number (regression), a binary variable (binary classification), a categorical variable (multiclass classification), a set of categorical variables (multi-label prediction), a sequence (tagging) or a general object (structured prediction)? For the latter two, CRFs, structured SVMs and the structured Perceptron can be used.

  • Dimensionality and Sparsity: For high-dimensional sparse data, Pegasos or the passive-aggressive Perceptron are state-of-the-art.

  • Linear separability: For highly non-linear datasets, ANNs, Decision Trees or kernel SVMs are probably the way to go.

  • Scalability (how many training instances do you have): stochastic and randomized algorithms.

  • Prediction speed (how many predictions per second do you need to do): linear models penalized with a sparsity-inducing norm!

  • Model size: same as above.

  • Memory requirement (e.g. in mobile phones): online algorithms.

answered Sep 27 '11 at 13:17

Mathieu%20Blondel's gravatar image

Mathieu Blondel
84621513

What kinds of stochastic and randomized algorithms are you referring to in your Scalability bullet point?

(Oct 03 '11 at 16:00) grautur

It will depend on what your data looks like. Have you tried plotting it? Can you assume independence in the data?

If you have some correlation of a variable to a sample, you might try something very simple like linear regression. If you are looking for the probability of VarX occurring given a random sample chosen, you might want to look into something like naive bayes.

answered Sep 26 '11 at 13:35

nop's gravatar image

nop
1062310

I agree, it all depends on my data. I looked at distributions too, but here am interested in finding an ideal ML framework that can be used to train a model using 3/4 of my 1000 samples and utilize for predicting next var for a new sample.

(Sep 26 '11 at 15:06) Khader Shameer

TO use Naive Bayes, he would need the Var label to be dependent on some other kind of features, rather than just dates (it would be like classifying spam based on the date an email was sent)

(Sep 27 '11 at 01:05) Leon Palafox

Thanks, but deriving features is a bottle neck here. So I am thinking about something like association rule mining or recommendation engine.

(Sep 27 '11 at 14:33) Khader Shameer

@Leon: Yes, if it's just a list of dates then I agree. If you are looking for patterns in when things arrive (e.g. bimonthly) then it might be useful.

@Khader: If you are going to try to do any analysis involving correlations than you should be able to eyeball the expected correlation of the data by the covariance matrices (i.e. the shape of the ellipse you draw around the data points) as well as some idea about the seperability of the data.

(Sep 28 '11 at 11:04) nop

If you are primarily interested in the co-occurence of the icd-9 codes in the same sample over long periods of time, then you can really adapt many of the techniques for document modelling and IR. Basically you can treat the samples as documents & the codes as words. You can take the bag of words view, ignoring the temporal structure with methods LSA, LDA & other spectral methods, various kinds of clustering or you can try and capture the temporal structure with things like hidden markov models. Pretty much anything that can be applied to documents should be applicable here. However, I suspect you will have fewer observations per user than a typical document has words, in which case you will need to make similar adjustments to your methods that people make when analysing short texts like tweets, web queries & various kinds of online comments.

In a similar way methods from collaborative filtering (recommendation systems) can be adapted by viewing samples as users and the codes as items. Actually these methods tend to be similar to the text modelling methods since the abstract structure of the data is the same.

[EDIT: Another thing I forgot to mention is that you could train a Bayes net or a DBN over your training samples this would allow you to take a test sample and activate the codes from it this would in turn activate other codes by association]

As to the more general question of choosing an appropriate learning method for your data, you really need more information than regression vs binary classification vs multiclass. Some methods like random forests are competitive across a very wide range of problems, but to make intelligent choices about more specialised methods like boosting, kernels & svms, GLMs & GAMs, regularization techniques, you really want to look at things like:

  • size of training data
  • number, sparsity & general distribution shape of your features
  • feature & label noise
  • correlated & irrelevant features
  • potential nonlinearities & feature interactions

answered Sep 27 '11 at 11:33

Daniel%20Mahler's gravatar image

Daniel Mahler
8462912

edited Sep 27 '11 at 17:17

Thanks Daniel. I will explore document modeling and recommendation systems. Could you please share few URLs on some good libraries to explore it further ? If IR = Information Retrieval, I thought of it - but I can't really generalize my data in an IR framework.

Thanks for your note on subjective aspect of my question. I agree there is no direct indications to select an ML framework for a given dataset. I was looking at a broader perspective how an appropriate framework can be selected with all these limitations. Thanks for your pointers, they are useful.

(Sep 27 '11 at 16:08) Khader Shameer
1

The generalization to IR is simple. If you index your training samples as documents using a search engine like Lucene, the Lemur toolkit, or Xapian, you can enter the codes of a test sample as as a query and get back the set most relevant/related samples. Yo can then aggregate the codes occuring in these, probably weighted by their relevance score, and this should give you a kind of smoothed distribution over over codes related to your sample.

To learn more about recommendation systems it is probably best to start at the Netflix challenge website. There were papers presented by the top teams at the and and lots of discussions and pointers on the forums. The 2007 KDD Cup was also based around the Netflix data and there were papers published from that. You will want to look into the "Simon Funk" method, which while not a winner was the best bang for buck method developed there.

I will see if I have a good reference handy for document/topic modelling, but if you look search for latent dirichlet allocation (LDA), latent semantic analysis (LSA) [sometimes called latent semantic indexing (LSI)] and singular value decomposition (SVD) or just "topic modelling" you will get lots of hits. Most mathematical libraries will have an implementation svd that will scale at least to your dataset (numpy & R do for sure). There is a topicmodels package for R on CRAN the should cover all of this.

(Sep 27 '11 at 17:12) Daniel Mahler

Thanks a lot for the suggestions and pro-tip on IR framework, very useful.

(Sep 27 '11 at 17:45) Khader Shameer

unless you have some knowledge of your problem, i doubt you'll be able to do better than an empirical selection. an interesting way to try to approach this would be to "meta optimize"- look at many problems of all kinds, derive some descriptive covariates, build many models on each data set, and then build a "meta" model predicting the performance of those models on the data. then, when a new problem appears, try to use this meta model to guess what model to use.

answered Sep 27 '11 at 08:39

downer's gravatar image

downer
48871620

ICD-9 code itself is knowledge. What I wanted to achieve is adapt this data using a suitable machine learning framework so that I can use the dataset to define next ICD-9 code for a new sample. Thanks for the suggestion on Meta optimize approach. Do you have any references / literature to share ?

(Sep 27 '11 at 14:34) Khader Shameer
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.