|
I have a dataset (observations of variables over a period of x years for 1000 samples) Different samples have different dates of observations. Observations are recorded using dates Data looks like this:
Given such a dataset, I am looking for an ideal learning algorithm that which can be used to predict next Var for a new sample. What techniques do you use to select a machine learning algorithm for a given dataset ? Is there any literature/protocols that I can follow ? EDIT: I am looking for opinion on whether there is any best practice in selecting machine learning frameworks for a given data type. For example, IMHO if someone need to develop a multi-classifier SVM is a good choice. For binary classifier: both SVM and RF could be considered, similar way is there any review/best practice/literature on adapting machine learning frameworks for a given dataset ? Thanks.
showing 5 of 7
show all
|
|
Here's a quick answer sketch to the general question asked in the title "Given a dataset, how do you choose a machine learning algorithm?"
What kinds of stochastic and randomized algorithms are you referring to in your Scalability bullet point?
(Oct 03 '11 at 16:00)
grautur
|
|
It will depend on what your data looks like. Have you tried plotting it? Can you assume independence in the data? If you have some correlation of a variable to a sample, you might try something very simple like linear regression. If you are looking for the probability of VarX occurring given a random sample chosen, you might want to look into something like naive bayes. I agree, it all depends on my data. I looked at distributions too, but here am interested in finding an ideal ML framework that can be used to train a model using 3/4 of my 1000 samples and utilize for predicting next var for a new sample.
(Sep 26 '11 at 15:06)
Khader Shameer
TO use Naive Bayes, he would need the Var label to be dependent on some other kind of features, rather than just dates (it would be like classifying spam based on the date an email was sent)
(Sep 27 '11 at 01:05)
Leon Palafox
Thanks, but deriving features is a bottle neck here. So I am thinking about something like association rule mining or recommendation engine.
(Sep 27 '11 at 14:33)
Khader Shameer
@Leon: Yes, if it's just a list of dates then I agree. If you are looking for patterns in when things arrive (e.g. bimonthly) then it might be useful. @Khader: If you are going to try to do any analysis involving correlations than you should be able to eyeball the expected correlation of the data by the covariance matrices (i.e. the shape of the ellipse you draw around the data points) as well as some idea about the seperability of the data.
(Sep 28 '11 at 11:04)
nop
|
|
If you are primarily interested in the co-occurence of the icd-9 codes in the same sample over long periods of time, then you can really adapt many of the techniques for document modelling and IR. Basically you can treat the samples as documents & the codes as words. You can take the bag of words view, ignoring the temporal structure with methods LSA, LDA & other spectral methods, various kinds of clustering or you can try and capture the temporal structure with things like hidden markov models. Pretty much anything that can be applied to documents should be applicable here. However, I suspect you will have fewer observations per user than a typical document has words, in which case you will need to make similar adjustments to your methods that people make when analysing short texts like tweets, web queries & various kinds of online comments. In a similar way methods from collaborative filtering (recommendation systems) can be adapted by viewing samples as users and the codes as items. Actually these methods tend to be similar to the text modelling methods since the abstract structure of the data is the same. [EDIT: Another thing I forgot to mention is that you could train a Bayes net or a DBN over your training samples this would allow you to take a test sample and activate the codes from it this would in turn activate other codes by association] As to the more general question of choosing an appropriate learning method for your data, you really need more information than regression vs binary classification vs multiclass. Some methods like random forests are competitive across a very wide range of problems, but to make intelligent choices about more specialised methods like boosting, kernels & svms, GLMs & GAMs, regularization techniques, you really want to look at things like:
Thanks Daniel. I will explore document modeling and recommendation systems. Could you please share few URLs on some good libraries to explore it further ? If IR = Information Retrieval, I thought of it - but I can't really generalize my data in an IR framework. Thanks for your note on subjective aspect of my question. I agree there is no direct indications to select an ML framework for a given dataset. I was looking at a broader perspective how an appropriate framework can be selected with all these limitations. Thanks for your pointers, they are useful.
(Sep 27 '11 at 16:08)
Khader Shameer
1
The generalization to IR is simple. If you index your training samples as documents using a search engine like Lucene, the Lemur toolkit, or Xapian, you can enter the codes of a test sample as as a query and get back the set most relevant/related samples. Yo can then aggregate the codes occuring in these, probably weighted by their relevance score, and this should give you a kind of smoothed distribution over over codes related to your sample. To learn more about recommendation systems it is probably best to start at the Netflix challenge website. There were papers presented by the top teams at the and and lots of discussions and pointers on the forums. The 2007 KDD Cup was also based around the Netflix data and there were papers published from that. You will want to look into the "Simon Funk" method, which while not a winner was the best bang for buck method developed there. I will see if I have a good reference handy for document/topic modelling, but if you look search for latent dirichlet allocation (LDA), latent semantic analysis (LSA) [sometimes called latent semantic indexing (LSI)] and singular value decomposition (SVD) or just "topic modelling" you will get lots of hits. Most mathematical libraries will have an implementation svd that will scale at least to your dataset (numpy & R do for sure). There is a topicmodels package for R on CRAN the should cover all of this.
(Sep 27 '11 at 17:12)
Daniel Mahler
Thanks a lot for the suggestions and pro-tip on IR framework, very useful.
(Sep 27 '11 at 17:45)
Khader Shameer
|
|
unless you have some knowledge of your problem, i doubt you'll be able to do better than an empirical selection. an interesting way to try to approach this would be to "meta optimize"- look at many problems of all kinds, derive some descriptive covariates, build many models on each data set, and then build a "meta" model predicting the performance of those models on the data. then, when a new problem appears, try to use this meta model to guess what model to use. ICD-9 code itself is knowledge. What I wanted to achieve is adapt this data using a suitable machine learning framework so that I can use the dataset to define next ICD-9 code for a new sample. Thanks for the suggestion on Meta optimize approach. Do you have any references / literature to share ?
(Sep 27 '11 at 14:34)
Khader Shameer
|
I am confused by your dataset. Can you give a more specific example? What is a var? Is it a value? e.g. In what way are Var1 and Var20 related?
The data is derived from electronic medical records. Here, variables are ICD-9 codes (See: http://en.wikipedia.org/wiki/List_of_ICD-9_codes).
Is there any data about samples besides a set of (date, ICD9-code) pairs. What kind of dependence do you expect between the dates & the codes. Is the earliest date of a sample the date it was taken? What is the distribution of number of observation per sample, and what kind of time frames are they distributed over. Are the codes relative constant per sample? (I would not expect the ICD-9 classification of a single sample to vary wildly from day to day)
If you only have your dates, I find little you can actually do, A simple linear regression would probably be your best choice, in this simple example it looks like the Var label does not repeat, so it is unfeasible to use any other algorithm.
If you have some more features, you can use nice techniques, like Markov Chains, which work quite nicely for template data
@Daniel: Thanks for the pointers. Samples (S1 ... Sn) also have basic demographic information like age, sex etc. The dependency should be derived for example how many times Var1 exist with Var20 and similar combination etc. (I am working on this aspect). The data is available for approximately 20 years. So there is variation in temporal observations of ICD-9 codes between Date1 to Date2.
@Daniel: S1 also have basic demographic information like age, sex etc. The dependency should be derived for example how many times Var1 exist with Var20 and similar combinations etc. (I am working on this aspect). The data is available for approximately 20 years. So there is variation in temporal observations of ICD-9 codes between Date1 to Date2.
@Leon: Var labels are repeating, but not in a uniform way. Thanks for pointing this important aspect, I have edited the data to illustrate this. I am afraid, linear regression may not be an ideal choice - I was looking at Association rule mining, I will also check how I can adapt my problem to Markov Chains.