I am a newbie for machine learning. I was trying to train a model of the execution profile of a parallel program on a small-scale system (say, my dual-core laptop), then test the accuracy of the model with the execution profile of the same program running on a large-scale system (e.g. a 128-node cluster). Obviously, the features I can get from the small system would be different from the ones from the large system. Suppose the same set of latent variables control the two sets of features as they come from the same program. What method would be a good fit for this problem?
I would be more specific in the following. Please ignore if you aren't interested. Just give your advice based on the above would still be very helpful. Thank you.
The aim is to detect anomalies in the volume of inter-node communications. In case you are not familiar with parallel computer, just think it as a network of N computers collaborating on computation and talking to each other when necessary. The feature I use to model the communication behavior is the amount of messages sent from a node to other nodes in the system. For example, in a 16-node system, every node is represented by a 16-component vector which consists of the number of messages the node sends to each other nodes (including 0 message to itself). This is why I have different feature spaces for different systems. A node in 16-node system is represented by 16-component vector while a node in 128-node system by 128-component vector. Now I want to train a model of communication volume and use it to predict when there is abnormal communication behaviors. My current approach is to first use k-means to cluster the nodes into groups, then for each group I use PCA to derive the normal range of communications based of squared prediction error (the Q statistic). So if any node's communication volume goes beyond that limit on squared error, it would be predicted abnormal.
My question is how to derive a model (e.g. PCA) in a small system and test it in a large system of different feature spaces?
Thanks for your patience and advice!
asked Nov 18 '10 at 01:30
As you say that both set of features (train and test) have the same underlying latent variables then I think one way would be to train a factor analysis model (with the same number of factors, say K) separately on both the training and the test data. Once you have the new representations of the training and test data (in terms of the K latent factors), you can learn a model from the training data and apply it on the test data.
Edit: As Alexandre pointed out, the factor analysis approach I suggested above actually wouldn't do the right thing in this case due to the identifiability issue in factor analysis. One hack that you might try is to cluster the test data features (128 in number) into 16 clusters (i.e., the number of features in the training data). Then pick each cluster center as a feature for the test data which would give a new feature representation for the test data. Another possibility could be to use something like the weakly paired maximum covariance analysis on the training and the test data which is a multimodal dimensionality reduction technique.. kind of like canonical correlation analysis (CCA) but does not require matchings between pair of examples in the two datasets (and the number of examples could be different in both datasets, unlike CCA).