|
I have two datasets from some Web store (like Amazon). Datasets have one and the same structure. Each record in these datasets has the following attributes:
The first dataset is a collection of records related to different users randomly selected. The second dataset is a collection of records related only to users who also clicked on a particular advertisement located on the same page as the product they where shopping for. Problem: Find dependencies that distinguish users in second dataset from users in the first dataset. To solve this I would calculate statistical parameters such as mean, expected value and standard deviation for all parameters in both datasets and compare them. Unfortunately, classifier (no matter what classification algorithm one uses) does not give an answer to the question "What feature inter-dependencies distinguish records in the second dataset from records in the first one?" Any other ideas how to find characteristic features distinguishing these datasets? I am new to this kind of problems, so please bear with me, and also let me know if my question makes no sense at all! Thanks! |
|
Perhaps you could pool all of the entries, with an additional class attribute for whether it is in dataset one or dataset two. Then, if you run a classifier on it you could perhaps infer the major determinants. For example, if it was all classified through a decision tree, you could look at the top one or two splits, because they'll probably be good overall determinants of how the datasets differ. And of course, you can apply this in different ways for all different kinds of classifier, although you'll need to find ways to find the "important" dimensions in each case. Trees make this easy because they put the important one at the root. |
|
So as I read your question, I realized that you may be interested in using methods that could use descriptive statistics as inputs since you already have these features readily available. One method you might find useful is to create a "ying-yang" classifier using minimum mean squared error (MSE). This is basically a really simple clustering technique that tries to minimize inter-group differences and, by consequence, maximizes intra-group differences. It follows this general set of steps: 1) randomly assign group membership 2) measure MSE across all features for these groups 3) [search] try all possibilities, every time you find a more efficient solution, record it 3) [heuristic] if no other features have larger MSE and you have created maximum split, then end |
So what do you want exactly? To build a classifier, or to build a classifier that is interpretable by a human? The latter is a vague concept, so please elaborate if you want that. Generally, machine learning is concerned with building a black box classifier that generalizes well, regardless of the dependencies between features it uses. // Decision trees are typically easy to interpret, and you may keep them simple with pruning. But this comes with a cost: more sophisticated models like random forests might show better performance.