|
Let's say: I want to do classification, where my instance is a "website". I have website visitors. For each visitor, I have some real-valued attributes (e.g. age) and categorical attributes (e.g. gender). What is the appropriate way to create an aggregate feature vector for a website, over the visitors? If I make a distribution or buckets over the individual attributes, I lose pairwise information like: "Percent of visitors who are female and over 60". So I should make pairwise attributes over visitors, and then aggregate. Any other suggestions or things I am missing? I also want to get statistics by comparing this website to other websites that I'm defining as "related". So instead of looking at the statistics for the website in isolation, I also care about some feature that measures how much the visitor demographics compare to the related site. For example, if this site gets significantly more females over 60 than the related sites, that's useful information. How do I also capture the relative features wrt the related websites? |
|
I'd define a feature vector per visitor and make a website-specific feature vector by concatenating a few different pooling options: the mean visitor vector, the pointwise max/min of each feature per visitor (if you have real-valued features). One family of features can also be defined if you cluster all visitors with kmeans, encode each visitor's feature vector as a linear combination of the means, pick these coefficients and do average / max / min pooling. As you mentioned the feature vector per visitor probably should have feature crosses. To get the relative features I'd just compute the mean feature vector of each group of websites and look at the difference between that and each website's feature vector, and also in which quantile the distribution of each feature individually is. |
|
No idea if it works, but you could try to use recursive auto encoders. You can use RAEs as some kind of "reduce" operation as in map/reduce to boil a sequence of arbitrary length down of visitor representations down to a fixed length descriptor. |