|
What is common balance between query related features and document related features in search engines? So if we aggregate them what group of feature has more influence. rank = a(all query related features)+b(all document related features) I'm not sure if the question formal enough, but want to find some common sense.
This question is marked "community wiki".
|
|
It really depends on the setting and the data. Usually what people do is try many values and see what does best. Also, what is often done is using tfidf to generate a document list and then rank these documents by pagerank, ignoring their tfidf weights. In most modern search engines there are also many other things being considered, and the ranking equations get rather complex (which is one of the reasons there's been a lot of interest in learning to rank). So you think that modern search engine first get 1000 best by tfidf rank and then sort by page rank. It's look like hard-coded heuristic rather then theoretic probabilistic model.
(Sep 01 '10 at 05:57)
yura
It's a mixture of probabilistic models and hacky heuristics, yes. What I described is, AFAIK, mostly correct, except there are a lot more factors (other than tfidf) involved in choosing the 1000 documents and a lot more factors involved in reranking them (other than pagerank). Also 1000 is not always used. Unfortunately it gets really hard to understand what is going on without actually working on a search engine, mainly because all modern search engines treat this as trade secrets. I build this informal description by looking at the problem settings and datasets used in learning to rank papers. See for example http://research.microsoft.com/en-us/um/beijing/projects/letor/
(Sep 01 '10 at 07:21)
Alexandre Passos ♦
|
|
If your system can take feedback from users, you can try to use reinforcement learning based on the user's feedback. For example, if the user open multiple documents from the set you provide, then most probable your set is not properly ordered. Based on this feedback you might attempt to adjust some coefficients.
This answer is marked "community wiki".
|