|
For add-alpha smoothing there is a well-known interpretation (looking at the marginal mode of posteriors assuming dirichlet priors), as for L1 and L2 regularization (laplace and gaussian priors on the weight vector), for Kneser-ney smoothing you use hierarchical pitman-yor processes, and many other things. Is there a well-known probabilistic grounding of tf-idf weighting? Also, is there a probabilistic rationale for weighting things by global log-factors, in general? Some sort of distribution that gives rise to posteriors that look like this? |
|
Some justification was given in this paper (section 3 in particular). Some connections were studied in this paper too. |
|
I believe work by Zhai and Lafferty also discussed the ranking similarity between language models and the tf-idf approach. |
I am familiar that their have been efforts to formalize TF*IDF in probabilistic terms, but I cannot remember the details.
I think for linear models at least a solution should focus more on the weight vector than on the term weights. Something like using an assymmetric regularization value, regularizing ||tf-idf*theta|| instead of theta.
The closest thing I can think of is confidence-weighted learning, but there the formulas don't quite come up looking tf-idf-like.