7
3

For add-alpha smoothing there is a well-known interpretation (looking at the marginal mode of posteriors assuming dirichlet priors), as for L1 and L2 regularization (laplace and gaussian priors on the weight vector), for Kneser-ney smoothing you use hierarchical pitman-yor processes, and many other things.

Is there a well-known probabilistic grounding of tf-idf weighting?

Also, is there a probabilistic rationale for weighting things by global log-factors, in general? Some sort of distribution that gives rise to posteriors that look like this?

asked Aug 20 '10 at 12:46

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Dec 03 '10 at 07:17

1

I am familiar that their have been efforts to formalize TF*IDF in probabilistic terms, but I cannot remember the details.

(Aug 20 '10 at 14:31) Joseph Turian ♦♦

I think for linear models at least a solution should focus more on the weight vector than on the term weights. Something like using an assymmetric regularization value, regularizing ||tf-idf*theta|| instead of theta.

The closest thing I can think of is confidence-weighted learning, but there the formulas don't quite come up looking tf-idf-like.

(Aug 20 '10 at 14:37) Alexandre Passos ♦

3 Answers:

Some justification was given in this paper (section 3 in particular). Some connections were studied in this paper too.

answered Aug 20 '10 at 15:13

spinxl39's gravatar image

spinxl39
3698114869

Thanks. I'll read them.

(Aug 20 '10 at 15:17) Alexandre Passos ♦
4

I'd also suggest looking at this one (more of info-theoretic perspective but I think it's kind of related).

(Aug 20 '10 at 15:20) spinxl39

Yes there is. See this paper: Deriving TF-IDF as a Fisher Kernel by Charles Elkan

answered Sep 04 '10 at 18:31

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I believe work by Zhai and Lafferty also discussed the ranking similarity between language models and the tf-idf approach.

answered Dec 14 '10 at 19:18

dataengines's gravatar image

dataengines
161

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.