|
Is there a benefit to train a logistic regression classifier on data which happens on labels which happen to be in the form of probabilities already? I know logistic regression was intended for classification, but I feel that cases where P_c1 = 51% and P_c2 = 49% should be taken into account during training. Also, if I were to train it in such a manner, would I need to use a different training procedure from iterative reweighted least squares? |
|
I don't see a reason why you shouldn't use this information if you have it. Your likelihood function would look different, probably, and you'd have a different gradient. You could probably still optimize it with iteratively reweighted least squares but usually you don't want to do this anyway as things like L-BFGS are better. So my likelihood would no longer be the binomial distribution? Is there any standard pdf when your labels are probability values? If I treat the likelihood as normal distribution, it seems like my model will just become a neural network with no hidden layer.
(Oct 19 '11 at 22:27)
crdrn
Logistic regression is one kind of neural network with no hidden layer already. You can use a nicer loss function, such as minimizing the KL divergence between the empirical distribution and your predicted distribution, which is - sum_x sum_y observed-P(Y|X) * log(predicted-P(Y|X))
(Oct 19 '11 at 22:34)
Alexandre Passos ♦
Can I interpret the probabilities I get from logistic regression outputs as the probability of the class Y of a given datapoint X assuming the pdf of y given x follows a logistic distribution?
(Oct 25 '11 at 23:52)
crdrn
|
|
If you have that information then you should use it, especially if your goal is to get a good estimate of the conditional probability, rather than just getting accurate classification. Most implementations of logistic regression will let you weight examples differently. You can use this feature to encode an example as having probability p of being positive. Just split it into two examples, one with weight p and a positive label and the other with weight (1-p) and a negative label. This is an example of using "Rao-Blackwellized" or "distributional" examples, in which some of the attributes of an example are observed precisely and others are specified only up to a probability distribution. |