|
I'm copying here a question that I had already posted on stackexchange.com, but couldn't get an exhaustive answer. I am trying to understand the difference between logistic regression probabilities and linear regression prediction intervals. For example, let's say we have a database of student test scores in the range of 1 to 100 and some predictors. The goal of this study is to build a model to predict if other students will reach at least a score of 60 with an 80% confidence. To simplify, we are assuming that all the linear modeling requirements in the data are verified. The first method would be to run a linear regression on the observed data, then calculate the 80% prediction intervals and finally determine whether a student will reach a score of 60 or higher based on the lower end of the prediction interval. The other approach is to categorize our data and run a logistic regression on each student score < 60 (0) or >= 60 (1) observation. Is there any benefit in using the logistic regression approach in this case? Or does linear regression will result in the same level of accuracy when using prediction intervals? |
|
Yes, the nuances here are confusing. A prediction interval tries to estimate the variance (instability) of the estimated output for a given input. In other words, how trustworthy is the point estimate produced by the linear regression model? In general, this will depend on both the amount of data used to fit the model and the (in)stability of the model class. (If the notion of stable / unstable model classes is unfamiliar, do a search for bias and variance decomposition of learning algorithms, particularly in the context of ensemble learning.) Note that one might want a prediction interval for a logistic regression model too. The interpretation is how stable / reliable is the predicted output for the given input value. I suspect statisticians have already worked out how to estimate such an interval, although I have not seen it in standard statistics books. What is confusing, for me at least, is how the machine learning community often treat the point estimate for logistic regression as a confidence score for whether the input belongs to class 1 or class 0 (in a classification setting). IF the output is a well-calibrated probability, this is reasonable. But if it isn't ??? Recently I've been reading about conformal prediction, and this is the clearest thinking I've seen on confidence estimates for classification settings. Finally, if it is possible to specify a prior for the output value, one can treat this problem using Bayesian statistics. Derive the posterior distribution of the output value (after seeing the observed (x,y) pairs), then integrate the distribution to see if 80% of the mass lies above 60. |
|
one thing to note is that using linear regression on a bounded range (1,100) is a misfit because linear regression can produce values outside this range. logistic regression is a type of "generalized linear regression" because the output of the linear regression (Bx) is transformed through the logistic function to project that value into the range (0,1). the typical interpretation is that this value is a probability, but if you want to ignore the probabilistic nature you could just see it as a bounded value. if you need to get this value in (1,100) you could use a linear scaling to do so. keep in mind though, that this may not be the best course of action because there are built in biases in the logistic function regarding the relationship between the linear regression and the output. there are many functions that can project into different ranges (e.g. probit and poisson regression) as for accuracy, this is often an empirical question. run both regressions and find out which is more accurate on the problem of interest to you. Good observation about the outcomes being range-bounded. I'll look into linear scaling and other functions. As per the main question, the only way I know on how to analyze this is through MSE errors and confusion matrices on forward data.
(Feb 05 '12 at 14:39)
Robert Kubrick
MSE is fine, just make sure you are using out of sample (non-training) data to calculate MSE. if you know more about what types of errors are worse (e.g. overpredicting a score versus underpredicting), then you can try to bake that into your loss function. this may make using a library version of logistic/linear regression harder, but may yield better results.
(Feb 05 '12 at 14:43)
Travis Wolfe
|