Wednesday, August 15, 2018

Errors vs Residuals in Regression

People often confuse errors and residuals in linear regression.  Recall our linear model is given by
\begin{align}
y_i=\beta^T X^{(i)}+\epsilon_i
\end{align}
where $X^{(i)}$ is a feature vector, $y_i$ is our observation/response, and $\epsilon_i$ are our 'true' errors.  We then fit a linear regression model via OLS and learn a vector $\hat{\beta}$.  Our residuals, on the other hand, are
\begin{align}
y_i-\hat{y}_i&=(\beta-\hat{\beta}^T)X^{(i)}-\epsilon_i
\end{align}
The assumptions of linear regression (heteroskedasticity etc.) are really assumptions on the $\epsilon_i$, but we are generally running tests on $y_i-\hat{y}_i$, since we don't know $\epsilon_i$.

What can running a test on $y_i-\hat{y}_i$ tell us about $\epsilon_i$?  Let's say that we conclude from plots that $y_i-\hat{y}_i$ is approximately iid Gaussian.  Can we conclude that $\epsilon_i$ is approximately iid Gaussian?

Assume that our residuals are iid Gaussian
\begin{align}
\left(\beta-\hat{\beta}^T\right)X^{(i)}-\epsilon_i\sim \mathcal{N}(0,\sigma^2_{\textrm{res}})
\end{align}
then
\begin{align}
\hat{\beta}^T X^{(i)}+\epsilon_i\sim \mathcal{N}(\beta X^{(i)},\sigma^2_{\textrm{res}})
\end{align}
but this does not let us say anything about the distribution of $\hat{\beta}^T$ or $\epsilon_i$.

On the other hand, assume that the $\epsilon_i$ are iid Gaussian with variance $\sigma^2$, and that $j$ is a new point not in the training dataset but from the same model (so that $\hat{\beta}$ and $\epsilon_j$ are independent).  Then $\hat{\beta}\sim \mathcal{N}(\beta,\sigma^2 (\boldsymbol{X}^T\boldsymbol{X})^{-1})$, so that
\begin{align}
(\beta-\hat{\beta}^T)X^{(j)}-\epsilon_j\sim \mathcal{N}(0,\sigma^2 X^{(j)T} (\boldsymbol{X}^T\boldsymbol{X})^{-1}X^{(j)}+\sigma^2)
\end{align}
thus Gaussian errors imply Gaussian residuals, but not vice versa.

So seeing if residuals are iid Gaussian is a necessary but not sufficient condition for the errors to be iid Gaussian.  In practice, it seems unlikely that in real data, the sum of two highly non-Gaussian random variables will be approximately Gaussian, so in a very heuristic sense approximately Gaussian residuals 'probably' tell us that errors are approximately Gaussian.

It's worth remembering this and distinguishing between errors and residuals: I've made the mistake of confusing the two in the past.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.