Errors vs Residuals in Regression

2018-08-15T17:41:00.002-07:00

People often confuse errors and residuals in linear regression. Recall our linear model is given by
\begin{align}
y_i=\beta^T X^{(i)}+\epsilon_i
\end{align}
where $X^{(i)}$ is a feature vector, $y_i$ is our observation/response, and $\epsilon_i$ are our 'true' errors. We then fit a linear regression model via OLS and learn a vector $\hat{\beta}$. Our residuals, on the other hand, are
\begin{align}
y_i-\hat{y}_i&=(\beta-\hat{\beta}^T)X^{(i)}-\epsilon_i
\end{align}
The assumptions of linear regression (heteroskedasticity etc.) are really assumptions on the $\epsilon_i$, but we are generally running tests on $y_i-\hat{y}_i$, since we don't know $\epsilon_i$.

What can running a test on $y_i-\hat{y}_i$ tell us about $\epsilon_i$? Let's say that we conclude from plots that $y_i-\hat{y}_i$ is approximately iid Gaussian. Can we conclude that $\epsilon_i$ is approximately iid Gaussian?

Assume that our residuals are iid Gaussian
\begin{align}
\left(\beta-\hat{\beta}^T\right)X^{(i)}-\epsilon_i\sim \mathcal{N}(0,\sigma^2_{\textrm{res}})
\end{align}
then
\begin{align}
\hat{\beta}^T X^{(i)}+\epsilon_i\sim \mathcal{N}(\beta X^{(i)},\sigma^2_{\textrm{res}})
\end{align}
but this does not let us say anything about the distribution of $\hat{\beta}^T$ or $\epsilon_i$.

On the other hand, assume that the $\epsilon_i$ are iid Gaussian with variance $\sigma^2$, and that $j$ is a new point not in the training dataset but from the same model (so that $\hat{\beta}$ and $\epsilon_j$ are independent). Then $\hat{\beta}\sim \mathcal{N}(\beta,\sigma^2 (\boldsymbol{X}^T\boldsymbol{X})^{-1})$, so that
\begin{align}
(\beta-\hat{\beta}^T)X^{(j)}-\epsilon_j\sim \mathcal{N}(0,\sigma^2 X^{(j)T} (\boldsymbol{X}^T\boldsymbol{X})^{-1}X^{(j)}+\sigma^2)
\end{align}
thus Gaussian errors imply Gaussian residuals, but not vice versa.

So seeing if residuals are iid Gaussian is a necessary but not sufficient condition for the errors to be iid Gaussian. In practice, it seems unlikely that in real data, the sum of two highly non-Gaussian random variables will be approximately Gaussian, so in a very heuristic sense approximately Gaussian residuals 'probably' tell us that errors are approximately Gaussian.

It's worth remembering this and distinguishing between errors and residuals: I've made the mistake of confusing the two in the past.

Hypothesis testing 2: two sample t-test

2018-08-03T15:04:00.002-07:00

In a previous post, we introduced the basic terminology of hypothesis testing. We also wanted to test the null hypothesis $H_0$ that the two websites have the same clickthrough rate i.e. $p_1=p_2$ against the alternative hypothesis $H_1$ that $p_1\neq p_2$. Here we show how to do it.

Test Statistic and the t-test
A test statistic is some function of a sample that is used for hypothesis testing. The name of the test denotes the distribution of the test statistic: for instance, a z-test refers to a hypothesis test where the test statistic under the null hypothesis is Gaussian distributed. However, to use an exact z-test, we require the standard deviation of our data to be known. An alternative related test is the t-test. Consider the case of iid observations $X_1,\cdots,X_n\sim \mathcal{N}(\mu,\sigma^2)$ with sample mean $\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\sim \mathcal{N}(\mu,\sigma^2/n)$ for some unknown parameters $\mu,\sigma^2$. Let $S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2$ which is an unbiased estimate of the variance. Then
\begin{align}
\frac{\bar{X}-\mu}{S/\sqrt{n}}
\end{align}
has a Student's t-distribution with $n-1$ degrees of freedom.

If we wanted to test whether the population or true mean was $p$, we would set a threshold for the p-value, for instance 0.05, let $t$ be $t$-distributed with $n-1$ degrees of freedom, and then calculate whether
\begin{align}
P(t\leq \frac{\bar{X}-p}{S/\sqrt{n}})&\leq 0.025\\P(t\geq \frac{\bar{X}-p}{S/\sqrt{n}})&\leq 0.025
\end{align}
If one of these holds, then we reject the null hypothesis.

Two Samples

In our setting we don't want to test whether our sample mean matches some hypothesized true mean, but instead want to test whether the true means between two samples are equal. That is, we have two samples $X_1^{(1)},\cdots,X_{n_1}^{(1)}$ with mean $p_1$ and $X_1^{(2)},\cdots,X_{n_2}^{(2)}$ with mean $p_2$, and we assume the sample means $\bar{X}^{(1)}\sim \mathcal{N}(p_1,\sigma_1^2/n_1)$ and $\bar{X}^{(2)}\sim \mathcal{N}(p_2,\sigma_2^2/n_2)$. Then $\bar{X}^{(1)}-\bar{X}^{(2)}\sim \mathcal{N}(p_1-p_2,\sigma_1^2/n_1+\sigma_2^2/n_2)$ is the sample mean of the differences. If $S^{2,(1)}$ is the unbiased estimator of the variance of $\bar{X}^{(1)}$ and $S^{2,(2)}$ that for $\bar{X}^{(2)}$, then under the null hypothesis $p_1=p_2$, $\bar{X}^{(1)}-\bar{X}^{(2)}\sim \mathcal{N}(0,\sigma_1^2/n_1+\sigma_2^2/n_2)$, so that
\begin{align}
\frac{\bar{X}^{(1)}-\bar{X}^{(2)}}{\sqrt{S^{2,(1)}/n_1+S^{2,(2)}/n_2}}
\end{align}
follows a student's t-distribution with $n_1+n_2-2$ degrees of freedom.

Here are some practice problems that people may find useful to further solidify understanding:
1. Show that if $\bar{X}\sim \mathcal{N}(\mu,\sigma^2/n)$, then
\begin{align}
\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}
\end{align}
is $\mathcal{N}(0,1)$
2. Show that if $\bar{X}\sim \mathcal{N}(\mu,\sigma^2/n)$
\begin{align}
\frac{\bar{X}-\mu}{S/\sqrt{n}}
\end{align}
has Student's t-distribution with $n-1$ degrees of freedom.
3. Show that $S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2$ is an unbiased estimator of the variance.

Alexander Moreno

Errors vs Residuals in Regression

Hypothesis testing 2: two sample t-test