Basic Idea

Fit a line to a bunch of points.

Example

linear-regression-gpa-distribution

Without extra information, we will predict the mean 2.47.

Average squared error = $\mathbb{E} [(studentGPA - predictedGPA)^2]$ = Variance

If we have SAT scores, then we can fit a line.

gpa-vs-sat

Now if we predict based on this line, the MSE drops to 0.43.

This is a regression problem with:

  • Predictor variable: SAT score
  • Response variable: College GPA

Formula

For $\mathbb{R}$ $$y = ax + b$$

For $\mathbb{R}^10$ $$f(x) = w_1x_1 + w_2x_2 + … + w_{10}x_{10} + b = wx + b$$

where $w = (w_1, w_2, …, w_{10})$

Loss Function

MSE is follow $$MSE(a, b) = \frac{1}{n} \sum_{i=1}^n (y^{(i)} -(ax^{(i)} + b))^2$$

and loss function for $\mathbb{R}$ $$L(a, b) = \sum_{i=1}^n (y^{(i)} -(ax^{(i)} + b))^2$$

Analagously for $\mathbb{R}^d$

$$L(w, b) = \sum_{i=1}^n (y^{(i)} -(w \cdot x^{(i)} + b))^2$$ where $x \in \mathbb{E}^d$, $y \in \mathbb{E}$, $w \in \mathbb{E}^d$, and $b \in \mathbb{E}$

Alternatively, we can write as: $$f(x) = \tilde{w} \cdot \tilde{x}$$ $$L( \tilde{w} ) = \sum_{i=1}^n (y^{(i)} - \tilde{w} \cdot \tilde{x}^{(i)})^2$$ where $\tilde{x} = (1, x) \in \mathbb{E}^{d+1}$ and $\tilde{w} = (b, w) \in \mathbb{E}^{d+1}$, and minimized at $\tilde{w} = (X^TX)^{-1}(X^Ty)$

Regularization

Regularization term is crucial in avoiding complex model. This term is introduced as follow:

$$L(w, b) = \sum_{i=1}^n (y^{(i)} -(w \cdot x^{(i)} + b))^2 + \lambda||w||^2$$ and the solution is $w = (X^TX + \lambda I)^{-1}(X^Ty)$

With the help of regularizer, we avoid the complex molde (overfitting the data), and obtain a better performance on test set.

λ training MSE test MSE
0.00001 0 585.81
0.0001 0 564.28
0.001 0 404.08
0.01 0.01 83.48
0.1 0.03 19.26
1 0.07 7.02
10 0.35 2.84
100 2.4 5.79
1000 8.19 10.97
10000 10.83 12.63

Ridge Regression

$$L(w, b) = \sum_{i=1}^n (y^{(i)} -(w \cdot x^{(i)} + b))^2 + \lambda||w||_2^2$$

Lasso Regression (Produce sparse w)

$$L(w, b) = \sum_{i=1}^n (y^{(i)} -(w \cdot x^{(i)} + b))^2 + \lambda||w||_1$$

Key differene1:

  • Ridge works well for overfitting
  • Lasso shrinks less important features, good for feature selection