Issues with Clsed Form Solution
Regression as Parameter Fitting
We seek coefficients that minimize the sum of squared error of the points over all possible coefficients.
$$
J(w_0, w_1) = \frac{1}{2n}\sum_{i=1}^n(y_i-f(x_i))^2
$$
Here the regression line is:
$$
f(x) = w_0 + w_1x
$$
Lines in Parameter Space
The error function $J(w_0, w_1)$ is convex, making it easy to find the single local/global minima.
Gradient Descent Search
A space with only one local/global minima is called convex
.
When a search space is convex, it is easy to find the minima: just keep walking down.
The fastest direction down is defined by the slope or tangent at the current point.
Gravity will take it down.
The Fastest Way Down
The direction down at a point is given by its derivative, which specified by its tangent line:
This could be approximately computed by finding the point $(x+dx, y(x+dx))$ and fitting the line with $(x, y(x))$
Partial Derivatives
The symbolic way of computing the gradient requires computing the partial derivative of the objective function:
$$
\begin{align}
\frac{\part}{\part w_j}&=\frac{2}{\part w_j}\frac{1}{2n}\sum_{i=1}^n(f(x_i)-b_i)^2\
&=\frac{2}{\part w_j}\frac{1}{2n}\sum_{i=1}^n(w_0 + (w_1x_i) - b_i)^2
\end{align}
$$
Gradient Descent for Regression
Gradient descent algorithm
$$
\begin{align}
&\text{repeat until convergence } {\
&\quad\theta_j := \theta_j - \alpha \frac{\part}{\part\theta_j}J(\theta_0, \theta_1)\
&\quad \text{(for j = 1 and j = 0)}\
&}
\end{align}
$$
Linear Regression Model
Which Functions are Convex
Getting Trapped in Local Optima
Effect of Learning Rate/Step Size
What is the Right Learning Rate?
Stochastic Gradient Descent
Too Many Features?
Providing a rich set of features to regression is good, but remember Occam’s Razor: ‘The simplest explanation is best.’
Ideally our regression would select the most important variables and fit them, but our objective function only tires to minimize sum of squares error.
Regularization
The trick is to add terms to the objective function seeking to keep coefficients small:
$$
$$
We pay a penalty proportional to the sum of squares of the coefficients, thus ignoring sign.
This rewards us for setting coefficients to zero.
Interpreting/Penalizing Coefficients
When variables has mean zero, its coefficient magnitude is a measure of value to the objective function.
Penalizing the sum of squared coefficients is ridge regression or Tikhonov regularization.
Penalizing the absolute value of the coefficients ($L_1\text{ metric vs. }L_2$)