Gradient Descent Search and Regularization

Issues with Clsed Form Solution

Regression as Parameter Fitting

We seek coefficients that minimize the sum of squared error of the points over all possible coefficients.
$$
J(w_0, w_1) = \frac{1}{2n}\sum_{i=1}^n(y_i-f(x_i))^2
$$
Here the regression line is:
$$
f(x) = w_0 + w_1x
$$

Lines in Parameter Space

The error function $J(w_0, w_1)$ is convex, making it easy to find the single local/global minima.

Gradient Descent Search

A space with only one local/global minima is called convex.

When a search space is convex, it is easy to find the minima: just keep walking down.

The fastest direction down is defined by the slope or tangent at the current point.

Gravity will take it down.

The Fastest Way Down

The direction down at a point is given by its derivative, which specified by its tangent line:

This could be approximately computed by finding the point $(x+dx, y(x+dx))$ and fitting the line with $(x, y(x))$

Partial Derivatives

The symbolic way of computing the gradient requires computing the partial derivative of the objective function:
$$
\begin{align}
\frac{\part}{\part w_j}&=\frac{2}{\part w_j}\frac{1}{2n}\sum_{i=1}^n(f(x_i)-b_i)^2\
&=\frac{2}{\part w_j}\frac{1}{2n}\sum_{i=1}^n(w_0 + (w_1x_i) - b_i)^2
\end{align}
$$

Gradient Descent for Regression

Gradient descent algorithm

$$
\begin{align}
&\text{repeat until convergence } {\
&\quad\theta_j := \theta_j - \alpha \frac{\part}{\part\theta_j}J(\theta_0, \theta_1)\
&\quad \text{(for j = 1 and j = 0)}\
&}
\end{align}
$$

Linear Regression Model

Which Functions are Convex

Getting Trapped in Local Optima

Effect of Learning Rate/Step Size

What is the Right Learning Rate?

Stochastic Gradient Descent

Too Many Features?

Providing a rich set of features to regression is good, but remember Occam’s Razor: ‘The simplest explanation is best.’

Ideally our regression would select the most important variables and fit them, but our objective function only tires to minimize sum of squares error.

Regularization

The trick is to add terms to the objective function seeking to keep coefficients small:
$$

$$
We pay a penalty proportional to the sum of squares of the coefficients, thus ignoring sign.

This rewards us for setting coefficients to zero.

Interpreting/Penalizing Coefficients

When variables has mean zero, its coefficient magnitude is a measure of value to the objective function.

Penalizing the sum of squared coefficients is ridge regression or Tikhonov regularization.

Penalizing the absolute value of the coefficients ($L_1\text{ metric vs. }L_2$)