Better Regression Models

Proper treatment of variables yie,ds better models:

  • Removing outliers
  • Fitting nonlinear functions
  • Feature/target scaling
  • Collapsing highly correlated variables

Outliers and Linear Regression

Because of the quadratic weight of residuals, outlying points can greatly affect the fit.

Identifying outlying points and removing them in a principled way can yield a more robust fit.

Fitting Non-Linear Functions

Linear regression fits lines, not high-order curves.

But we can fit quadratics by creating another variable with the value $x^2$ to our data matrix.

We can fit arbitrary polynomials (including square roots) and exponentials/logarithms by explicitly including the component variables in our data matrix: sqrt(x), log(x), $x^3$, 1/x.

However explicit inclusion of all possible non-linear terms quickly becomes intractable.

Feature Scaling: Z-scores

Features over wide numerical ranges (say national population vs. fractions) require coefficients over wide scales to bring together.
$$
V = c_1 300,000,000 + c_2 0.02
$$
Fixed learning rates (step size) will over/under shoot over such a range, in gradient descent.

Scale the features in your matrix to Z-scores.

Dominance of Power Law Features

Consider a linear model. for years of education, which ranges from 0 to 12+4+5=21
$$
Y=c_1*income + c_2
$$
No such model can gives sensible answers for both my kids and Bill Gates’ kids.

Z-scores of such power law variables don’t help because they are just a linear transformation.

Feature Scaling: Sublinear Functions

An enormous gap between the largest/smallest and median values means no coefficient can use the feature without blowup on big values.

The key is to replace/augment such features x with sublinear functions like log(x) and sqrt(x).

Z-scores of these variables will prove much more meaningful.

Small Coefficients Need Small Targets

Trying to predict income from Z-scored variables will need large coefficients: how can you get to $100,000 from functions of -3 to +3?

If your features are normally distributed, you can only do a good job regressing to a similarly distributed target.

Taking logs of big targets can give better models.

Avoid Highly Correlated Features

Suppose you have two perfectly-correlated features (e.g. height in feet, height in meters).

This is confusing (how should weight be distributed between them) but worse

The rows in the covariance matrix are dependent $\because r_1=c\times r_2, \therefore w=(A^TA)^{-1}A^Tb$ requires inverting a singular matrix.

Punting Highly Correlated Features

Perfectly correlated features provide no additional information for modeling.

Identify them by computing the covariance matrix: either one can go with little loss.

This motivates the problem of dimension reduction: e.g. singular value decomposition, principal component analysis.