| Liam's Blog

Better Regression Models

Proper treatment of variables yie,ds better models:

Removing outliers
Fitting nonlinear functions
Feature/target scaling
Collapsing highly correlated variables

Outliers and Linear Regression

Because of the quadratic weight of residuals, outlying points can greatly affect the fit.

Identifying outlying points and removing them in a principled way can yield a more robust fit.

Fitting Non-Linear Functions

Linear regression fits lines, not high-order curves.

But we can fit quadratics by creating another variable with the value $x^2$ to our data matrix.

We can fit arbitrary polynomials (including square roots) and exponentials/logarithms by explicitly including the component variables in our data matrix: sqrt(x), log(x), $x^3$, 1/x.

However explicit inclusion of all possible non-linear terms quickly becomes intractable.

Feature Scaling: Z-scores

Features over wide numerical ranges (say national population vs. fractions) require coefficients over wide scales to bring together.
$$
V = c_1 300,000,000 + c_2 0.02
$$
Fixed learning rates (step size) will over/under shoot over such a range, in gradient descent.

Scale the features in your matrix to Z-scores.

Dominance of Power Law Features

Consider a linear model. for years of education, which ranges from 0 to 12+4+5=21
$$
Y=c_1*income + c_2
$$
No such model can gives sensible answers for both my kids and Bill Gates’ kids.

Z-scores of such power law variables don’t help because they are just a linear transformation.

Feature Scaling: Sublinear Functions

An enormous gap between the largest/smallest and median values means no coefficient can use the feature without blowup on big values.

The key is to replace/augment such features x with sublinear functions like log(x) and sqrt(x).

Z-scores of these variables will prove much more meaningful.

Small Coefficients Need Small Targets

Trying to predict income from Z-scored variables will need large coefficients: how can you get to $100,000 from functions of -3 to +3?

If your features are normally distributed, you can only do a good job regressing to a similarly distributed target.

Taking logs of big targets can give better models.

Avoid Highly Correlated Features

Suppose you have two perfectly-correlated features (e.g. height in feet, height in meters).

This is confusing (how should weight be distributed between them) but worse

The rows in the covariance matrix are dependent $\because r_1=c\times r_2, \therefore w=(A^TA)^{-1}A^Tb$ requires inverting a singular matrix.

Punting Highly Correlated Features

Perfectly correlated features provide no additional information for modeling.

Identify them by computing the covariance matrix: either one can go with little loss.

This motivates the problem of dimension reduction: e.g. singular value decomposition, principal component analysis.