Better Regression Models
Proper treatment of variables yie,ds better models:
- Removing outliers
- Fitting nonlinear functions
- Feature/target scaling
- Collapsing highly correlated variables
Outliers and Linear Regression
Because of the quadratic weight of residuals, outlying points can greatly affect the fit.
Identifying outlying points and removing them in a principled way can yield a more robust fit.
Fitting Non-Linear Functions
Linear regression fits lines, not high-order curves.
But we can fit quadratics by creating another variable with the value $x^2$ to our data matrix.
We can fit arbitrary polynomials (including square roots) and exponentials/logarithms by explicitly including the component variables in our data matrix: sqrt(x), log(x), $x^3$, 1/x.
However explicit inclusion of all possible non-linear terms quickly becomes intractable.
Feature Scaling: Z-scores
Features over wide numerical ranges (say national population vs. fractions) require coefficients over wide scales to bring together.
$$
V = c_1 300,000,000 + c_2 0.02
$$
Fixed learning rates (step size) will over/under shoot over such a range, in gradient descent.
Scale the features in your matrix to Z-scores.
Dominance of Power Law Features
Consider a linear model. for years of education, which ranges from 0 to 12+4+5=21
$$
Y=c_1*income + c_2
$$
No such model can gives sensible answers for both my kids and Bill Gates’ kids.
Z-scores of such power law variables don’t help because they are just a linear transformation.
Feature Scaling: Sublinear Functions
An enormous gap between the largest/smallest and median values means no coefficient can use the feature without blowup on big values.
The key is to replace/augment such features x with sublinear functions like log(x) and sqrt(x).
Z-scores of these variables will prove much more meaningful.
Small Coefficients Need Small Targets
Trying to predict income from Z-scored variables will need large coefficients: how can you get to $100,000 from functions of -3 to +3?
If your features are normally distributed, you can only do a good job regressing to a similarly distributed target.
Taking logs of big targets can give better models.
Avoid Highly Correlated Features
Suppose you have two perfectly-correlated features (e.g. height in feet, height in meters).
This is confusing (how should weight be distributed between them) but worse
The rows in the covariance matrix are dependent $\because r_1=c\times r_2, \therefore w=(A^TA)^{-1}A^Tb$ requires inverting a singular matrix.
Punting Highly Correlated Features
Perfectly correlated features provide no additional information for modeling.
Identify them by computing the covariance matrix: either one can go with little loss.
This motivates the problem of dimension reduction: e.g. singular value decomposition, principal component analysis.