Linear and Logistic Regression

Summary

Linear regression has a beautiful theoretical foundation yet, in practice, this algebraic formulation is generally discarded in favor of faster, more heuristic optimization.
Linear regression models are, by definition, linear. This provides an opportunity to witness the limitations of such models, as well as develop clever techniques to generalize to other forms.
Linear regression simultaneously encourages model building with hundreds of variables, and regularization techniques to ensure that most of them will get ignored.

Linear Regression

Given a collection of n points, find the line which best approximates or fits the points. There are many reasions why we might want to do this.

One class of goals involves simplification and compression:

we can replace a large set of noisy data points in the xy-plane by a tidy line that describes them.
This regression line is useful for visualization, by showing the underlying trend in the data and highlighting the location and magnitude of outliers.

Linear Regression and Duality

When solving linear systems, we seek the single point that lies on n given lines. In regression, we are instead given n points, and we seek the line that lies on ‘all’ points. There are two differences here:

the interchange of points for lines
finding the best fit under constraints verses a totally constrained problem

By the duality transformation, lines are equivalent to points in another space.

Error in Linear Regression

The residual error is the difference between the predicted and actual values.

Least squares regression minimizes the sum of the squares of the residuals of all points.

This metric is chosen because

squaring the residuals ignores the signs of the errors, so positive and negative residuals do not offset each other
it leads to a surprisingly nice closed form for finding the coefficients of the best-fitting line

Finding the Optimal Fit

Linear regression seeks the line $y = f(x)$ which minimizes the sum of the squared errors over all the training points, i.e. the coefficient vector $w$ that minimizes
$$
\sum_{i=1}^n(y_i - f(x_i))^2, \text{ where } f(x) = w_0 + \sum_{i=1}^{m-1}w_ix_i
$$
Suppose we are trying to fit a set of n points, each of which is m dimensional. The first m-1 dimensions of each point is the feature vector $(x_1, …, x_{m-1})$ with the last value $y=x_m$ serving as the target or dependent variable.

We can encode these n feature vectors as an $n\times (m-1)$ matrix. We can make it an $n\times m$ matrix $A$ by prepending a column of ones to the matrix. This column can be thought of as a ‘constant’ feature, one that when multiplied by the appropriate coefficient becomes the y-intercept of the fitted line. Further, the n target values can be nicely represented in an $n\times 1$ vector $b$.

How can we find the coefficients of the best fitting line? The vector $w$ is given by:
$$
w=(A^TA)^{-1}A^Tb
$$
The dimensions of the term on the right are
$$
((m\times n)(n\times m))(m\times n)(n\times 1)\rightarrow (m\times 1)
$$
Consider the case of a single variable $x$, where we seek the best-fitting line of the form $y=w_0+w_1x$. The slope of this line is given by
$$
w_1=\sum_{i=1}^n\frac{(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^n(x_i-\bar x)^2}=r_{xy}\frac{\sigma_x}{\sigma_y}, \text{ with } w_0 = \bar y - w_1\bar x
$$
The intercept follows since I goes through the x-mean and y-mean.

Connections with Correlation

The connection with the correlation coefficient ($r_{xy}$) here is clear.

If $x$ were uncorrelated with $y$, $w_1$ should be zero.
If $x, y$ are perfectly correlated, we must scale $x$ to bring it into the right size range of $y$. This is the role of $\frac{\sigma_y}{\sigma_x}$

Where Does This Come from?

It should be clear that in the best-fitting line, we cannot change any of the coefficients $w$ and hope to make a better fit. This means that the error vector $(b-Aw)$ must be orthogonal to the vector for each variable, or else there would be a way to change the coefficient to fit it better.

Orthogonal vectors have dot products of zero. Since the ith column of $A^T$ has a zero dot product with the error vector, $(A^T)(b-Aw)=\bar 0$, where $\bar 0$ is a vector of all zeros. Straightforward algebra then yields
$$
w = (A^TA)^{-1}A^Tb
$$

Better Regression Models

There are several steps one can take which can lead to better regression models. Some of them involve manipulating the input data to increase the likelihood of an accurate model, but others require more conceptual issues of what our model should look like.

To yield better models, there are some proper treatments of variables can do as follows:

Removing outliers
Fitting non-linear functions
Feature/target scaling
Collapsing highly correlated variables

Removing Outliers

Because of the quadratic weight of residuals, outlying points can greatly affect the fit. Identifying outlying points and removing them in a principled way can yield a more robust fit.

It is important to convince yourself that these points really represent errors before deleting them. Otherwise, you will be left with an impressively linear fit that works well only on the examples you didn’t delete.

Fitting Non-Linear Functions

Linear relationships are easier to understand than non-linear ones, and grossly appropriate as a default assumption in the absence of better data. Many phenomena are linear in nature, with the dependent variable growing roughly proportionally with the input variables:

Income grows roughly linearly with the amount of time worked.
The price of a home grows roughly linearly with the size of the living area it contains.
People’s weight increases roughly linearly with the amount of food eaten.

Linear regression does great when it tries to fit data that in fact has an underlying linear relationship. But, generally speaking, jno interesting function is perfectly linear.

Linear regression fits lines, not high-order curves. But we can fit quadratics by creating another variable with the value $x^2 $ to our data matrix. We can fit arbitrarily-complex functions by adding the right higher-order variables to our data matrix. However, explicit inclusion of all possible non-linear terms quickly becomes intractable.

One must be judicious about which non-linear terms to consider for a role in the model. Indeed, one of the advantages of more powerful learning methods, like support vector machines, will be that they can incorporate non-linear terms without explicit enumeration.

Feature and Target Scaling

In principle, linear regression can find the best linear model fitting any data set. But we should do whatever we can to help it find the right model. This generally involves preprocessing the data to optimize for expressibility, interpretability, and numerical stability.

Example

Suppose we wanted to build a model to predict the gross national product of countries in dollars, as a function of their population size $x_1$ and literacy rate $x_2$. Both factors seem like reasonable components of such a model. The problem is that these two features operate on entirely different scales: national populations vary from tens of thousands to over a billion people, while the fraction of people who can read is, by definition, between zero and one. One might imagine the resulting fitted model as looking somewhat like this:
$$
GDP = $10,000x_1+$10,000,000,000,000x_2
$$
This is very bad, for several reasons:

Unreadable coefficients
Numerical imprecision
Inappropriate formulations

Feature Scaling: Z-Scores

Let $\mu$ be the mean value of the given feature, and $\sigma$ the standard deviation. Then the Z-score of $x$ is $Z(x)=\frac{x-\mu}{\sigma}$. Using Z-scores in regression addresses the question of interpretability. Since all features will have similar means and variances, the magnitude of the coefficients will determine the relative importance of these factors towards the forecast.

Sublinear Feature Scaling

Consider a linear model for predicting the number of years of education $y$ that a child will receive as a function of household income. Education levels can vary between 0 and 12+4+5=21 years, since we consider up to the possible completion of a Ph.D. A family’s income level x can vary between 0 and Bill Gates. But observe that no model of the form
$$
y = w_1x+w_0
$$
can possibly give sensible answers for ordinary kids and Bill Gates’ kids. Z-scores of such power law variables don’t help because they are just a linear transformation.

When the features are likely power law distributed: there are many small poor countries compared to very few large rich ones. Any linear combination of normally-distributed variables cannot effectively realize a power law distributed target. The solution here is that trying to predict the logarithm of a power law target y.

Dealing with Highly-Correlated Features

It is great to have features that are highly correlated with the target: these enable us to build highly-predictive models. However, having multiple features which are highly correlated with each other can be asking for trouble.

Perfectly correlated features provide no additional information for modeling.

Identify them by computing the covariance matrix: either one can go with little loss.

This motivates the problem of dimension reduction: e.g. singular value decomposition, principal component analysis.

Reference

Lecture 16

Chapter 9-1, Chapter 9-2