Logistic Regression and Classification

Classification Problems

Often we are given collections of examples labeled by class:

male/female
democrat/republican
spam/non-spam (ham)
cancer/benign

Classification assigns a babel to an input record.

Regression for Classification

We could use linear regression to build from annotated examples by converting the class names to numbers:

male=0/female=1
democrat=0/republican=1
spam=1/non-spam=0
cancer=1/benign=0

Zero/one works for binary classifiers. By convention the ‘positive’ class gets 1 and the ‘negative’ one 0.

Class Labels from Regression Lines

The regression line will cut through these classes, even through a separator exists.

Adding very +/- examples shifts the line but hurts the boundary.

Decision Boundaries

Ideally, our two classes will be well-separated in feature space, so a line can partition them.

Logistic regression is a method to find the best separating line for a given training set.

Non-Linear Decision Boundaries

Logistic regression can find non-linear boundaries if seeded with non-linear features.

To get separating circles, we need to explicitly add features like $x^2$ and $x_1*x_2$.

The Logit Function

We want to a way to take a real variable and convert it to a probability:
$$
f(x)=\frac{1}{1+e^{-cx}}
$$

$f(0)=\frac{1}{2}$
$f(\infin)=1$
$f(-\infin)= 0$

Logit can give the probability of x being in a particular class.

Logit for Classification

To extend logit to m variables, and set the threshold and steepness parameters, fit
$$
h(x, w)=w_0 +\sum_{i=1}^{m-1}w_i\cdot x_i
$$
Plug into logit to get a classifier:
$$
f(x) = \frac{1}{1+e^{-h(x,w)}}
$$

Scoring for Logistic Regression

Assume each of our n training examples are labeled as to being in either class 0 or 1.

We seek to identify the coefficients w that give high probability on positive (class 1) items, and low probability on negative (class 0) items.

We need a cost function to value a probability p for an item of class c.

Costs for Positive/Negative Cases

We want zero error for the best probability, and increasing cost as the class prediction gets more and more wrong.

Logarithms of Probabilities

Observe that log(1) = 0. Thus it gives proper cost for correct predictions of positive items.

Observe that $\log (0)\rightarrow -\infin$ the cost function:
$$
cost(x_i, 1) = -\log(f(x_i))
$$
For negative instances, the cost function is:
$$
cost(x_i, 0)=-\log(1-f(x_i))
$$

Cost/Loss Function

We can use the zero/one labels as indicator variables to compute the loss function:
$$
\begin{align}
J(w)&=\frac{1}{n}\sum_{i=1}^ncost(f(x_i,w), y_i)\
&=-\frac{1}{n}[\sum_{i=1}^n(y_i\log f(x_i,w)+(1-y_i)\log(1-f(x_i, w))]
\end{align}
$$
This works because the class indicator variables are 0 and 1.

Logistic Regression via Gradient Descent

The loss function here is convex, so we can find the parameters which best fit it by gradient descent.

Thus we can find the best linear separator between two classes (linear in the possibly non-linear features).

Issues in Logistic Classification

Balanced training class sizes
Multi-class classification
Hierarchical classification
Partition functions

Balanced Training Classes

Consider the optimal separating line for grossly unbalanced class sizes, say 1 positive example vs. 1,000,000 negative examples.

The best scoring line from logistic regression will try to be very far from the cluster instead of the midpoint between them.

Use equal numbers of positive and negative examples.

Ways to Balance Classes

Work harder to find members of the minority class.
Discard elements from the bigger class.
Weigh the minority class more heavily, but beware of overfitting.
Replicate members of the smaller class, ideally with random perturbation.

Multi-class Classification

Classification tasks are not always binary.

Is a given movie a comedy, drama, action, documentary, etc.?

Encoding Multi-classes: Bad Idea

It is natural to represent multiple classes by distinct numbers: blond=0, brown=1, red=2.

But unless the ordering of the classes reflect an increasing scale, the numbering is meaningless.

Claaes on a Likert scale are ordinal.

4 stars - 3 stars - 2 stars - 1 star

One Versus All Classifiers

We can build multi-class classifiers by build multi-class classifiers by building multiple independent binary classifiers.

Select the class of highest probability as the predicted label.

The Chanllenges of Many Classes

Multi-class classification gets much harder as the number of classes increase.

The monkey is right only 1/k times for k classes.

With many available classes, certain mistakes are more acceptable than others.
The actual population sizes of rare classes can easily become vanishingly small.

Hierarchical Classification

Grouping classes by similarity and building a taxonomy reduces the effective number of classes.

Imagenet has 27 categories (e.g. animal, appliance, food, furniture) on top of 21,841 subcategories.

Classify from top-down in tree.

Bayesian Priors

Having honest estimates for the probability distribution of classes is essential to avod domination by rare classes (‘rock stars’)
$$
P(A|B)=\frac{P(B|A)P(A)}{P(B)}
$$
Here A is the given class, and B is the event your classifier returned a guess of class A.

Partition Functions

There is no reason why independent binary classifiers yield probabilities which sum to 1.

To get real probabilities for Bayes theorem:
$$
T=\sum_{i=1}^cF_i(x)
$$
and
$$
P(c)=\frac{F_c(x)}{T}
$$