Topics in Machine Learning

The World of Many Weak Features

Often we have many relatively weak features to apply to a classification problem.

In Text classification problems, we often have the frequency of each word in documents of positive and negative classes: e.g. the frequency of sale in spam and real email.

Bayesian Classifiers

To classify a vector $X = (x_1, …, x_n)$ into one of $m$ classes, we can use Bayes Theorem:
$$
p(C_i|X)=\frac{p(C^i)p(X|C_i)}{p(X)}
$$
This reduces decisions about the class given the input to the input given the class.

Identifying the Most Probable Class

Argmax is the class with the highest probability:
$$
C(X)=\max_{i=1}^m\frac{p(C_i)p(X|C_i)}{p(X)}=\max_{i=1}^mp(C_i)p(X|C_i)
$$
$p(C_i)$ is the prior probability of class $i$

$p(X)$ is the probability of seeing input $X$ over all classes. This is dicey, but can be ignored for classification because it constant.

Independence and Naive Bayes

What is $p(X|C)$, where $X$ is a complex feature vector?

If $(a, b)$ are independent, then $p(ab) = p(a)p(b)$ this calculation is much simpler than factoring in correlations and interactions of multiple factors, but:

What’s the probability of having two size 9 feet?

Complete Naive Bayes Formulation

We seek the argmax of:
$$
C(X) = \max_{i=i}^mp(C_i)p(X|C_i)=\max_{i=1}^mp(C_i)\Pi_{j=1}^np(x_j|C_i)
$$
Multiplying many probabilities is bad, so:
$$
C(X)=\max_{i=1}^m(\log(p(C_i)) + \sum_{j=1}^n\log(p(x_j|C_i)))
$$

Dealing with Zero Counts

You may never have seen it before, but what is the probability my next word is defenestrate?

Observed counts do not accurately capture the frequency of rare events, for which there is typically a long tail.

Laplace asked: What is the probability the sun will rise tomorrow?

+1 Discounting

Discounting is a statistical technique to adjust counts for yet-as-unseen events. The simplest technique is add one discounting, where we add one to the frequency all outcomes, including unseen. Thus after seeing 5 reds and 3 greens,
$$
p(new-color) = \frac{1}{(5+1)+(3+1)+(0+1)} = \frac{1}{11}
$$

Feature Engineering

Domain-dependent data cleaning is important:

Z-scores and normalization
Creating bell-shaped distributions
Imputing missing values
Dimension reduction, like SVD
Explicit incorporation of non-linear combinations like products and ratios

Deep Learning

The hottest area of machine learning today involves large, deep neural network architectures.

Basic Principles of Deep Learning

That the weight of each edge is a distinct parameter means large networks exploits large training sets.
The depth of the networks means they can build up hierarchical representations of features: e.g. pixels, edges, regions, objects
Toolkits like TensorFlow make it easy to build DL models if you have the data

Node Computations

Each node in the network typically computes a nonlinear function Phi(v) of a weighted input sum:
$$
v_j = \beta + \sum_iw_ix_i
$$
The beta term is the bias, the activation in the absence of input.

Many dot products implies matrix multiplication!

Non-Linearity

The logit and RELU functions make good candidates for Phi. Linear function like addition cannot exploit depth, because hidden layers add no power.

Backpropagation

NNs are trained by a stochastic gradient descent-like algorithm, with changes for each training example pushed down to lower levels. The non-linear functions result in a non-convex optimization function, but this generally produces good results.

Word Embeddings

One NN application I have found particularly useful is word2vec, constructing 100 dimensional word representations from text corpora.

The goal is to try to predict missing words by context: We would ** to improve

Thus large volumes of training data can be construction from text without supervision.

Graph Embeddings (DeepWalk)

Networks based on similarity or links define very sparse feature vectors.

Random walks on networks (sequences of vertices) look like sentences (sequences of words).

Thus we can use word2vec to train network representations.

Support Vector Machines

SVMs are an important way to build non-linear classifiers.

They work by seeking maximum margin linear separators between the two classes.

Optimization Problem

Optimize the coefficient size $||w||$ subject to the constraints $y_i(\bold{w} \cdot \bold{x_i} - b)\geq 1$ for all i = 1, …, n.

only a few points touch the boundary of the separating channel. Near-vertical lines are closer than horizontal lines even b +/- 1 are 2 apart, hence minimizing on $||w||$.

SVMs vs. Logistic Regression

Both methods find separating planes, but different ones.

LR values all points, but SVM only the points at the boundary.

Projecting to Higher Dimensions

Adding enough dimensions makes everything linearly separable.

Here $(x, y)\rightarrow (x, y, x^2+y^2)$ does the job.

Efficient solvers like LibSVM are available for this.

Projecting to Higher Dimensions

The non-linearity depends upon how space is projected to higher dimensions. The distance from all $n$ input points to the target creates an n-dimensional feature vector.

Kernal functions give the power to use such features efficiently, without building the $n\times n$ matrix.