The World of Many Weak Features
Often we have many relatively weak features to apply to a classification problem.
In Text classification problems, we often have the frequency of each word in documents of positive and negative classes: e.g. the frequency of sale
in spam and real email.
Bayesian Classifiers
To classify a vector $X = (x_1, …, x_n)$ into one of $m$ classes, we can use Bayes Theorem:
$$
p(C_i|X)=\frac{p(C^i)p(X|C_i)}{p(X)}
$$
This reduces decisions about the class given the input to the input given the class.
Identifying the Most Probable Class
Argmax is the class with the highest probability:
$$
C(X)=\max_{i=1}^m\frac{p(C_i)p(X|C_i)}{p(X)}=\max_{i=1}^mp(C_i)p(X|C_i)
$$
$p(C_i)$ is the prior probability of class $i$
$p(X)$ is the probability of seeing input $X$ over all classes. This is dicey, but can be ignored for classification because it constant.
Independence and Naive Bayes
What is $p(X|C)$, where $X$ is a complex feature vector?
If $(a, b)$ are independent, then $p(ab) = p(a)p(b)$ this calculation is much simpler than factoring in correlations and interactions of multiple factors, but:
What’s the probability of having two size 9 feet?
Complete Naive Bayes Formulation
We seek the argmax of:
$$
C(X) = \max_{i=i}^mp(C_i)p(X|C_i)=\max_{i=1}^mp(C_i)\Pi_{j=1}^np(x_j|C_i)
$$
Multiplying many probabilities is bad, so:
$$
C(X)=\max_{i=1}^m(\log(p(C_i)) + \sum_{j=1}^n\log(p(x_j|C_i)))
$$
Dealing with Zero Counts
You may never have seen it before, but what is the probability my next word is defenestrate?
Observed counts do not accurately capture the frequency of rare events, for which there is typically a long tail.
Laplace asked: What is the probability the sun will rise tomorrow?
+1 Discounting
Discounting is a statistical technique to adjust counts for yet-as-unseen events. The simplest technique is add one discounting, where we add one to the frequency all outcomes, including unseen. Thus after seeing 5 reds and 3 greens,
$$
p(new-color) = \frac{1}{(5+1)+(3+1)+(0+1)} = \frac{1}{11}
$$
Feature Engineering
Domain-dependent data cleaning is important:
- Z-scores and normalization
- Creating bell-shaped distributions
- Imputing missing values
- Dimension reduction, like SVD
- Explicit incorporation of non-linear combinations like products and ratios
Deep Learning
The hottest area of machine learning today involves large, deep neural network architectures.
Basic Principles of Deep Learning
- That the weight of each edge is a distinct parameter means large networks exploits large training sets.
- The depth of the networks means they can build up hierarchical representations of features: e.g. pixels, edges, regions, objects
- Toolkits like TensorFlow make it easy to build DL models if you have the data
Node Computations
Each node in the network typically computes a nonlinear function Phi(v) of a weighted input sum:
$$
v_j = \beta + \sum_iw_ix_i
$$
The beta term is the bias, the activation in the absence of input.
Many dot products implies matrix multiplication!
Non-Linearity
The logit and RELU functions make good candidates for Phi. Linear function like addition cannot exploit depth, because hidden layers add no power.
Backpropagation
NNs are trained by a stochastic gradient descent-like algorithm, with changes for each training example pushed down to lower levels. The non-linear functions result in a non-convex optimization function, but this generally produces good results.
Word Embeddings
One NN application I have found particularly useful is word2vec, constructing 100 dimensional word representations from text corpora.
The goal is to try to predict missing words by context: We would ** to improve
Thus large volumes of training data can be construction from text without supervision.
Graph Embeddings (DeepWalk)
Networks based on similarity or links define very sparse feature vectors.
Random walks on networks (sequences of vertices) look like sentences (sequences of words).
Thus we can use word2vec to train network representations.
Support Vector Machines
SVMs are an important way to build non-linear classifiers.
They work by seeking maximum margin linear separators between the two classes.
Optimization Problem
Optimize the coefficient size $||w||$ subject to the constraints $y_i(\bold{w} \cdot \bold{x_i} - b)\geq 1$ for all i = 1, …, n.
only a few points touch the boundary of the separating channel. Near-vertical lines are closer than horizontal lines even b +/- 1 are 2 apart, hence minimizing on $||w||$.
SVMs vs. Logistic Regression
Both methods find separating planes, but different ones.
LR values all points, but SVM only the points at the boundary.
Projecting to Higher Dimensions
Adding enough dimensions makes everything linearly separable.
Here $(x, y)\rightarrow (x, y, x^2+y^2)$ does the job.
Efficient solvers like LibSVM are available for this.
Projecting to Higher Dimensions
The non-linearity depends upon how space is projected to higher dimensions. The distance from all $n$ input points to the target creates an n-dimensional feature vector.
Kernal functions give the power to use such features efficiently, without building the $n\times n$ matrix.