Support Vector Machines

Sometimes also called a large margin classifier.

Logistic regression:

$$\min_\theta\frac{1}{m}\left[\sum_{i=1}^my^{(i)}\left(-\log h_\theta(x^{(i)})\right)+(1-y^{(i)})\left(-\log(1-h_\theta(x^{(i)}))\right)\right]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$$

Support vector machine:

$$\min_\theta C\sum_{i=1}^m\left[y^{(i)}\text{cost}1(\theta^Tx^{(i)})+(1-y^{(i)})\text{cost}_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum{j=1}^n\theta_j^2$$

SVM hypothesis

$$h_\theta(x)=\begin{cases}
1 & \text{ if } \theta^Tx\geq0 \
0 & \text{ otherwise }
\end{cases}$$

SVM and cost function(left: cost_1(z)), right:cost_0(z)

SVM and cost function(left: cost_1(z)), right:cost_0(z)

SVM Decision Boundary

SVM trying to separate the positive and negative examples with as big a margin as possible.

When $C$ is not very large, SVM can ignore the few outliers, and also do fine and do reasonable things even if your data is not linearly separable.

  • $\min_\theta\frac{1}{2}\sum_{j=1}^n\theta_j^2=\min_\theta\frac{1}{2}(\theta_1^2+\theta_2^2+…)=\min_\theta\frac{1}{2}\left|\theta\right|^2$
  • $\begin{cases}\theta^Tx\geq1&\text{ if } y^{(i)}=1\\theta^Tx\leq1&\text{ if } y^{(i)}=0\end{cases}\Rightarrow\begin{cases}p^{(i)}\cdot\left|\theta\right|^2\geq1&\text{ if } y^{(i)}=1\p^{(i)}\cdot\left|\theta\right|^2\leq1&\text{ if } y^{(i)}=0\end{cases}$(where $p^{(i)}$ is the projection of $x^{(i)}$ onto the vector $\theta$)

See $p^{(i)}\cdot\left|\theta\right|^2\geq1$, when $p^{(i)}\uparrow$, then $\left|\theta\right|^2\downarrow$, $\min_\theta\frac{1}{2}\sum_{j=1}^n\theta_j^2\downarrow$.

Kernels

Ginven $x$, compute new feature depending on proximity to landmarks $l$.

$$f=\text{similarity}(x,l)=\exp\left(-\frac{\left|x-l\right|^2}{2\sigma^2}\right)\ \text{(Gaussian kernel)}$$

If $x\approx l$:
$$f\approx\exp\left(-\frac{0^2}{2\sigma^2}\right)\approx1$$

If $x$ is far from $l$:
$$f=\exp\left(-\frac{(\text{large number})^2}{2\sigma^2}\right)\approx0$$

Predict $y=1$ if $\theta\times f\geq0$

SVM with kernels

Given $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})$, choose $l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},…,l^{(m)}=x^{(m)}$.

Given example $x$:

    $f_1=\text{similarity}(x,l^{(1)})$

    $f_2=\text{similarity}(x,l^{(2)})$

    …

Hypothesis: Given $x$, compute features $f,\theta\in\mathbb{R}^{m+1}$

    Predict $y=1$ if $\theta^Tf\geq0$

Training:

$$\min_\theta C\sum_{i=1}^my^{(i)}\text{cost}1(\theta^Tf^{(i)})+(1-y^{(i)})\text{cost}_0(\theta^Tf^{(i)})+\frac{1}{2}\sum{j=1}^{n(=m)}\theta_j^2$$

Parameters:

$C$(=$\frac{1}{\lambda}$):

Large $C$: Lower bias, high variance.

Small $C$: Higher bias, low variance.

$\sigma^2$:

Large $\sigma^2$: Features $f_i$ vary more smoothly. Higher bias, lower variance.

Small $\sigma^2$: Features $f_i$ vary less smoothly. Lower bias, Higher variance.

Using An SVM

Use SVM software package(e.g. liblinear, libsvm, …) to solve for parameters $\theta$.

Need to specify:

    Choice of parameter $C$.

    Choice of kernel (similarity function):

        No kernel (“linear kernel”): $n$ large, $m$ small.

            Predict $y=1$ if $\theta^Tx\geq0$

        Gaussian kernel: $n$ small, $m$ large.
$$f_i=\exp\left(-\frac{\left|x-l^{(i)}\right|^2}{2\sigma^2}\right),\text{where }l^{(i)}=x^{(i)}$$

            Need to choose $\sigma^2$

        More kernel functions.

Many off-the-shelf kernals available

Not all similarity functions make valid kernels.(Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, and do not diverge)

  • Polynomial kernel: $k(x,l)=(X^T\times l+\text{constant})^\text{degree}$ ($X$ and $l$ are all strictly non negative)
  • More esoteric: String kernel, chi-square kernel, histogram intersection kernel, …

Mutli-class classification

Many SVM packages already have built-in multi-class classification functionality.

Otherwise, use on-vs.-all method.(Train $K$ SVMs, one to distinguish $y=i$ from the rest, for $i=1,2,…,K$), get $\theta^{(1)},\theta^{(2)},…,\theta^{(K)}$, Pick class $i$ with largest $(\theta^{(i)})^Tx$

Logistic regression vs. SVMs

$n=$ number of features ($x\in\mathbb{R}^{n+1}$), $m=$ number of training examples

if $n$ is large (relative to $m$):

Use logistic regression, or SVM without a kernal (“linear kernel”)

if $n$ is small, $m$ is intermediate:

Use SVM with Gaussian kernel

if $n$ is small, $m$ is large:

Create/add more features, then use logistic regression or SVM without a kernel (Gaussian kernel will be slow)

Neural network likely to work well for most of these settiongs, but may be slower to train.