Anomaly detection

  • Problem Motivation
  • Review of Gaussian distribution
  • Density estimation
  • Evaluation
  • Multivariate Gaussian distribution

Example

  • Fraud detection:
    1.     $x^{(i)}$ = features of user $i$’s activities
    2.     Model $p(x)$ from data
    3.     Identify unusual users by checking which have $p(x)\lt\epsilon$
  • Manufacturing
  • Monitoring computers in a data center

Gaussian (Normal) distribution

$$X\sim N(\mu,\sigma^2)$$

$$P(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}\exp{\left[-\frac{(x-\mu)^2}{2\sigma^2}\right]}$$

$$\mu=\frac{1}{m}\sum^m_{i=1}x^{(i)}, \sigma^2=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)^2$$

Algorithm

Density estimation

Training set: ${x^{(1)}, …, x^{(m)}}$

Each example is $x\in\mathbb{R}^n$

$$P(x)=\prod_{j=1}^nP(x_j;\mu_j,\sigma_j^2)$$

Anomaly detection algorithm

  1. Choose features $x_i$ that you think might be indicative of anomalous examples.
  2. Fit parameters $\mu_1, …, \mu_n, \sigma_1^2, …, \sigma_2^2$
    • $\mu=\frac{1}{m}\sum^m_{i=1}x^{(i)}$
    • $\sigma^2=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)^2$ ($\frac{1}{m-1}$ also ok, but in machine learning usually use this formular)
  3. Given new example $x$, compute $p(x)$:
    • $P(x)=\prod_{j=1}^nP(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^n\frac{1}{\sqrt{2\pi}\sigma_j}\exp{\left[-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}\right]}$
    • Anomaly if $P(x)\lt\epsilon$

Evaluation

Assume there some labeled data, of anomalous and non-anomalous examples. ($y=0$ for normal, $y=1$ for anomalous)

Training set: $x^{(i)}, i=1,2,…,m_{train}$

Cross validation set: $(x^{(i)}{cv}, y^{(i)}{cv}), i=1,2,…,m_{cv}$

Test set: $(x^{(i)}{test}, y^{(i)}{test}), i=1,2,…,m_{test}$

Possible evaluation metrics:

  • True positive, false positive, false negative, true negative
  • Precision/Recall
  • $F_1$-score

Do not use classification accuracy for the dataset is very skewed.

Can also use cross validation set to choose parameter $\epsilon$

Anomaly detection vs. Supervised learning

  • Anomaly detection
    • Very small number of positive examples
    • Large number of negative examples
    • Many different “types” of anomalous. Hard for any algorithm to learn from positive examples what the anomalous look like
    • Future anomalous may look nothing like any of the anomalous examples we’ve seen so far
  • Supervised learning
    • Large number of positive and negative examples.
    • Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set

Feature selection

Non-gaussian features

  • try $\log(x)$ or $\exp(x)$ or $x^a, a\in\mathbb{R}$
  • $\log(x+c)$ is also ok

Error analysis for anomaly detection

Want:

  • $P(x)$ large for normal examples $x$
  • $P(x)$ small for anormalous examples $x$

Most common problem:

  • $P(x)$ is comparable (both large) for normal and anomalous examples.

Try to find a new feature to distinguish anormalous examples. Choose features that might take on unusually large or small values in the event of an anomaly.

Multivariate Gaussian distribution

Model $P(x)$ all in one go. ($x\in\mathbb{R}^n, \mu\in\mathbb{R}^n, \Sigma\in\mathbb{R}^{n\times n} (\mathrm{cov})$)

$$P(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]}$$

$$\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}, \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T$$

Algorithm

  1. Fit model $P(x)$ by setting $\mu, \Sigma$
  2. Given a new example $x$, compute $P(x)$, Flag an anomaly if $P(x)\lt\epsilon$

Original model vs. Multivariate Gaussian distribution

  • Original model: $P(x)=\prod_{j=1}^nP(x_j;\mu_j,\sigma_j^2)$
    • Manually create features to capture anomalies where $x_1, x_2$ take unusual combinations of values
    • Computationally cheaper (works well with large number of features $n$)
    • Ok even if $m$ (training set size) is small
  • Multivariate Gaussian distribution: $P(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]}$ (Original model is a special case of multivariate Gaussian model when $\Sigma$ is a scalar matrix)
    • Automatically captures correlations between features
    • Computationally more expensive (need to compute $\Sigma$)
    • Must have $m\gt n$ (in mathematics, in pratice only when $m\geq 10\times n$), or else $\Sigma$ is non-invertible