# Anomaly detection

• Problem Motivation
• Review of Gaussian distribution
• Density estimation
• Evaluation
• Multivariate Gaussian distribution

# Example

• Fraud detection:
1.     $x^{(i)}$ = features of user $i$’s activities
2.     Model $p(x)$ from data
3.     Identify unusual users by checking which have $p(x)\lt\epsilon$
• Manufacturing
• Monitoring computers in a data center

# Gaussian (Normal) distribution

$$X\sim N(\mu,\sigma^2)$$

$$P(x;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}\exp{\left[-\frac{(x-\mu)^2}{2\sigma^2}\right]}$$

$$\mu=\frac{1}{m}\sum^m_{i=1}x^{(i)}, \sigma^2=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)^2$$

# Algorithm

## Density estimation

Training set: ${x^{(1)}, …, x^{(m)}}$

Each example is $x\in\mathbb{R}^n$

$$P(x)=\prod_{j=1}^nP(x_j;\mu_j,\sigma_j^2)$$

## Anomaly detection algorithm

1. Choose features $x_i$ that you think might be indicative of anomalous examples.
2. Fit parameters $\mu_1, …, \mu_n, \sigma_1^2, …, \sigma_2^2$
• $\mu=\frac{1}{m}\sum^m_{i=1}x^{(i)}$
• $\sigma^2=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)^2$ ($\frac{1}{m-1}$ also ok, but in machine learning usually use this formular)
3. Given new example $x$, compute $p(x)$:
• $P(x)=\prod_{j=1}^nP(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^n\frac{1}{\sqrt{2\pi}\sigma_j}\exp{\left[-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}\right]}$
• Anomaly if $P(x)\lt\epsilon$

# Evaluation

Assume there some labeled data, of anomalous and non-anomalous examples. ($y=0$ for normal, $y=1$ for anomalous)

Training set: $x^{(i)}, i=1,2,…,m_{train}$

Cross validation set: $(x^{(i)}{cv}, y^{(i)}{cv}), i=1,2,…,m_{cv}$

Test set: $(x^{(i)}{test}, y^{(i)}{test}), i=1,2,…,m_{test}$

Possible evaluation metrics:

• True positive, false positive, false negative, true negative
• Precision/Recall
• $F_1$-score

Do not use classification accuracy for the dataset is very skewed.

Can also use cross validation set to choose parameter $\epsilon$

# Anomaly detection vs. Supervised learning

• Anomaly detection
• Very small number of positive examples
• Large number of negative examples
• Many different “types” of anomalous. Hard for any algorithm to learn from positive examples what the anomalous look like
• Future anomalous may look nothing like any of the anomalous examples we’ve seen so far
• Supervised learning
• Large number of positive and negative examples.
• Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set

# Feature selection

## Non-gaussian features

• try $\log(x)$ or $\exp(x)$ or $x^a, a\in\mathbb{R}$
• $\log(x+c)$ is also ok

## Error analysis for anomaly detection

Want:

• $P(x)$ large for normal examples $x$
• $P(x)$ small for anormalous examples $x$

Most common problem:

• $P(x)$ is comparable (both large) for normal and anomalous examples.

Try to find a new feature to distinguish anormalous examples. Choose features that might take on unusually large or small values in the event of an anomaly.

# Multivariate Gaussian distribution

Model $P(x)$ all in one go. ($x\in\mathbb{R}^n, \mu\in\mathbb{R}^n, \Sigma\in\mathbb{R}^{n\times n} (\mathrm{cov})$)

$$P(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]}$$

$$\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}, \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T$$

## Algorithm

1. Fit model $P(x)$ by setting $\mu, \Sigma$
2. Given a new example $x$, compute $P(x)$, Flag an anomaly if $P(x)\lt\epsilon$

## Original model vs. Multivariate Gaussian distribution

• Original model: $P(x)=\prod_{j=1}^nP(x_j;\mu_j,\sigma_j^2)$
• Manually create features to capture anomalies where $x_1, x_2$ take unusual combinations of values
• Computationally cheaper (works well with large number of features $n$)
• Ok even if $m$ (training set size) is small
• Multivariate Gaussian distribution: $P(x;\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right]}$ (Original model is a special case of multivariate Gaussian model when $\Sigma$ is a scalar matrix)
• Automatically captures correlations between features
• Computationally more expensive (need to compute $\Sigma$)
• Must have $m\gt n$ (in mathematics, in pratice only when $m\geq 10\times n$), or else $\Sigma$ is non-invertible