Machine learning diagnostic

Evaluating a hypothesis

  • training set ($m$) & test set ($m_{test}$)
  • train & cross validation(CV) & test sets

Train/validation/test error

  • Training error:
    $$J_{train}(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$$
  • Cross Validation error:
    $$J_{cv}(\theta)=\frac{1}{2m_{cv}}\sum_{i=1}^{m_{cv}}(h_\theta(x_{cv}^{(i)})-y_{cv}^{(i)})^2$$
  • Test error:
    $$J_{test}(\theta)=\frac{1}{2m_{test}}\sum_{i=1}^{m_{test}}(h_\theta(x_{test}^{(i)})-y_{test}^{(i)})^2$$

bias VS variance (underfitting VS overfitting)

  • High bias(underfit):
    high training error($J_{train}$), high validation error($J_{cv}$); $J_{train}\approx J_{cv}$
  • High variance(overfit):
    low training error($J_{train}$), high validation error($J_{cv}$);$J_{train}<<J_{cv}$

Regularization and bias

  • use regularization to solve overfitting
  • High bias(underfit):
    $\lambda\rightarrow\infty, \theta\rightarrow0$, horizontal line.
  • High variance(overfit):
    $\lambda=0$

How to choose the regularization parameter $\lambda$

Model:

$h_\theta(x)=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4\J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2$

Try $\lambda=0:0.01\times2^n:10(0.01,0.02,0.04,0.08,…,10)$

Learning Curves

  • Plot $m\text{(training set size)-error}$ figure by choose some $m$
  • $J_{train}(\theta)$ will tend to increase as $m$ invreases
  • $J_{cv}(\theta)$ will tend to decrease as $m$ invreases

High bias

$J_{train}(\theta)\uparrow\rightarrow\J_{cv}(\theta)\downarrow\rightarrow$
And $J_{train}(\theta)\approx J_{cv}(\theta)>>0$

More training data will not help much.

High variance

$J_{train}(\theta)\uparrow\rightarrow\rightarrow(\sim0)\J_{cv}(\theta)\downarrow\rightarrow\rightarrow(>>0)$
And $J_{train}(\theta) << J_{cv}(\theta)\\lim_{m\rightarrow\infty}\left(J_{cv}(\theta)-J_{train}(\theta)\right)\rightarrow0$

More training data is likely to help.

What to try next (revisited)

  • More training examples: fixes high variance
  • Try smaller sets of features: fixes high variance
  • Try getting additional features: fixes high bias
  • Try adding polynomial features ($x_1^2,x_2^2,x_1x_2,\text{etc}$): fixes high bias
  • Try decreasing $\lambda$: fixes high bias
  • Try increasing $\lambda$: fixes high variance

Machine learning system design

Spam classifier

  • Collect lots of data (“honeypot” project)
  • Develop sophisticated features based on email routing information (from email header)
  • Develop sophisticated features for nessage body (“descount” & “discounts”; “deal” & “Dealer”; punctuation)
  • Develop sophisticated algorithm to detect misspellings (m0rtgage, med1cine, w4tches)

Error Analysis

  • Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
  • Plot learning curves to decide if more data, more features, etc. are likely to help.
  • Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what types of examples it is making errors on.

important: a way of numerical evaluation

Error Metrics for skewed classes

  • skewed classes: A lot more of examples from one class than from the other class.

Precision/Recall

Actual 1 Actual 0
Predicted 1 Ture positive False positive
Predicted 0 False negative True negative
  • Precision = True positives / (True positive + False positive)
  • Recall = Ture positives / (True positives + False negative)
  • Both higher is better

Trading off precision and recall

How to compare precision/recall numbers?

  • Average: $\frac{P+R}{2}$ (not very good, e.g. P=0.01, R=1.0)
  • $F_1$ Score: $2\frac{PR}{P+R}$
  • Try different threshold.

Data for machine learning

It’s not who has the best algorithm that wins. It’s who has the most data.

  • Large data rationale