what’s in this article

  • Introduction
  • Linear Algebra review
  • Linear Regression
  • classification

Chinese notes

机器学习个人笔记完整版

what is machining learning

  • definition
  • supervised learning: regression problem, classification
  • unsupervised learning: cluster
  • other: reinforcement learning; recommender system

Linear Algebra review

Matrices and vectors

Matrices

  • Matrix: Rectangular array of numbers
  • Dimension of matrix: number of rows $\times$ number of columns
  • Matrix Elements(entries of matrix): $a_{ij}=(i^{th}\ \mathrm{row}, j^{th}\ \mathrm{column})$

Vector

  • Vector: An $n\times1$ matrix
  • $y_i=i^{th}\ \mathrm{element}$
  • 1-indexed/0-indexed: elements index start at 0/1

Addition and scalar(numerical) multiplication

Addition

$$A+B=C\Rightarrow a_{ij}+b_{ij}=c_{i,j}$$

scalar(numerical) multiplication

$$k\cdot A=B\Rightarrow k\cdot a_{ij}=b_{ij}$$

Matrix-matrix multiplication

  • Matrix times vector = vector
  • Vector times matrix: Invalid
  • Vector times vector = matrix or number
  • column times row
  • column of $A$ = row of $B$
  • row of result = row of $A$
  • column of result = column of $B$

if $A(m\times p),\ B(p\times n)$:

$$C(m\times n)=AB=A\times B\Rightarrow (AB){i,j}=\sum{k=1}^pa_{ik}b_{kj}$$

Matrix-vector multiplication

if $A(m\times n),\ \vec{y}(n)$:

$$\vec{z}=A\times\vec{y}\Rightarrow \vec{z}{i}=\sum{k=1}^n a_{ik}\cdot \vec{y}_k$$

Inverse and transpose

  • Inverse: only square matrix and it has an inverse ($AA^{-1}=A^{-1}A=I$)
  • singular/degenerate matrix: don’t have an inverse
  • transpose ($A^T$)

Linear Regression (one-variable/univariate)

  • dataset
  • training exapmles (training set)
  • hypothesis h(x)

Hypothesis: $h_\theta(x)=\theta_0+\theta_1x$

Parameters: $\theta_i$

Cost Function(square error function):

$$J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x_i)-y_i)^2$$

Goal:

$$\min_{\theta_0,\theta_1}J(\theta_0,\theta_1)$$

Gradient descent

Have function $J(\theta_0,\theta_1)$

Want $\min_{\theta_0,\theta_1}J(\theta_0,\theta_1)$

Outline:

  • Start with some $\theta_0,\theta_1$
  • Keep changing $\theta_0,\theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a minimun

ALGORITHM

repeat until convergence {
$$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta_0,\theta_1)\quad(\mathrm{for}\ j=0\ \mathrm{and} \ j=1)\\Leftrightarrow\begin{pmatrix}
\mathrm{temp0}:=\theta_0-\alpha\frac{\partial}{\partial \theta_0}J(\theta_0,\theta_1)\
\mathrm{temp1}:=\theta_1-\alpha\frac{\partial}{\partial \theta_1}J(\theta_0,\theta_1)\
\theta_0:=temp0\
\theta_0:=temp0
\end{pmatrix}\Leftrightarrow\
\begin{pmatrix}
\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)\
\theta_1:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)\cdot x_i
\end{pmatrix}$$
}

$\alpha$: learning rate

Linear Regression (multiple variables/features)

  • $n$: number of features
  • $x^{(i)}$: input(features) of $i^{th}$ training example
  • $x^{(i)}_j$: value of feature $j$ in $i^{th}$ training example

Hypothesis:

$$h_\theta(x)=\theta_0+\sum_{i=1}^n\theta_i x_i$$

for convenience, define $x_0=1$:

$$h_\theta(x)=\theta^Tx$$

Gradient descent for multipe variables

Hypothesis: $h_\theta(x)=\theta^Tx$

Parameters: $\theta_0,\theta_1,…,\theta_n$

Cost Function(square error function):

$$J(\theta_0,\theta_1,…,\theta_n)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$$

Gradient descent:

Repeat {
$$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\ \Leftrightarrow\theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\
\left(h_\theta(x)=\theta^Tx\right)$$
}

Some tips for gradient descent

  • Feature Scaling: make sure features are on a similar scale
  • Mean normalization: replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean
  • Make sure gradient descent is working correctly: $J(\theta)$ should decrease after every iteration
  • Automatic convergence test: if $J(\theta)$ decreases by less than $\epsilon$(e.g. $10^{-3}$) in one iteration
  • Better convergence test: look at $J(\theta)$-No. of inerations plots.
  • if $J(\theta)$ increase after every iteration, use smaller $\alpha$
  • if $\alpha$ is too small: the algorithm can be slow to converge
  • if $\alpha$ is too large: the algorithm may not decrease on every iteration or may not converge, slow converge also possible

Features and Polynomial Regression

  • create new features(variable) if necessary
  • Polynomial regression(don’t forget feature scaling): $h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+…=\theta_0+\theta_1x+\theta_2x^2+…$

Normal Equation

  • Normal equation: Method to solve for $\theta$ analytically
  • By using (partial) derivatives
  • don’t need feature scaling
  • Only for linear regression

for design matrix $X$ and target vector $y$:

$$\theta=(X^TX)^{-1}X^Ty$$

This formula will give the optimal value of $\theta$. (reference)

Noninvertibility

use pinv()(pseudo-inverse) instead inv()(inverse) to calculate $A^{-1}$, then you can always get $\theta$.

  • inv(): advanced numerical computing concepts

What if $X^TX$ is non-invertible?

  • Redundant features(linearly dependent)
  • Too many featrues(e.g. training set $\leq$ features)

Gradient Descent vs Normal Equation

for $m$ training examples, $n$ features:

Gradient Descent Normal Equation
Need to choose $\alpha$ No need to choose $\alpha$
Needs many iterations Don’t need to iterate
Works well even $n$ is large Slow if $n$ is very large(for compute $(X^TX)^{-1}$, $O(n^3)$)

$10^3$: Normal Equation; $10^4$: Gradient Descent.

Classification(binary)

  • algorithm: Logistic Regression
  • $y\in {0,1}$

Logistic Regression

Hypothesis Representation

  • Sigmoid function/Logistic function: $g(z)=\frac{1}{1+e^{-z}}$
  • $h_\theta(x)=\frac{1}{1+e^{-\theta^T x}}=P(y=1|x;\theta)$
  • Decision Boundary: property of the hypothesis not property of the dataset
  • Non-linear decision boundaries: add more features like $x^2$

Cost Function

  • Training set: ${(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})}$
  • $m$ examples $x\in \begin{bmatrix}x_0\x_1\…\x_n\end{bmatrix}\quad x_0=1,y\in{0,1}$
  • $h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}$
  • fit the parameters $\theta$

Old cost function: $J(\theta)=\frac{1}{m}\sum_{i=1}^m \frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$

if define $Cost(h_\theta(x),y)=\frac{1}{2}(h_\theta(x)-y)^2$

$h_\theta(x)$ is non-linearity $\Rightarrow Cost(h_\theta(x),y)$ is non-convex

  • Logistic regression cost function

$$Cost(h_\theta(x),y)=\left{\begin{matrix}
-\log(h_\theta(x))&\mathrm{if}\ y=1\
-\log(1-h_\theta(x))&\mathrm{if}\ y=0
\end{matrix}\right.$$

$\mathrm{if}\ y=1,h_\theta(x)=1\ \mathrm{then}\ Cost=0\\mathrm{As}\ h_\theta(x)\rightarrow0,Cost\rightarrow\infty$

$\mathrm{if}\ y=0,h_\theta(x)=0\ \mathrm{then}\ Cost=0\\mathrm{As}\ h_\theta(x)\rightarrow1,Cost\rightarrow\infty$

Simplified cost function and gradient descent

For $y$ must be 0 or 1:

$$Cost(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))$$

So:

$$\begin{align} J(\theta)&=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^{(i)}),y^{(i)})\
&=-\frac{1}{m}\left [\sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log\left(1-h_\theta(x^{(i)})\right)\right ]
\end{align
}$$

Gradient descent:

Repeat {
$$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\ \Leftrightarrow\theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\
\left(h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}\right)$$
}

Optimization algorithms:

  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BGFG

Classification(multi-class)

One-vs-All(One-vs-Rest) classification

Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y=i$.

$$h_\theta^{(i)}(x)=P(y=i|x;\theta)\quad(i=1,2,3,…,m)$$

On a new input $x$, to make a prediction, pick the calss $i$ that maximizes

$$\max_ih_\theta^{(i)}(x)$$

The Problem of Overfitting

  • “Underfit”, “High bias”
  • “Overfit”, “High variance”

Overfitting: If we have too many features, the learned hypothesis may fit the training set very well($J(\theta)\approx0$), but fail to generalize to new examples.

Addressing overfitting

Reduce number of features

  • Manually select which features to keep.
  • Model selection algorithm.

Regularization

  • Keep all the features, but reduce magnitude/values of parameters $\theta_j$.
  • Works well when we have a lot of features, each of which contributes a bit to predicting $y$.

Regularization

Cost function

Small values for parameters $\theta_0,\theta_1,…,\theta_n$

  • “Simpler” hypothesis
  • Less prone to overfitting

$$J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m\left(h_\theta(x^{(i)})-y^{(i)}\right)^2+\lambda\sum_{i=1}^n\theta_j^2\right]$$

$\lambda$: regularization parameter (control the trade of fitting the training set well and keeping $\theta_j$ small)

Regularized Linear Regression

Gradient descent

Repeat {
$$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\ \theta_j:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\frac{\lambda}{m}\theta_j\right]\
\left(j=1,2,3,…,n\right)$$
}

Normal Equation

$$\theta=(X^TX+\lambda\begin{bmatrix}0\&1\&&1\&&&\ddots\&&&&1\end{bmatrix}_{(n+1)\times(n+1)})^{-1}X^Ty$$

if $\lambda>0,\ ( X^TX+\lambda\begin{bmatrix}0\&1\&&1\&&&\ddots\&&&&1\end{bmatrix})$ would be invertible.

Regularized Logistic Regression

$$J(\theta)=-\left[\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log\left(1-h_\theta(x^{(i)})\right)\right]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$$

Repeat {
$$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\ \theta_j:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}-\frac{\lambda}{m}\theta_j\right]\
\left(j=1,2,3,…,n\right)$$
}