Neural Networks

Representation

  • Neuron in the brain
  • Neuron model: Logistic unit

Logistic Unit

  • bais unit $x_0$
  • Sigmoid(logistic) activation function: $g(z)=\frac{1}{1+e^{-z}}$
  • “weights”: parameters
Logstic unit

Logstic unit

Simple Neural Network

Neural Network

Neural Network

  • Input layer(Layer 1)
  • Hidden layer(Layer 2)
  • Output layer(Layer 3)
  • $a_i^{(j)}$: “activation” of unit $i$ in layer $j$
  • $\Theta^{(j)}$: matrix of weights controlling function mapping from layer $j$ to layer $j+1$
  • architecture: how the different neurons are connected to each other.

if network has $s_j$ units in layer $j$, $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1}\times(s_j+1)$.

forward propagation

$$a_1^{(2)}=g(\Theta_{10}^{(1)}x_{0}+\Theta_{11}^{(1)}x_{1}+\Theta_{12}^{(1)}x_{2}+\Theta_{13}^{(1)}x_{3})\
a_2^{(2)}=g(\Theta_{20}^{(1)}x_{0}+\Theta_{21}^{(1)}x_{1}+\Theta_{22}^{(1)}x_{2}+\Theta_{23}^{(1)}x_{3})\
a_3^{(2)}=g(\Theta_{30}^{(1)}x_{0}+\Theta_{31}^{(1)}x_{1}+\Theta_{32}^{(1)}x_{2}+\Theta_{33}^{(1)}x_{3})$$

$$x=\begin{bmatrix}x_0\x_1\x_2\x_3\end{bmatrix}\qquad z^{(2)}=\begin{bmatrix}z^{(2)}_1\z^{(2)}_2\z^{(2)}_3\end{bmatrix}$$

$z^{(2)}=\Theta^{(1)}x=\Theta^{(1)}a^{(1)}\a^{(2)}=g(z^{(2)})$

Add $a^{(2)}_0=1$.

$z^{(3)}=\Theta^{(2)}a^{(2)}\h_\Theta(x)=a^{(3)}=g(z^{(3)})$

Finally,

$$h_\Theta(x)=g(\Theta_{10}^{(2)}a_{0}^{(2)}+\Theta_{11}^{(2)}a_{1}^{(2)}+\Theta_{12}^{(2)}a_{2}^{(2)}+\Theta_{13}^{(2)}a_{3}^{(2)})$$

Logic Gate

Logic Gate

Logic Gate

Multi-class classification

  • One-vs-all
multiple_output_units

multiple_output_units

Set $h_\Theta(x)\approx\begin{bmatrix}1\0\0\end{bmatrix}$ when class 1, $h_\Theta(x)\approx\begin{bmatrix}0\1\0\end{bmatrix}$ when class 2, etc.

Cost function

  • training set: $\left{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})\right}$
  • total layers in network: $L$
  • number of units(without bias unit) in layer $l$: $s_l$

Logistic regression:

$$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^my^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))\right] +\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$$

  • a generalization fof logistic regression
  • regularization term

Neural network:

for $h_\Theta(x)\in\mathbb{R}^K\quad(h_\Theta(x))_i=i^{th}\ \mathrm{output}$

$$\begin{align}J(\Theta)=&-\frac{1}{m}\left[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}\log (h_\Theta(x^{(i)}))k+(1-y_k^{(i)})\log(1-(h_\Theta(x^{(i)}))_k)\right]\ &+\frac{\lambda}{2m}\sum{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2\end{align}$$

Backpropagation algorithm

To get $min_\Theta J(\Theta)$,

Need to compute:

- $J(\Theta)$
- $\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)$

Gradient computation: Backpropagation algorithm

Intuition: $\delta_j^{(l)}=$”error” of node $j$ in layer $l$.

$$\delta^{(l)}=a^{(l)}-y\
\delta^{(l-1)}=(\Theta^{(l-1)})^T\delta^{(l)}\cdot g’(z^{(l-1)})\
\delta^{(l-2)}=(\Theta^{(l-2)})^T\delta^{(l-1)}\cdot g’(z^{(l-2)})\
…\
\delta^{(2)}=(\Theta^{(2)})^T\delta^{(3)}\cdot g’(z^{(2)})\
(\mathrm{No}\ \delta^{(1)}, \mathrm{it’s\ input})$$

Meanwhile,
$$g’(z^{(l)})=a^{(l)}\cdot(1-a^{(l)})\
\left(\mathrm{Because}\ g(z)=\frac{1}{1+e^{-x}}\Rightarrow g’(z)=\frac{e^{-x}}{(1-e^{-x})^2}\Rightarrow g’(z)=g(z)(1-g(z))\right)$$

Finally,

$$\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)=a_j^{(l)}\delta_i^{(l+1)}\qquad(\mathrm{Ignore}\ \lambda;\ \mathrm{if}\ \lambda=0)$$

Implement of backpropagation algorithm

Init training set:

$\left{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),…,(x^{(m)},y^{(m)})\right}$

Set $\Delta_{ij}^{(l)}=0\ (\mathrm{for\ all}\ l,i,j)$.

For $i=1$ to $m$:

    Set $a^{(1)}=x^{(i)}$

    Perform forward propagation to compute $a^{(l)}$ for $l=2,3,…,L$

    Using $y^{(i)}$, compute $\delta^{(L)}=a^{(L)}-y^{(i)}$

    Compute $\delta^{(L-1)},\delta^{(L-2)},…,\delta^{(2)}\ (\mathrm{No}\ \delta^{(1)})$

    $\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_{i}^{(l+1)}\Rightarrow\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$

$\begin{align}D_{ij}^{(l)}:=&\frac{1}{m}\Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)}&\text{ if } j\neq0 \ D_{ij}^{(l)}:=&\frac{1}{m}\Delta_{ij}^{(l)}&\text{ if } j=0 \end{align}$

$$\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)=D_{ij}^{(l)}$$

Gradient Checking

  1. Numerical estimation of gradients
    $\theta\in\mathbb{R}$:
    $$\frac{\mathrm{d} }{\mathrm{d}\theta}J(\theta)\approx\frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon}$$
  2. $\theta\in\mathbb{R^n}$:
    $$\frac{\partial}{\partial\theta_i}J(\theta)\approx\frac{J(\theta_1,\theta_2,…,\theta_i+\epsilon,…,\theta_n)-J(\theta_1,\theta_2,…,\theta_i-\epsilon,…,\theta_n)}{2\epsilon}$$(gradApprox)
  3. Check that gradApprox $\approx$ DVec(from back propagation)

Implementatation Note

  • Implement backprop to compute DVec (Unrolled $D^{(1)},D^{(2)},D^{(3)}$)
  • Implement numerical gradient check to compute gradApprox
  • Make sure they give similar values
  • Turn off gradient checking. Using backprop code for learning

Important

  • Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent (or in the inner loop of costFunction()) your code will be very slow.

Random initialization

Initial value of $\Theta$

  • NOT WORK: set initialTheta = zeros(n,1)
  • Random Initialization: Symmetry breaking:
    Initialize each $\Theta_{ij}^{(l)}$ to a random value in $[-\epsilon,\epsilon]$

Summary

To training a neural network:

Pick a network architecture

  • connectivity pattern between neurons, how many hidden layer and how many hidden units in each layer
  • Number of input units: Dimension of features $x^{(i)}$
  • Number of output units: Number of classes
  • Reasonable default: 1 hidden layer, or if more than 1 hidden layer, have same number of hidden units in every layer (usually the more the better)

Training process of neural network

  1. randomly initialize the weights to small values close to 0, between $-\epsilon$ to $+\epsilon$
  2. implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
  3. Implement code to compute cost function $J(\Theta)$
  4. Implement backprop to compute partial derivatives $\frac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta)$
    for i = 1:m
        Perform forward propagation and backpropagation using example $(x^{(i)},y^{(i)})$
        (Get activations $a^{(l)}$ and delta terms $\delta^{(l)}$ for $l=2,…,L$)
        $\Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$
    compute $\frac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta)$
  5. Use gradient checking to compare $\frac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta)$ computed using backprop VS. using numerical estimate of gradient of $J(\Theta)$

Then disable gradient checking code.

  1. Use either gradient descent or one of the advanced optimization algorithm with backprop to try to minimize $J(\theta)$ as a function of parameters $\Theta$.