Deep Learning Homework in Python

Please see the attached PDF for the assignment requirements. I have also attached the starter code.

Deep learning for action

• Input

• Observation

• Output

• Action

…

Linear

Avg Pool

Conv

ReLU

Conv

Acting in an environment

• Action changes that

state the of the world

• Non-differentiable

• Often non-repeatable

• Long-range

dependencies

…

Linear

Avg Pool

Conv

ReLU

Conv

…

Linear

Avg Pool

Conv

ReLU

Conv

…

Linear

Avg Pool

Conv

ReLU

Conv

t=0 t=1 t=2

Acting in an environment

left

right

up

down

jump

shoot

environment agent

action

state /

observation

Acting in an environment

left

right

up

down

jump

shoot

environment agent

action

state /

observation

Video Game Deep Network

How to train the agent?

• What should the agent

learn to do?

• Minimize loss

• Reward from

environment

left

right

up

down

jump

shoot

environment agent

action

state /

observation

Video

Game

Deep

Network

Markov decision process

(MDP) – Formal definition

• state ”

• action ”

• reward ”

• transition ”

• policy ”

*s *∈ S

*a *∈ A

*r**s*

∈ ℝ

*T*(*s**t*+1|*s**t*,*a**t*)

*π*(*a*|*s*)

*a*

*s*

*r**s*

environment

“*T*(*s**t*+1|*s**t*,*a**t*)

agent

“*π*(*a*|*s*)

MDP – objective

• Trajectory

”

• Return

”

• Objective

*τ*= “•

{*s*0,*a*0,*s*1,*a*1, …}

*R*(*τ*) = ∑

*t*

*γ**t**r**s*

*t*

maximize

*π *피*τ*∼*P*

*π*,*T*

[*R*(*τ*)]

*a*

*s*

*r**s*

environment

“*T*(*s**t*+1|*s**t*,*a**t*)

agent

“*π*(*a*|*s*)

Partially observed Markov

decision process (POMDP)

• state ”

• action ”

• reward ”

• transition ”

• observation !

• observation function

!

• policy !

*s *∈ S

*a *∈ A

*r**s*

∈ ℝ

*T*(*s**t*+1|*s**t*,*a**t*)

*o *∈ O

*O*(*o*|*s*)

*π*(*a*|*o*)

*a*

*o*

*r**s*

environment

“*T*(*s**t*+1|*s**t*,*a**t*)

agent

“*π*(*a*|*o*)

Examples – Cart-pole

• MDP

• objective: balance a pole on

movable cart

• state: angle, angular velocity,

position, velocity

• action: force applied

• reward: 1 for each time step

pole is upright

Image source: https://commons.wikimedia.org/wiki/

File:Balancer_with_wine_3.JPG

Examples – Robot

locomotation

• MDP

• objective: make the robot move

• state: joint angle and position

• action: torques applied to joints

• reward: 1 for each time upright

+ moving

Video source: https://commons.wikimedia.org/wiki/File:Passive_dynamic_walker.gif

Examples – Games

• POMDP

• objective: beat the game

• state: position, location, state of

all objects, agents and world

• action: game controls

• reward: score increase/

decrease, complete level, die

Video source: SuperTuxKart 1.0 Official Trailer, https://www.youtube.com/watch?

v=Lm1TFDBiIIg

Examples – GO

• MDP

• objective: win the game

• state: position of pieces

• action: next piece

• reward: 0 lose, 1 win

Image source: https://en.wikipedia.org/wiki/Go_(game)#/media/File:FloorGoban.JPG

Examples – supervised

learning

• MDP

• objective: Minimize the

training (or validation) loss

• state: weights and hyperparameters

• action: gradient update

• reward: change in loss

Everything is a (PO)MDP

• Very general concept

• NP-hard

• Specialized algorithms

still work well

*a*

*o*

*r**s*

environment

“*T*(*s**t*+1|*s**t*, *a**t*)

agent

“*π*(*a*|*o*)

Learning to act

• How do we train our

policy?

• Imitation learning

• Ask an expert

…

Linear

Avg Pool

Conv

ReLU

Conv

…

Linear

Avg Pool

Conv

ReLU

Conv

…

Linear

Avg Pool

Conv

ReLU

Conv

t=0 t=1 t=2

Imitation learning – definition

• Oracle / expert

• Provides trajectories

” with high return

• Policy

• Supervised learning

”

*τ*

maximize

*π *피*τ *[∑ *t *log *π*(*a**t*, *s**t*)]

*a*

*o*

*r**s*

environment

“*T*(*s**t*+1|*s**t*, *a**t*)

agent

“*π*(*a*|*o*)

maximize

*π *피*τ*∼*P*

*π*,*T*

[*R*(*τ*)]

Imitation learning – Issues

• Expert annotations are

sometimes expensive

• Drift from expert

• Distribution mismatch

between training and

testing

Agent

Expert

Imitation learning – bag of

tricks

• Use pre-trained

architecture

• More robust to slight

mismatch in

observations

• Better generalization

…

Linear

Avg Pool

Conv

ReLU

Conv

…

Linear

Avg Pool

Conv

ReLU

Conv

Pre-training

task

Demonstration

Imitation learning – bag of

tricks

• Data augmentation

• Encourage policy to

get back on track

Agent

Expert

Imitation learning with tricks

• Easy to train

• supervised learning

• Sometimes works

• Major issues with drift /

distribution mismatch

• Requires expert

annotations

Exploring the Limitations of Behavior Cloning for

Autonomous Driving, Codevilla etal. 2019

Imitation learning

• Drift

• Mismatch in training

and testing

distribution

Agent

Expert

Imitation learning –

Alternative interpretation

• Expert policy ”

• Agent policy ”

• Iterate

• Take action ”

• Imitate ”

• State update

”

*π**E*

*π a*

*Et*

∼ *π*

*E*( ⋅ |*s**t*)

log *π*(*a**tE*|*s**t*)

*s*

*t*+1 ∼ *T*( ⋅ |*a**tE*, *s**t*)

Dataset Aggregation

• Expert policy ”

• Agent policy ”

• Iterate

• Take action ”

• Take action ”

• Imitate ”

• State update ”

*π**E*

*π*

*a**E*

*t *∼ *π**E*( ⋅ |*s**t*)

*a*

*t *∼ *π*( ⋅ |*s**t*)

log *π*(*a**tE*|*s**t*)

*s*

*t*+1 ∼ *T*( ⋅ |*a**t*, *s**t*)

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online

Learning, Ross et al., AISTATS 2011

DAgger – Issues

• Requires expert oracle

• Very hard to humans

DAgger

• On-policy imitation

learning

• Guaranteed to work

for agents with

enough capacity and

good enough expert

Deep learning for action

• Why not just learn a

policy that maximizes

reward?

• Hard to optimize!

*a r*

*s*

environment

“*T*(*s**t*+1|*s**t*, *a**t*)

agent

“*π*(*a*|*s*) CNN

*s*

Deep learning for action

• Two sources of nondifferentiability

• Sampling

• Environment

*a r*

*s*

environment

“*T*(*s**t*+1|*s**t*, *a**t*)

agent

“*π*(*a*|*s*) CNN

*s*

Differentiating sampling

• Compute gradient of

”

•

”

피

*a*∼*P*

*θ*

[*g**θ*(*a*)] = ∑

*a*

*P**θ*

(*a*)*g**θ*(*a*)

∂

∂*θ*

피

*a*∼*P*

*θ*

[*g**θ*(*a*)] = ∑

*a*

*g**θ*(*a*)

∂

∂*θ*

*P**θ*

(*a*)

+ ∑

*a*

*P**θ*

(*a*)

∂

∂*θ*

*g**θ*(*a*)

f*P**θ **a*

sample

g

*g**θ*(*a*)

Differentiating sampling –

Issues

• Large sum over all

samples / action

• Generally intractable

∂

∂*θ*

피

*a*∼*P*

*θ*

[*g**θ*(*a*)] = ∑

*a*

*g**θ*(*a*)

∂

∂*θ*

*P**θ*

(*a*)

+ ∑

*a*

*P**θ*

(*a*)

∂

∂*θ*

*g**θ*(*a*)

Reparametrization trick

• For continuous distributions

• Rewrite

”

• e.g. standard normal

*P**θ *“•

(*a*) =

1*σ**θ*

*P *(*a *− *σ**θ**μ**θ *)

피

*a*∼*P*

*θ*

[*g**θ*(*a*)] = ∫Ω ˜ *P *(*b*) *g**θ*(*bσ**θ *+ *μ**θ*)*db*

Auto-Encoding Variational Bayes, Kingma and Welling, ICLR 2014

Reparametrization trick

• Compute gradient

Don't use plagiarized sources. Get Your Custom Essay on

Deep Learning in Python

Just from $10/Page

”

• Gradient computation by

sampling

∂

∂*θ*

피

*b*∼*P*[*g**θ*(*bσ**θ *+ *μ**θ*)] = 피*b*∼*P *[ ∂∂*θ g**θ*(*bσ**θ *+ *μ**θ*)]

*bσ*

*θ *+ *μ**θ*

f

*μ**θ*, *σ**θ*

sample ”

from ”

*b*

*P*

g

Reparametrization trick –

discrete variables

•

”

• No change of variables

• No differentiable function that

maps to discrete distribution

• Continuous relaxation of one-hot

vectors

• Gumbel softmax

피

*a*∼*P*

*θ*

[*g**θ*(*a*)] = ∑

*a*

*P**θ*

(*a*)*g**θ*(*a*)

• The Concrete Distribution: a Continuous Relaxation of Discrete Random Variables,

Maddison et al., ICLR 2017

• Categorical Reparameterization with Gumbel-Softmax, Jang et al, ICLR 2017

*bσ*

*θ *+ *μ**θ*

f

*μ**θ*, *σ**θ*

sample ”

from ”

*b*

*P*

g

Differentiating the

environment

• Quite hard

• Up next

*a r*

*s*

environment

“*T*(*s**t*+1|*s**t*, *a**t*)

agent

“*π*(*a*|*s*) CNN

*s*

Non-differentiability

• Compute gradient of

!

피

*τ*∼*P*

*π*,*T*

[*R*(*τ*)]

= ∑

*τ*

*P**π*

,*T*(*τ*)*R*(*τ*)

*a r*

*s*

environment

!*T*(*s**t*+1|*s**t*, *a**t*)

agent

!*π*(*a*|*s*) CNN

*s*

The log-derivative trick

• Simple chain rule

!

• Gradient of expected

return

!

∇

*θ **p**θ*(*x*) = *p**θ*(*x*)∇*θ*log *p**θ*(*x*)

∇피

*τ*∼*P*

*π*,*T*

[*R*(*τ*)] = ∑

*τ*

*P**π*

,*T*(*τ*)*R*(*a*)∇log *P**π*,*T*(*τ*)

= 피

*τ*∼*P*

*π*,*T*[*R*(*τ*)∇log *P**π*,*T*(*τ*)]

*a r*

*s*

environment

!*T*(*s**t*+1|*s**t*, *a**t*)

agent

!*π*(*a*|*s*) CNN

*s*

REINFORCE

• Compute gradient using

Monte Carlo sampling

!

피

*τ*∼*P*

*π*,*T*[*R*(*τ*)∇log *P**π*,*T*(*τ*)]

≈

1*N*

∑

*τ*∼*P*

*π*,*T*

[*R*(*τ*)∇log *P**π*,*T*(*τ*)]

Simple statistical gradient-following algorithms for connectionist reinforcement

learning, Williams, Machine learning 1992

REINFORCE issues

• Needs lots of samples

for a good gradient

• High-variance gradient

estimator

• Cannot reuse rollouts

(!*τ*)

1*N*

∑

*τ*∼*P*

*π*,*T*

[*R*(*τ*)∇log *P**π*,*T*(*τ*)]

Policy gradient

• REINFORCE on steroids

• lower variance

• baseline

• off-policy

• reuse rollouts

1*N*

∑

*τ*∼*P*

*π*,*T*

*R*(*τ*)∇log *P**π*,*T*(*τ*)

Vanilla policy gradient

algorithm

• For i iterations

• Collect rollouts

• Estimate the sample

gradient

• Take a gradient step

1*N*

∑

*τ*∼*P*

*π*,*T*

*R*(*τ*)∇log *P**π*,*T*(*τ*)

Variance of REINFORCE

• What happens if all

rewards are positive?

• Only learn to do “more”

things in ”

• SGD zig-zags

• RL worst best of we

have positive and

negative returns

*τ*

1*N*

∑

*τ*∼*P*

*π*,*T*

*R*(*τ*)∇log *P**π*,*T*(*τ*)

Baselines

• Gradient for constant

return is zero

• ”

• Reduces variance

• Positive and negative

returns

• Unbiased gradient estimate

피

*τ*∼*P*

*π*,*T*

[*b*∇log *P**π*,*T*(*τ*)] = 0

1*N*

∑

*τ*∼*P*

*π*,*T*

*R*(*τ*)∇log *P**π*,*T*(*τ*)

On- vs off-policy

• REINFORCE is on-policy

• Trajectories (rollouts)

need to come from

current policy

• No reuse of

trajectories between

gradient update

1*N*

∑

*τ*∼*P*

*π*,*T*

*R*(*τ*)∇log *P**π*,*T*(*τ*)

Off-policy

• Importance sampling

• Many variants

1*N*

∑*τ*∼*Q*

*P**π*

,*T*(*τ*)

*Q*(*τ*)

*R*(*τ*)∇log *P**π*,*T*(*τ*)

Policy gradient algorithm

• For i iterations

• Collect rollouts

• Add to replay buffer

• Update baseline network

• For j batches

• Estimate the sample

gradient on replay buffer

• Take a gradient step

Policy gradient

• REINFORCE with many

tricks

• Not very sample

efficient

• Gradient estimate by

sampling from an

exponential trajectory

space

1*N*

∑

*τ*∼*P*

*π*,*T*

*R*(*τ*)∇log *P**π*,*T*(*τ*)

Why do we need a gradient?

!*θ *!*π*

parameters policy

*T *!

environment reward

/ return

getting good gradients is hard!

피

*τ*∼*P*[*R*(*τ*)]

Why do we need a gradient?

is the forward pass hard?

!*θ *!*π*

parameters policy

*T *!

environment reward

/ return

피

*τ*∼*P*[*R*(*τ*)]

What if we had an oracle?

• Given policies ” and ”

• which one is better?

• rollout and compute

return

*π**A*

*π**B*

Gradient Free Optimization

• maximize ” w.r.t. ”

• can only evaluate

function value

• no gradients

• ” is smooth

• similar ” produce

similar ”

*f*(*θ*) *θ*

*f*

*θf*(

*θ*)

black box

“*f*(*θ*)

Random Search

• Randomly sample ”

• pick highest ”

*θ*

*f*(*θ*)

parameters space

Iterative Random Search

• Randomly sample ”

• pick highest ”

• Sample more points

around maxima

• repeat

*θ*

*f*(*θ*)

parameters space

Cross entropy method

• Initialize ”

• sample ”

• compute reward ”

• select top ” (” )

• fit Gaussian for new ”

• repeat

*μ*, *σ*

*θ *∼ 풩(*μ*, *σ*2)

*f*(*θ*)

*p *% *p *= 20

*μ*, *σ*

parameters space

Evolution Strategies

• Initialize ”

• Iterate

• Sample ”

• Compute returns ”

• Normalize ”

•

Update ”

*θ*

*ϵ*

0, *ϵ*1, …, *ϵ**n *∼ 풩(0,*I*)

*F**i*

= *R*(*θ *+ *σϵ**i*)

*F *˜

*i *=

*F**i*

− *μ**F*

*σ**F*

*w *:= *w *+

*α*

*σn*

*n*∑*i*=1

*F *˜

*i**ϵ**i*

parameters space

Evolution strategies as a scalable alternative to reinforcement learning, Salimans et al.,

arXiv 2017

Augmented random search

• Initialize ”

• Iterate

• Sample ”

• Compute returns ” ,

”

• Update

”

*θ*

*ϵ*

0, *ϵ*1, …, *ϵ**n *∼ 풩(0,*I*)

*F*+

*i *= *R*(*θ *+ *σϵ**i*)

*F*−

*i *= *R*(*θ *− *σϵ**i*)

*w *:= *w *+

*α*

*σn*

*n*∑*i*=1

(*F**i*+ − *F**i*−)*ϵ**i*

parameters space

• Simple random search provides a competitive approach to reinforcement learning,

Mania et al., NIPS 2018

• Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient

Approximation, Spall, Automatic Control, 1992

Gradient free optimization

• Exponential in

parameter space

• works better if

• parameter space small

• parameters correlate

with expected return

black box

“*f*(*θ*)

Vision and action

• In humans and animals

• vision developed as a side

product for action

• no explicit supervision for

vision

• emerges from model

structure (connections)

Image source: https://en.wikipedia.org/wiki/

Cambrian_explosion#/media/

File:Opabinia_BW2.jpg

Vision and action

• In computer vision

• lots of data and labels

• explicit supervision

• emerges from data

• Classical robotics

• Planing after computer

vision

example 0: ( , Zebra)

example 1: ( , Zebra)

example 2: ( , Zebra)

example 999: ( , Zebra)

…

Deep Network

Open problem

• Why this disconnect?

example 0: ( , Zebra)

example 1: ( , Zebra)

example 2: ( , Zebra)

example 999: ( , Zebra)

…

Deep Network

Hypothesis 1 – Too narrow

tasks

• On single (narrow) task

• Data and labels

always win

• On multiple tasks

• Generalization

between tasks creates

visual representation

hunt

escape

search

move

…

Hypothesis 2 – Wrong models

and algorithms

• Backprop + SGD biased

• Doesn’t work on all

tasks and

architectures equally

well

example 0: ( , Zebra)

example 1: ( , Zebra)

example 999: ( , Zebra)

…

Deep

Network

Hypothesis 3 – No evolution

• Insufficient optimization

of models in outer loop

• Meta-learning can find

visual representations

without much

supervision

• supervision: acting

well survival

Implications

• If any of above

hypothesis are right

• We are wasting time

with labeled data

example 0: ( , Zebra)

example 4: ( , Zebra)

example 1: ( , Zebra)

example 2: ( , Zebra)

example 3: ( , Zebra)

example 999: ( , Zebra)

…

?

Hypothesis 4 – Value of

labels

• Our current approach is

fine

• labeled data provides

abstract

representation without

need for evolution and

massive optimization

example 0: ( , Zebra)

example 1: ( , Zebra)

example 2: ( , Zebra)

example 999: ( , Zebra)

…

Deep Network

Summary

• Learning to act using

deep networks

• Option 1: Imitate

expert

• Option 2: Policy

gradient

• Option 3: Gradient

free

Which one should I chose?

• Is it easy for human to

perform the tasks?

• Imitation

• Do we want to do better

than humans?

• Gradient free

• Can I use existing code?

• Policy gradient

environment agent

action

state /

observation

Video

Game

Deep

Network

Looking at your data

• E.g. images

• Random ones

• Smallest / largest file

size

• Try solving the task

manually

Random images

Largest file size

Smallest file size

Solving the task manually

Dataset

• Training set

• Learn model parameters

• Validation set

• Learn hyper-parameters

• Test set

• Measure generalization

performance

Why split the data?

• Overfitting

• Goal: Learn a model

that works well in the

real world

• Optimization objective:

Learn a model that

works well in training

data

Training set

• Used to train all

parameters of the

model

• Model will work very

well on training set

• Size: 60-80% of data

Validation set

• Used to determine how

well the model works

• Used to tune model and

hyper-parameters

• Size: 10-20% of data

Testing set

• Used to measure

performance of model

on unseen data

• Used exactly once

• Size: 10-20% of data

How to split the data?

• Random sampling

without replacement

Distribution of data

Low dimensions

*D**da*

*ta *≈ *D**train *≈ *D**valid *≈ *D**test*

High dimensions

*D**da*

*ta *≠ *D**train *≠ *D**valid *≠ *D**test*

Graduate student descent

Look at your

data / model output

Design and

train your model

Evaluate

your model on

validation set

automated

semiautomated manual

Network initialization

• What do we set the

initial parameters to?

Linear

ReLU

Linear

data

Idea 1: All zero

*x*

z

1 = W1x

z

2 = max(z1,0)

o = W

3z2

v3 = *ℓ*(o)

∂*ℓ*(o)

∂o

v

2 = W⊤ 3 v3

v

1 = v2[z1 > 0]

v

0 = W⊤ 1 v1

∂*ℓ*(o)

∂W

3

= v

3z⊤ 2

∂*ℓ*(o)

∂W

1

= v

1x⊤

Idea 1: All zero

• Does not work

• No gradient

• Saddle point

Idea 2: constant

*x*

z

1 = W1x

z

2 = max(z1,0)

o = W

3z2

v3 = *ℓ*(o)

∂*ℓ*(o)

∂o

v

2 = W⊤ 3 v3

v

1 = v2[z1 > 0]

v

0 = W⊤ 1 v1

∂*ℓ*(o)

∂W

3

= v

3z⊤ 2

∂*ℓ*(o)

∂W

1

= v

1x⊤

Idea 2: constant

• Does not break

symmetries

Solution

• Random initialization

W

*i *= 풩(*μ**i*, *σ**i*2I)

Random initialization

• Initialize weights

• Normal distribution

• Uniform distribution

• What should and

be?

• For simplicity = 0

and bias = 0

Linear

ReLU

Linear

W

3 = 풩(*μ*3, *σ*3 2I)

W

1 = 풩(*μ*1, *σ*1 2I)

*σ**i*

*μ**i*

*μ**i*

x o

Scaling matters

Linear

Linear

W

2

W

1

x o

o = W

2W1x

∂*ℓ*(o)

∂W

1

= (W⊤ 2 ∂*ℓ *∂(oo)) x⊤

∂*ℓ*(o)

∂W

2

=

∂*ℓ*(o)

∂o (W1x)⊤

How do we scale the

initialization?

• By hand

• A lot of tuning

• Automatically

• A lot of math

Linear

ReLU

Linear

x o

Linear

ReLU

Xavier and Kaiming

initialization

• Strategy to set

variance of Normal

initialization

• All activations are of

similar scale

Linear

ReLU

Linear

W

3 ∼ 풩(*μ*3, *σ*3 2I)

W

1 ∼ 풩(*μ*1, *σ*1 2I)

x o

*σ*2

Random matrix multiplication

a⊤x ∼ 풩(*μ**a*∑

*i*

x

*i*,∥x∥2*σ**a *2) for a ∼ 풩(*μ**a*, *σ**a *2I)

1 for derivation see: https://en.m.wikipedia.org/wiki/Multivariate_normal_distribution

1

Random matrix multiplication

z

*i *= W*i*−1z*i*−1 ∼ 풩(0, ∥z*i*−1∥2*σ**W *2

*i*−1

I) for W*i*−1 ∼ 풩(0, *σ**W *2

*i*−1

I)

Random ReLU

z

*i*+1 = max(z*i*, 0) for z*i *∼ 풩(0,*σ**i*2I)

피[∥z*i*+1∥2] = 1

2

*n*

z

*i*

*σ*2

*i*

Putting things together

z

*i*+2 ∼ 풩(0, ∥zi+1∥2*σ**W *2

*i*+1

I

*σ*2

*i*+2

∥z*i*+1∥2 ≈ 피[∥z*i*+1∥2] = 1 )

2

*n*

z

*i*

*σ*2

*i*

z

*i *∼ 풩(0, *σ**i*2I)

*σ**i*

+2 =

1

2

*σ**W**i*

+1

*σ**i*

*n*

z

*i*

*σ**i*

=

(*i*−1)/2

∏*k*=0

( 12 *σ**W*2*k*+1 *n*z2*k*) *σ**x*

Randomly initialized network

*σ**i*

=

(*i*−1)/2

∏*k*=0

( 12 *σ**W*2*k*+1 *n*z2*k*) *σ**x*

Linear

ReLU

Linear

x o

Linear

ReLU

Variance of back-propagation

graph

̂*σ**i*

=

(*N*−1)/2

∏*k*=*i*/2

( 12 *σ**W*2*k*+1 *n*z2*k*+2) *σ**N *̂

Linear

ReLU

Linear

x o

Linear

ReLU

Linear

ReLU

Linear

ReLU

Linear

∂*ℓ*(o)

∂o

Xavier initialization

• Try to keep both

activations and gradient

magnitude constant

•

*σ**i*

=

(*i*−1)/2

∏*k*=0

( 12 *σ**W*2*k*+1 *n*z2*k*) *σ**x*

̂*σ**i*

=

(*N*−1)/2

∏*k*=*i*/2

( 12 *σ**W*2*k*+1 *n*z2*k*+2) *σ**N *̂

*σ**W*

= 2

2

*n*

z

*i*

+ *n*

z

*i*+1

Kaiming initialization

• Try to keep either

activation or gradient

magnitude constant

• •

*σ**W*

= 2

1*n*

z

*i*

*σ**W*

= 2

1*n*z

*i*+1

*σ**i*

=

(*i*−1)/2

∏*k*=0

( 12 *σ**W*2*k*+1 *n*z2*k*) *σ**x*

̂*σ**i*

=

(*N*−1)/2

∏*k*=*i*/2

( 12 *σ**W*2*k*+1 *n*z2*k*+2) *σ**N *̂

Initialization in practice

• Xavier (default) is often

good enough

• Initialize last layer to

zero

Optimization

• Stochastic Gradient

Descent

• Convergence speed

• Training accuracy

• Generalization

performance

iterations

Validation

Training

Accuracy

0.9

0.7

0.5

0.3

Input normalization

• Input:

• Apply affine

transformation

Linear

ReLU

Linear

x o

x

*i*

x̂

*i *= *α*x*i *+ *β*

Gradients of uncentered

inputs: A simple example

• Input vector

• Output scalar

•

x *o*

∂*ℓ*(*o*)

∂w

= *v*x⊤

Linear

x ∈ ℝ*c*

*o *∈ ℝ

Linear

*v *∈ ℝ

∂*ℓ*(*o*)

∂w

Mean subtraction

• Input:

• Apply affine

transformation

x

*i*

x̂

*i *= x*i *− *μ*x

Gradients of unnormalized

inputs: A simple example

• Input vector

• Output scalar

•

|x[0]| ≪ |x[1]|

x *o*

∂*ℓ*(*o*)

∂w

= *v*x⊤

Linear

x ∈ ℝ2

*o *∈ ℝ

Linear

*v *∈ ℝ

∂*ℓ*(*o*)

∂w

Gradients of unnormalized

inputs: A simple example

Input normalization

• Input:

• Apply affine

transformation

x

*i*

x̂

*i *= (x*i *− *μ*x)/*σ*x

Exploding gradients

• Weight of one layer

grows

• Gradient of all other

layers grow

• Weights of other

layers grow

• Bad feedback loop

ReLU

Conv

x

Conv

ReLU

Conv

Conv

…

Conv

Conv

ReLU

Conv

∂*ℓ*(o)

∂o

o

ReLU

Conv

…

Exploding gradients

• Not a bit issue for most

networks

• An issue for recurrent

networks, and networks

that share parameters

across layers

Detecting exploding

gradients

• Plot loss

• Plot weight and

gradient magnitude per

layer

Vanishing gradients

• Weight of one layer

shrink

• Gradient of all other

layers shrink

• Weights of other

layers stay constant

• No progress

ReLU

Conv

x

Conv

ReLU

Conv

Conv

…

Conv

Conv

ReLU

Conv

∂*ℓ*(o)

∂o

o

ReLU

Conv

…

Vanishing gradient

• Big issue for larger

networks

• Issue for recurrent

networks and weights

tied across layers

Detecting vanishing gradients

• Plot loss

• Plot weight and

gradient magnitude per

layer

Normalization

• How to prevent

vanishing (or exploding)

gradients?

ReLU

Conv

x

Conv

ReLU

Conv

Conv

…

Conv

Conv

ReLU

Conv

∂*ℓ*(o)

∂o

o

ReLU

Conv

…

Batch normalization

• Make activations zero

mean and unit variance

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training

by reducing internal covariate shift. In ICML, 2015

ReLU

Conv

BN

Batch normalization

• Normalize by channelwise mean and

standard deviation

Z ∈ ℝ*B*×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c *− *μ**c*

*σ**c*

*μ**c *=

1

*BWH *∑

*k*,*x*,*y*

Z

*k*,*x*,*y*,*c*

*σ*2

*c *=

1

*BWH *∑

*k*,*x*,*y*

(Z*k*,*x*,*y*,*c *− *μ**c*)2

*μ**c*

*σ**c*

What does batch

normalization do?

W

H

C

B

What does batch

normalization do?

• The good:

• Regularizes the network

• Handles badly scaled

weights

• The bad:

• Mixes gradient

information between

samples

Z ∈ ℝ*B*×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c *− *μ**c*

*σ**c*

*μ**c *=

1

*BWH *∑

*k*,*x*,*y*

Z

*k*,*x*,*y*,*c*

*σ*2

*c *=

1

*BWH *∑

*k*,*x*,*y*

(Z*k*,*x*,*y*,*c *− *μ**c*)2

Batch norm and batch size

• Large batch sizes work

better

• More stable mean and

standard deviation

estimates

Z ∈ ℝ*B*×*W*×*H*×*C*

*μ**c *=

1

*BWH *∑

*k*,*x*,*y*

Z

*k*,*x*,*y*,*c*

*σ*2

*c *=

1

*BWH *∑

*k*,*x*,*y*

(Z*k*,*x*,*y*,*c *− *μ**c*)2

Z

*k*,*x*,*y*,*c *− *μ**c*

*σ**c*

Batch norm at test time

• Compute mean and

standard deviation on

training set using

running average

Z ∈ ℝ1×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c *− *μ**c*

*σ**c*

Layer normalization

• Make activations zero

mean and unit variance

without collecting

statistics across batches

Layer Normalization, Ba, J., Kiros, J. R. and Hinton, G., arXiv preprint arXiv:

1607.06450, 2016

ReLU

Conv

LN

Layer normalization

• Normalize by imagewise mean and

standard deviation

Z ∈ ℝ*B*×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c *− *μ**k*

*σ**k*

*μ**k *=

1

*WHC *∑

*x*,*y*,*c*

Z

*k*,*x*,*y*,*c*

*σ*2

*k *=

1

*WHC *∑

*x*,*y*,*c*

(Z*k*,*x*,*y*,*c *− *μ**k*)2

*μ**k*

*σ**k*

What does layer

normalization do?

W

H

C

B

Comparison to batch norm

• No summary statistics

• Training and testing

are the same

• Works well for

sequence models

• Does not scale

activations individually

batch norm

layer norm

Instance normalization

• Batch norm per input

Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. “Instance normalization:

The missing ingredient for fast stylization.” arXiv 2016.

ReLU

Conv

IN

Instance normalization

• Normalize by spatial

mean and standard

deviation

Z ∈ ℝ*B*×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c *− *μ**kc*

*σ**kc*

*μ**kc *=

1*WH*

∑*x*,*y*

Z

*k*,*x*,*y*,*c*

*σ*2

*kc *=

1*WH*

∑*x*,*y*

(Z*k*,*x*,*y*,*c *− *μ**kc*)2

*μ**kc*

*σ**k*

*c*

What does instance

normalization do?

W

H

C

B

Comparison to batch norm

• No summing over

batches

• Works well for graphics

applications

• Not used much in

recognition

• Unstable statistics

batch norm

layer norm

instance norm

Group normalization

• Normalize groups of G

channels together

Yuxin Wu, and Kaiming He. “Group normalization.” ECCV. 2018.

group 1 group 2 group 3 group 4

channels

0 … c-1

Z ∈ ℝ*B*×*W*×*H*×*C*

,

Z

*k*,*x*,*y*,*c *− *μ**kg*

*σ**kg*

*μ**kg *=

1

*WHG*

(*g*+1)*G*−1

∑

*c*=*gG*

∑*x*,*y*

Z

*k*,*x*,*y*,*c*

*σ*2

*kg *=

1

*WHG*

(*g*+1)*G*−1

∑

*c*=*gG*

∑*x*,*y*

(Z*k*,*x*,*y*,*c *− *μ**kg*)2

*g *= ⌊*c*/*G*⌋

What does group

normalization do?

W

H

C

B

Comparison to other norms

• More stable statistics

than instance norm

• G=C

• Not all channels tied as

in layer norm

• G=1

batch norm

layer norm

instance norm

group norm

Local response normalization

• “Generalization” of

group norm

• Parameters and

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification

with deep convolutional neural networks.” NIPS 2012

channels

0 … c-1

Z ∈ ℝ*B*×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c **γ *+

*αn*

*c*+ *n*

2∑

*c*′= ! *c*− *n*

2

Z2

*k*,*x*,*y*,*c*

−*β*

*α β*

Differences between LRN

and GN

• Group norm

• Normalize over all spatial

locations

• Subtract mean

• Scale and bias

transformation

• Local response normalization

• More flexible

parametrization

channels

0 … c-1

Z ∈ ℝ*B*×*W*×*H*×*C*

Z

*k*,*x*,*y*,*c **γ *+

*αn*

*c*+ *n*

2∑

*c*′= ! *c*− *n*

2

Z2

*k*,*x*,*y*,*c*

−*β*

Where to add normalization?

• Option A

• After convolution

• Option B

• After ReLU (nonlinearity)

ReLU

Conv

BN

ReLU

Conv

BN

Option A Option B

Option A

• No bias in conv

• Activations are zero mean

• Half of activations

zeroed out in ReLU

• Solution:

• Learn a scale and

bias after norm

ReLU

Conv

*s*

*c*

*b*

*c*

Z

*k*,*x*,*y*,*c *− *μ*

*σ*

⋅ *s*

*c *+ *b**c*

Option B

• Scale and bias

optional

ReLU

Conv

*s*

*c **b**c*

Z

*k*,*x*,*y*,*c *− *μ*

*σ*

⋅ *s*

*c *+ *b**c*

Where to add normalization?

• Both work

• Option A is more popular

• Option B is easier

• Scale and bias

optional

• Conv unchanged

ReLU

Conv

BN

ReLU

Conv

BN

Option A Option B

Where not to add batch

norm?

• After fully connected

layers

• Mean and standard

deviation estimates

too unstable

Why does normalization

work?

• Regularizes the

network

• Handles badly scaled

weights

• Single parameter to

learn scale *s*

*c*

ReLU

Conv

Z

*k*,*x*,*y*,*c *− *μ*

*σ*

⋅ *s*

*c *+ *b**c*

Deep networks

• Without normalization

• Max depth 10-12

• With normalization

• Max depth 20-30

ReLU

Conv

ReLU

Conv

…

What happens to deeper

networks?

• It does not train well

[Figure source: Kaiming He et al., “Deep Residual

Learning for Image Recognition”, CVPR 2016]

What happens to deeper

networks?

• Training a shallower

network and adding

identity layers works

better

ReLU

Conv

ReLU

Conv

ReLU

Conv

ReLU

Conv

ReLU

Conv

I

Solution: Residual

connections

• Parametrize layers as ReLU

Conv

ReLU

Conv

+

*f*(x) x + *g*(x)

*f*(x) = x + *g*(x)

Fun fact

• Backward graph is

symmetric ReLU

Conv

+

+

ReLU (back)

Conv (back)

Residual Networks

[Figure source: Kaiming He et al., “Deep Residual Learning for Image Recognition”,

CVPR 2016]

How well do residual

connections work?

• Can train networks of

up to 1000 layers

ReLU

Conv

Conv

Linear

Pool

BN

x998

Why do residual connection

work? – Practical answer

• Gradient travels further without

major modifications (vanishing)

• Reuse of patters

• Only update patterns

• Dropping some layers does not

even hurt performance

• Weights

• Model identity

ReLU

Conv

+

→ 0

[Gao Huang et al., “Deep Networks with Stochastic Depth”, ECCV 2016]

→

Why do residual connection

work? – Theoretical answer

• Without ReLU

• Invertible functions

• Very wide

• SGD find global optimum

ReLU

Conv

+

[Simon S. Du, et al., “Gradient Descent Finds Global Minima of Deep Neural Networks”, ICML 2019]

[Moritz Hardt and Tengyu Ma, “Identity matters in deep learning”, ICLR 2017]

Residual connections –

Summary

• Used in most modern

networks

• Allow for much deeper

networks

Stochastic Gradient Descent

with Momentum

• Default optimizer

• Works well in most

cases

• Tune learning rate

for n epochs

for batches

g:= 피x,*y*∈*B**i*[

∂*ℓ*(x, *y*|*θ*)

∂*θ*

]

v:= *ρ*v + g

*θ*:= *θ *− *ϵ*v

*B*

*i*

RMSProp

• Very specialized

• Auto-tunes learning rate

• Momentum optional

• Doesn’t play nice with

momentum

• Works well on some

reinforcement learning

problems

for n epochs

for batches

m := v := 0

g:= 피x,*y*∈*B**i*[

∂*ℓ*(x, *y*|*θ*)

∂*θ*

]

m:= *α*m + (1 − *α*)g2

v:= *ρ*v +

g

m + *ε*

*θ*:= *θ *− *ϵ*v

*B*

*i*

Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5-RMSProp: Divide the gradient by a

running average of its recent magnitude.” Neural networks for machine learning 4.2, 2012

ADAM

• Less learning rate tuning

• Works well on small

networks and problems

• Trains well, generalizes

worse

• Mathematically not

correct

for n epochs

for batches *B**i*

g:= 피x,*y*∈*B**i*[

∂*ℓ*(x, *y*|*θ*)

∂*θ*

]

v:= *β*1v + (1 − *β*1)g

m:= *β*2m + (1 − *β*2)g2

*s*:= *ϵ*

1 − *β*2 step

1 − *β*1 step

*θ*:= *θ *− *s*

v

m + *ε*

v := m := 0

Kingma, D. P., & Ba, J. L.. Adam: a Method for

Stochastic Optimization. ICLR 2015

What optimizer to use?

• Large models and data

• SGD with momentum

• Small models and data

• ADAM

Optimization algorithms

• Hyper parameters

• Learning rate

• Momentum

• Batch size

What learning rate it use?

• Rule of thumb: Largest

LR that trains

• Train for a few

epochs and measure

validation accuracy

Learning rate vs batch size

• Linear Scaling Rule:

When the minibatch

size is multiplied by ,

multiply the learning

rate by .

Priya Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet

in 1 Hour”, arXiv 2017

*k*

*k*

Learning rate schedules

• Step schedule

• Linear schedule

• Cosine schedule

• Cyclical schedules

0

0.5

0 1

0.5

1 0

0.5

1 0

0.5

1

Summary

• Use close to the largest

LR that trains

• Step schedule

How do we train a small

network?

• Idea 1:

• Randomly initialize

and train network

• Idea 2:

• Train a larger network

and make it small

Network distillation

• Train an ensemble of large

networks

• Train a small network to

mimic it’s output (with

cross entropy)

• Important: Reduce

confidence of ensemble

prediction (soft targets)

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural

network.” arXiv 2015.

Why does distillation work?

• Dark knowledge

• Networks learn about

(visual) relationships

of classes

• Boost training signal

Network pruning /

factorization

• Train a wide network (many

channels)

• Remove channels/weights

that are used the least

• 90% of parameters can be

removed after training

• Training the small network is

challenging

H Li, A Kadav, I Durdanovic, H Samet, HP Graf, “Pruning Filters for Efficient ConvNets”,

ICLR 2017

S Han, J Pool, J Tran, W Dally, “Learning both Weights and Connections for Efficient Neural

Network” NIPS 2015

Possible explanation: Lottery

ticket hypothesis

• Not all initializations

are created even

• Train network

A randomly-initialized, dense

neural network contains a

subnetwork that is initialized

such that—when trained in

isolation—it can match the test

accuracy of the original network

after training for at most the

same number of iterations.

Jonathan Frankle, Michael Carbin, “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural

Networks”, ICLR 2019

Lottery ticket hypothesis

• Very nice idea

• Likely not the full story

Zhuang Liu et al., “Rethinking the Value of Network Pruning”, ICLR 2019

Overfitting

• Fit model to training set

• Model does not work

on unseen data (e.g.

validation set)

loss

10.0

0.0

iterations

Accuracy

1.0

0.0

iterations

train

val

How to detect overfitting?

• Plot training and

validation accuracy

Accuracy

1.0

0.0

iterations

train

val

Is overfitting always bad?

Accuracy

1.0

0.0

iterations

train

val

Why do we overfit?

• Sampling bias

• Optimization fits

patterns that only

exist in training set

Do we overfit with infinite

training data?

How do we prevent

overfitting?

• Collect more data

• Make the model simpler

• Regularize

• Transfer learning

Training and overfitting

Accuracy

1.0

0.0

iterations

train

val

Early stopping in practice

• No need for stop button

• Measure validation

accuracy periodically

• Save your model

periodically

Accuracy

1.0

0.0

iterations

train

val

Signs of overfitting

• Does not capture

invariances in data

Conv

Conv

…

dog

Conv

Conv

…

cat

Conv

Conv

…

cat

How to capture invariances?

• Build them into the

model

• Convolutions

• All-convolutional

models

• Build them into the data

• Data augmentation

Conv

Conv

…

ReLU

Pool

Linear

Data augmentation

• Capture invariances in

data

• (Randomly) transform

data during training

• Reuse a label

( , dog)

( , dog)

( , dog)

( , dog)

Image augmentations

flip

shift

rotate

saturation

brightness

scale tint/hue

Training with data

augmentation

• (Randomly) augment

every single iteration

• Network never sees

exact same data twice

Dog

Augmentation

Conv

Conv

…

Loss

Linear

Unsupervised data

augmentation

• Captures invariances on

unseen and unlabeled

data

Augmentation

Conv

Conv

…

Linear

Conv

Conv

…

Consistency

Linear

Xie, Dai, Hovy, Luong, Le,

“Unsupervised Data Augmentation”,

arXiv 2019

Data augmentation

• Always use data

augmentation if possible

• Some augmentations

require augmentation of

labels

• e.g. for dense

prediction tasks

Overfitting in deep networks

• Overfitting

• Exploit patterns that

exist in training data,

but not in the

validation / test data

• Not all activations

overfit

Linear

Linear

Linear

ReLU

ReLU

Activation 1

Activation 2

Overfitting in deep networks

• Deeper layers overfit

more

• Rely on overfit

activations from

previous layers

Linear

Linear

Linear

ReLU

ReLU

Activation 1

Activation 2

Preventing overfitting in

deep networks

• Reduct reliance on

specific activations in

previous layer

• Randomly remove

activations

Linear

Linear

Linear

ReLU

ReLU

Activation 1

Activation 2

Dropout

• During training

• With probability set

activation to zero

• During evaluation

• Use all activations,

but scale by

Linear

Linear

Linear

ReLU

ReLU

Activation 1

Activation 2

1 − *α*

*α*

*a*

*l*(*i*)

Dropout in practice

• A separate layer

torch.nn.Dropout

• During training

• With probability set

activation to zero

• Scale activations by

• During evaluation identity

1

1 − *α*

*α*

*a*

*l*(*i*)

Linear

Linear

Linear

ReLU

ReLU

Dropout

Dropout

Where to add dropout?

• Before any large fully

connected layer

• Before some 1×1

convolutions

• Not before general

convolutions

Conv 1×1

ReLU

Simpler models

• Traditional wisdom

• Simpler model = less

overfitting

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

(H, W, 64)

(H, W, 128)

(H, W, 3)

… …

Idea 1: Smaller model

• Overfits less

• Fits less

• Worse generalization

(H, W, 32)

(H, W, 64)

(H, W, 3)

… …

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Idea 2: Big model with

regularization

• Weight decay

• Keep weights small

(L2 norm)

• Works sometimes

• Keep weight at same

magnitude

(H, W, 64)

(H, W, 128)

(H, W, 3)

… …

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

How to use weight decay?

• Parameter in optimizer,

e.g. torch.optim.SGD or

torch.optim.Adam

• weight_decay

• Use 1e-4 as default

Other reasons to use weight

decay

• Network weights cannot

grow infinitely large

• Helps handle

exploding gradients

(H, W, 64)

(H, W, 128)

(H, W, 3)

… …

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Ensembles

• Train multiple models

• Average predictions of

multiple models

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Ensembles

• Pre-deep learning

• Use different subsets of

training data

• Deep learning

• Use different random

initializations / data

augmentation

• Different local minima

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Why do ensembles work?

• Fewer parameters /

model

• Each model overfits in

its own way

• Usually a 1-3% accuracy

boost on most tasks

• longer training

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

Why do we average

predictions?

• For a convex loss

function

• loss of average

prediction < average

loss of individual

models

When to use ensembles?

• If you have the

compute power

• If you really need the

last bit of accuracy

• e.g. production,

competitions

Training on small datasets

• How to training

• a large model

• on a small dataset?

Softmax

Conv

ReLU

Linear

…

Solution: Pre-training / finetuning

• Train model on large

dataset (Pre-training)

• on related task

• Copy model

• Continue training on

small dataset (Finetuning) Softmax

Conv

ReLU

Linear

…

Softmax

Conv

ReLU

Linear

…

Pre-training

• Computer vision

• Supervised

• e.g. ImageNet

• Natural Language

Processing

• Self-supervised

• Unlabeled text

Pre-training / fine-tuning in

practice

• Download a pre-trained

model

• Model-zoo

• Directly for author

• Run a few training

iterations on small

dataset Softmax

Conv

ReLU

Linear

…

Why does transfer learning

work?

• Similar inputs

• e.g. Images, Text, …

• Transfer between

tasks

• Good initialization

• Learned weights are

tuned well Softmax

Conv

ReLU

Linear

…

Softmax

Conv

ReLU

Linear

…

When to use transfer

learning?

• Whenever possible

• In early experiments

• Large pre-trained

model exists

Generalization in deep

learning

• Standard wisdom

• Bigger/wider models

overfit more

Softmax

Conv

ReLU

Linear

…

Conv

ReLU

Deep networks are big enough

to remember all training data

• Deep networks easily fit

random labels

• Memorize all data

• Works even for

random noise inputs

Understanding deep learning requires rethinking generalization, Zhang etal. 2017

wide resnet on cifar-10

Accuracy

0

25

50

75

100

Epochs

Random labels

Correct labels

Why does SGD still work?

• SGD gradually minimizes

objective

• Prefers solutions close

to initialization

• Implicitly regularizes

• Random labels take SGD

on a longer path

Exploring generalization in Deep Learning, Neyshabur etal. 2017

Larger networks overfit less

• Without data

augmentation

• 100% training accuracy

• Larger models

generalize better

• Hence overfit less

Understanding deep learning requires rethinking generalization, Zhang etal. 2017

wide resnet on cifar-10

Training accuracy

75

81.25

87.5

93.75

100

width 16

width 32

width 48

Validation accuracy

75

78.75

82.5

86.25

90

Epochs

Larger networks overfit less

• All models overfit

massively on loss (log

likelihood)

On Calibration of Modern Neural Networks, Guo etal. 2017

wide resnet on cifar-10

Training loss

0

0.25

0.5

0.75

1

width 16

width 32

width 48

Validation loss

0

0.25

0.5

0.75

1

Epochs

Larger networks overfit less

• Do we need a new

learning theory?

• Do we need new

intuitions?

In summary

• Models can overfit, but do not with

SGD and data augmentation

• Implicit regularization

• How to do make it explicit?

• Overfitting is dependent on learning

algorithms (e.g. Adam overfits more)

• How can we measure overfitting?

Graduate student descent

Look at your

data / model output

Design and

train your model

Evaluate

your model on

validation set

automated

semiautomated

manual

Evaluation on validation set

• Run during training

• Every epoch or n

iterations

• Log in TensorBoard

Look at your

data / model output

Design and

train your

model

Evaluate

your model

on

validation set

Look at your data / model

output

• Run during training

• Every epoch or n

iterations

• Log in TensorBoard

• Select same training

and validation images

Look at your

data / model output

Design and

train your

model

Evaluate

your model on

validation set

Design and train your model

• Mostly manual work

Look at your

data / model output

Design and

train your

model

Evaluate

your model on

validation set

Design and train your model

• Network does not train

• Vanishing or exploding

gradients?

• Fix initialization and learning

rate

• Slow training

• Add normalization

• Residual connections

• Iterate until model trains

loss

10.0

0.0

iterations

loss

10.0

0.0

iterations

loss

10.0

0.0

iterations

Design and train your model

• Network overfits to training

data

• Add data augmentation

• Early stopping

• Try a pre-trained network

• Collect more data

• Iterate until model

generalizes well

Accuracy

1.0

0.0

iterations

train

val

Design and train your model

• Network fits training

and validation data well

• Stop graduate student

descent

• Take a break

• Evaluate on test set

Look at your

data / model output

Design and

train your

model

Evaluate

your model on

validation set

High dimensional inputs

Linear

Activation

Linear

o

Activation

Linear

128

128

Images and structure

• Fully connected

networks are not shift

invariant

Finding shift-invariant

patterns

Convolutions

• “Sliding” linear

transformation

a b c

d e f

g h i

* =

Examples of convolutions

Original Vertical edges Horizontal edges Laplace filter

-1 0 1

-1 0 1

-1 0 1

-1 -1 -1

0 0 0

1 1 1

-1 -1 -1

-1 8 -1

-1 -1 -1

Convolutions on multiple

channels

Original Vertical edges Horizontal edges Laplace filter

-1 0 1

-1 0 1

-1 0 1

-1 -1 -1

0 0 0

1 1 1

-1 -1 -1

-1 8 -1

-1 -1 -1

Formal definition

• Input:

• Kernel:

• Bias:

• Output:

X ∈ ℝ*H*×*W*×*C*1

w ∈ ℝ*h*×*w*×*C*1×*C*2

b ∈ ℝ*C*2

Z ∈ ℝ(*H*−*h*+1)×(*W*−*w*+1)×*C*2

Z

*a*,*b*,*c *= b*c *+

*h*∑*i*=0

*w*∑*j*=0

*C*

1∑*k*=0

X

*a*+*i*,*b*+*j*,*k**w**i*,*j*,*k*,*c*

Conv wxh

Stacking multiple layers

Conv 3×3

ReLU

Conv 3×3

ReLU

Conv 3×3

ReLU

Convolution as a linear layer

input: 3 x 3 x 1 input: 9

a b

c d

kernel: 2 x 2 x 1 x 1

output: 2 x 2 x 1

a b c d

a b c d

a b c d

a b c d

weight: 4 x 9

output: 4

Special case: 1×1 convolution

• Pixel-wise linear

transformation

• Kernel: 1 × 1 × *C*1 × *C*2

* w =

Output size

• Input:

• Kernel:

• Output:

X ∈ ℝ*H*×*W*×*C*1

w ∈ ℝ*h*×*w*×*C*1×*C*2

Z ∈ ℝ(*H*−*h*+1)×(*W*−*w*+1)×*C*2

a b

* c d =

Padding

• Add zeros in

each dimension

• Input:

• Kernel:

• Output:

*p**w*, *p**h*

X ∈ ℝ*H*×*W*×*C*1

w ∈ ℝ*h*×*w*×*C*1×*C*2

Z ∈ ℝ(*H*−*h*+2*p**h*+1)×(*W*−*w*+2*p**w*+1)×*C*2

0 0 0 0 0

0 0

0 0

0 0

0 0 0 0 0

a b

c d

* =

Output resolution

• High output resolution

• Slow computation

Conv 3×3

Conv 3×3

Conv 3×3

(H, W, 1024)

(H, W, 512)

(H, W, 3)

Striding

• Only compute every nth output:

• Input:

• Kernel:

• Output:

*s**w*

,*s**h*

X ∈ ℝ*H*×*W*×*C*1

w ∈ ℝ*h*×*w*×*C*1×*C*2

Z ∈ ℝ( *H *− *hsh *+ 2*ph *+1)×( *W *− *w sw *+ 2*pw *+ 1)×*C*2

0 0 0 0 0

0 0

0 0

0 0

0 0 0 0 0

a b c

d e f

g h i

* =

Output size with striding

a b c

d e f

g h i

* =

a b c

d e f

g h i

* =

Parameters

• Every input channel

is connected to every

output channel

*C*

1

*C*2

*C*

1 = 6 *C*2 = 4

Grouping

• Split channels into g

groups

• Reduce parameters and

computation by factor g

*C*

1 = 6 *C*2 = 4

Depthwise convolution

• Special grouping

• •

*C*

1 = *g*

*C*2

= *g*

*C*

1 = 3 *C*2 = 3

Hyper-parameters of

convolutions

• Kernel size:

• Padding: ,

• Stride: ,

a b c

d e f

g h i

*

*w *× *h*

*p**w **p**h*

*s**w*

*s*

*h*

*h*

*w*

*p**w*

*p**h*

] ] ]] *w **s*

*s*

*h*

a b c

d e f

g h i

* =

Convolutional operators

• Run arbitrary operation

“over” image

*f*(x)

*f*(x)

Average pooling

• Convolutional operator

• *f**c*(x) = mean*i*,*j*(x*i*,*j*,*c*)

Where to use average

pooling?

• Older networks:

• Inside a network

…

Conv 3×3

ReLU

Avg Pool

Conv 3×3

ReLU

(H, W, 1024)

(H/2, W/2, 1024)

Where to use average

pooling?

• Modern networks

• Global average pooling

…

Linear

Softmax

Avg Pool

Conv 3×3

ReLU

(H’, W’, 1024)

(1024)

Max pooling

• Convolutional operator

• *f**c*(x) = max

*i*,*j*

(x*i*,*j*,*c*)

Where to use max pooling?

• Inside a network

• With strides, as

down-sampling

…

Conv 3×3

ReLU

Max Pool

Conv 3×3

ReLU

(H, W, 1024)

(H/2, W/2, 1024)

Max pooling as a nonlinearity

• Similar to maxout

Conv

Max Pool

Conv

…

Max Pool

Receptive fields

• Can input affect

output ?

input: 3 x 3 x 1

output: 2 x 2 x 1

x*abc *Conv 2×2

z

*ijk*

Receptive fields

input: 3 x 4 x 1

output: 2 x 2 x 1

Conv 2×2

Conv 2×2

• Can input affect

output ?

x

*abc*

z

*ijk*

How do we compute the

receptive field

• Option 1: lots of math

• Option 2: Computationally

• Feed a image of 0s to

the network

• Change a single

element to NaN

• See output changes

Conv

Conv

…

Structure of receptive field?

Use striding, increase

channels

• Trade spatial resolution

for channels

• Balance computation

Keep kernels small

• 3×3 kernels almost

everywhere

• exception:

• first layer up to 7×7

Conv 7×7

Conv 3×3

Conv 3×3

Conv 3×3

…

Repeat patterns

• First layer or two are

special and not

repeated

• All others usually follow

a fixed pattern

Conv 1×1

ReLU

Conv 3×3

ReLU

Conv 1×1

ReLU

All-convolutional

• Average in the end

• Fewer parameters

• Better training signal

• “Ensemble”/voting

effect for testing

softmax

softmax

Linear

…

Conv Conv

…

…

Flatten Avg Pool

Linear Linear

Structure of input data

• Images

• Repeating patterns

• at various scales

7×7 patches

Structure of convolutional

networks

• Exploit repeating

structure of images

Conv 7×7

Conv 3×3

Conv 3×3

Conv 3×3

…

Network Dissection: Quantifying Interpretability of Deep Visual

Representations, D. Bau etal, CVPR 2017

What do networks learn?

Linear layered networks do

not work well on images

• Largest linear network

for computer vision

• Locally connected

Building high-level features using large-scale unsupervised learning, Q. Le etal, ICML 2012

Large Scale Distributed Deep Networks, J. Dean etal. NeurIPS 2012

Locally connected Block

L2 Pool

Local Linear

Local Res. Norm

Locally connected Block

Locally connected Block

Segmentation

Receptive field

• How to increase

receptive field?

• Large kernel size

• Striding

Conv 3×3

ReLU

Conv 3×3

ReLU

Conv 3×3

ReLU

…

Receptive field

receptive field = 3

receptive field = 5

receptive field = 7

Receptive field and striding

receptive field = 3

receptive field = 7

receptive field = 15

Dilation

• Add 0-padding between

values in convolutional

kernel

a b c

d e f

g h i

*

a b c

d e f

g h i

*

Dilation vs striding

receptive field = 3

receptive field = 7

receptive field = 15

Dilation vs striding

receptive field = 3

receptive field = 7

receptive field = 15

The many names of dilation

• hole

• a trous

Inverse of strided

convolution

Strided

conv

Upsampling

Up-convolution

• Dilation of the input

a b c

d e f

g h i

*

Rounding

• Strided convolution

rounds down

• How to correct for this?

Up-convolution in action

• Used closer to output

layers

Conv

Conv

Conv

Up-Conv

Up-Conv

Up-Conv

Up-convolutions and skip

connections

• Provides lower-level

high-resolution features

to output

Conv

Conv

Conv

Up-Conv

Up-Conv

Up-Conv

The many names of upconvolution

• Transpose convolution

• “Deconvolution”

• fractionally strided

convolutions

Convolutional networks

• Linear transformations

• Convolutions

• Convolutional nonlinearities

• Pooling

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

ConvNet Design

• Stride and increase

channels

• Small kernel size

• All convolutional

• Up-convolution close to

output (optional)

Softmax

Conv 7×7

ReLU

Conv 3×3

ReLU

Linear

Avg Pool

(H, W, 3)

(H/2, W/2, 64)

(H/4, W/4, 128)

(128)

Applications of ConvNets

• Autonomous vehicles

• Analyze medical images

• Geoscience: Analyze

scans of rock

formations to find oil

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Our essay writers are graduates with diplomas, bachelor, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.

- Plagiarism free papers
- Timely delivery
- Any deadline
- Skilled, Experienced Native English Writers
- Subject-relevant academic writer
- Adherence to paper instructions
- Ability to tackle bulk assignments
- Reasonable prices
- 24/7 Customer Support
- Get superb grades consistently

Basic features

- Free title page and bibliography
- Unlimited revisions
- Plagiarism-free guarantee
- Money-back guarantee
- 24/7 support

On-demand options

- Writer’s samples
- Part-by-part delivery
- Overnight delivery
- Copies of used sources
- Expert Proofreading

Paper format

- 275 words per page
- 12 pt Arial/Times New Roman
- Double line spacing
- Any citation style (APA, MLA, Chicago/Turabian, Harvard)

We value our customers and so we ensure that what we do is 100% original..

With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read moreThe Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

Read moreThe Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

Read moreThe Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

Read moreBy placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

Read more
The price is based on these factors:

Academic level

Number of pages

Urgency