Deep Learning in Python

Deep Learning Homework in Python
Please see the attached PDF for the assignment requirements. I have also attached the starter code.

Deep learning for action
• Input
• Observation
• Output
• Action

Linear
Avg Pool
Conv
ReLU
Conv
Acting in an environment
• Action changes that
state the of the world
• Non-differentiable
• Often non-repeatable
• Long-range
dependencies

Linear
Avg Pool
Conv
ReLU
Conv

Linear
Avg Pool
Conv
ReLU
Conv

Linear
Avg Pool
Conv
ReLU
Conv
t=0 t=1 t=2
Acting in an environment
left
right
up
down
jump
shoot
environment agent
action
state /
observation
Acting in an environment
left
right
up
down
jump
shoot
environment agent
action
state /
observation
Video Game Deep Network
How to train the agent?
• What should the agent
learn to do?
• Minimize loss
• Reward from
environment
left
right
up
down
jump
shoot
environment agent
action
state /
observation
Video
Game
Deep
Network
Markov decision process
(MDP) – Formal definition
• state ”
• action ”
• reward ”
• transition ”
• policy ”
s ∈ S
a ∈ A
rs
∈ ℝ
T(st+1|st,at)
π(a|s)
a
s
rs
environment
T(st+1|st,at)
agent
π(a|s)
MDP – objective
• Trajectory

• Return

• Objective
τ= “•
{s0,a0,s1,a1, …}
R(τ) = ∑
t
γtrs
t
maximize
π τP
π,T
[R(τ)]
a
s
rs
environment
T(st+1|st,at)
agent
π(a|s)
Partially observed Markov
decision process (POMDP)
• state ”
• action ”
• reward ”
• transition ”
• observation !
• observation function
!
• policy !
s ∈ S
a ∈ A
rs
∈ ℝ
T(st+1|st,at)
o ∈ O
O(o|s)
π(a|o)
a
o
rs
environment
T(st+1|st,at)
agent
π(a|o)
Examples – Cart-pole
• MDP
• objective: balance a pole on
movable cart
• state: angle, angular velocity,
position, velocity
• action: force applied
• reward: 1 for each time step
pole is upright
Image source: https://commons.wikimedia.org/wiki/
File:Balancer_with_wine_3.JPG
Examples – Robot
locomotation
• MDP
• objective: make the robot move
• state: joint angle and position
• action: torques applied to joints
• reward: 1 for each time upright
+ moving
Video source: https://commons.wikimedia.org/wiki/File:Passive_dynamic_walker.gif
Examples – Games
• POMDP
• objective: beat the game
• state: position, location, state of
all objects, agents and world
• action: game controls
• reward: score increase/
decrease, complete level, die
Video source: SuperTuxKart 1.0 Official Trailer, https://www.youtube.com/watch?
v=Lm1TFDBiIIg
Examples – GO
• MDP
• objective: win the game
• state: position of pieces
• action: next piece
• reward: 0 lose, 1 win
Image source: https://en.wikipedia.org/wiki/Go_(game)#/media/File:FloorGoban.JPG
Examples – supervised
learning
• MDP
• objective: Minimize the
training (or validation) loss
• state: weights and hyperparameters
• action: gradient update
• reward: change in loss
Everything is a (PO)MDP
• Very general concept
• NP-hard
• Specialized algorithms
still work well
a
o
rs
environment
T(st+1|st, at)
agent
π(a|o)
Learning to act
• How do we train our
policy?
• Imitation learning
• Ask an expert

Linear
Avg Pool
Conv
ReLU
Conv

Linear
Avg Pool
Conv
ReLU
Conv

Linear
Avg Pool
Conv
ReLU
Conv
t=0 t=1 t=2
Imitation learning – definition
• Oracle / expert
• Provides trajectories
” with high return
• Policy
• Supervised learning

τ
maximize
π τ [∑ t log π(at, st)]
a
o
rs
environment
T(st+1|st, at)
agent
π(a|o)
maximize
π τP
π,T
[R(τ)]
Imitation learning – Issues
• Expert annotations are
sometimes expensive
• Drift from expert
• Distribution mismatch
between training and
testing
Agent
Expert
Imitation learning – bag of
tricks
• Use pre-trained
architecture
• More robust to slight
mismatch in
observations
• Better generalization

Linear
Avg Pool
Conv
ReLU
Conv

Linear
Avg Pool
Conv
ReLU
Conv
Pre-training
task
Demonstration
Imitation learning – bag of
tricks
• Data augmentation
• Encourage policy to
get back on track
Agent
Expert
Imitation learning with tricks
• Easy to train
• supervised learning
• Sometimes works
• Major issues with drift /
distribution mismatch
• Requires expert
annotations
Exploring the Limitations of Behavior Cloning for
Autonomous Driving, Codevilla etal. 2019
Imitation learning
• Drift
• Mismatch in training
and testing
distribution
Agent
Expert
Imitation learning –
Alternative interpretation
• Expert policy ”
• Agent policy ”
• Iterate
• Take action ”
• Imitate ”
• State update

πE
π a
Et
π
E( ⋅ |st)
log π(atE|st)
s
t+1 ∼ T( ⋅ |atE, st)
Dataset Aggregation
• Expert policy ”
• Agent policy ”
• Iterate
• Take action ”
• Take action ”
• Imitate ”
• State update ”
πE
π
aE
t πE( ⋅ |st)
a
t π( ⋅ |st)
log π(atE|st)
s
t+1 ∼ T( ⋅ |at, st)
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online
Learning, Ross et al., AISTATS 2011
DAgger – Issues
• Requires expert oracle
• Very hard to humans
DAgger
• On-policy imitation
learning
• Guaranteed to work
for agents with
enough capacity and
good enough expert
Deep learning for action
• Why not just learn a
policy that maximizes
reward?
• Hard to optimize!
a r
s
environment
T(st+1|st, at)
agent
π(a|s) CNN
s
Deep learning for action
• Two sources of nondifferentiability
• Sampling
• Environment
a r
s
environment
T(st+1|st, at)
agent
π(a|s) CNN
s
Differentiating sampling
• Compute gradient of




aP
θ
[gθ(a)] = ∑
a
Pθ
(a)gθ(a)

θ

aP
θ
[gθ(a)] = ∑
a
gθ(a)

θ
Pθ
(a)
+ ∑
a
Pθ
(a)

θ
gθ(a)
fPθ a
sample
g
gθ(a)
Differentiating sampling –
Issues
• Large sum over all
samples / action
• Generally intractable

θ

aP
θ
[gθ(a)] = ∑
a
gθ(a)

θ
Pθ
(a)
+ ∑
a
Pθ
(a)

θ
gθ(a)
Reparametrization trick
• For continuous distributions
• Rewrite

• e.g. standard normal
Pθ “•
(a) =
1σθ
P (a σθμθ )

aP
θ
[gθ(a)] = ∫Ω ˜ P (b) gθ(θ + μθ)db
Auto-Encoding Variational Bayes, Kingma and Welling, ICLR 2014
Reparametrization trick
• Compute gradient

Don't use plagiarized sources. Get Your Custom Essay on
Deep Learning in Python
Just from $10/Page
Order Essay


• Gradient computation by
sampling

θ

bP[gθ(θ + μθ)] = 피bP [ ∂∂θ gθ(θ + μθ)]

θ + μθ
f
μθ, σθ
sample ”
from ”
b
P
g
Reparametrization trick –
discrete variables


• No change of variables
• No differentiable function that
maps to discrete distribution
• Continuous relaxation of one-hot
vectors
• Gumbel softmax

aP
θ
[gθ(a)] = ∑
a
Pθ
(a)gθ(a)
• The Concrete Distribution: a Continuous Relaxation of Discrete Random Variables,
Maddison et al., ICLR 2017
• Categorical Reparameterization with Gumbel-Softmax, Jang et al, ICLR 2017

θ + μθ
f
μθ, σθ
sample ”
from ”
b
P
g
Differentiating the
environment
• Quite hard
• Up next
a r
s
environment
T(st+1|st, at)
agent
π(a|s) CNN
s
Non-differentiability
• Compute gradient of
!

τP
π,T
[R(τ)]
= ∑
τ
Pπ
,T(τ)R(τ)
a r
s
environment
!T(st+1|st, at)
agent
!π(a|s) CNN
s
The log-derivative trick
• Simple chain rule
!
• Gradient of expected
return
!

θ pθ(x) = pθ(x)∇θlog pθ(x)
∇피
τP
π,T
[R(τ)] = ∑
τ
Pπ
,T(τ)R(a)∇log Pπ,T(τ)
= 피
τP
π,T[R(τ)∇log Pπ,T(τ)]
a r
s
environment
!T(st+1|st, at)
agent
!π(a|s) CNN
s
REINFORCE
• Compute gradient using
Monte Carlo sampling
!

τP
π,T[R(τ)∇log Pπ,T(τ)]

1N

τP
π,T
[R(τ)∇log Pπ,T(τ)]
Simple statistical gradient-following algorithms for connectionist reinforcement
learning, Williams, Machine learning 1992
REINFORCE issues
• Needs lots of samples
for a good gradient
• High-variance gradient
estimator
• Cannot reuse rollouts
(!τ)
1N

τP
π,T
[R(τ)∇log Pπ,T(τ)]
Policy gradient
• REINFORCE on steroids
• lower variance
• baseline
• off-policy
• reuse rollouts
1N

τP
π,T
R(τ)∇log Pπ,T(τ)
Vanilla policy gradient
algorithm
• For i iterations
• Collect rollouts
• Estimate the sample
gradient
• Take a gradient step
1N

τP
π,T
R(τ)∇log Pπ,T(τ)
Variance of REINFORCE
• What happens if all
rewards are positive?
• Only learn to do “more”
things in ”
• SGD zig-zags
• RL worst best of we
have positive and
negative returns
τ
1N

τP
π,T
R(τ)∇log Pπ,T(τ)
Baselines
• Gradient for constant
return is zero
• ”
• Reduces variance
• Positive and negative
returns
• Unbiased gradient estimate

τP
π,T
[b∇log Pπ,T(τ)] = 0
1N

τP
π,T
R(τ)∇log Pπ,T(τ)
On- vs off-policy
• REINFORCE is on-policy
• Trajectories (rollouts)
need to come from
current policy
• No reuse of
trajectories between
gradient update
1N

τP
π,T
R(τ)∇log Pπ,T(τ)
Off-policy
• Importance sampling
• Many variants
1N
τQ
Pπ
,T(τ)
Q(τ)
R(τ)∇log Pπ,T(τ)
Policy gradient algorithm
• For i iterations
• Collect rollouts
• Add to replay buffer
• Update baseline network
• For j batches
• Estimate the sample
gradient on replay buffer
• Take a gradient step
Policy gradient
• REINFORCE with many
tricks
• Not very sample
efficient
• Gradient estimate by
sampling from an
exponential trajectory
space
1N

τP
π,T
R(τ)∇log Pπ,T(τ)
Why do we need a gradient?
!θ !π
parameters policy
T !
environment reward
/ return
getting good gradients is hard!

τP[R(τ)]
Why do we need a gradient?
is the forward pass hard?
!θ !π
parameters policy
T !
environment reward
/ return

τP[R(τ)]
What if we had an oracle?
• Given policies ” and ”
• which one is better?
• rollout and compute
return
πA
πB
Gradient Free Optimization
• maximize ” w.r.t. ”
• can only evaluate
function value
• no gradients
• ” is smooth
• similar ” produce
similar ”
f(θ) θ
f
θf(
θ)
black box
f(θ)
Random Search
• Randomly sample ”
• pick highest ”
θ
f(θ)
parameters space
Iterative Random Search
• Randomly sample ”
• pick highest ”
• Sample more points
around maxima
• repeat
θ
f(θ)
parameters space
Cross entropy method
• Initialize ”
• sample ”
• compute reward ”
• select top ” (” )
• fit Gaussian for new ”
• repeat
μ, σ
θ ∼ 풩(μ, σ2)
f(θ)
p % p = 20
μ, σ
parameters space
Evolution Strategies
• Initialize ”
• Iterate
• Sample ”
• Compute returns ”
• Normalize ”

Update ”
θ
ϵ
0, ϵ1, …, ϵn ∼ 풩(0,I)
Fi
= R(θ + σϵi)
F ˜
i =
Fi
μF
σF
w := w +
α
σn
ni=1
F ˜
iϵi
parameters space
Evolution strategies as a scalable alternative to reinforcement learning, Salimans et al.,
arXiv 2017
Augmented random search
• Initialize ”
• Iterate
• Sample ”
• Compute returns ” ,

• Update

θ
ϵ
0, ϵ1, …, ϵn ∼ 풩(0,I)
F+
i = R(θ + σϵi)
F
i = R(θ σϵi)
w := w +
α
σn
ni=1
(Fi+ − Fi−)ϵi
parameters space
• Simple random search provides a competitive approach to reinforcement learning,
Mania et al., NIPS 2018
• Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation, Spall, Automatic Control, 1992
Gradient free optimization
• Exponential in
parameter space
• works better if
• parameter space small
• parameters correlate
with expected return
black box
f(θ)
Vision and action
• In humans and animals
• vision developed as a side
product for action
• no explicit supervision for
vision
• emerges from model
structure (connections)
Image source: https://en.wikipedia.org/wiki/
Cambrian_explosion#/media/
File:Opabinia_BW2.jpg
Vision and action
• In computer vision
• lots of data and labels
• explicit supervision
• emerges from data
• Classical robotics
• Planing after computer
vision
example 0: ( , Zebra)
example 1: ( , Zebra)
example 2: ( , Zebra)
example 999: ( , Zebra)

Deep Network
Open problem
• Why this disconnect?
example 0: ( , Zebra)
example 1: ( , Zebra)
example 2: ( , Zebra)
example 999: ( , Zebra)

Deep Network
Hypothesis 1 – Too narrow
tasks
• On single (narrow) task
• Data and labels
always win
• On multiple tasks
• Generalization
between tasks creates
visual representation
hunt
escape
search
move

Hypothesis 2 – Wrong models
and algorithms
• Backprop + SGD biased
• Doesn’t work on all
tasks and
architectures equally
well
example 0: ( , Zebra)
example 1: ( , Zebra)
example 999: ( , Zebra)

Deep
Network
Hypothesis 3 – No evolution
• Insufficient optimization
of models in outer loop
• Meta-learning can find
visual representations
without much
supervision
• supervision: acting
well survival
Implications
• If any of above
hypothesis are right
• We are wasting time
with labeled data
example 0: ( , Zebra)
example 4: ( , Zebra)
example 1: ( , Zebra)
example 2: ( , Zebra)
example 3: ( , Zebra)
example 999: ( , Zebra)

?
Hypothesis 4 – Value of
labels
• Our current approach is
fine
• labeled data provides
abstract
representation without
need for evolution and
massive optimization
example 0: ( , Zebra)
example 1: ( , Zebra)
example 2: ( , Zebra)
example 999: ( , Zebra)

Deep Network
Summary
• Learning to act using
deep networks
• Option 1: Imitate
expert
• Option 2: Policy
gradient
• Option 3: Gradient
free
Which one should I chose?
• Is it easy for human to
perform the tasks?
• Imitation
• Do we want to do better
than humans?
• Gradient free
• Can I use existing code?
• Policy gradient
environment agent
action
state /
observation
Video
Game
Deep
Network

 

Looking at your data
• E.g. images
• Random ones
• Smallest / largest file
size
• Try solving the task
manually
Random images
Largest file size
Smallest file size
Solving the task manually
Dataset
• Training set
• Learn model parameters
• Validation set
• Learn hyper-parameters
• Test set
• Measure generalization
performance
Why split the data?
• Overfitting
• Goal: Learn a model
that works well in the
real world
• Optimization objective:
Learn a model that
works well in training
data
Training set
• Used to train all
parameters of the
model
• Model will work very
well on training set
• Size: 60-80% of data
Validation set
• Used to determine how
well the model works
• Used to tune model and
hyper-parameters
• Size: 10-20% of data
Testing set
• Used to measure
performance of model
on unseen data
• Used exactly once
• Size: 10-20% of data
How to split the data?
• Random sampling
without replacement
Distribution of data
Low dimensions
Dda
ta Dtrain Dvalid Dtest
High dimensions
Dda
ta Dtrain Dvalid Dtest
Graduate student descent
Look at your
data / model output
Design and
train your model
Evaluate
your model on
validation set
automated
semiautomated manual
Network initialization
• What do we set the
initial parameters to?
Linear
ReLU
Linear
data
Idea 1: All zero
x
z
1 = W1x
z
2 = max(z1,0)
o = W
3z2
v3 = (o)
(o)
∂o
v
2 = W⊤ 3 v3
v
1 = v2[z1 > 0]
v
0 = W⊤ 1 v1
(o)
∂W
3
= v
3z⊤ 2
(o)
∂W
1
= v
1x⊤
Idea 1: All zero
• Does not work
• No gradient
• Saddle point
Idea 2: constant
x
z
1 = W1x
z
2 = max(z1,0)
o = W
3z2
v3 = (o)
(o)
∂o
v
2 = W⊤ 3 v3
v
1 = v2[z1 > 0]
v
0 = W⊤ 1 v1
(o)
∂W
3
= v
3z⊤ 2
(o)
∂W
1
= v
1x⊤
Idea 2: constant
• Does not break
symmetries
Solution
• Random initialization
W
i = 풩(μi, σi2I)
Random initialization
• Initialize weights
• Normal distribution
• Uniform distribution
• What should and
be?
• For simplicity = 0
and bias = 0
Linear
ReLU
Linear
W
3 = 풩(μ3, σ3 2I)
W
1 = 풩(μ1, σ1 2I)
σi
μi
μi
x o
Scaling matters
Linear
Linear
W
2
W
1
x o
o = W
2W1x
(o)
∂W
1
= (W⊤ 2 ∂∂(oo)) x⊤
(o)
∂W
2
=
(o)
∂o (W1x)⊤
How do we scale the
initialization?
• By hand
• A lot of tuning
• Automatically
• A lot of math
Linear
ReLU
Linear
x o
Linear
ReLU
Xavier and Kaiming
initialization
• Strategy to set
variance of Normal
initialization
• All activations are of
similar scale
Linear
ReLU
Linear
W
3 ∼ 풩(μ3, σ3 2I)
W
1 ∼ 풩(μ1, σ1 2I)
x o
σ2
Random matrix multiplication
a⊤x ∼ 풩(μa
i
x
i,∥x∥2σa 2) for a ∼ 풩(μa, σa 2I)
1 for derivation see: https://en.m.wikipedia.org/wiki/Multivariate_normal_distribution
1
Random matrix multiplication
z
i = Wi−1zi−1 ∼ 풩(0, ∥zi−1∥2σW 2
i−1
I) for Wi−1 ∼ 풩(0, σW 2
i−1
I)
Random ReLU
z
i+1 = max(zi, 0) for zi ∼ 풩(0,σi2I)
피[∥zi+1∥2] = 1
2
n
z
i
σ2
i
Putting things together
z
i+2 ∼ 풩(0, ∥zi+1∥2σW 2
i+1
I
σ2
i+2
∥zi+1∥2 ≈ 피[∥zi+1∥2] = 1 )
2
n
z
i
σ2
i
z
i ∼ 풩(0, σi2I)
σi
+2 =
1
2
σWi
+1
σi
n
z
i
σi
=
(i−1)/2
k=0
( 12 σW2k+1 nz2k) σx
Randomly initialized network
σi
=
(i−1)/2
k=0
( 12 σW2k+1 nz2k) σx
Linear
ReLU
Linear
x o
Linear
ReLU
Variance of back-propagation
graph
̂σi
=
(N−1)/2
k=i/2
( 12 σW2k+1 nz2k+2) σN ̂
Linear
ReLU
Linear
x o
Linear
ReLU
Linear
ReLU
Linear
ReLU
Linear
(o)
∂o
Xavier initialization
• Try to keep both
activations and gradient
magnitude constant

σi
=
(i−1)/2
k=0
( 12 σW2k+1 nz2k) σx
̂σi
=
(N−1)/2
k=i/2
( 12 σW2k+1 nz2k+2) σN ̂
σW
= 2
2
n
z
i
+ n
z
i+1
Kaiming initialization
• Try to keep either
activation or gradient
magnitude constant
• •
σW
= 2
1n
z
i
σW
= 2
1nz
i+1
σi
=
(i−1)/2
k=0
( 12 σW2k+1 nz2k) σx
̂σi
=
(N−1)/2
k=i/2
( 12 σW2k+1 nz2k+2) σN ̂
Initialization in practice
• Xavier (default) is often
good enough
• Initialize last layer to
zero
Optimization
• Stochastic Gradient
Descent
• Convergence speed
• Training accuracy
• Generalization
performance
iterations
Validation
Training
Accuracy
0.9
0.7
0.5
0.3
Input normalization
• Input:
• Apply affine
transformation
Linear
ReLU
Linear
x o
x
i

i = αxi + β
Gradients of uncentered
inputs: A simple example
• Input vector
• Output scalar

x o
(o)
∂w
= vx⊤
Linear
x ∈ ℝc
o ∈ ℝ
Linear
v ∈ ℝ
(o)
∂w
Mean subtraction
• Input:
• Apply affine
transformation
x
i

i = xi μx
Gradients of unnormalized
inputs: A simple example
• Input vector
• Output scalar

|x[0]| ≪ |x[1]|
x o
(o)
∂w
= vx⊤
Linear
x ∈ ℝ2
o ∈ ℝ
Linear
v ∈ ℝ
(o)
∂w
Gradients of unnormalized
inputs: A simple example
Input normalization
• Input:
• Apply affine
transformation
x
i

i = (xi μx)/σx
Exploding gradients
• Weight of one layer
grows
• Gradient of all other
layers grow
• Weights of other
layers grow
• Bad feedback loop
ReLU
Conv
x
Conv
ReLU
Conv
Conv

Conv
Conv
ReLU
Conv
(o)
∂o
o
ReLU
Conv

Exploding gradients
• Not a bit issue for most
networks
• An issue for recurrent
networks, and networks
that share parameters
across layers
Detecting exploding
gradients
• Plot loss
• Plot weight and
gradient magnitude per
layer
Vanishing gradients
• Weight of one layer
shrink
• Gradient of all other
layers shrink
• Weights of other
layers stay constant
• No progress
ReLU
Conv
x
Conv
ReLU
Conv
Conv

Conv
Conv
ReLU
Conv
(o)
∂o
o
ReLU
Conv

Vanishing gradient
• Big issue for larger
networks
• Issue for recurrent
networks and weights
tied across layers
Detecting vanishing gradients
• Plot loss
• Plot weight and
gradient magnitude per
layer
Normalization
• How to prevent
vanishing (or exploding)
gradients?
ReLU
Conv
x
Conv
ReLU
Conv
Conv

Conv
Conv
ReLU
Conv
(o)
∂o
o
ReLU
Conv

Batch normalization
• Make activations zero
mean and unit variance
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In ICML, 2015
ReLU
Conv
BN
Batch normalization
• Normalize by channelwise mean and
standard deviation
Z ∈ ℝB×W×H×C
Z
k,x,y,c μc
σc
μc =
1
BWH
k,x,y
Z
k,x,y,c
σ2
c =
1
BWH
k,x,y
(Zk,x,y,c μc)2
μc
σc
What does batch
normalization do?
W
H
C
B
What does batch
normalization do?
• The good:
• Regularizes the network
• Handles badly scaled
weights
• The bad:
• Mixes gradient
information between
samples
Z ∈ ℝB×W×H×C
Z
k,x,y,c μc
σc
μc =
1
BWH
k,x,y
Z
k,x,y,c
σ2
c =
1
BWH
k,x,y
(Zk,x,y,c μc)2
Batch norm and batch size
• Large batch sizes work
better
• More stable mean and
standard deviation
estimates
Z ∈ ℝB×W×H×C
μc =
1
BWH
k,x,y
Z
k,x,y,c
σ2
c =
1
BWH
k,x,y
(Zk,x,y,c μc)2
Z
k,x,y,c μc
σc
Batch norm at test time
• Compute mean and
standard deviation on
training set using
running average
Z ∈ ℝ1×W×H×C
Z
k,x,y,c μc
σc
Layer normalization
• Make activations zero
mean and unit variance
without collecting
statistics across batches
Layer Normalization, Ba, J., Kiros, J. R. and Hinton, G., arXiv preprint arXiv:
1607.06450, 2016
ReLU
Conv
LN
Layer normalization
• Normalize by imagewise mean and
standard deviation
Z ∈ ℝB×W×H×C
Z
k,x,y,c μk
σk
μk =
1
WHC
x,y,c
Z
k,x,y,c
σ2
k =
1
WHC
x,y,c
(Zk,x,y,c μk)2
μk
σk
What does layer
normalization do?
W
H
C
B
Comparison to batch norm
• No summary statistics
• Training and testing
are the same
• Works well for
sequence models
• Does not scale
activations individually
batch norm
layer norm
Instance normalization
• Batch norm per input
Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. “Instance normalization:
The missing ingredient for fast stylization.” arXiv 2016.
ReLU
Conv
IN
Instance normalization
• Normalize by spatial
mean and standard
deviation
Z ∈ ℝB×W×H×C
Z
k,x,y,c μkc
σkc
μkc =
1WH
x,y
Z
k,x,y,c
σ2
kc =
1WH
x,y
(Zk,x,y,c μkc)2
μkc
σk
c
What does instance
normalization do?
W
H
C
B
Comparison to batch norm
• No summing over
batches
• Works well for graphics
applications
• Not used much in
recognition
• Unstable statistics
batch norm
layer norm
instance norm
Group normalization
• Normalize groups of G
channels together
Yuxin Wu, and Kaiming He. “Group normalization.” ECCV. 2018.
group 1 group 2 group 3 group 4
channels
0 … c-1
Z ∈ ℝB×W×H×C
,
Z
k,x,y,c μkg
σkg
μkg =
1
WHG
(g+1)G−1

c=gG
x,y
Z
k,x,y,c
σ2
kg =
1
WHG
(g+1)G−1

c=gG
x,y
(Zk,x,y,c μkg)2
g = ⌊c/G
What does group
normalization do?
W
H
C
B
Comparison to other norms
• More stable statistics
than instance norm
• G=C
• Not all channels tied as
in layer norm
• G=1
batch norm
layer norm
instance norm
group norm
Local response normalization
• “Generalization” of
group norm
• Parameters and
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification
with deep convolutional neural networks.” NIPS 2012
channels
0 … c-1
Z ∈ ℝB×W×H×C
Z
k,x,y,c γ +
αn
c+ n
2∑
c′= ! cn
2
Z2
k,x,y,c
β
α β
Differences between LRN
and GN
• Group norm
• Normalize over all spatial
locations
• Subtract mean
• Scale and bias
transformation
• Local response normalization
• More flexible
parametrization
channels
0 … c-1
Z ∈ ℝB×W×H×C
Z
k,x,y,c γ +
αn
c+ n
2∑
c′= ! cn
2
Z2
k,x,y,c
β
Where to add normalization?
• Option A
• After convolution
• Option B
• After ReLU (nonlinearity)
ReLU
Conv
BN
ReLU
Conv
BN
Option A Option B
Option A
• No bias in conv
• Activations are zero mean
• Half of activations
zeroed out in ReLU
• Solution:
• Learn a scale and
bias after norm
ReLU
Conv
s
c
b
c
Z
k,x,y,c μ
σ
s
c + bc
Option B
• Scale and bias
optional
ReLU
Conv
s
c bc
Z
k,x,y,c μ
σ
s
c + bc
Where to add normalization?
• Both work
• Option A is more popular
• Option B is easier
• Scale and bias
optional
• Conv unchanged
ReLU
Conv
BN
ReLU
Conv
BN
Option A Option B
Where not to add batch
norm?
• After fully connected
layers
• Mean and standard
deviation estimates
too unstable
Why does normalization
work?
• Regularizes the
network
• Handles badly scaled
weights
• Single parameter to
learn scale s
c
ReLU
Conv
Z
k,x,y,c μ
σ
s
c + bc
Deep networks
• Without normalization
• Max depth 10-12
• With normalization
• Max depth 20-30
ReLU
Conv
ReLU
Conv

What happens to deeper
networks?
• It does not train well
[Figure source: Kaiming He et al., “Deep Residual
Learning for Image Recognition”, CVPR 2016]
What happens to deeper
networks?
• Training a shallower
network and adding
identity layers works
better
ReLU
Conv
ReLU
Conv
ReLU
Conv
ReLU
Conv
ReLU
Conv
I
Solution: Residual
connections
• Parametrize layers as ReLU
Conv
ReLU
Conv
+
f(x) x + g(x)
f(x) = x + g(x)
Fun fact
• Backward graph is
symmetric ReLU
Conv
+
+
ReLU (back)
Conv (back)
Residual Networks
[Figure source: Kaiming He et al., “Deep Residual Learning for Image Recognition”,
CVPR 2016]
How well do residual
connections work?
• Can train networks of
up to 1000 layers
ReLU
Conv
Conv
Linear
Pool
BN
x998
Why do residual connection
work? – Practical answer
• Gradient travels further without
major modifications (vanishing)
• Reuse of patters
• Only update patterns
• Dropping some layers does not
even hurt performance
• Weights
• Model identity
ReLU
Conv
+
→ 0
[Gao Huang et al., “Deep Networks with Stochastic Depth”, ECCV 2016]

Why do residual connection
work? – Theoretical answer
• Without ReLU
• Invertible functions
• Very wide
• SGD find global optimum
ReLU
Conv
+
[Simon S. Du, et al., “Gradient Descent Finds Global Minima of Deep Neural Networks”, ICML 2019]
[Moritz Hardt and Tengyu Ma, “Identity matters in deep learning”, ICLR 2017]
Residual connections –
Summary
• Used in most modern
networks
• Allow for much deeper
networks
Stochastic Gradient Descent
with Momentum
• Default optimizer
• Works well in most
cases
• Tune learning rate
for n epochs
for batches
g:= 피x,yBi[
(x, y|θ)
θ
]
v:= ρv + g
θ:= θ ϵv
B
i
RMSProp
• Very specialized
• Auto-tunes learning rate
• Momentum optional
• Doesn’t play nice with
momentum
• Works well on some
reinforcement learning
problems
for n epochs
for batches
m := v := 0
g:= 피x,yBi[
(x, y|θ)
θ
]
m:= αm + (1 − α)g2
v:= ρv +
g
m + ε
θ:= θ ϵv
B
i
Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5-RMSProp: Divide the gradient by a
running average of its recent magnitude.” Neural networks for machine learning 4.2, 2012
ADAM
• Less learning rate tuning
• Works well on small
networks and problems
• Trains well, generalizes
worse
• Mathematically not
correct
for n epochs
for batches Bi
g:= 피x,yBi[
(x, y|θ)
θ
]
v:= β1v + (1 − β1)g
m:= β2m + (1 − β2)g2
s:= ϵ
1 − β2 step
1 − β1 step
θ:= θ s
v
m + ε
v := m := 0
Kingma, D. P., & Ba, J. L.. Adam: a Method for
Stochastic Optimization. ICLR 2015
What optimizer to use?
• Large models and data
• SGD with momentum
• Small models and data
• ADAM
Optimization algorithms
• Hyper parameters
• Learning rate
• Momentum
• Batch size
What learning rate it use?
• Rule of thumb: Largest
LR that trains
• Train for a few
epochs and measure
validation accuracy
Learning rate vs batch size
• Linear Scaling Rule:
When the minibatch
size is multiplied by ,
multiply the learning
rate by .
Priya Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet
in 1 Hour”, arXiv 2017
k
k
Learning rate schedules
• Step schedule
• Linear schedule
• Cosine schedule
• Cyclical schedules
0
0.5
0 1
0.5
1 0
0.5
1 0
0.5
1
Summary
• Use close to the largest
LR that trains
• Step schedule
How do we train a small
network?
• Idea 1:
• Randomly initialize
and train network
• Idea 2:
• Train a larger network
and make it small
Network distillation
• Train an ensemble of large
networks
• Train a small network to
mimic it’s output (with
cross entropy)
• Important: Reduce
confidence of ensemble
prediction (soft targets)
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural
network.” arXiv 2015.
Why does distillation work?
• Dark knowledge
• Networks learn about
(visual) relationships
of classes
• Boost training signal
Network pruning /
factorization
• Train a wide network (many
channels)
• Remove channels/weights
that are used the least
• 90% of parameters can be
removed after training
• Training the small network is
challenging
H Li, A Kadav, I Durdanovic, H Samet, HP Graf, “Pruning Filters for Efficient ConvNets”,
ICLR 2017
S Han, J Pool, J Tran, W Dally, “Learning both Weights and Connections for Efficient Neural
Network” NIPS 2015
Possible explanation: Lottery
ticket hypothesis
• Not all initializations
are created even
• Train network
A randomly-initialized, dense
neural network contains a
subnetwork that is initialized
such that—when trained in
isolation—it can match the test
accuracy of the original network
after training for at most the
same number of iterations.
Jonathan Frankle, Michael Carbin, “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural
Networks”, ICLR 2019
Lottery ticket hypothesis
• Very nice idea
• Likely not the full story
Zhuang Liu et al., “Rethinking the Value of Network Pruning”, ICLR 2019
Overfitting
• Fit model to training set
• Model does not work
on unseen data (e.g.
validation set)
loss
10.0
0.0
iterations
Accuracy
1.0
0.0
iterations
train
val
How to detect overfitting?
• Plot training and
validation accuracy
Accuracy
1.0
0.0
iterations
train
val
Is overfitting always bad?
Accuracy
1.0
0.0
iterations
train
val
Why do we overfit?
• Sampling bias
• Optimization fits
patterns that only
exist in training set
Do we overfit with infinite
training data?
How do we prevent
overfitting?
• Collect more data
• Make the model simpler
• Regularize
• Transfer learning
Training and overfitting
Accuracy
1.0
0.0
iterations
train
val
Early stopping in practice
• No need for stop button
• Measure validation
accuracy periodically
• Save your model
periodically
Accuracy
1.0
0.0
iterations
train
val
Signs of overfitting
• Does not capture
invariances in data
Conv
Conv

dog
Conv
Conv

cat
Conv
Conv

cat
How to capture invariances?
• Build them into the
model
• Convolutions
• All-convolutional
models
• Build them into the data
• Data augmentation
Conv
Conv

ReLU
Pool
Linear
Data augmentation
• Capture invariances in
data
• (Randomly) transform
data during training
• Reuse a label
( , dog)
( , dog)
( , dog)
( , dog)
Image augmentations
flip
shift
rotate
saturation
brightness
scale tint/hue
Training with data
augmentation
• (Randomly) augment
every single iteration
• Network never sees
exact same data twice
Dog
Augmentation
Conv
Conv

Loss
Linear
Unsupervised data
augmentation
• Captures invariances on
unseen and unlabeled
data
Augmentation
Conv
Conv

Linear
Conv
Conv

Consistency
Linear
Xie, Dai, Hovy, Luong, Le,
“Unsupervised Data Augmentation”,
arXiv 2019
Data augmentation
• Always use data
augmentation if possible
• Some augmentations
require augmentation of
labels
• e.g. for dense
prediction tasks
Overfitting in deep networks
• Overfitting
• Exploit patterns that
exist in training data,
but not in the
validation / test data
• Not all activations
overfit
Linear
Linear
Linear
ReLU
ReLU
Activation 1
Activation 2
Overfitting in deep networks
• Deeper layers overfit
more
• Rely on overfit
activations from
previous layers
Linear
Linear
Linear
ReLU
ReLU
Activation 1
Activation 2
Preventing overfitting in
deep networks
• Reduct reliance on
specific activations in
previous layer
• Randomly remove
activations
Linear
Linear
Linear
ReLU
ReLU
Activation 1
Activation 2
Dropout
• During training
• With probability set
activation to zero
• During evaluation
• Use all activations,
but scale by
Linear
Linear
Linear
ReLU
ReLU
Activation 1
Activation 2
1 − α
α
a
l(i)
Dropout in practice
• A separate layer
torch.nn.Dropout
• During training
• With probability set
activation to zero
• Scale activations by
• During evaluation identity
1
1 − α
α
a
l(i)
Linear
Linear
Linear
ReLU
ReLU
Dropout
Dropout
Where to add dropout?
• Before any large fully
connected layer
• Before some 1×1
convolutions
• Not before general
convolutions
Conv 1×1
ReLU
Simpler models
• Traditional wisdom
• Simpler model = less
overfitting
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
(H, W, 64)
(H, W, 128)
(H, W, 3)
… …
Idea 1: Smaller model
• Overfits less
• Fits less
• Worse generalization
(H, W, 32)
(H, W, 64)
(H, W, 3)
… …
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Idea 2: Big model with
regularization
• Weight decay
• Keep weights small
(L2 norm)
• Works sometimes
• Keep weight at same
magnitude
(H, W, 64)
(H, W, 128)
(H, W, 3)
… …
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
How to use weight decay?
• Parameter in optimizer,
e.g. torch.optim.SGD or
torch.optim.Adam
• weight_decay
• Use 1e-4 as default
Other reasons to use weight
decay
• Network weights cannot
grow infinitely large
• Helps handle
exploding gradients
(H, W, 64)
(H, W, 128)
(H, W, 3)
… …
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Ensembles
• Train multiple models
• Average predictions of
multiple models
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Ensembles
• Pre-deep learning
• Use different subsets of
training data
• Deep learning
• Use different random
initializations / data
augmentation
• Different local minima
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Why do ensembles work?
• Fewer parameters /
model
• Each model overfits in
its own way
• Usually a 1-3% accuracy
boost on most tasks
• longer training
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
Why do we average
predictions?
• For a convex loss
function
• loss of average
prediction < average
loss of individual
models
When to use ensembles?
• If you have the
compute power
• If you really need the
last bit of accuracy
• e.g. production,
competitions
Training on small datasets
• How to training
• a large model
• on a small dataset?
Softmax
Conv
ReLU
Linear

Solution: Pre-training / finetuning
• Train model on large
dataset (Pre-training)
• on related task
• Copy model
• Continue training on
small dataset (Finetuning) Softmax
Conv
ReLU
Linear

Softmax
Conv
ReLU
Linear

Pre-training
• Computer vision
• Supervised
• e.g. ImageNet
• Natural Language
Processing
• Self-supervised
• Unlabeled text
Pre-training / fine-tuning in
practice
• Download a pre-trained
model
• Model-zoo
• Directly for author
• Run a few training
iterations on small
dataset Softmax
Conv
ReLU
Linear

Why does transfer learning
work?
• Similar inputs
• e.g. Images, Text, …
• Transfer between
tasks
• Good initialization
• Learned weights are
tuned well Softmax
Conv
ReLU
Linear

Softmax
Conv
ReLU
Linear

When to use transfer
learning?
• Whenever possible
• In early experiments
• Large pre-trained
model exists
Generalization in deep
learning
• Standard wisdom
• Bigger/wider models
overfit more
Softmax
Conv
ReLU
Linear

Conv
ReLU
Deep networks are big enough
to remember all training data
• Deep networks easily fit
random labels
• Memorize all data
• Works even for
random noise inputs
Understanding deep learning requires rethinking generalization, Zhang etal. 2017
wide resnet on cifar-10
Accuracy
0
25
50
75
100
Epochs
Random labels
Correct labels
Why does SGD still work?
• SGD gradually minimizes
objective
• Prefers solutions close
to initialization
• Implicitly regularizes
• Random labels take SGD
on a longer path
Exploring generalization in Deep Learning, Neyshabur etal. 2017
Larger networks overfit less
• Without data
augmentation
• 100% training accuracy
• Larger models
generalize better
• Hence overfit less
Understanding deep learning requires rethinking generalization, Zhang etal. 2017
wide resnet on cifar-10
Training accuracy
75
81.25
87.5
93.75
100
width 16
width 32
width 48
Validation accuracy
75
78.75
82.5
86.25
90
Epochs
Larger networks overfit less
• All models overfit
massively on loss (log
likelihood)
On Calibration of Modern Neural Networks, Guo etal. 2017
wide resnet on cifar-10
Training loss
0
0.25
0.5
0.75
1
width 16
width 32
width 48
Validation loss
0
0.25
0.5
0.75
1
Epochs
Larger networks overfit less
• Do we need a new
learning theory?
• Do we need new
intuitions?
In summary
• Models can overfit, but do not with
SGD and data augmentation
• Implicit regularization
• How to do make it explicit?
• Overfitting is dependent on learning
algorithms (e.g. Adam overfits more)
• How can we measure overfitting?
Graduate student descent
Look at your
data / model output
Design and
train your model
Evaluate
your model on
validation set
automated
semiautomated
manual
Evaluation on validation set
• Run during training
• Every epoch or n
iterations
• Log in TensorBoard
Look at your
data / model output
Design and
train your
model
Evaluate
your model
on
validation set
Look at your data / model
output
• Run during training
• Every epoch or n
iterations
• Log in TensorBoard
• Select same training
and validation images
Look at your
data / model output
Design and
train your
model
Evaluate
your model on
validation set
Design and train your model
• Mostly manual work
Look at your
data / model output
Design and
train your
model
Evaluate
your model on
validation set
Design and train your model
• Network does not train
• Vanishing or exploding
gradients?
• Fix initialization and learning
rate
• Slow training
• Add normalization
• Residual connections
• Iterate until model trains
loss
10.0
0.0
iterations
loss
10.0
0.0
iterations
loss
10.0
0.0
iterations
Design and train your model
• Network overfits to training
data
• Add data augmentation
• Early stopping
• Try a pre-trained network
• Collect more data
• Iterate until model
generalizes well
Accuracy
1.0
0.0
iterations
train
val
Design and train your model
• Network fits training
and validation data well
• Stop graduate student
descent
• Take a break
• Evaluate on test set
Look at your
data / model output
Design and
train your
model
Evaluate
your model on
validation set

 

 

High dimensional inputs
Linear
Activation
Linear
o
Activation
Linear
128
128
Images and structure
• Fully connected
networks are not shift
invariant
Finding shift-invariant
patterns
Convolutions
• “Sliding” linear
transformation
a b c
d e f
g h i
* =
Examples of convolutions
Original Vertical edges Horizontal edges Laplace filter
-1 0 1
-1 0 1
-1 0 1
-1 -1 -1
0 0 0
1 1 1
-1 -1 -1
-1 8 -1
-1 -1 -1
Convolutions on multiple
channels
Original Vertical edges Horizontal edges Laplace filter
-1 0 1
-1 0 1
-1 0 1
-1 -1 -1
0 0 0
1 1 1
-1 -1 -1
-1 8 -1
-1 -1 -1
Formal definition
• Input:
• Kernel:
• Bias:
• Output:
X ∈ ℝH×W×C1
w ∈ ℝh×w×CC2
b ∈ ℝC2
Z ∈ ℝ(Hh+1)×(Ww+1)×C2
Z
a,b,c = bc +
hi=0
wj=0
C
1∑k=0
X
a+i,b+j,kwi,j,k,c
Conv wxh
Stacking multiple layers
Conv 3×3
ReLU
Conv 3×3
ReLU
Conv 3×3
ReLU
Convolution as a linear layer
input: 3 x 3 x 1 input: 9
a b
c d
kernel: 2 x 2 x 1 x 1
output: 2 x 2 x 1
a b c d
a b c d
a b c d
a b c d
weight: 4 x 9
output: 4
Special case: 1×1 convolution
• Pixel-wise linear
transformation
• Kernel: 1 × 1 × C1 × C2
* w =
Output size
• Input:
• Kernel:
• Output:
X ∈ ℝH×W×C1
w ∈ ℝh×w×CC2
Z ∈ ℝ(Hh+1)×(Ww+1)×C2
a b
* c d =
Padding
• Add zeros in
each dimension
• Input:
• Kernel:
• Output:
pw, ph
X ∈ ℝH×W×C1
w ∈ ℝh×w×CC2
Z ∈ ℝ(Hh+2ph+1)×(Ww+2pw+1)×C2
0 0 0 0 0
0 0
0 0
0 0
0 0 0 0 0
a b
c d
* =
Output resolution
• High output resolution
• Slow computation
Conv 3×3
Conv 3×3
Conv 3×3
(H, W, 1024)
(H, W, 512)
(H, W, 3)
Striding
• Only compute every nth output:
• Input:
• Kernel:
• Output:
sw
,sh
X ∈ ℝH×W×C1
w ∈ ℝh×w×CC2
Z ∈ ℝ( H hsh + 2ph +1)×( W w sw + 2pw + 1)×C2
0 0 0 0 0
0 0
0 0
0 0
0 0 0 0 0
a b c
d e f
g h i
* =
Output size with striding
a b c
d e f
g h i
* =
a b c
d e f
g h i
* =
Parameters
• Every input channel
is connected to every
output channel
C
1
C2
C
1 = 6 C2 = 4
Grouping
• Split channels into g
groups
• Reduce parameters and
computation by factor g
C
1 = 6 C2 = 4
Depthwise convolution
• Special grouping
• •
C
1 = g
C2
= g
C
1 = 3 C2 = 3
Hyper-parameters of
convolutions
• Kernel size:
• Padding: ,
• Stride: ,
a b c
d e f
g h i
*
w × h
pw ph
sw
s
h
h
w
pw
ph
] ] ]] w s
s
h
a b c
d e f
g h i
* =
Convolutional operators
• Run arbitrary operation
“over” image
f(x)
f(x)
Average pooling
• Convolutional operator
fc(x) = meani,j(xi,j,c)
Where to use average
pooling?
• Older networks:
• Inside a network

Conv 3×3
ReLU
Avg Pool
Conv 3×3
ReLU
(H, W, 1024)
(H/2, W/2, 1024)
Where to use average
pooling?
• Modern networks
• Global average pooling

Linear
Softmax
Avg Pool
Conv 3×3
ReLU
(H’, W’, 1024)
(1024)
Max pooling
• Convolutional operator
fc(x) = max
i,j
(xi,j,c)
Where to use max pooling?
• Inside a network
• With strides, as
down-sampling

Conv 3×3
ReLU
Max Pool
Conv 3×3
ReLU
(H, W, 1024)
(H/2, W/2, 1024)
Max pooling as a nonlinearity
• Similar to maxout
Conv
Max Pool
Conv

Max Pool
Receptive fields
• Can input affect
output ?
input: 3 x 3 x 1
output: 2 x 2 x 1
xabc Conv 2×2
z
ijk
Receptive fields
input: 3 x 4 x 1
output: 2 x 2 x 1
Conv 2×2
Conv 2×2
• Can input affect
output ?
x
abc
z
ijk
How do we compute the
receptive field
• Option 1: lots of math
• Option 2: Computationally
• Feed a image of 0s to
the network
• Change a single
element to NaN
• See output changes
Conv
Conv

Structure of receptive field?
Use striding, increase
channels
• Trade spatial resolution
for channels
• Balance computation
Keep kernels small
• 3×3 kernels almost
everywhere
• exception:
• first layer up to 7×7
Conv 7×7
Conv 3×3
Conv 3×3
Conv 3×3

Repeat patterns
• First layer or two are
special and not
repeated
• All others usually follow
a fixed pattern
Conv 1×1
ReLU
Conv 3×3
ReLU
Conv 1×1
ReLU
All-convolutional
• Average in the end
• Fewer parameters
• Better training signal
• “Ensemble”/voting
effect for testing
softmax
softmax
Linear

Conv Conv


Flatten Avg Pool
Linear Linear
Structure of input data
• Images
• Repeating patterns
• at various scales
7×7 patches
Structure of convolutional
networks
• Exploit repeating
structure of images
Conv 7×7
Conv 3×3
Conv 3×3
Conv 3×3

Network Dissection: Quantifying Interpretability of Deep Visual
Representations, D. Bau etal, CVPR 2017
What do networks learn?
Linear layered networks do
not work well on images
• Largest linear network
for computer vision
• Locally connected
Building high-level features using large-scale unsupervised learning, Q. Le etal, ICML 2012
Large Scale Distributed Deep Networks, J. Dean etal. NeurIPS 2012
Locally connected Block
L2 Pool
Local Linear
Local Res. Norm
Locally connected Block
Locally connected Block
Segmentation
Receptive field
• How to increase
receptive field?
• Large kernel size
• Striding
Conv 3×3
ReLU
Conv 3×3
ReLU
Conv 3×3
ReLU

Receptive field
receptive field = 3
receptive field = 5
receptive field = 7
Receptive field and striding
receptive field = 3
receptive field = 7
receptive field = 15
Dilation
• Add 0-padding between
values in convolutional
kernel
a b c
d e f
g h i
*
a b c
d e f
g h i
*
Dilation vs striding
receptive field = 3
receptive field = 7
receptive field = 15
Dilation vs striding
receptive field = 3
receptive field = 7
receptive field = 15
The many names of dilation
• hole
• a trous
Inverse of strided
convolution
Strided
conv
Upsampling
Up-convolution
• Dilation of the input
a b c
d e f
g h i
*
Rounding
• Strided convolution
rounds down
• How to correct for this?
Up-convolution in action
• Used closer to output
layers
Conv
Conv
Conv
Up-Conv
Up-Conv
Up-Conv
Up-convolutions and skip
connections
• Provides lower-level
high-resolution features
to output
Conv
Conv
Conv
Up-Conv
Up-Conv
Up-Conv
The many names of upconvolution
• Transpose convolution
• “Deconvolution”
• fractionally strided
convolutions
Convolutional networks
• Linear transformations
• Convolutions
• Convolutional nonlinearities
• Pooling
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
ConvNet Design
• Stride and increase
channels
• Small kernel size
• All convolutional
• Up-convolution close to
output (optional)
Softmax
Conv 7×7
ReLU
Conv 3×3
ReLU
Linear
Avg Pool
(H, W, 3)
(H/2, W/2, 64)
(H/4, W/4, 128)
(128)
Applications of ConvNets
• Autonomous vehicles
• Analyze medical images
• Geoscience: Analyze
scans of rock
formations to find oil

Get professional assignment help cheaply

Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?

Whichever your reason may is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.

Our essay writers are graduates with diplomas, bachelor, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college diploma. When assigning your order, we match the paper subject with the area of specialization of the writer.

Why choose our academic writing service?

  • Plagiarism free papers
  • Timely delivery
  • Any deadline
  • Skilled, Experienced Native English Writers
  • Subject-relevant academic writer
  • Adherence to paper instructions
  • Ability to tackle bulk assignments
  • Reasonable prices
  • 24/7 Customer Support
  • Get superb grades consistently

 

 

 

 

 

 

Order a unique copy of this paper
(550 words)

Approximate price: $22

Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

We value our customers and so we ensure that what we do is 100% original..
With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

The Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.

Read more

Free-revision policy

The Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.

Read more

Privacy policy

The Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.

Read more

Fair-cooperation guarantee

By placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.

Read more

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Open chat
1
You can contact our live agent via WhatsApp! Via +1 817 953 0426

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code VICTORY