Похожие презентации:

# Back propagation example

## 1.

19back-propagation training

## 2.

Error1.0

0.0

3.7

2.9

1

• Computed output: y = .76

• Correct output: t = 1.0

⇒ How do we adjust the weights?

20

.90

.17

1

-5.2

.76

## 3.

Key Concepts• Gradient descent

–

–

–

–

–

error is a function of the weights

we want to reduce the error

gradient descent: move towards the error minimum

compute gradient → get direction to the error minimum

adjust weights towards direction of lower error

• Back-propagation

– first adjust last set of weights

– propagate error back to each previous layer

– adjust their weights

21

## 4.

Gradient Descent22

error(λ)

λ

optimal λ

current λ

## 5.

Gradient DescentGradient for w1

Current Point

Gradient for w2

Optimum

23

## 6.

Coborârea gradientului (gradient descent)în spațiul ponderilor

Din cartea Machine Learning, de Tom Mitchel.

http://profsite.um.ac.ir/~monsefi/machine-learning/pdf/MachineLearning-Tom-Mitchell.pdf

## 7.

Derivative of Sigmoid• Sigmoid

sigmoid(x) =

1

1 + e−x

• Reminder: quotient rule

• Derivative

d sigmoid(x)

dx

=

=

d

1

dx 1 + e − x

0 × (1 − e − x ) − ( − e − x )

(1 + e −x ) 2

e−x

1

=

1 + e−x 1 + e−x

1

1

1

−

=

1 + e−x

1 + e−x

= sigmoid(x)(1 − sigmoid(x))

24

## 8.

Final Layer Update• Linear combination of weights

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t −y)2

• Derivative of error with regard to one weight wk

25

## 9.

Final Layer Update (1)• Linear combination of weights

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t −y)2

• Derivative of error with regard to one weight wk

dE

dE dy ds

=

dwk

dy dsdwk

• Error E is defined with respect to y

2

26

## 10.

Final Layer Update (2)• Linear combination of weights

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t −y)2

• Derivative of error with regard to one weight wk

dE

dE dy ds

=

dwk

dy dsdwk

• y with respect to x is sigmoid(s)

dy = d sigmoid(s) = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)

ds

ds

27

## 11.

Final Layer Update (3)• Linear combination of weights s =

Σ

k

wkhk

• Activation function y = sigmoid(s)

• Error (L2 norm) E = 12(t −y)2

• Derivative of error with regard to one weight wk

dE

dE dy ds

=

dwk

dy dsdwk

• x is weighted linear combination of hidden node values hk

28

## 12.

Putting it All Together• Derivative of error with regard to one weight wk

dE

dE dy ds

=

dwk

dy dsdwk

= −(t − y) y(1 − y) hk

– error

– derivative of sigmoid: y'

• Weight adjustment will be scaled by a fixed learning rate µ

29

## 13.

Multiple Output Nodes• Our example only had one output node

• Typically neural networks have multiple output nodes

• Error is computed over all j output nodes

• Weights k → j are adjusted according to the node they point to

30

## 14.

Hidden Layer Update31

• In a hidden layer, we do not have a target output value

• But we can compute how much each node contributed to downstream error

• Definition of error term of each node

• Back-propagate the error term

(why this way? there is math to back it up...)

• Universal update formula

∆w j←k = µ δj hk

## 15.

Our ExampleA

1.0

3.7

D

.90

G

E

B

0.0

C

.17

2.9

-5.2

F

1

32

1

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)

– δG = (t − y) y' = (1 − .76) 0.181 = .0434

– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391

– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074

– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434

.76

## 16.

Our ExampleA

1.0

3.7

D

.90

E

B

0.0

C

.17

2.9

-5.126 -—5.—2

F

1

33

1

• Computed output: y = .76

• Correct output: t = 1.0

• Final layer weight updates (learning rate µ = 10)

– δG = (t − y) y' = (1 − .76) 0.181 = .0434

– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391

– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074

– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434

G

.76

## 17.

Hidden Layer UpdatesA

1.0

3.7

0.0

C

.17

2.9

F

1

• Hidden node E

.90

E

B

• Hidden node D

D

1

-5.126 -—5.—2

G

.76

34

## 18.

35some additional aspects

## 19.

Initialization of Weights• Weights are initialized randomly

e.g., uniformly from interval [−0.01, 0.01]

• Glorot and Bengio (2010) suggest

– for shallow neural networks

n is the size of the previous layer

– for deep neural networks

n j is the size of the previous layer, n j size of next layer

36

## 20.

Neural Networks for Classification• Predict class: one output node per class

• Training data output: ”One-hot vector”, e.g., ˙

• Prediction

– predicted class is output node yi with highest value

– obtain posterior probability distribution by soft-max

37

## 21.

Problems with Gradient Descent Trainingerror(λ)

λ

Too high learning rate

38

## 22.

Problems with Gradient Descent Training39

error(λ)

λ

Bad initialization

Philipp Koehn

Machine Translation: Introduction to Neural Networks

27 September 2018

## 23.

Problems with Gradient Descent Trainingerror(λ)

local optimum

global optimum

Local optimum

λ

40

## 24.

Speedup: Momentum Term41

• Updates may move a weight slowly in one direction

• To speed this up, we can keep a memory of prior updates

∆wj←k (n −1)

• ... and add these to any new updates (with decay factor ρ)

∆wj←k (n) = µ δj hk + ρ∆wj←k (n − 1)

Philipp Koehn

Machine Translation: Introduction to Neural Networks

27 September 2018

## 25.

Adagrad42

• Typically reduce the learning rate µ over time

– at the beginning, things have to change a lot

– later, just fine-tuning

• Adapting learning rate per parameter

• Adagrad update

based on error E with respect to the weight w at time t = gt = dE

dw

∆ wt = . Σ

µ

t τ

=1

gτ2

gt

## 26.

Dropout43

• A general problem of machine learning: overfitting to training data

(very good on train, bad on unseen test)

• Solution: regularization, e.g., keeping weights from having extreme values

• Dropout: randomly remove some hidden units during training

– mask: set of hidden units dropped

– randomly generate, say, 10–20 masks

– alternate between the masks during training

• Why does that work?

→ bagging, ensemble, ...

## 27.

Mini Batches• Each training example yields a set of weight updates ∆wi .

• Batch up several training examples

– sum up their updates

– apply sum to model

• Mostly done or speed reasons

44

## 28.

45computational aspects

## 29.

Vector and Matrix Multiplications• Forward computation:

• Activation function:

• Error term:

• Propagation of error term:

• Weight updates:

46

## 30.

GPU• Neural network layers may have, say, 200 nodes

• Computations such as

multiplications

require 200 × 200 = 40, 000

• Graphics Processing Units (GPU) are designed for such computations

– image rendering requires such vector and matrix operations

– massively mulit-core but lean processing units

– example: NVIDIA Tesla K20c GPU provides 2496 thread processors

• Extensions to C to support programming of GPUs, such as CUDA

47

## 31.

Toolkits• Theano

• Tensorflow (Google)

• PyTorch (Facebook)

• MXNet (Amazon)

• DyNet

• С (easy api)

48

## 32.

[email protected]## 33.

NextTema: Modele secvențiale

Prezentator: Tudor Bumbu