Nets compression and speedup

1.

Deep neural networks compression
Alexander Chigorin
Head of research projects
VisionLabs
[email protected]

2.

Deep neural networks compression. Motivation
Neural net architectures and size
Architecture
VGG-16
# params
(millions)
138
Size in MB
552
Accuracy on
ImageNet
71%
AlexNet
ResNet-18
GoogleNet V1
61
~34
~5
244
138
20
57%
68%
69%
GoogleNet V4
~40
163
80%
Too much for some types of devices (mobile phones, embedded).
Some models can be compressed up to 50x without loss of accuracy.

3.

Deep neural networks compression. Overview
Methods to review:
• Learning both Weights and Connections for Efficient Neural Networks
• Deep Compression: Compressing Deep Neural Networks with Pruning,
Trained Quantization and Huffman Coding
• Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights

4.

Networks Pruning

5.

Networks Pruning. Generic idea
2

6.

Networks Pruning. More details
Until accuracy is acceptable
Initial weights
Prune small abs values
Retrain other weights
2.5
-0.9
0.1
0.2
2.5
-0.9
0
0.2
2.9
-1.9
0
0.2
0.2
-0.1
2
-0.2
0.2
0
2
-0.2
0.1
0
1.3
-0.1
0.5
-1.9 -0.2 -0.1
0.5
0
0.3
-1
-0.5
0
-0.3
0.1
-0.3
0
-1.5
1.2
-0.2
0
-1.5
1.3
2.9
-1.9
0
0
2.9
-1.9
0
0
0
0
1.3
0
0
0
1.3
0
0.3
-1
-0.5
0
0.3
-1
-0.5
0
0
0
-1.5
1.3
0
0
-1.5
1.3
-1.5
1.2
-1.9 -0.2

7.

Networks Pruning. Results
9-13x overall compression
2

8.

Networks Pruning. Results
~60% weights sparsity in conv layers
~96% weights sparsity in fc layers
2

9.

Deep Compression
ICLR 2016 Best Paper
2

10.

Deep Compression. Overview
Algorithm:
• Iterative weights pruning
• Weights quantization
• Huffman encoding
2

11.

Deep Compression. Weights pruning
Already discussed

12.

Deep Compression. Weights quantization
Initial weights
Cluster weights
2.5
-0.9
0.1
0.2
0
3
1
1
2
0.2
-0.1
2
-0.2
1
2
0
2
0.2
0.5
-1.9 -0.2 -0.1
1
3
2
2
-0.3
0.1
2
1
3
0
-1.5
1.2
Write to the disk.
Each index can be
compressed to 2 bits
Fine-tuned
centroids
Centroids
Retrain with
weights
sharing
Final weights
2.5
2.5
-1.6
0.2
0.2
0.1
0.2
-0.2
2.5
-0.2
-0.2
-0.2
0.2
-1.6 -0.2 -0.2
-1.5
-1.6
-0.2
0.2
-1.6
Write to the disk.
Only ¼ of original weights
~4x reduction
2.5

13.

Deep Compression. Huffman coding
Distribution of the weight indexes. Some indexes are more frequent
than the others!
Huffman coding – lossless compression. Output is
The variable-length code table for encoding a source
symbol.
Frequent symbols are encoded with less bits.
2

14.

Deep Compression. Results
35-49x overall compression

15.

Deep Compression. Results
~13x reduction (pruning)
~31x reduction (quantization)
~49x reduction (Huffman coding)

16.

Incremental Network Quantization

17.

Incremental Network Quantization. Idea
Idea:
• let’s quantize weights incrementally (as we do during pruning)
• let’s quantize to the power of 2

18.

Incremental Network Quantization. Overview
Repeat until everything is quantized
Initial weights
Power of 2
quantization
Partitioning
Retraining
2.5
-0.9
0.1
0.2
2.5
-0.9
0.1
0.2
21
-20
0.1
0.2
21
-20
0.4
0.1
0.2
-0.1
2
-0.2
0.2
-0.1
2
-0.2
0.2
-0.1
21
-0.2
0.5
-0.5
21
-0.2
0.5
-1.9 -0.2 -0.1
0.5
-1.9 -0.2 -0.1
0.5
-20
-0.2 -0.1
0.5
-20
-0.2 -0.3
-0.3
0.1
-0.3
0.1
-0.3
0.1
-20
-0.1
0.2
-20
-1.5
1.2
-1.5
1.2
20
20

19.

Incremental Network Quantization. Overview
Repeat until everything is quantized
Initial weights
Power of 2
quantization
Partitioning
Retraining
2.5
-0.9
0.1
0.2
21
-20
0.4
0.1
21
-20
2-1
0.1
21
-20
2-1
0.4
0.2
-0.1
2
-0.2
0.5
-0.5
21
-0.2
2-1
-2-1
21
-0.2
2-1
-2-1
21
-0.1
0.5
-1.9 -0.2 -0.1
0.5
-20
-0.2 -0.3
2-1
-20
-0.2 -0.3
2-1
-20
-0.2 -0.1
-0.3
0.1
-0.1
0.2
-20
-0.1
0.2
-20
-0.4
0.2
-20
-1.5
1.2
20
20
20

20.

Incremental Network Quantization. Overview
Initial weights
Power of 2
quantization
Partitioning
21
0.4
2-1
2-1
-0.1
2-1
-2-1
21
-2-3
Write to the disc.
Powers of 2 set: {-3, -2, -1, 0, 1}
Can be represented with 3 bits.
~10x reduction (3 bits instead of 32)
-0.9
0.1
0.2
0.2
-0.1
2
-0.2
2-1
-2-1
0.5
-1.9 -0.2 -0.1
2-1
-20
-0.2 -0.1
2-1
-20
-2-2
-2-3
-0.3
0.1
-0.4
0.2
-20
-2-1
2-2
-20
20
1.2
2-1
-20
2.5
-1.5
-20
21
21
20
2

21.

Incremental Network Quantization. Results
~7x reduction, accuracy increased (!)

22.

Incremental Network Quantization. Results
No big drop in accuracy even with 3 bits
for ResNet-18
2

23.

Incremental Network Quantization. Results
~53x reduction if combined with pruning
(better than Deep Compression)

24.

Future: native hardware support

25.

Future: native hardware support
~92 8-bit OPS/sec

26.

Этапы типового внедрения платформы
26
Alexander Chigorin
Head of research projects
VisionLabs
[email protected]
English     Русский Правила