3.75M
Категория: ПрограммированиеПрограммирование

Local LLM deployment

1.

Local LLM deployment
01
02
4 bits QLoRA quantization
8 bits QLoRA quantization
03
04
Code generation
*Tuned Code generation

2.

4 bits QLoRA quantization
Quantization is the process of discretizing an
input from a representation that holds more
information to a representation with less
information. FP32->Int8 quantization (basic):
Full Finetuning is expensive
QLoRA advantages:
• 4bNF: usage of theoretically optimal
data type
• Double Quant: quant model and
quantization constants
• LoRA: model is frozen, adaptors are
learning
• Paged Optimizers: Lazy auto memory
allocation on CPU when GPU is out of
memory
QLORA: Efficient Finetuning of Quantized LLMs https://arxiv.org/pdf/2305.14314.pdf

3.

4 bits QLoRA quantization
No quantization:
4 bits quantization:

4.

8 bits QLoRA quantization
No quantization:
8 bits quantization:

5.

Codegen
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf

6.

Codegen
No quantization:
Codegen quantization:

7.

Tuned Codegen
How to find the best from the
prepared search map?
How to prepare a search map?
1. User defined space (~, ~)
1. Brute-force search
2. Random space (pretty big and low effecitve)
2. Randomized grid search
3. Meta-parameters analysis (kinda small, highly effective)
3. RL-agent (TVM approach)
(a) 32 and 128 for tile
sizes;
(b) 8 and 32 for microtile sizes; and
(c) 1 and 4 for nano-tile
sizes.

8.

Tuned Codegen
No quantization:
Tuned codegen quantization:

9.

*Throughput optimization
Latency is a simple delay between two
desired events.
Throughput is the amount of data
— or something else — that can
be transferred within a specific
period.

10.

*Throughput optimization
No quantization:
Batched(4) Tuned codegen quantization:
2.41 / batch 4 = 0.6025 s/item

11.

Result
9
8.14
(x1)
10
7.69
(x1)
8
9
7
8
7
6
6
5
5
4
2.83
(x3)
3
2.69
(x3)
4
2.34
(x3)
3
2
0.625
(x13)
1
0
2
1
0
No
quantization
4 bits quant
8 bits quant
Codegen
Tuned
codegen
Batched
tuned CG
English     Русский Правила