mirror of
https://github.com/hoshikawa2/qlora_training.git
synced 2026-03-06 18:21:01 +00:00
Adjustments
This commit is contained in:
131
README.md
131
README.md
@@ -35,7 +35,7 @@ Install required packages:
|
|||||||
pip install torch transformers datasets peft bitsandbytes accelerate
|
pip install torch transformers datasets peft bitsandbytes accelerate
|
||||||
```
|
```
|
||||||
|
|
||||||
Download the training dataset in JSON format:
|
Download a training dataset in JSON format like this format:
|
||||||
```json
|
```json
|
||||||
[
|
[
|
||||||
{"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
|
{"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
|
||||||
@@ -45,10 +45,137 @@ Download the training dataset in JSON format:
|
|||||||
|
|
||||||
Save as ./datasets/qa_dataset.json.
|
Save as ./datasets/qa_dataset.json.
|
||||||
|
|
||||||
Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090).
|
## ⚡ GPU Compatibility and Recommendations
|
||||||
|
|
||||||
|
Running and fine-tuning large language models is computationally intensive. While it is possible to execute inference on CPUs, performance is significantly slower and often impractical for real-time or large-scale workloads. For this reason, using a dedicated GPU is strongly recommended.
|
||||||
|
|
||||||
|
### Why GPUs?
|
||||||
|
|
||||||
|
GPUs are optimized for massively parallel matrix operations, which are at the core of transformer models. This parallelism translates directly into:
|
||||||
|
|
||||||
|
Faster training and inference — models that would take hours on CPU can be executed in minutes on GPU.
|
||||||
|
|
||||||
|
Support for larger models — GPUs provide high-bandwidth memory (VRAM), enabling models with billions of parameters to fit and run efficiently.
|
||||||
|
|
||||||
|
Energy efficiency — despite high TDPs, GPUs often consume less power per token generated than CPUs, thanks to their optimized architecture.
|
||||||
|
|
||||||
|
### Why NVIDIA?
|
||||||
|
|
||||||
|
Although other vendors are entering the AI market, NVIDIA GPUs remain the de-facto standard for LLM training and deployment because:
|
||||||
|
|
||||||
|
They offer CUDA and cuDNN, mature software stacks with deep integration into PyTorch, TensorFlow, and Hugging Face Transformers.
|
||||||
|
|
||||||
|
Popular quantization and fine-tuning libraries (e.g., BitsAndBytes, PEFT, Accelerate) are optimized for NVIDIA architectures.
|
||||||
|
|
||||||
|
Broad ecosystem support ensures driver stability, optimized kernels, and multi-GPU scalability.
|
||||||
|
|
||||||
|
### Examples of GPUs
|
||||||
|
|
||||||
|
Consumer RTX Series (e.g., RTX 3090, 4090, 5090):
|
||||||
|
These GPUs are widely available, offering 16–24 GB VRAM (3090/4090) or more in newer generations like the RTX 5090. They are suitable for personal or workstation setups, providing excellent performance for inference and QLoRA fine-tuning.
|
||||||
|
|
||||||
|
Data Center GPUs (e.g., NVIDIA A10, A100, H100):
|
||||||
|
Enterprise GPUs are designed for continuous workloads, offering higher VRAM (24–80 GB), ECC memory for reliability, and optimized virtualization capabilities. For example:
|
||||||
|
|
||||||
|
An A10 (24 GB) is an affordable option for cloud deployments.
|
||||||
|
|
||||||
|
An A100 (40–80 GB) or H100 supports massive batch sizes and full fine-tuning of very large models.
|
||||||
|
|
||||||
|
### Clusters and Memory Considerations
|
||||||
|
|
||||||
|
VRAM capacity is the main limiting factor. A 7B parameter model typically requires ~14–16 GB in FP16, but can run on ~8 GB when quantized (e.g., 4-bit QLoRA). Larger models (13B, 34B, 70B) may require 24 GB or more.
|
||||||
|
|
||||||
|
Clustered GPU setups (multi-GPU training) enable splitting the model across devices using tensor or pipeline parallelism. This is essential for very large models but introduces complexity in synchronization and scaling.
|
||||||
|
|
||||||
|
Cloud providers often expose A10, A100, or L4 instances that scale horizontally. Choosing between a single powerful GPU and a cluster of smaller GPUs depends on workload:
|
||||||
|
|
||||||
|
Single GPU: simpler setup, fewer synchronization bottlenecks.
|
||||||
|
|
||||||
|
Multi-GPU cluster: better for large-scale training or serving multiple requests concurrently.
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
For most developers:
|
||||||
|
|
||||||
|
RTX 4090/5090 is an excellent choice for local fine-tuning and inference of 7B–13B models.
|
||||||
|
|
||||||
|
A10/A100 GPUs are better suited for enterprise clusters or cloud deployments, where high availability and VRAM capacity matter more than cost.
|
||||||
|
|
||||||
|
When planning GPU resources, always balance VRAM requirements, throughput, and scalability needs. Proper hardware selection ensures that training and inference are both feasible and efficient.
|
||||||
|
|
||||||
|
|
||||||
## 📝 Step-by-step code
|
## 📝 Step-by-step code
|
||||||
|
|
||||||
|
This step-by-step pipeline is organized to keep **memory usage low**
|
||||||
|
while preserving model quality and reproducibility:
|
||||||
|
|
||||||
|
- **Quantization first (4-bit with BitsAndBytes)** reduces VRAM so you
|
||||||
|
can load a 7B model on a single modern GPU without sacrificing too
|
||||||
|
much accuracy. Then we layer **LoRA adapters** on top, updating only
|
||||||
|
low-rank matrices instead of full weights --- which dramatically
|
||||||
|
cuts the number of trainable parameters and speeds up training.\
|
||||||
|
- We explicitly **set the tokenizer's `pad_token` to `eos_token`** to
|
||||||
|
avoid padding issues with causal LMs and keep batching simple and
|
||||||
|
efficient.\
|
||||||
|
- Using **`device_map="auto"`** delegates placement (GPU/CPU/offload)
|
||||||
|
to Accelerate, ensuring the model fits while exploiting available
|
||||||
|
GPU memory.\
|
||||||
|
- The **data pipeline** keeps labels equal to inputs for next-token
|
||||||
|
prediction and uses a **LM data collator** (no MLM), which is the
|
||||||
|
correct objective for decoder-only transformers.\
|
||||||
|
- **Trainer** abstracts the training loop (forward, backward,
|
||||||
|
optimization, logging, checkpoints), reducing boilerplate and
|
||||||
|
error-prone code.
|
||||||
|
|
||||||
|
### 🏗️ Training Architecture (End-to-End)
|
||||||
|
|
||||||
|
``` mermaid
|
||||||
|
flowchart LR
|
||||||
|
A[A-Raw JSON Dataset] --> B[B-Tokenization & Formatting]
|
||||||
|
B --> C[C-4-bit Quantized Base Model]
|
||||||
|
C --> D[D-LoRA Adapters PEFT]
|
||||||
|
D --> E[E-Trainer Loop FP16/4-bit]
|
||||||
|
E --> F[F-Checkpoints qlora-output]
|
||||||
|
F --> G[G-Merge Adapters → Base]
|
||||||
|
G --> H[H-Unified Export safetensors]
|
||||||
|
H --> I[I-Inference Transformers or GGUF Ollama/llama.cpp]
|
||||||
|
```
|
||||||
|
|
||||||
|
- **C → D**: Only LoRA layers are trainable; the base model stays
|
||||||
|
quantized/frozen.
|
||||||
|
- **F → G → H**: After training, you **merge** LoRA into the base,
|
||||||
|
then **export** to a production-friendly format (single or sharded
|
||||||
|
`safetensors`) or convert to **GGUF** for lightweight runtimes.
|
||||||
|
|
||||||
|
### 🚶 The core steps (and why each matters)
|
||||||
|
|
||||||
|
1. **Main configurations** --- Centralize base model, dataset path,
|
||||||
|
output directory, and sequence length so experiments are
|
||||||
|
reproducible and easy to tweak.
|
||||||
|
2. **Quantization with BitsAndBytes (4-bit)** --- Shrinks memory
|
||||||
|
footprint and bandwidth pressure; crucial for single-GPU training.
|
||||||
|
3. **Load tokenizer & model** --- Set `pad_token = eos_token` (causal
|
||||||
|
LM best practice) and let `device_map="auto"` place weights
|
||||||
|
efficiently.
|
||||||
|
4. **Prepare LoRA (PEFT)** --- Target attention projections (`q_proj`,
|
||||||
|
`k_proj`, `v_proj`, `o_proj`) for maximal quality/latency gains per
|
||||||
|
parameter trained.
|
||||||
|
5. **Dataset & tokenization** --- Labels mirror inputs for next-token
|
||||||
|
prediction; truncation/padding give stable batch shapes.
|
||||||
|
6. **Data collator (no MLM)** --- Correct objective for decoder-only
|
||||||
|
models; ensures clean, consistent batches.
|
||||||
|
7. **Training arguments** --- Small batch + gradient accumulation
|
||||||
|
balances VRAM limits with throughput; `fp16=True` saves memory.
|
||||||
|
8. **Trainer** --- Handles the full loop
|
||||||
|
(forward/backward/optim/ckpts/logging) to reduce complexity and
|
||||||
|
bugs.\
|
||||||
|
9. **Train & save** --- Persist adapters, PEFT config, and tokenizer
|
||||||
|
for later merge or continued training.
|
||||||
|
10. **(Post) Merge & export** --- Fuse adapters into the base, export to
|
||||||
|
`safetensors`, or convert to **GGUF** for Ollama/llama.cpp if you
|
||||||
|
need a single-file runtime.
|
||||||
|
|
||||||
|
|
||||||
1. Main configurations
|
1. Main configurations
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|||||||
Reference in New Issue
Block a user