From 85c301121f69c5e08a3b6de5921a5c8dd52eb9d8 Mon Sep 17 00:00:00 2001 From: hoshikawa2 Date: Mon, 20 Oct 2025 22:48:26 -0300 Subject: [PATCH] Adjustments --- README.md | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 129 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e95ce0b..d6e65c5 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ Install required packages: pip install torch transformers datasets peft bitsandbytes accelerate ``` -Download the training dataset in JSON format: +Download a training dataset in JSON format like this format: ```json [ {"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."}, @@ -45,10 +45,137 @@ Download the training dataset in JSON format: Save as ./datasets/qa_dataset.json. -Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090). +## ⚡ GPU Compatibility and Recommendations + +Running and fine-tuning large language models is computationally intensive. While it is possible to execute inference on CPUs, performance is significantly slower and often impractical for real-time or large-scale workloads. For this reason, using a dedicated GPU is strongly recommended. + +### Why GPUs? + +GPUs are optimized for massively parallel matrix operations, which are at the core of transformer models. This parallelism translates directly into: + +Faster training and inference — models that would take hours on CPU can be executed in minutes on GPU. + +Support for larger models — GPUs provide high-bandwidth memory (VRAM), enabling models with billions of parameters to fit and run efficiently. + +Energy efficiency — despite high TDPs, GPUs often consume less power per token generated than CPUs, thanks to their optimized architecture. + +### Why NVIDIA? + +Although other vendors are entering the AI market, NVIDIA GPUs remain the de-facto standard for LLM training and deployment because: + +They offer CUDA and cuDNN, mature software stacks with deep integration into PyTorch, TensorFlow, and Hugging Face Transformers. + +Popular quantization and fine-tuning libraries (e.g., BitsAndBytes, PEFT, Accelerate) are optimized for NVIDIA architectures. + +Broad ecosystem support ensures driver stability, optimized kernels, and multi-GPU scalability. + +### Examples of GPUs + +Consumer RTX Series (e.g., RTX 3090, 4090, 5090): +These GPUs are widely available, offering 16–24 GB VRAM (3090/4090) or more in newer generations like the RTX 5090. They are suitable for personal or workstation setups, providing excellent performance for inference and QLoRA fine-tuning. + +Data Center GPUs (e.g., NVIDIA A10, A100, H100): +Enterprise GPUs are designed for continuous workloads, offering higher VRAM (24–80 GB), ECC memory for reliability, and optimized virtualization capabilities. For example: + +An A10 (24 GB) is an affordable option for cloud deployments. + +An A100 (40–80 GB) or H100 supports massive batch sizes and full fine-tuning of very large models. + +### Clusters and Memory Considerations + +VRAM capacity is the main limiting factor. A 7B parameter model typically requires ~14–16 GB in FP16, but can run on ~8 GB when quantized (e.g., 4-bit QLoRA). Larger models (13B, 34B, 70B) may require 24 GB or more. + +Clustered GPU setups (multi-GPU training) enable splitting the model across devices using tensor or pipeline parallelism. This is essential for very large models but introduces complexity in synchronization and scaling. + +Cloud providers often expose A10, A100, or L4 instances that scale horizontally. Choosing between a single powerful GPU and a cluster of smaller GPUs depends on workload: + +Single GPU: simpler setup, fewer synchronization bottlenecks. + +Multi-GPU cluster: better for large-scale training or serving multiple requests concurrently. + +### Summary + +For most developers: + +RTX 4090/5090 is an excellent choice for local fine-tuning and inference of 7B–13B models. + +A10/A100 GPUs are better suited for enterprise clusters or cloud deployments, where high availability and VRAM capacity matter more than cost. + +When planning GPU resources, always balance VRAM requirements, throughput, and scalability needs. Proper hardware selection ensures that training and inference are both feasible and efficient. + ## 📝 Step-by-step code +This step-by-step pipeline is organized to keep **memory usage low** +while preserving model quality and reproducibility: + +- **Quantization first (4-bit with BitsAndBytes)** reduces VRAM so you + can load a 7B model on a single modern GPU without sacrificing too + much accuracy. Then we layer **LoRA adapters** on top, updating only + low-rank matrices instead of full weights --- which dramatically + cuts the number of trainable parameters and speeds up training.\ +- We explicitly **set the tokenizer's `pad_token` to `eos_token`** to + avoid padding issues with causal LMs and keep batching simple and + efficient.\ +- Using **`device_map="auto"`** delegates placement (GPU/CPU/offload) + to Accelerate, ensuring the model fits while exploiting available + GPU memory.\ +- The **data pipeline** keeps labels equal to inputs for next-token + prediction and uses a **LM data collator** (no MLM), which is the + correct objective for decoder-only transformers.\ +- **Trainer** abstracts the training loop (forward, backward, + optimization, logging, checkpoints), reducing boilerplate and + error-prone code. + +### 🏗️ Training Architecture (End-to-End) + +``` mermaid +flowchart LR + A[A-Raw JSON Dataset] --> B[B-Tokenization & Formatting] + B --> C[C-4-bit Quantized Base Model] + C --> D[D-LoRA Adapters PEFT] + D --> E[E-Trainer Loop FP16/4-bit] + E --> F[F-Checkpoints qlora-output] + F --> G[G-Merge Adapters → Base] + G --> H[H-Unified Export safetensors] + H --> I[I-Inference Transformers or GGUF Ollama/llama.cpp] +``` + +- **C → D**: Only LoRA layers are trainable; the base model stays + quantized/frozen. +- **F → G → H**: After training, you **merge** LoRA into the base, + then **export** to a production-friendly format (single or sharded + `safetensors`) or convert to **GGUF** for lightweight runtimes. + +### 🚶 The core steps (and why each matters) + +1. **Main configurations** --- Centralize base model, dataset path, + output directory, and sequence length so experiments are + reproducible and easy to tweak. +2. **Quantization with BitsAndBytes (4-bit)** --- Shrinks memory + footprint and bandwidth pressure; crucial for single-GPU training. +3. **Load tokenizer & model** --- Set `pad_token = eos_token` (causal + LM best practice) and let `device_map="auto"` place weights + efficiently. +4. **Prepare LoRA (PEFT)** --- Target attention projections (`q_proj`, + `k_proj`, `v_proj`, `o_proj`) for maximal quality/latency gains per + parameter trained. +5. **Dataset & tokenization** --- Labels mirror inputs for next-token + prediction; truncation/padding give stable batch shapes. +6. **Data collator (no MLM)** --- Correct objective for decoder-only + models; ensures clean, consistent batches. +7. **Training arguments** --- Small batch + gradient accumulation + balances VRAM limits with throughput; `fp16=True` saves memory. +8. **Trainer** --- Handles the full loop + (forward/backward/optim/ckpts/logging) to reduce complexity and + bugs.\ +9. **Train & save** --- Persist adapters, PEFT config, and tokenizer + for later merge or continued training. +10. **(Post) Merge & export** --- Fuse adapters into the base, export to + `safetensors`, or convert to **GGUF** for Ollama/llama.cpp if you + need a single-file runtime. + + 1. Main configurations ```python