Adjustments

2026-03-06 18:21:01 +00:00 · 2025-10-20 22:48:26 -03:00
parent 0ce10f42d4
commit 85c301121f
1 changed files with 129 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ Install required packages:
 pip install torch transformers datasets peft bitsandbytes accelerate
 ```
-Download the training dataset in JSON format:
+Download a training dataset in JSON format like this format:
 ```json
 [
  {"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
@@ -45,10 +45,137 @@ Download the training dataset in JSON format:
 Save as ./datasets/qa_dataset.json.
-Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090).
+## ⚡ GPU Compatibility and Recommendations
 Running and fine-tuning large language models is computationally intensive. While it is possible to execute inference on CPUs, performance is significantly slower and often impractical for real-time or large-scale workloads. For this reason, using a dedicated GPU is strongly recommended.
 ### Why GPUs?
 GPUs are optimized for massively parallel matrix operations, which are at the core of transformer models. This parallelism translates directly into:
 Faster training and inference — models that would take hours on CPU can be executed in minutes on GPU.
 Support for larger models — GPUs provide high-bandwidth memory (VRAM), enabling models with billions of parameters to fit and run efficiently.
 Energy efficiency — despite high TDPs, GPUs often consume less power per token generated than CPUs, thanks to their optimized architecture.
 ### Why NVIDIA?
 Although other vendors are entering the AI market, NVIDIA GPUs remain the de-facto standard for LLM training and deployment because:
 They offer CUDA and cuDNN, mature software stacks with deep integration into PyTorch, TensorFlow, and Hugging Face Transformers.
 Popular quantization and fine-tuning libraries (e.g., BitsAndBytes, PEFT, Accelerate) are optimized for NVIDIA architectures.
 Broad ecosystem support ensures driver stability, optimized kernels, and multi-GPU scalability.
 ### Examples of GPUs
 Consumer RTX Series (e.g., RTX 3090, 4090, 5090):
 These GPUs are widely available, offering 16–24 GB VRAM (3090/4090) or more in newer generations like the RTX 5090. They are suitable for personal or workstation setups, providing excellent performance for inference and QLoRA fine-tuning.
 Data Center GPUs (e.g., NVIDIA A10, A100, H100):
 Enterprise GPUs are designed for continuous workloads, offering higher VRAM (24–80 GB), ECC memory for reliability, and optimized virtualization capabilities. For example:
 An A10 (24 GB) is an affordable option for cloud deployments.
 An A100 (40–80 GB) or H100 supports massive batch sizes and full fine-tuning of very large models.
 ### Clusters and Memory Considerations
 VRAM capacity is the main limiting factor. A 7B parameter model typically requires ~14–16 GB in FP16, but can run on ~8 GB when quantized (e.g., 4-bit QLoRA). Larger models (13B, 34B, 70B) may require 24 GB or more.
 Clustered GPU setups (multi-GPU training) enable splitting the model across devices using tensor or pipeline parallelism. This is essential for very large models but introduces complexity in synchronization and scaling.
 Cloud providers often expose A10, A100, or L4 instances that scale horizontally. Choosing between a single powerful GPU and a cluster of smaller GPUs depends on workload:
 Single GPU: simpler setup, fewer synchronization bottlenecks.
 Multi-GPU cluster: better for large-scale training or serving multiple requests concurrently.
 ### Summary
 For most developers:
 RTX 4090/5090 is an excellent choice for local fine-tuning and inference of 7B–13B models.
 A10/A100 GPUs are better suited for enterprise clusters or cloud deployments, where high availability and VRAM capacity matter more than cost.
 When planning GPU resources, always balance VRAM requirements, throughput, and scalability needs. Proper hardware selection ensures that training and inference are both feasible and efficient.
 ## 📝 Step-by-step code
 This step-by-step pipeline is organized to keep **memory usage low**
 while preserving model quality and reproducibility:
 -   **Quantization first (4-bit with BitsAndBytes)** reduces VRAM so you
    can load a 7B model on a single modern GPU without sacrificing too
    much accuracy. Then we layer **LoRA adapters** on top, updating only
    low-rank matrices instead of full weights --- which dramatically
    cuts the number of trainable parameters and speeds up training.\
 -   We explicitly **set the tokenizer's `pad_token` to `eos_token`** to
    avoid padding issues with causal LMs and keep batching simple and
    efficient.\
 -   Using **`device_map="auto"`** delegates placement (GPU/CPU/offload)
    to Accelerate, ensuring the model fits while exploiting available
    GPU memory.\
 -   The **data pipeline** keeps labels equal to inputs for next-token
    prediction and uses a **LM data collator** (no MLM), which is the
    correct objective for decoder-only transformers.\
 -   **Trainer** abstracts the training loop (forward, backward,
    optimization, logging, checkpoints), reducing boilerplate and
    error-prone code.
 ### 🏗️ Training Architecture (End-to-End)
 ``` mermaid
 flowchart LR
  A[A-Raw JSON Dataset] --> B[B-Tokenization & Formatting]
  B --> C[C-4-bit Quantized Base Model]
  C --> D[D-LoRA Adapters PEFT]
  D --> E[E-Trainer Loop FP16/4-bit]
  E --> F[F-Checkpoints qlora-output]
  F --> G[G-Merge Adapters → Base]
  G --> H[H-Unified Export safetensors]
  H --> I[I-Inference Transformers or GGUF Ollama/llama.cpp]
 ```
 -   **C → D**: Only LoRA layers are trainable; the base model stays
    quantized/frozen.
 -   **F → G → H**: After training, you **merge** LoRA into the base,
    then **export** to a production-friendly format (single or sharded
    `safetensors`) or convert to **GGUF** for lightweight runtimes.
 ### 🚶 The core steps (and why each matters)
 1.  **Main configurations** --- Centralize base model, dataset path,
    output directory, and sequence length so experiments are
    reproducible and easy to tweak.
 2.  **Quantization with BitsAndBytes (4-bit)** --- Shrinks memory
    footprint and bandwidth pressure; crucial for single-GPU training.
 3.  **Load tokenizer & model** --- Set `pad_token = eos_token` (causal
    LM best practice) and let `device_map="auto"` place weights
    efficiently.
 4.  **Prepare LoRA (PEFT)** --- Target attention projections (`q_proj`,
    `k_proj`, `v_proj`, `o_proj`) for maximal quality/latency gains per
    parameter trained.
 5.  **Dataset & tokenization** --- Labels mirror inputs for next-token
    prediction; truncation/padding give stable batch shapes.
 6.  **Data collator (no MLM)** --- Correct objective for decoder-only
    models; ensures clean, consistent batches.
 7.  **Training arguments** --- Small batch + gradient accumulation
    balances VRAM limits with throughput; `fp16=True` saves memory.
 8.  **Trainer** --- Handles the full loop
    (forward/backward/optim/ckpts/logging) to reduce complexity and
    bugs.\
 9.  **Train & save** --- Persist adapters, PEFT config, and tokenizer
    for later merge or continued training.
 10. **(Post) Merge & export** --- Fuse adapters into the base, export to
    `safetensors`, or convert to **GGUF** for Ollama/llama.cpp if you
    need a single-file runtime.
 1. Main configurations
 ```python