first commit

2026-03-03 16:09:36 +00:00 · 2025-10-20 20:31:55 -03:00
commit 781949dccb
7 changed files with 19274 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,516 @@
 # Fine-tuning with QLoRA on Mistral 7B
 ## 🎯 Introduction
 Training large language models (LLMs) from scratch is unfeasible for most people, as it requires hundreds of GPUs and billions of tokens. However, it is possible to adapt pre-trained models (such as Mistral-7B or any other model) for specific use cases using lighter techniques such as LoRA and QLoRA. These allow training only a fraction of the original parameters with 4-bit quantization, drastically reducing memory consumption.
 ### Real Applications
 - Specialized chatbots: healthcare, legal, finance.
 - Question and Answer (QA) systems over document bases.
 - Business automation: agents that understand internal processes.
 - Academic research: summarization and analysis of technical articles.
 In this tutorial, you will learn how to fine-tune a Mistral 7B model with QLoRA using a local JSON dataset of questions and answers.
 ## 🧩 Key Concepts
 - **BitsAndBytes (bnb):** library for 4-bit quantization, enabling training of large models on a single modern GPU.
 - **LoRA (Low-Rank Adaptation):** technique that adds lightweight learning layers without modifying the original weights.
 - **QLoRA:** combination of LoRA + 4-bit quantization, resulting in efficient and accessible fine-tuning.
 - **Trainer (Hugging Face):** abstraction that manages dataset, batches, checkpoints, and metrics automatically.
 - **JSON dataset:** in the expected format, each entry should have a "text" field containing the instruction or training data.
 ## ⚙️ Prerequisites
 Install required packages:
 ```bash
 pip install torch transformers datasets peft bitsandbytes accelerate
 ```
 Download the training dataset in JSON format:
 ```json
 [
  {"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
  {"text": "Question: Who created the Mistral model?\nAnswer: The company Mistral AI."}
 ]
 ```
 Save as ./datasets/qa_dataset.json.
 Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090).
 ## 📝 Step-by-step code
 1. Main configurations
 ```python
 model_name = "mistralai/Mistral-7B-Instruct-v0.2"
 data_file_path = "./datasets/qa_dataset.json"
 output_dir = "./qlora-output"
 max_length = 512
 ```
 Defines base model, dataset, output, and token limit per sample.
 2. Quantization with BitsAndBytes
 ```python
 bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype="float16"
 )
 ```
 Activates 4-bit quantization (nf4) to reduce memory.
 3. Model and tokenizer loading
 ```python
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 tokenizer.pad_token = tokenizer.eos_token
 model = AutoModelForCausalLM.from_pretrained(
   model_name,
   quantization_config=bnb_config,
   device_map="auto",
   trust_remote_code=True
 )
 ```
 Defines pad_token equal to eos_token (needed in causal models).
 Loads the quantized model.
 4. Preparing for LoRA
 ```python
 model = prepare_model_for_kbit_training(model)
 peft_config = LoraConfig(
   r=8,
   lora_alpha=16,
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
   lora_dropout=0.05,
   bias="none",
   task_type="CAUSAL_LM"
 )
 model = get_peft_model(model, peft_config)
 ```
 Configures LoRA layers only on attention projections.
 5. Loading and tokenizing the dataset
 ```python
 dataset = load_dataset("json", data_files=data_file_path, split="train")
 def tokenize(example):
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        max_length=max_length,
        padding="max_length"
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized
 tokenized_dataset = dataset.map(tokenize, remove_columns=dataset.column_names)
 ```
 Tokenizes entries, sets labels equal to input_ids.
 6. Data Collator
 ```python
 data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
 ```
 Ensures consistent batches without masking (MLM disabled).
 7. Training arguments
 ```python
 training_args = TrainingArguments(
   output_dir=output_dir,
   per_device_train_batch_size=2,
   gradient_accumulation_steps=4,
   num_train_epochs=3,
   learning_rate=2e-4,
   fp16=True,
   logging_steps=10,
   save_strategy="epoch",
   report_to="none"
 )
 ```
 Small batch size (2), accumulating gradients to effective batch of 8.
 fp16 enabled to reduce memory.
 8. Initializing the Trainer
 ```python
 trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset,
   data_collator=data_collator
 )
 ```
 Trainer organizes training, checkpoints, and logging.
 9. Running the training
 ```python
 trainer.train()
 model.save_pretrained(output_dir)
 tokenizer.save_pretrained(output_dir)
 ```
 Saves adapted model in qlora-output.
 ### 🚀 Expected result
 After training, you will have:
 - An adapted model for your domain (./qlora-output).
 - Ability to run specific inferences using:
 ```python
 from transformers import pipeline
 pipe = pipeline("text-generation", model="./qlora-output", tokenizer="./qlora-output")
 print(pipe("Question: What is QLoRA?\nAnswer:")[0]["generated_text"])
 ```
 ## Complete QLoRA Pipeline
 So far we have seen model configuration, quantization, LoRA application, and dataset preparation.
 Now let's understand what happens afterwards:
 1. Data Collator
 ```python
 data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
 ```
 ### 📌 Why use it?
 The Data Collator is responsible for dynamically building batches during training.
 In causal models, we do not use MLM (Masked Language Modeling) as in BERT, so we set mlm=False.
 2. Training Arguments
 ```python
 training_args = TrainingArguments(
   output_dir=output_dir,
   per_device_train_batch_size=2,
   gradient_accumulation_steps=4,
   num_train_epochs=3,
   learning_rate=2e-4,
   fp16=True,
   logging_steps=10,
   save_strategy="epoch",
   report_to="none"
 )
 ```
 ### 📌 Why configure this?
 - per_device_train_batch_size=2 → limits batch size (GPU constraint).
 - gradient_accumulation_steps=4 → accumulates gradients before update → effective batch = 2 x 4 = 8.
 - num_train_epochs=3 → trains dataset 3 times.
 - learning_rate=2e-4 → adequate LR for LoRA.
 - fp16=True → saves VRAM.
 - logging_steps=10 → logs every 10 batches.
 - save_strategy="epoch" → saves checkpoint after each epoch.
 - report_to="none" → avoids external integrations.
 3. Defining Trainer
 ```python
 trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset,
   data_collator=data_collator
 )
 ```
 ### 📌 Why use Trainer?
 Trainer automates:
 - Forward pass (batch propagation).
 - Backpropagation (gradient calculation).
 - Weight optimization.
 - Logging and checkpoints.
 4. Running Training
 ```python
 trainer.train()
 ```
 ### 📌 What happens here?
 Trainer processes data, batches, loss calculation, weight updates.
 Only LoRA layers are updated.
 5. Saving Adapted Model
 ```python
 model.save_pretrained(output_dir)
 tokenizer.save_pretrained(output_dir)
 ```
 ### 📌 Why save?
 The ./qlora-output/ directory will contain:
 - LoRA weights (adapter_model.bin).
 - PEFT config (adapter_config.json).
 - Adapted tokenizer.
 This output can be loaded later for inference or further training.
 ## Interactive Inference
 After training with QLoRA, we usually merge LoRA weights into the base model.
 The code below shows how to load the merged model and use it interactively.
 ```python
 # -*- coding: utf-8 -*-
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 # auto, mps, cuda
 gpu="cuda"
 MODEL_PATH = "./merged_model"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
 tokenizer.pad_token = tokenizer.eos_token
 model = AutoModelForCausalLM.from_pretrained(
   MODEL_PATH,
   device_map=gpu,
   offload_folder="./offload",
   torch_dtype=torch.float16
 )
 model.eval()
 ```
 ### Function to generate
 ```python
 def generate_answer(prompt, max_new_tokens=64):
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(gpu) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False  # faster and deterministic
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 ```
 ### Interactive loop
 ```python
 if __name__ == "__main__":
    print("🤖 Model loaded! Type your question or 'exit' to quit.")
    while True:
        question = input("\n📝 Instruction: ")
        if question.strip().lower() in ["exit", "quit"]:
            break
        formatted = f"### Instruction:\n{question}\n\n### Answer:"
        answer = generate_answer(formatted)
        print("\n📎 Answer:")
        print(answer)
 ```
 ## 📊 Inference Flow
 ```mermaid
 flowchart TD
 A[User types question] --> B[Tokenization]
 B --> C[QLoRA merged model]
 C --> D[Token decoding]
 D --> E[Answer text]
 E --> A
 ```
 ## ✅ Conclusion of Inference
 - This stage validates if the model learned the desired domain.
 - Can be expanded to:
  - Flask/FastAPI APIs.
  - Chatbots (web interfaces).
  - Corporate integration (API Gateway, MCP servers).
 📊 Full Pipeline Summary
 ```mermaid
 flowchart TD
 A[JSON Dataset] --> B[Tokenization]
 B --> C[Tokenized Dataset]
 C --> D[Data Collator]
 D --> E[Trainer]
 E --> F[QLoRA Training]
 F --> G[Adapted Model]
 G --> H[Inference or Deploy]
 ```
 ## 🔄 Post-training: Merge and Unified Model Export
 After training with QLoRA, the adapter weights are saved in the `./qlora-output/` folder.  
 However, these are not directly usable for inference in production. The next step is to **merge** the LoRA adapters with the base model and then export them into a format ready for inference.
 ### 1. Merge Stage
 The merging process fuses the LoRA adapter weights into the original Mistral 7B model.  
 This produces a `./merged_model/` directory that contains the complete model weights in shards (e.g., `pytorch_model-00001-of-000xx.bin`).
 ### 2. Unification into Hugging Face format
 To simplify inference, we can re-export the model into a **Hugging Face-compatible format** that uses the `safetensors` standard.  
 In this step, we obtain a folder such as `./final_model_single/` containing:
 - `model.safetensors` (the consolidated weights, possibly in one large file or multiple shards depending on size)  
 - `config.json`, `generation_config.json`  
 - Tokenizer files (`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, etc.)
 ⚠️ Note: although this looks like a "single model", Hugging Face does not bundle tokenizer and weights into a single binary. The folder still contains multiple files.  
 If you want a **truly single-file format** (weights + tokenizer + metadata in one blob), you should convert the model to **GGUF**, which is the standard used by `llama.cpp` and `Ollama`.
 ### 3. Alternative Inference with Unified Files
 Once exported, you can load the model for inference like this:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_dir = "./final_model_single"
 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    torch_dtype="float16",
    device_map="auto",
    trust_remote_code=True
 )
 ```
 ```mermaid
 flowchart TD
    A[Training Output<br>qlora-output/] --> B[Merge Step<br>merged_model/]
    B --> C[Unification<br>final_model_single/]
    C --> D[Inference<br>Transformers / Pipeline]
    C --> E[Optional Conversion<br>GGUF for llama.cpp/Ollama]
 ```
 ## 🗜️ Conversion to GGUF (Optional)
 Although the Hugging Face format (`safetensors` + config + tokenizer) is widely supported, it still requires multiple files.  
 If you prefer a **true single-file model** that bundles weights, tokenizer, and metadata, you can convert the model to **GGUF** — the format used by `llama.cpp`, `Ollama`, and other lightweight inference engines.
 ### 1. Install llama.cpp
 Clone the repository and build it:
 ```bash
 git clone https://github.com/ggerganov/llama.cpp
 cd llama.cpp
 make
 ### 2. Convert Hugging Face model to GGUF
 Use the conversion script provided in llama.cpp:
 ```python
 python3 convert.py ./final_model_single --outfile mistral-qlora.gguf
 ```
 This will generate a single GGUF file (e.g., mistral-qlora.gguf) containing all weights and tokenizer data.
 ### 3. Run inference with llama.cpp
 You can now run the model with:
 ```bash
 ./main -m mistral-qlora.gguf -p "Explain what OMOP CDM is in a few sentences."
 ```
 ### 4. Run with Ollama
 If you prefer Ollama, place the .gguf file in your Ollama models directory and create a Modelfile:
 ```docker
 FROM mistral-qlora.gguf
 ```
 Then run:
 ```bash
 ollama create mistral-qlora -f Modelfile
 ollama run mistral-qlora
 ```
 >**Note:**
 	GGUF models are quantized (e.g., Q4_K_M, Q5_K_M, Q8_0) to reduce size and run efficiently on consumer GPUs or CPUs.	
 	The resulting .gguf file will be much smaller (often 4–8 GB) compared to the 28 GB FP16 Hugging Face export.
 ## ✅ Final Conclusion
 - This tutorial showed:
  - How to apply QLoRA on large models.
  - How to train on consumer GPUs with reduced memory.
  - How to adapt Mistral 7B for specialized Q&A.
 👉 This technique can be extended to other datasets and domains such as banking, legal, or healthcare.
 ## References
 [Deploy NVIDIA RTX Virtual Workstation on Oracle Cloud Infrastructure](https://docs.oracle.com/en/learn/deploy-nvidia-rtx-oci/index.html)
 ## Acknowledgments
 - **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
--- a/datasets/qa_dataset.json
+++ b/datasets/qa_dataset.json
--- a/export_single_file.py
+++ b/export_single_file.py
@@ -0,0 +1,34 @@
 # -*- coding: utf-8 -*-
 """
 Exporta modelo LoRA fundido em um único arquivo .safetensors + tokenizer
 """
 import os
 from transformers import AutoModelForCausalLM, AutoTokenizer
 # Diretórios
 src_dir = "./merged_model"          # onde já está o modelo fundido (shards)
 dst_dir = "./final_model_single"    # destino consolidado
 base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
 os.makedirs(dst_dir, exist_ok=True)
 print("🔹 Carregando modelo fundido...")
 model = AutoModelForCausalLM.from_pretrained(
    src_dir,
    torch_dtype="auto",   # mais estável que float16 fixo
    device_map="cpu"      # força carregar em CPU (não estoura GPU)
 )
 print("🔹 Salvando em safetensors único...")
 model.save_pretrained(
    dst_dir,
    safe_serialization=True,   # gera model.safetensors
    max_shard_size="30GB"      # ajusta para caber tudo em 1 arquivo
 )
 print("🔹 Salvando tokenizer...")
 tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
 tokenizer.save_pretrained(dst_dir)
 print(f"✅ Arquivo único salvo em {dst_dir}/model.safetensors")
--- a/infer_single_file.py
+++ b/infer_single_file.py
@@ -0,0 +1,48 @@
 # -*- coding: utf-8 -*-
 """
 Chat interativo no terminal com modelo consolidado em .safetensors
 """
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 # Caminho do modelo consolidado
 model_dir = "./final_model_single"
 print("🔹 Carregando tokenizer...")
 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
 print("🔹 Carregando modelo consolidado...")
 model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",         # envia para GPU se disponível
    torch_dtype=torch.float16, # usa FP16 (menos memória)
    trust_remote_code=True
 )
 model.eval()
 # Função de inferência
 def generate_text(prompt, max_new_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_p=0.9,
            temperature=0.7
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 # Loop de chat
 if __name__ == "__main__":
    print("💬 Chat iniciado! Digite sua pergunta (ou 'sair' para encerrar).")
    while True:
        user_input = input("📝 Você: ")
        if user_input.strip().lower() in ["sair", "exit", "quit"]:
            print("👋 Encerrando o chat.")
            break
        resposta = generate_text(user_input)
        print("🤖 Modelo:", resposta)
--- a/inference_qlora.py
+++ b/inference_qlora.py
@@ -0,0 +1,55 @@
 from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 from peft import PeftModel
 import torch
 # Caminho do modelo LoRA treinado
 output_dir = "./qlora-output"
 base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
 # Configuração de quantização 4-bit
 bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
 )
 # Carrega tokenizer
 tokenizer = AutoTokenizer.from_pretrained(output_dir, trust_remote_code=True)
 tokenizer.pad_token = tokenizer.eos_token
 # Carrega modelo base com quantização
 base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
 )
 # Aplica os pesos LoRA treinados
 model = PeftModel.from_pretrained(base_model, output_dir)
 model.eval()
 # Função para gerar resposta
 def gerar_resposta(prompt, max_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_p=0.9,
            temperature=0.7
        )
    resposta = tokenizer.decode(output[0], skip_special_tokens=True)
    return resposta
 # Exemplo de teste
 if __name__ == "__main__":
    while True:
        prompt = input("\nDigite sua pergunta (ou 'sair'): ")
        if prompt.lower() == "sair":
            break
        resultado = gerar_resposta(prompt)
        print("\n📎 Resposta gerada:")
        print(resultado)
--- a/merge_qlora.py
+++ b/merge_qlora.py
@@ -0,0 +1,20 @@
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
 base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
 lora_path = "./qlora-output"
 output_path = "./merged_model"
 # Carrega o modelo base
 base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
 # Carrega LoRA fundido
 model = PeftModel.from_pretrained(base_model, lora_path)
 model = model.merge_and_unload()
 # Salva pesos fundidos em múltiplos shards
 model.save_pretrained(output_path, max_shard_size="4GB")
 # Copia tokenizer
 tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
 tokenizer.save_pretrained(output_path)
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,11 @@
 chainlit==1.0.500
 langchain==0.1.16
 langchain-community==0.0.34
 langchain-core==0.1.47
 python-dotenv
 pydantic==1.10.13
 transformers==4.41.2
 peft==0.10.0
 accelerate==0.30.1
 datasets
 bitsandbytes --prefer-binary --extra-index-url https://jllllll.github.io/bitsandbytes-wheels/cu121/