first commit

2026-03-03 16:09:36 +00:00 · 2025-10-20 20:31:55 -03:00
commit 781949dccb
7 changed files with 19274 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,516 @@
+# Fine-tuning with QLoRA on Mistral 7B
+
+## 🎯 Introduction
+
+Training large language models (LLMs) from scratch is unfeasible for most people, as it requires hundreds of GPUs and billions of tokens. However, it is possible to adapt pre-trained models (such as Mistral-7B or any other model) for specific use cases using lighter techniques such as LoRA and QLoRA. These allow training only a fraction of the original parameters with 4-bit quantization, drastically reducing memory consumption.
+
+### Real Applications
+
+- Specialized chatbots: healthcare, legal, finance.
+
+- Question and Answer (QA) systems over document bases.
+
+- Business automation: agents that understand internal processes.
+
+- Academic research: summarization and analysis of technical articles.
+
+In this tutorial, you will learn how to fine-tune a Mistral 7B model with QLoRA using a local JSON dataset of questions and answers.
+
+## 🧩 Key Concepts
+
+- **BitsAndBytes (bnb):** library for 4-bit quantization, enabling training of large models on a single modern GPU.
+
+- **LoRA (Low-Rank Adaptation):** technique that adds lightweight learning layers without modifying the original weights.
+
+- **QLoRA:** combination of LoRA + 4-bit quantization, resulting in efficient and accessible fine-tuning.
+
+- **Trainer (Hugging Face):** abstraction that manages dataset, batches, checkpoints, and metrics automatically.
+
+- **JSON dataset:** in the expected format, each entry should have a "text" field containing the instruction or training data.
+
+## ⚙️ Prerequisites
+
+Install required packages:
+```bash
+pip install torch transformers datasets peft bitsandbytes accelerate
+```
+
+Download the training dataset in JSON format:
+```json
+[
+  {"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
+  {"text": "Question: Who created the Mistral model?\nAnswer: The company Mistral AI."}
+]
+```
+
+Save as ./datasets/qa_dataset.json.
+
+Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090).
+
+## 📝 Step-by-step code
+
+1. Main configurations
+   
+```python
+model_name = "mistralai/Mistral-7B-Instruct-v0.2"
+data_file_path = "./datasets/qa_dataset.json"
+output_dir = "./qlora-output"
+max_length = 512
+```
+
+Defines base model, dataset, output, and token limit per sample.
+
+2. Quantization with BitsAndBytes
+
+```python
+bnb_config = BitsAndBytesConfig(
+   load_in_4bit=True,
+   bnb_4bit_use_double_quant=True,
+   bnb_4bit_quant_type="nf4",
+   bnb_4bit_compute_dtype="float16"
+)
+```
+
+Activates 4-bit quantization (nf4) to reduce memory.
+
+3. Model and tokenizer loading
+   
+```python
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+
+model = AutoModelForCausalLM.from_pretrained(
+   model_name,
+   quantization_config=bnb_config,
+   device_map="auto",
+   trust_remote_code=True
+)
+```
+
+Defines pad_token equal to eos_token (needed in causal models).
+
+Loads the quantized model.
+
+4. Preparing for LoRA
+
+```python
+model = prepare_model_for_kbit_training(model)
+
+peft_config = LoraConfig(
+   r=8,
+   lora_alpha=16,
+   target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+   lora_dropout=0.05,
+   bias="none",
+   task_type="CAUSAL_LM"
+)
+model = get_peft_model(model, peft_config)
+```
+
+Configures LoRA layers only on attention projections.
+
+5. Loading and tokenizing the dataset
+
+```python
+dataset = load_dataset("json", data_files=data_file_path, split="train")
+
+def tokenize(example):
+    tokenized = tokenizer(
+        example["text"],
+        truncation=True,
+        max_length=max_length,
+        padding="max_length"
+    )
+    tokenized["labels"] = tokenized["input_ids"].copy()
+    return tokenized
+
+tokenized_dataset = dataset.map(tokenize, remove_columns=dataset.column_names)
+```
+
+Tokenizes entries, sets labels equal to input_ids.
+
+6. Data Collator
+
+```python
+data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
+```
+
+Ensures consistent batches without masking (MLM disabled).
+
+7. Training arguments
+
+```python
+training_args = TrainingArguments(
+   output_dir=output_dir,
+   per_device_train_batch_size=2,
+   gradient_accumulation_steps=4,
+   num_train_epochs=3,
+   learning_rate=2e-4,
+   fp16=True,
+   logging_steps=10,
+   save_strategy="epoch",
+   report_to="none"
+)
+```
+
+Small batch size (2), accumulating gradients to effective batch of 8.
+
+fp16 enabled to reduce memory.
+
+8. Initializing the Trainer
+
+```python
+trainer = Trainer(
+   model=model,
+   args=training_args,
+   train_dataset=tokenized_dataset,
+   data_collator=data_collator
+)
+```
+
+Trainer organizes training, checkpoints, and logging.
+
+9. Running the training
+
+```python
+trainer.train()
+model.save_pretrained(output_dir)
+tokenizer.save_pretrained(output_dir)
+```
+
+Saves adapted model in qlora-output.
+
+### 🚀 Expected result
+
+After training, you will have:
+
+- An adapted model for your domain (./qlora-output).
+
+- Ability to run specific inferences using:
+
+```python
+from transformers import pipeline
+pipe = pipeline("text-generation", model="./qlora-output", tokenizer="./qlora-output")
+print(pipe("Question: What is QLoRA?\nAnswer:")[0]["generated_text"])
+```
+
+## Complete QLoRA Pipeline
+
+So far we have seen model configuration, quantization, LoRA application, and dataset preparation.
+
+Now let's understand what happens afterwards:
+
+1. Data Collator
+
+```python
+data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
+```
+
+### 📌 Why use it?
+
+The Data Collator is responsible for dynamically building batches during training.
+
+In causal models, we do not use MLM (Masked Language Modeling) as in BERT, so we set mlm=False.
+
+2. Training Arguments
+
+```python
+training_args = TrainingArguments(
+   output_dir=output_dir,
+   per_device_train_batch_size=2,
+   gradient_accumulation_steps=4,
+   num_train_epochs=3,
+   learning_rate=2e-4,
+   fp16=True,
+   logging_steps=10,
+   save_strategy="epoch",
+   report_to="none"
+)
+```
+
+### 📌 Why configure this?
+
+- per_device_train_batch_size=2 → limits batch size (GPU constraint).
+
+- gradient_accumulation_steps=4 → accumulates gradients before update → effective batch = 2 x 4 = 8.
+
+- num_train_epochs=3 → trains dataset 3 times.
+
+- learning_rate=2e-4 → adequate LR for LoRA.
+
+- fp16=True → saves VRAM.
+
+- logging_steps=10 → logs every 10 batches.
+
+- save_strategy="epoch" → saves checkpoint after each epoch.
+
+- report_to="none" → avoids external integrations.
+
+3. Defining Trainer
+
+```python
+trainer = Trainer(
+   model=model,
+   args=training_args,
+   train_dataset=tokenized_dataset,
+   data_collator=data_collator
+)
+```
+
+### 📌 Why use Trainer?
+
+Trainer automates:
+
+- Forward pass (batch propagation).
+
+- Backpropagation (gradient calculation).
+
+- Weight optimization.
+
+- Logging and checkpoints.
+
+4. Running Training
+
+```python
+trainer.train()
+```
+
+### 📌 What happens here?
+
+Trainer processes data, batches, loss calculation, weight updates.
+
+Only LoRA layers are updated.
+
+5. Saving Adapted Model
+
+```python
+model.save_pretrained(output_dir)
+tokenizer.save_pretrained(output_dir)
+```
+
+### 📌 Why save?
+
+The ./qlora-output/ directory will contain:
+
+- LoRA weights (adapter_model.bin).
+
+- PEFT config (adapter_config.json).
+
+- Adapted tokenizer.
+
+This output can be loaded later for inference or further training.
+
+## Interactive Inference
+
+After training with QLoRA, we usually merge LoRA weights into the base model.
+
+The code below shows how to load the merged model and use it interactively.
+
+```python
+# -*- coding: utf-8 -*-
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from accelerate import init_empty_weights, load_checkpoint_and_dispatch
+
+# auto, mps, cuda
+gpu="cuda"
+
+MODEL_PATH = "./merged_model"
+
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+
+model = AutoModelForCausalLM.from_pretrained(
+   MODEL_PATH,
+   device_map=gpu,
+   offload_folder="./offload",
+   torch_dtype=torch.float16
+)
+model.eval()
+```
+
+### Function to generate
+
+```python
+def generate_answer(prompt, max_new_tokens=64):
+    inputs = tokenizer(prompt, return_tensors="pt")
+    inputs = {k: v.to(gpu) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            do_sample=False  # faster and deterministic
+        )
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+```
+
+### Interactive loop
+
+```python
+if __name__ == "__main__":
+    print("🤖 Model loaded! Type your question or 'exit' to quit.")
+    while True:
+        question = input("\n📝 Instruction: ")
+        if question.strip().lower() in ["exit", "quit"]:
+            break
+
+        formatted = f"### Instruction:\n{question}\n\n### Answer:"
+        answer = generate_answer(formatted)
+        print("\n📎 Answer:")
+        print(answer)
+```
+
+## 📊 Inference Flow
+
+```mermaid
+flowchart TD
+A[User types question] --> B[Tokenization]
+B --> C[QLoRA merged model]
+C --> D[Token decoding]
+D --> E[Answer text]
+E --> A
+```
+
+## ✅ Conclusion of Inference
+
+- This stage validates if the model learned the desired domain.
+
+- Can be expanded to:
+
+  - Flask/FastAPI APIs.
+
+  - Chatbots (web interfaces).
+
+  - Corporate integration (API Gateway, MCP servers).
+
+
+📊 Full Pipeline Summary
+
+```mermaid
+flowchart TD
+A[JSON Dataset] --> B[Tokenization]
+B --> C[Tokenized Dataset]
+C --> D[Data Collator]
+D --> E[Trainer]
+E --> F[QLoRA Training]
+F --> G[Adapted Model]
+G --> H[Inference or Deploy]
+```
+
+## 🔄 Post-training: Merge and Unified Model Export
+
+After training with QLoRA, the adapter weights are saved in the `./qlora-output/` folder.  
+However, these are not directly usable for inference in production. The next step is to **merge** the LoRA adapters with the base model and then export them into a format ready for inference.
+
+### 1. Merge Stage
+The merging process fuses the LoRA adapter weights into the original Mistral 7B model.  
+This produces a `./merged_model/` directory that contains the complete model weights in shards (e.g., `pytorch_model-00001-of-000xx.bin`).
+
+### 2. Unification into Hugging Face format
+To simplify inference, we can re-export the model into a **Hugging Face-compatible format** that uses the `safetensors` standard.  
+In this step, we obtain a folder such as `./final_model_single/` containing:
+
+- `model.safetensors` (the consolidated weights, possibly in one large file or multiple shards depending on size)  
+- `config.json`, `generation_config.json`  
+- Tokenizer files (`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, etc.)
+
+⚠️ Note: although this looks like a "single model", Hugging Face does not bundle tokenizer and weights into a single binary. The folder still contains multiple files.  
+If you want a **truly single-file format** (weights + tokenizer + metadata in one blob), you should convert the model to **GGUF**, which is the standard used by `llama.cpp` and `Ollama`.
+
+### 3. Alternative Inference with Unified Files
+Once exported, you can load the model for inference like this:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model_dir = "./final_model_single"
+
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_dir,
+    torch_dtype="float16",
+    device_map="auto",
+    trust_remote_code=True
+)
+```
+
+```mermaid
+flowchart TD
+    A[Training Output<br>qlora-output/] --> B[Merge Step<br>merged_model/]
+    B --> C[Unification<br>final_model_single/]
+    C --> D[Inference<br>Transformers / Pipeline]
+    C --> E[Optional Conversion<br>GGUF for llama.cpp/Ollama]
+```
+
+## 🗜️ Conversion to GGUF (Optional)
+
+Although the Hugging Face format (`safetensors` + config + tokenizer) is widely supported, it still requires multiple files.  
+If you prefer a **true single-file model** that bundles weights, tokenizer, and metadata, you can convert the model to **GGUF** — the format used by `llama.cpp`, `Ollama`, and other lightweight inference engines.
+
+### 1. Install llama.cpp
+Clone the repository and build it:
+
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make
+
+### 2. Convert Hugging Face model to GGUF
+
+Use the conversion script provided in llama.cpp:
+
+```python
+python3 convert.py ./final_model_single --outfile mistral-qlora.gguf
+```
+
+This will generate a single GGUF file (e.g., mistral-qlora.gguf) containing all weights and tokenizer data.
+
+### 3. Run inference with llama.cpp
+
+You can now run the model with:
+
+```bash
+./main -m mistral-qlora.gguf -p "Explain what OMOP CDM is in a few sentences."
+```
+
+### 4. Run with Ollama
+
+If you prefer Ollama, place the .gguf file in your Ollama models directory and create a Modelfile:
+
+```docker
+FROM mistral-qlora.gguf
+```
+
+Then run:
+
+```bash
+ollama create mistral-qlora -f Modelfile
+ollama run mistral-qlora
+```
+
+>**Note:**
+	GGUF models are quantized (e.g., Q4_K_M, Q5_K_M, Q8_0) to reduce size and run efficiently on consumer GPUs or CPUs.	
+	The resulting .gguf file will be much smaller (often 4–8 GB) compared to the 28 GB FP16 Hugging Face export.
+	
+
+## ✅ Final Conclusion
+
+- This tutorial showed:
+
+  - How to apply QLoRA on large models.
+
+  - How to train on consumer GPUs with reduced memory.
+
+  - How to adapt Mistral 7B for specialized Q&A.
+
+
+👉 This technique can be extended to other datasets and domains such as banking, legal, or healthcare.
+
+## References
+
+[Deploy NVIDIA RTX Virtual Workstation on Oracle Cloud Infrastructure](https://docs.oracle.com/en/learn/deploy-nvidia-rtx-oci/index.html)
+
+
+## Acknowledgments
+
+- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
--- a/datasets/qa_dataset.json
+++ b/datasets/qa_dataset.json
--- a/export_single_file.py
+++ b/export_single_file.py
@@ -0,0 +1,34 @@
+# -*- coding: utf-8 -*-
+"""
+Exporta modelo LoRA fundido em um único arquivo .safetensors + tokenizer
+"""
+
+import os
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# Diretórios
+src_dir = "./merged_model"          # onde já está o modelo fundido (shards)
+dst_dir = "./final_model_single"    # destino consolidado
+base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
+
+os.makedirs(dst_dir, exist_ok=True)
+
+print("🔹 Carregando modelo fundido...")
+model = AutoModelForCausalLM.from_pretrained(
+    src_dir,
+    torch_dtype="auto",   # mais estável que float16 fixo
+    device_map="cpu"      # força carregar em CPU (não estoura GPU)
+)
+
+print("🔹 Salvando em safetensors único...")
+model.save_pretrained(
+    dst_dir,
+    safe_serialization=True,   # gera model.safetensors
+    max_shard_size="30GB"      # ajusta para caber tudo em 1 arquivo
+)
+
+print("🔹 Salvando tokenizer...")
+tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
+tokenizer.save_pretrained(dst_dir)
+
+print(f"✅ Arquivo único salvo em {dst_dir}/model.safetensors")
--- a/infer_single_file.py
+++ b/infer_single_file.py
@@ -0,0 +1,48 @@
+# -*- coding: utf-8 -*-
+"""
+Chat interativo no terminal com modelo consolidado em .safetensors
+"""
+
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# Caminho do modelo consolidado
+model_dir = "./final_model_single"
+
+print("🔹 Carregando tokenizer...")
+tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+
+print("🔹 Carregando modelo consolidado...")
+model = AutoModelForCausalLM.from_pretrained(
+    model_dir,
+    device_map="auto",         # envia para GPU se disponível
+    torch_dtype=torch.float16, # usa FP16 (menos memória)
+    trust_remote_code=True
+)
+model.eval()
+
+# Função de inferência
+def generate_text(prompt, max_new_tokens=200):
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            do_sample=True,
+            top_p=0.9,
+            temperature=0.7
+        )
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+# Loop de chat
+if __name__ == "__main__":
+    print("💬 Chat iniciado! Digite sua pergunta (ou 'sair' para encerrar).")
+    while True:
+        user_input = input("📝 Você: ")
+        if user_input.strip().lower() in ["sair", "exit", "quit"]:
+            print("👋 Encerrando o chat.")
+            break
+        resposta = generate_text(user_input)
+        print("🤖 Modelo:", resposta)
--- a/inference_qlora.py
+++ b/inference_qlora.py
@@ -0,0 +1,55 @@
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+from peft import PeftModel
+import torch
+
+# Caminho do modelo LoRA treinado
+output_dir = "./qlora-output"
+base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
+
+# Configuração de quantização 4-bit
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.float16
+)
+
+# Carrega tokenizer
+tokenizer = AutoTokenizer.from_pretrained(output_dir, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+
+# Carrega modelo base com quantização
+base_model = AutoModelForCausalLM.from_pretrained(
+    base_model_name,
+    quantization_config=bnb_config,
+    device_map="auto",
+    trust_remote_code=True
+)
+
+# Aplica os pesos LoRA treinados
+model = PeftModel.from_pretrained(base_model, output_dir)
+model.eval()
+
+# Função para gerar resposta
+def gerar_resposta(prompt, max_tokens=200):
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=max_tokens,
+            do_sample=True,
+            top_p=0.9,
+            temperature=0.7
+        )
+    resposta = tokenizer.decode(output[0], skip_special_tokens=True)
+    return resposta
+
+# Exemplo de teste
+if __name__ == "__main__":
+    while True:
+        prompt = input("\nDigite sua pergunta (ou 'sair'): ")
+        if prompt.lower() == "sair":
+            break
+        resultado = gerar_resposta(prompt)
+        print("\n📎 Resposta gerada:")
+        print(resultado)
--- a/merge_qlora.py
+++ b/merge_qlora.py
@@ -0,0 +1,20 @@
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+
+base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
+lora_path = "./qlora-output"
+output_path = "./merged_model"
+
+# Carrega o modelo base
+base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
+
+# Carrega LoRA fundido
+model = PeftModel.from_pretrained(base_model, lora_path)
+model = model.merge_and_unload()
+
+# Salva pesos fundidos em múltiplos shards
+model.save_pretrained(output_path, max_shard_size="4GB")
+
+# Copia tokenizer
+tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
+tokenizer.save_pretrained(output_path)
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,11 @@
+chainlit==1.0.500
+langchain==0.1.16
+langchain-community==0.0.34
+langchain-core==0.1.47
+python-dotenv
+pydantic==1.10.13
+transformers==4.41.2
+peft==0.10.0
+accelerate==0.30.1
+datasets
+bitsandbytes --prefer-binary --extra-index-url https://jllllll.github.io/bitsandbytes-wheels/cu121/