mirror of
https://github.com/hoshikawa2/qlora_training.git
synced 2026-03-06 02:10:42 +00:00
517 lines
13 KiB
Markdown
517 lines
13 KiB
Markdown
# Fine-tuning with QLoRA on Mistral 7B
|
||
|
||
## 🎯 Introduction
|
||
|
||
Training large language models (LLMs) from scratch is unfeasible for most people, as it requires hundreds of GPUs and billions of tokens. However, it is possible to adapt pre-trained models (such as Mistral-7B or any other model) for specific use cases using lighter techniques such as LoRA and QLoRA. These allow training only a fraction of the original parameters with 4-bit quantization, drastically reducing memory consumption.
|
||
|
||
### Real Applications
|
||
|
||
- Specialized chatbots: healthcare, legal, finance.
|
||
|
||
- Question and Answer (QA) systems over document bases.
|
||
|
||
- Business automation: agents that understand internal processes.
|
||
|
||
- Academic research: summarization and analysis of technical articles.
|
||
|
||
In this tutorial, you will learn how to fine-tune a Mistral 7B model with QLoRA using a local JSON dataset of questions and answers.
|
||
|
||
## 🧩 Key Concepts
|
||
|
||
- **BitsAndBytes (bnb):** library for 4-bit quantization, enabling training of large models on a single modern GPU.
|
||
|
||
- **LoRA (Low-Rank Adaptation):** technique that adds lightweight learning layers without modifying the original weights.
|
||
|
||
- **QLoRA:** combination of LoRA + 4-bit quantization, resulting in efficient and accessible fine-tuning.
|
||
|
||
- **Trainer (Hugging Face):** abstraction that manages dataset, batches, checkpoints, and metrics automatically.
|
||
|
||
- **JSON dataset:** in the expected format, each entry should have a "text" field containing the instruction or training data.
|
||
|
||
## ⚙️ Prerequisites
|
||
|
||
Install required packages:
|
||
```bash
|
||
pip install torch transformers datasets peft bitsandbytes accelerate
|
||
```
|
||
|
||
Download the training dataset in JSON format:
|
||
```json
|
||
[
|
||
{"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
|
||
{"text": "Question: Who created the Mistral model?\nAnswer: The company Mistral AI."}
|
||
]
|
||
```
|
||
|
||
Save as ./datasets/qa_dataset.json.
|
||
|
||
Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090).
|
||
|
||
## 📝 Step-by-step code
|
||
|
||
1. Main configurations
|
||
|
||
```python
|
||
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
|
||
data_file_path = "./datasets/qa_dataset.json"
|
||
output_dir = "./qlora-output"
|
||
max_length = 512
|
||
```
|
||
|
||
Defines base model, dataset, output, and token limit per sample.
|
||
|
||
2. Quantization with BitsAndBytes
|
||
|
||
```python
|
||
bnb_config = BitsAndBytesConfig(
|
||
load_in_4bit=True,
|
||
bnb_4bit_use_double_quant=True,
|
||
bnb_4bit_quant_type="nf4",
|
||
bnb_4bit_compute_dtype="float16"
|
||
)
|
||
```
|
||
|
||
Activates 4-bit quantization (nf4) to reduce memory.
|
||
|
||
3. Model and tokenizer loading
|
||
|
||
```python
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
||
tokenizer.pad_token = tokenizer.eos_token
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
model_name,
|
||
quantization_config=bnb_config,
|
||
device_map="auto",
|
||
trust_remote_code=True
|
||
)
|
||
```
|
||
|
||
Defines pad_token equal to eos_token (needed in causal models).
|
||
|
||
Loads the quantized model.
|
||
|
||
4. Preparing for LoRA
|
||
|
||
```python
|
||
model = prepare_model_for_kbit_training(model)
|
||
|
||
peft_config = LoraConfig(
|
||
r=8,
|
||
lora_alpha=16,
|
||
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
||
lora_dropout=0.05,
|
||
bias="none",
|
||
task_type="CAUSAL_LM"
|
||
)
|
||
model = get_peft_model(model, peft_config)
|
||
```
|
||
|
||
Configures LoRA layers only on attention projections.
|
||
|
||
5. Loading and tokenizing the dataset
|
||
|
||
```python
|
||
dataset = load_dataset("json", data_files=data_file_path, split="train")
|
||
|
||
def tokenize(example):
|
||
tokenized = tokenizer(
|
||
example["text"],
|
||
truncation=True,
|
||
max_length=max_length,
|
||
padding="max_length"
|
||
)
|
||
tokenized["labels"] = tokenized["input_ids"].copy()
|
||
return tokenized
|
||
|
||
tokenized_dataset = dataset.map(tokenize, remove_columns=dataset.column_names)
|
||
```
|
||
|
||
Tokenizes entries, sets labels equal to input_ids.
|
||
|
||
6. Data Collator
|
||
|
||
```python
|
||
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||
```
|
||
|
||
Ensures consistent batches without masking (MLM disabled).
|
||
|
||
7. Training arguments
|
||
|
||
```python
|
||
training_args = TrainingArguments(
|
||
output_dir=output_dir,
|
||
per_device_train_batch_size=2,
|
||
gradient_accumulation_steps=4,
|
||
num_train_epochs=3,
|
||
learning_rate=2e-4,
|
||
fp16=True,
|
||
logging_steps=10,
|
||
save_strategy="epoch",
|
||
report_to="none"
|
||
)
|
||
```
|
||
|
||
Small batch size (2), accumulating gradients to effective batch of 8.
|
||
|
||
fp16 enabled to reduce memory.
|
||
|
||
8. Initializing the Trainer
|
||
|
||
```python
|
||
trainer = Trainer(
|
||
model=model,
|
||
args=training_args,
|
||
train_dataset=tokenized_dataset,
|
||
data_collator=data_collator
|
||
)
|
||
```
|
||
|
||
Trainer organizes training, checkpoints, and logging.
|
||
|
||
9. Running the training
|
||
|
||
```python
|
||
trainer.train()
|
||
model.save_pretrained(output_dir)
|
||
tokenizer.save_pretrained(output_dir)
|
||
```
|
||
|
||
Saves adapted model in qlora-output.
|
||
|
||
### 🚀 Expected result
|
||
|
||
After training, you will have:
|
||
|
||
- An adapted model for your domain (./qlora-output).
|
||
|
||
- Ability to run specific inferences using:
|
||
|
||
```python
|
||
from transformers import pipeline
|
||
pipe = pipeline("text-generation", model="./qlora-output", tokenizer="./qlora-output")
|
||
print(pipe("Question: What is QLoRA?\nAnswer:")[0]["generated_text"])
|
||
```
|
||
|
||
## Complete QLoRA Pipeline
|
||
|
||
So far we have seen model configuration, quantization, LoRA application, and dataset preparation.
|
||
|
||
Now let's understand what happens afterwards:
|
||
|
||
1. Data Collator
|
||
|
||
```python
|
||
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||
```
|
||
|
||
### 📌 Why use it?
|
||
|
||
The Data Collator is responsible for dynamically building batches during training.
|
||
|
||
In causal models, we do not use MLM (Masked Language Modeling) as in BERT, so we set mlm=False.
|
||
|
||
2. Training Arguments
|
||
|
||
```python
|
||
training_args = TrainingArguments(
|
||
output_dir=output_dir,
|
||
per_device_train_batch_size=2,
|
||
gradient_accumulation_steps=4,
|
||
num_train_epochs=3,
|
||
learning_rate=2e-4,
|
||
fp16=True,
|
||
logging_steps=10,
|
||
save_strategy="epoch",
|
||
report_to="none"
|
||
)
|
||
```
|
||
|
||
### 📌 Why configure this?
|
||
|
||
- per_device_train_batch_size=2 → limits batch size (GPU constraint).
|
||
|
||
- gradient_accumulation_steps=4 → accumulates gradients before update → effective batch = 2 x 4 = 8.
|
||
|
||
- num_train_epochs=3 → trains dataset 3 times.
|
||
|
||
- learning_rate=2e-4 → adequate LR for LoRA.
|
||
|
||
- fp16=True → saves VRAM.
|
||
|
||
- logging_steps=10 → logs every 10 batches.
|
||
|
||
- save_strategy="epoch" → saves checkpoint after each epoch.
|
||
|
||
- report_to="none" → avoids external integrations.
|
||
|
||
3. Defining Trainer
|
||
|
||
```python
|
||
trainer = Trainer(
|
||
model=model,
|
||
args=training_args,
|
||
train_dataset=tokenized_dataset,
|
||
data_collator=data_collator
|
||
)
|
||
```
|
||
|
||
### 📌 Why use Trainer?
|
||
|
||
Trainer automates:
|
||
|
||
- Forward pass (batch propagation).
|
||
|
||
- Backpropagation (gradient calculation).
|
||
|
||
- Weight optimization.
|
||
|
||
- Logging and checkpoints.
|
||
|
||
4. Running Training
|
||
|
||
```python
|
||
trainer.train()
|
||
```
|
||
|
||
### 📌 What happens here?
|
||
|
||
Trainer processes data, batches, loss calculation, weight updates.
|
||
|
||
Only LoRA layers are updated.
|
||
|
||
5. Saving Adapted Model
|
||
|
||
```python
|
||
model.save_pretrained(output_dir)
|
||
tokenizer.save_pretrained(output_dir)
|
||
```
|
||
|
||
### 📌 Why save?
|
||
|
||
The ./qlora-output/ directory will contain:
|
||
|
||
- LoRA weights (adapter_model.bin).
|
||
|
||
- PEFT config (adapter_config.json).
|
||
|
||
- Adapted tokenizer.
|
||
|
||
This output can be loaded later for inference or further training.
|
||
|
||
## Interactive Inference
|
||
|
||
After training with QLoRA, we usually merge LoRA weights into the base model.
|
||
|
||
The code below shows how to load the merged model and use it interactively.
|
||
|
||
```python
|
||
# -*- coding: utf-8 -*-
|
||
import torch
|
||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
|
||
|
||
# auto, mps, cuda
|
||
gpu="cuda"
|
||
|
||
MODEL_PATH = "./merged_model"
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
|
||
tokenizer.pad_token = tokenizer.eos_token
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
MODEL_PATH,
|
||
device_map=gpu,
|
||
offload_folder="./offload",
|
||
torch_dtype=torch.float16
|
||
)
|
||
model.eval()
|
||
```
|
||
|
||
### Function to generate
|
||
|
||
```python
|
||
def generate_answer(prompt, max_new_tokens=64):
|
||
inputs = tokenizer(prompt, return_tensors="pt")
|
||
inputs = {k: v.to(gpu) for k, v in inputs.items()}
|
||
with torch.no_grad():
|
||
outputs = model.generate(
|
||
**inputs,
|
||
max_new_tokens=max_new_tokens,
|
||
do_sample=False # faster and deterministic
|
||
)
|
||
return tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||
```
|
||
|
||
### Interactive loop
|
||
|
||
```python
|
||
if __name__ == "__main__":
|
||
print("🤖 Model loaded! Type your question or 'exit' to quit.")
|
||
while True:
|
||
question = input("\n📝 Instruction: ")
|
||
if question.strip().lower() in ["exit", "quit"]:
|
||
break
|
||
|
||
formatted = f"### Instruction:\n{question}\n\n### Answer:"
|
||
answer = generate_answer(formatted)
|
||
print("\n📎 Answer:")
|
||
print(answer)
|
||
```
|
||
|
||
## 📊 Inference Flow
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[User types question] --> B[Tokenization]
|
||
B --> C[QLoRA merged model]
|
||
C --> D[Token decoding]
|
||
D --> E[Answer text]
|
||
E --> A
|
||
```
|
||
|
||
## ✅ Conclusion of Inference
|
||
|
||
- This stage validates if the model learned the desired domain.
|
||
|
||
- Can be expanded to:
|
||
|
||
- Flask/FastAPI APIs.
|
||
|
||
- Chatbots (web interfaces).
|
||
|
||
- Corporate integration (API Gateway, MCP servers).
|
||
|
||
|
||
📊 Full Pipeline Summary
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[JSON Dataset] --> B[Tokenization]
|
||
B --> C[Tokenized Dataset]
|
||
C --> D[Data Collator]
|
||
D --> E[Trainer]
|
||
E --> F[QLoRA Training]
|
||
F --> G[Adapted Model]
|
||
G --> H[Inference or Deploy]
|
||
```
|
||
|
||
## 🔄 Post-training: Merge and Unified Model Export
|
||
|
||
After training with QLoRA, the adapter weights are saved in the `./qlora-output/` folder.
|
||
However, these are not directly usable for inference in production. The next step is to **merge** the LoRA adapters with the base model and then export them into a format ready for inference.
|
||
|
||
### 1. Merge Stage
|
||
The merging process fuses the LoRA adapter weights into the original Mistral 7B model.
|
||
This produces a `./merged_model/` directory that contains the complete model weights in shards (e.g., `pytorch_model-00001-of-000xx.bin`).
|
||
|
||
### 2. Unification into Hugging Face format
|
||
To simplify inference, we can re-export the model into a **Hugging Face-compatible format** that uses the `safetensors` standard.
|
||
In this step, we obtain a folder such as `./final_model_single/` containing:
|
||
|
||
- `model.safetensors` (the consolidated weights, possibly in one large file or multiple shards depending on size)
|
||
- `config.json`, `generation_config.json`
|
||
- Tokenizer files (`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, etc.)
|
||
|
||
⚠️ Note: although this looks like a "single model", Hugging Face does not bundle tokenizer and weights into a single binary. The folder still contains multiple files.
|
||
If you want a **truly single-file format** (weights + tokenizer + metadata in one blob), you should convert the model to **GGUF**, which is the standard used by `llama.cpp` and `Ollama`.
|
||
|
||
### 3. Alternative Inference with Unified Files
|
||
Once exported, you can load the model for inference like this:
|
||
|
||
```python
|
||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||
|
||
model_dir = "./final_model_single"
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
model_dir,
|
||
torch_dtype="float16",
|
||
device_map="auto",
|
||
trust_remote_code=True
|
||
)
|
||
```
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Training Output<br>qlora-output/] --> B[Merge Step<br>merged_model/]
|
||
B --> C[Unification<br>final_model_single/]
|
||
C --> D[Inference<br>Transformers / Pipeline]
|
||
C --> E[Optional Conversion<br>GGUF for llama.cpp/Ollama]
|
||
```
|
||
|
||
## 🗜️ Conversion to GGUF (Optional)
|
||
|
||
Although the Hugging Face format (`safetensors` + config + tokenizer) is widely supported, it still requires multiple files.
|
||
If you prefer a **true single-file model** that bundles weights, tokenizer, and metadata, you can convert the model to **GGUF** — the format used by `llama.cpp`, `Ollama`, and other lightweight inference engines.
|
||
|
||
### 1. Install llama.cpp
|
||
Clone the repository and build it:
|
||
|
||
```bash
|
||
git clone https://github.com/ggerganov/llama.cpp
|
||
cd llama.cpp
|
||
make
|
||
|
||
### 2. Convert Hugging Face model to GGUF
|
||
|
||
Use the conversion script provided in llama.cpp:
|
||
|
||
```python
|
||
python3 convert.py ./final_model_single --outfile mistral-qlora.gguf
|
||
```
|
||
|
||
This will generate a single GGUF file (e.g., mistral-qlora.gguf) containing all weights and tokenizer data.
|
||
|
||
### 3. Run inference with llama.cpp
|
||
|
||
You can now run the model with:
|
||
|
||
```bash
|
||
./main -m mistral-qlora.gguf -p "Explain what OMOP CDM is in a few sentences."
|
||
```
|
||
|
||
### 4. Run with Ollama
|
||
|
||
If you prefer Ollama, place the .gguf file in your Ollama models directory and create a Modelfile:
|
||
|
||
```docker
|
||
FROM mistral-qlora.gguf
|
||
```
|
||
|
||
Then run:
|
||
|
||
```bash
|
||
ollama create mistral-qlora -f Modelfile
|
||
ollama run mistral-qlora
|
||
```
|
||
|
||
>**Note:**
|
||
GGUF models are quantized (e.g., Q4_K_M, Q5_K_M, Q8_0) to reduce size and run efficiently on consumer GPUs or CPUs.
|
||
The resulting .gguf file will be much smaller (often 4–8 GB) compared to the 28 GB FP16 Hugging Face export.
|
||
|
||
|
||
## ✅ Final Conclusion
|
||
|
||
- This tutorial showed:
|
||
|
||
- How to apply QLoRA on large models.
|
||
|
||
- How to train on consumer GPUs with reduced memory.
|
||
|
||
- How to adapt Mistral 7B for specialized Q&A.
|
||
|
||
|
||
👉 This technique can be extended to other datasets and domains such as banking, legal, or healthcare.
|
||
|
||
## References
|
||
|
||
[Deploy NVIDIA RTX Virtual Workstation on Oracle Cloud Infrastructure](https://docs.oracle.com/en/learn/deploy-nvidia-rtx-oci/index.html)
|
||
|
||
|
||
## Acknowledgments
|
||
|
||
- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
|