mirror of
https://github.com/hoshikawa2/qlora_training.git
synced 2026-03-03 16:09:36 +00:00
first commit
This commit is contained in:
516
README.md
Normal file
516
README.md
Normal file
@@ -0,0 +1,516 @@
|
||||
# Fine-tuning with QLoRA on Mistral 7B
|
||||
|
||||
## 🎯 Introduction
|
||||
|
||||
Training large language models (LLMs) from scratch is unfeasible for most people, as it requires hundreds of GPUs and billions of tokens. However, it is possible to adapt pre-trained models (such as Mistral-7B or any other model) for specific use cases using lighter techniques such as LoRA and QLoRA. These allow training only a fraction of the original parameters with 4-bit quantization, drastically reducing memory consumption.
|
||||
|
||||
### Real Applications
|
||||
|
||||
- Specialized chatbots: healthcare, legal, finance.
|
||||
|
||||
- Question and Answer (QA) systems over document bases.
|
||||
|
||||
- Business automation: agents that understand internal processes.
|
||||
|
||||
- Academic research: summarization and analysis of technical articles.
|
||||
|
||||
In this tutorial, you will learn how to fine-tune a Mistral 7B model with QLoRA using a local JSON dataset of questions and answers.
|
||||
|
||||
## 🧩 Key Concepts
|
||||
|
||||
- **BitsAndBytes (bnb):** library for 4-bit quantization, enabling training of large models on a single modern GPU.
|
||||
|
||||
- **LoRA (Low-Rank Adaptation):** technique that adds lightweight learning layers without modifying the original weights.
|
||||
|
||||
- **QLoRA:** combination of LoRA + 4-bit quantization, resulting in efficient and accessible fine-tuning.
|
||||
|
||||
- **Trainer (Hugging Face):** abstraction that manages dataset, batches, checkpoints, and metrics automatically.
|
||||
|
||||
- **JSON dataset:** in the expected format, each entry should have a "text" field containing the instruction or training data.
|
||||
|
||||
## ⚙️ Prerequisites
|
||||
|
||||
Install required packages:
|
||||
```bash
|
||||
pip install torch transformers datasets peft bitsandbytes accelerate
|
||||
```
|
||||
|
||||
Download the training dataset in JSON format:
|
||||
```json
|
||||
[
|
||||
{"text": "Question: What is QLoRA?\nAnswer: An efficient fine-tuning technique using 4-bit quantization."},
|
||||
{"text": "Question: Who created the Mistral model?\nAnswer: The company Mistral AI."}
|
||||
]
|
||||
```
|
||||
|
||||
Save as ./datasets/qa_dataset.json.
|
||||
|
||||
Compatible GPU: preferably one with 16 GB of VRAM or more (e.g., RTX 3090, 4090, 5090).
|
||||
|
||||
## 📝 Step-by-step code
|
||||
|
||||
1. Main configurations
|
||||
|
||||
```python
|
||||
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
|
||||
data_file_path = "./datasets/qa_dataset.json"
|
||||
output_dir = "./qlora-output"
|
||||
max_length = 512
|
||||
```
|
||||
|
||||
Defines base model, dataset, output, and token limit per sample.
|
||||
|
||||
2. Quantization with BitsAndBytes
|
||||
|
||||
```python
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
bnb_4bit_compute_dtype="float16"
|
||||
)
|
||||
```
|
||||
|
||||
Activates 4-bit quantization (nf4) to reduce memory.
|
||||
|
||||
3. Model and tokenizer loading
|
||||
|
||||
```python
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
quantization_config=bnb_config,
|
||||
device_map="auto",
|
||||
trust_remote_code=True
|
||||
)
|
||||
```
|
||||
|
||||
Defines pad_token equal to eos_token (needed in causal models).
|
||||
|
||||
Loads the quantized model.
|
||||
|
||||
4. Preparing for LoRA
|
||||
|
||||
```python
|
||||
model = prepare_model_for_kbit_training(model)
|
||||
|
||||
peft_config = LoraConfig(
|
||||
r=8,
|
||||
lora_alpha=16,
|
||||
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
||||
lora_dropout=0.05,
|
||||
bias="none",
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
model = get_peft_model(model, peft_config)
|
||||
```
|
||||
|
||||
Configures LoRA layers only on attention projections.
|
||||
|
||||
5. Loading and tokenizing the dataset
|
||||
|
||||
```python
|
||||
dataset = load_dataset("json", data_files=data_file_path, split="train")
|
||||
|
||||
def tokenize(example):
|
||||
tokenized = tokenizer(
|
||||
example["text"],
|
||||
truncation=True,
|
||||
max_length=max_length,
|
||||
padding="max_length"
|
||||
)
|
||||
tokenized["labels"] = tokenized["input_ids"].copy()
|
||||
return tokenized
|
||||
|
||||
tokenized_dataset = dataset.map(tokenize, remove_columns=dataset.column_names)
|
||||
```
|
||||
|
||||
Tokenizes entries, sets labels equal to input_ids.
|
||||
|
||||
6. Data Collator
|
||||
|
||||
```python
|
||||
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||||
```
|
||||
|
||||
Ensures consistent batches without masking (MLM disabled).
|
||||
|
||||
7. Training arguments
|
||||
|
||||
```python
|
||||
training_args = TrainingArguments(
|
||||
output_dir=output_dir,
|
||||
per_device_train_batch_size=2,
|
||||
gradient_accumulation_steps=4,
|
||||
num_train_epochs=3,
|
||||
learning_rate=2e-4,
|
||||
fp16=True,
|
||||
logging_steps=10,
|
||||
save_strategy="epoch",
|
||||
report_to="none"
|
||||
)
|
||||
```
|
||||
|
||||
Small batch size (2), accumulating gradients to effective batch of 8.
|
||||
|
||||
fp16 enabled to reduce memory.
|
||||
|
||||
8. Initializing the Trainer
|
||||
|
||||
```python
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=tokenized_dataset,
|
||||
data_collator=data_collator
|
||||
)
|
||||
```
|
||||
|
||||
Trainer organizes training, checkpoints, and logging.
|
||||
|
||||
9. Running the training
|
||||
|
||||
```python
|
||||
trainer.train()
|
||||
model.save_pretrained(output_dir)
|
||||
tokenizer.save_pretrained(output_dir)
|
||||
```
|
||||
|
||||
Saves adapted model in qlora-output.
|
||||
|
||||
### 🚀 Expected result
|
||||
|
||||
After training, you will have:
|
||||
|
||||
- An adapted model for your domain (./qlora-output).
|
||||
|
||||
- Ability to run specific inferences using:
|
||||
|
||||
```python
|
||||
from transformers import pipeline
|
||||
pipe = pipeline("text-generation", model="./qlora-output", tokenizer="./qlora-output")
|
||||
print(pipe("Question: What is QLoRA?\nAnswer:")[0]["generated_text"])
|
||||
```
|
||||
|
||||
## Complete QLoRA Pipeline
|
||||
|
||||
So far we have seen model configuration, quantization, LoRA application, and dataset preparation.
|
||||
|
||||
Now let's understand what happens afterwards:
|
||||
|
||||
1. Data Collator
|
||||
|
||||
```python
|
||||
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||||
```
|
||||
|
||||
### 📌 Why use it?
|
||||
|
||||
The Data Collator is responsible for dynamically building batches during training.
|
||||
|
||||
In causal models, we do not use MLM (Masked Language Modeling) as in BERT, so we set mlm=False.
|
||||
|
||||
2. Training Arguments
|
||||
|
||||
```python
|
||||
training_args = TrainingArguments(
|
||||
output_dir=output_dir,
|
||||
per_device_train_batch_size=2,
|
||||
gradient_accumulation_steps=4,
|
||||
num_train_epochs=3,
|
||||
learning_rate=2e-4,
|
||||
fp16=True,
|
||||
logging_steps=10,
|
||||
save_strategy="epoch",
|
||||
report_to="none"
|
||||
)
|
||||
```
|
||||
|
||||
### 📌 Why configure this?
|
||||
|
||||
- per_device_train_batch_size=2 → limits batch size (GPU constraint).
|
||||
|
||||
- gradient_accumulation_steps=4 → accumulates gradients before update → effective batch = 2 x 4 = 8.
|
||||
|
||||
- num_train_epochs=3 → trains dataset 3 times.
|
||||
|
||||
- learning_rate=2e-4 → adequate LR for LoRA.
|
||||
|
||||
- fp16=True → saves VRAM.
|
||||
|
||||
- logging_steps=10 → logs every 10 batches.
|
||||
|
||||
- save_strategy="epoch" → saves checkpoint after each epoch.
|
||||
|
||||
- report_to="none" → avoids external integrations.
|
||||
|
||||
3. Defining Trainer
|
||||
|
||||
```python
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=tokenized_dataset,
|
||||
data_collator=data_collator
|
||||
)
|
||||
```
|
||||
|
||||
### 📌 Why use Trainer?
|
||||
|
||||
Trainer automates:
|
||||
|
||||
- Forward pass (batch propagation).
|
||||
|
||||
- Backpropagation (gradient calculation).
|
||||
|
||||
- Weight optimization.
|
||||
|
||||
- Logging and checkpoints.
|
||||
|
||||
4. Running Training
|
||||
|
||||
```python
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
### 📌 What happens here?
|
||||
|
||||
Trainer processes data, batches, loss calculation, weight updates.
|
||||
|
||||
Only LoRA layers are updated.
|
||||
|
||||
5. Saving Adapted Model
|
||||
|
||||
```python
|
||||
model.save_pretrained(output_dir)
|
||||
tokenizer.save_pretrained(output_dir)
|
||||
```
|
||||
|
||||
### 📌 Why save?
|
||||
|
||||
The ./qlora-output/ directory will contain:
|
||||
|
||||
- LoRA weights (adapter_model.bin).
|
||||
|
||||
- PEFT config (adapter_config.json).
|
||||
|
||||
- Adapted tokenizer.
|
||||
|
||||
This output can be loaded later for inference or further training.
|
||||
|
||||
## Interactive Inference
|
||||
|
||||
After training with QLoRA, we usually merge LoRA weights into the base model.
|
||||
|
||||
The code below shows how to load the merged model and use it interactively.
|
||||
|
||||
```python
|
||||
# -*- coding: utf-8 -*-
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
|
||||
|
||||
# auto, mps, cuda
|
||||
gpu="cuda"
|
||||
|
||||
MODEL_PATH = "./merged_model"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
MODEL_PATH,
|
||||
device_map=gpu,
|
||||
offload_folder="./offload",
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
model.eval()
|
||||
```
|
||||
|
||||
### Function to generate
|
||||
|
||||
```python
|
||||
def generate_answer(prompt, max_new_tokens=64):
|
||||
inputs = tokenizer(prompt, return_tensors="pt")
|
||||
inputs = {k: v.to(gpu) for k, v in inputs.items()}
|
||||
with torch.no_grad():
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_new_tokens,
|
||||
do_sample=False # faster and deterministic
|
||||
)
|
||||
return tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||||
```
|
||||
|
||||
### Interactive loop
|
||||
|
||||
```python
|
||||
if __name__ == "__main__":
|
||||
print("🤖 Model loaded! Type your question or 'exit' to quit.")
|
||||
while True:
|
||||
question = input("\n📝 Instruction: ")
|
||||
if question.strip().lower() in ["exit", "quit"]:
|
||||
break
|
||||
|
||||
formatted = f"### Instruction:\n{question}\n\n### Answer:"
|
||||
answer = generate_answer(formatted)
|
||||
print("\n📎 Answer:")
|
||||
print(answer)
|
||||
```
|
||||
|
||||
## 📊 Inference Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[User types question] --> B[Tokenization]
|
||||
B --> C[QLoRA merged model]
|
||||
C --> D[Token decoding]
|
||||
D --> E[Answer text]
|
||||
E --> A
|
||||
```
|
||||
|
||||
## ✅ Conclusion of Inference
|
||||
|
||||
- This stage validates if the model learned the desired domain.
|
||||
|
||||
- Can be expanded to:
|
||||
|
||||
- Flask/FastAPI APIs.
|
||||
|
||||
- Chatbots (web interfaces).
|
||||
|
||||
- Corporate integration (API Gateway, MCP servers).
|
||||
|
||||
|
||||
📊 Full Pipeline Summary
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[JSON Dataset] --> B[Tokenization]
|
||||
B --> C[Tokenized Dataset]
|
||||
C --> D[Data Collator]
|
||||
D --> E[Trainer]
|
||||
E --> F[QLoRA Training]
|
||||
F --> G[Adapted Model]
|
||||
G --> H[Inference or Deploy]
|
||||
```
|
||||
|
||||
## 🔄 Post-training: Merge and Unified Model Export
|
||||
|
||||
After training with QLoRA, the adapter weights are saved in the `./qlora-output/` folder.
|
||||
However, these are not directly usable for inference in production. The next step is to **merge** the LoRA adapters with the base model and then export them into a format ready for inference.
|
||||
|
||||
### 1. Merge Stage
|
||||
The merging process fuses the LoRA adapter weights into the original Mistral 7B model.
|
||||
This produces a `./merged_model/` directory that contains the complete model weights in shards (e.g., `pytorch_model-00001-of-000xx.bin`).
|
||||
|
||||
### 2. Unification into Hugging Face format
|
||||
To simplify inference, we can re-export the model into a **Hugging Face-compatible format** that uses the `safetensors` standard.
|
||||
In this step, we obtain a folder such as `./final_model_single/` containing:
|
||||
|
||||
- `model.safetensors` (the consolidated weights, possibly in one large file or multiple shards depending on size)
|
||||
- `config.json`, `generation_config.json`
|
||||
- Tokenizer files (`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, etc.)
|
||||
|
||||
⚠️ Note: although this looks like a "single model", Hugging Face does not bundle tokenizer and weights into a single binary. The folder still contains multiple files.
|
||||
If you want a **truly single-file format** (weights + tokenizer + metadata in one blob), you should convert the model to **GGUF**, which is the standard used by `llama.cpp` and `Ollama`.
|
||||
|
||||
### 3. Alternative Inference with Unified Files
|
||||
Once exported, you can load the model for inference like this:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
model_dir = "./final_model_single"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_dir,
|
||||
torch_dtype="float16",
|
||||
device_map="auto",
|
||||
trust_remote_code=True
|
||||
)
|
||||
```
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Training Output<br>qlora-output/] --> B[Merge Step<br>merged_model/]
|
||||
B --> C[Unification<br>final_model_single/]
|
||||
C --> D[Inference<br>Transformers / Pipeline]
|
||||
C --> E[Optional Conversion<br>GGUF for llama.cpp/Ollama]
|
||||
```
|
||||
|
||||
## 🗜️ Conversion to GGUF (Optional)
|
||||
|
||||
Although the Hugging Face format (`safetensors` + config + tokenizer) is widely supported, it still requires multiple files.
|
||||
If you prefer a **true single-file model** that bundles weights, tokenizer, and metadata, you can convert the model to **GGUF** — the format used by `llama.cpp`, `Ollama`, and other lightweight inference engines.
|
||||
|
||||
### 1. Install llama.cpp
|
||||
Clone the repository and build it:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
cd llama.cpp
|
||||
make
|
||||
|
||||
### 2. Convert Hugging Face model to GGUF
|
||||
|
||||
Use the conversion script provided in llama.cpp:
|
||||
|
||||
```python
|
||||
python3 convert.py ./final_model_single --outfile mistral-qlora.gguf
|
||||
```
|
||||
|
||||
This will generate a single GGUF file (e.g., mistral-qlora.gguf) containing all weights and tokenizer data.
|
||||
|
||||
### 3. Run inference with llama.cpp
|
||||
|
||||
You can now run the model with:
|
||||
|
||||
```bash
|
||||
./main -m mistral-qlora.gguf -p "Explain what OMOP CDM is in a few sentences."
|
||||
```
|
||||
|
||||
### 4. Run with Ollama
|
||||
|
||||
If you prefer Ollama, place the .gguf file in your Ollama models directory and create a Modelfile:
|
||||
|
||||
```docker
|
||||
FROM mistral-qlora.gguf
|
||||
```
|
||||
|
||||
Then run:
|
||||
|
||||
```bash
|
||||
ollama create mistral-qlora -f Modelfile
|
||||
ollama run mistral-qlora
|
||||
```
|
||||
|
||||
>**Note:**
|
||||
GGUF models are quantized (e.g., Q4_K_M, Q5_K_M, Q8_0) to reduce size and run efficiently on consumer GPUs or CPUs.
|
||||
The resulting .gguf file will be much smaller (often 4–8 GB) compared to the 28 GB FP16 Hugging Face export.
|
||||
|
||||
|
||||
## ✅ Final Conclusion
|
||||
|
||||
- This tutorial showed:
|
||||
|
||||
- How to apply QLoRA on large models.
|
||||
|
||||
- How to train on consumer GPUs with reduced memory.
|
||||
|
||||
- How to adapt Mistral 7B for specialized Q&A.
|
||||
|
||||
|
||||
👉 This technique can be extended to other datasets and domains such as banking, legal, or healthcare.
|
||||
|
||||
## References
|
||||
|
||||
[Deploy NVIDIA RTX Virtual Workstation on Oracle Cloud Infrastructure](https://docs.oracle.com/en/learn/deploy-nvidia-rtx-oci/index.html)
|
||||
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
|
||||
18590
datasets/qa_dataset.json
Normal file
18590
datasets/qa_dataset.json
Normal file
File diff suppressed because it is too large
Load Diff
34
export_single_file.py
Normal file
34
export_single_file.py
Normal file
@@ -0,0 +1,34 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Exporta modelo LoRA fundido em um único arquivo .safetensors + tokenizer
|
||||
"""
|
||||
|
||||
import os
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
# Diretórios
|
||||
src_dir = "./merged_model" # onde já está o modelo fundido (shards)
|
||||
dst_dir = "./final_model_single" # destino consolidado
|
||||
base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
|
||||
|
||||
os.makedirs(dst_dir, exist_ok=True)
|
||||
|
||||
print("🔹 Carregando modelo fundido...")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
src_dir,
|
||||
torch_dtype="auto", # mais estável que float16 fixo
|
||||
device_map="cpu" # força carregar em CPU (não estoura GPU)
|
||||
)
|
||||
|
||||
print("🔹 Salvando em safetensors único...")
|
||||
model.save_pretrained(
|
||||
dst_dir,
|
||||
safe_serialization=True, # gera model.safetensors
|
||||
max_shard_size="30GB" # ajusta para caber tudo em 1 arquivo
|
||||
)
|
||||
|
||||
print("🔹 Salvando tokenizer...")
|
||||
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
|
||||
tokenizer.save_pretrained(dst_dir)
|
||||
|
||||
print(f"✅ Arquivo único salvo em {dst_dir}/model.safetensors")
|
||||
48
infer_single_file.py
Normal file
48
infer_single_file.py
Normal file
@@ -0,0 +1,48 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Chat interativo no terminal com modelo consolidado em .safetensors
|
||||
"""
|
||||
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
# Caminho do modelo consolidado
|
||||
model_dir = "./final_model_single"
|
||||
|
||||
print("🔹 Carregando tokenizer...")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
||||
if tokenizer.pad_token is None:
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
print("🔹 Carregando modelo consolidado...")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_dir,
|
||||
device_map="auto", # envia para GPU se disponível
|
||||
torch_dtype=torch.float16, # usa FP16 (menos memória)
|
||||
trust_remote_code=True
|
||||
)
|
||||
model.eval()
|
||||
|
||||
# Função de inferência
|
||||
def generate_text(prompt, max_new_tokens=200):
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
with torch.no_grad():
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_new_tokens,
|
||||
do_sample=True,
|
||||
top_p=0.9,
|
||||
temperature=0.7
|
||||
)
|
||||
return tokenizer.decode(outputs[0], skip_special_tokens=True)
|
||||
|
||||
# Loop de chat
|
||||
if __name__ == "__main__":
|
||||
print("💬 Chat iniciado! Digite sua pergunta (ou 'sair' para encerrar).")
|
||||
while True:
|
||||
user_input = input("📝 Você: ")
|
||||
if user_input.strip().lower() in ["sair", "exit", "quit"]:
|
||||
print("👋 Encerrando o chat.")
|
||||
break
|
||||
resposta = generate_text(user_input)
|
||||
print("🤖 Modelo:", resposta)
|
||||
55
inference_qlora.py
Normal file
55
inference_qlora.py
Normal file
@@ -0,0 +1,55 @@
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
|
||||
from peft import PeftModel
|
||||
import torch
|
||||
|
||||
# Caminho do modelo LoRA treinado
|
||||
output_dir = "./qlora-output"
|
||||
base_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
|
||||
|
||||
# Configuração de quantização 4-bit
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
bnb_4bit_compute_dtype=torch.float16
|
||||
)
|
||||
|
||||
# Carrega tokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(output_dir, trust_remote_code=True)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
# Carrega modelo base com quantização
|
||||
base_model = AutoModelForCausalLM.from_pretrained(
|
||||
base_model_name,
|
||||
quantization_config=bnb_config,
|
||||
device_map="auto",
|
||||
trust_remote_code=True
|
||||
)
|
||||
|
||||
# Aplica os pesos LoRA treinados
|
||||
model = PeftModel.from_pretrained(base_model, output_dir)
|
||||
model.eval()
|
||||
|
||||
# Função para gerar resposta
|
||||
def gerar_resposta(prompt, max_tokens=200):
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
with torch.no_grad():
|
||||
output = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_tokens,
|
||||
do_sample=True,
|
||||
top_p=0.9,
|
||||
temperature=0.7
|
||||
)
|
||||
resposta = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||
return resposta
|
||||
|
||||
# Exemplo de teste
|
||||
if __name__ == "__main__":
|
||||
while True:
|
||||
prompt = input("\nDigite sua pergunta (ou 'sair'): ")
|
||||
if prompt.lower() == "sair":
|
||||
break
|
||||
resultado = gerar_resposta(prompt)
|
||||
print("\n📎 Resposta gerada:")
|
||||
print(resultado)
|
||||
20
merge_qlora.py
Normal file
20
merge_qlora.py
Normal file
@@ -0,0 +1,20 @@
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from peft import PeftModel
|
||||
|
||||
base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
|
||||
lora_path = "./qlora-output"
|
||||
output_path = "./merged_model"
|
||||
|
||||
# Carrega o modelo base
|
||||
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
|
||||
|
||||
# Carrega LoRA fundido
|
||||
model = PeftModel.from_pretrained(base_model, lora_path)
|
||||
model = model.merge_and_unload()
|
||||
|
||||
# Salva pesos fundidos em múltiplos shards
|
||||
model.save_pretrained(output_path, max_shard_size="4GB")
|
||||
|
||||
# Copia tokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
|
||||
tokenizer.save_pretrained(output_path)
|
||||
11
requirements.txt
Normal file
11
requirements.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
chainlit==1.0.500
|
||||
langchain==0.1.16
|
||||
langchain-community==0.0.34
|
||||
langchain-core==0.1.47
|
||||
python-dotenv
|
||||
pydantic==1.10.13
|
||||
transformers==4.41.2
|
||||
peft==0.10.0
|
||||
accelerate==0.30.1
|
||||
datasets
|
||||
bitsandbytes --prefer-binary --extra-index-url https://jllllll.github.io/bitsandbytes-wheels/cu121/
|
||||
Reference in New Issue
Block a user