adjustments

This commit is contained in:
2025-07-02 00:16:25 -03:00
parent 90f770ce74
commit ead1861eec
7 changed files with 554 additions and 20 deletions

BIN
.DS_Store vendored

Binary file not shown.

159
README.md
View File

@@ -35,6 +35,7 @@ Download the Python code from here:
- [`requirements.txt`](./files/requirements.txt)
- [`oci_genai_llm_context.py`](./files/oci_genai_llm_context.py)
- [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py)
- [`oci_genai_llm_graphrag.py`](./files/oci_genai_llm_graphrag.py)
You can find the PDF documents here:
@@ -89,9 +90,9 @@ Create a folder named `Manuals` and move these PDFs there.
![img_5.png](./images/img_5.png "image")
- In the first processing execution, the vector data will be saved on a faiss database.
- In the first processing execution, the vector data will be saved on a FAISS database.
![img_6.png](./images/img_6.png "image")
![img_6.png](./images/img_6.png "image")
- **Define the Prompt:**
@@ -110,7 +111,7 @@ Create a folder named `Manuals` and move these PDFs there.
- `llm` generates a response based on the retrieved documents.
- `StrOutputParser` formats the final output.
- ![img_7.png](images/img_7.png)
![img_7.png](./images/img_7.png "image")
- **Question and Answer Loop:**
@@ -124,7 +125,7 @@ Create a folder named `Manuals` and move these PDFs there.
![img_8.png](./images/img_8.png "image")
## Fixed Size Chunking
### Fixed Size Chunking
**(A Faster Alternative: Fixed-Size Chunking)**
@@ -174,15 +175,15 @@ In applications like semantic search or Retrieval-Augmented Generation (RAG), po
- This is a very simple method to split text:
![img_10.png](images/img_10.png)
![img_10.png](./images/img_10.png "image")
- And this is the main process of fixed chunking:
![img_11.png](images/img_11.png)
![img_11.png](./images/img_11.png "image")
>**Note:** Download this code to process the **fixed chunking** more **faster**: [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py)
>**Note:** Download this code to process the **fixed chunking** more **faster**: [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py)
## Semantic Chunking
### Semantic Chunking
**What is Semantic Chunking?**
@@ -197,6 +198,7 @@ Instead of cutting text arbitrarily, Semantic Chunking tries to preserve the ful
A traditional chunking process, based on fixed size, is fast: the system just counts tokens or characters and cuts accordingly.
With Semantic Chunking, several extra steps of semantic analysis are required:
1. Reading and interpreting the full text (or large blocks) before splitting:
The LLM needs to “understand” the content to identify the best chunk boundaries.
2. Running LLM prompts or topic classification models:
@@ -207,17 +209,129 @@ With Semantic Chunking, several extra steps of semantic analysis are required:
4. Sequential and incremental decision-making:
Semantic chunking often works in steps (e.g., analyzing 10,000-token blocks and then refining chunk boundaries inside that block), which increases total processing time.
>**Note 1:** Depending on your machine processing power, you will wait a long, long time to finalize the first execution using **Semantic Chunking**.
>**Note 2:** You can use this algorithm to produce customized chunking using **OCI Gen AI**.
>**Note:**
> - Depending on your machine processing power, you will wait a long, long time to finalize the first execution using **Semantic Chunking**.
> - You can use this algorithm to produce customized chunking using **OCI Generative AI**.
- This is the main document process. It uses:
- **smart_split_text()**: separates the full-text in small pieces of 10kb (you can configure to adopt other strategies). The mechanism perceive the last paragraph. If part of the paragraph is in the next text piece, this part will be ignored in the processing and will be appended on the next processing text group.
- **semantic_chunk()**: This method will use the OCI LLM mechanism to separate the paragraphs. It includes the intelligence to identify the titles, components of a table, the paragraphs to execute a smart chunk. The strategy here is to use the **Semantic Chunk** technique. It will took more time to complete the mission if compared with the common processing. So, the first processing will took a long time but the next will load all the faiss pre-saved data.
- **split_llm_output_into_chapters()**: This method will finalize the chunk, separating the chapters.
![img.png](images/img_9.png)
- **`smart_split_text()`:** Separates the full-text in small pieces of 10kb (you can configure to adopt other strategies). The mechanism perceive the last paragraph. If part of the paragraph is in the next text piece, this part will be ignored in the processing and will be appended on the next processing text group.
- **`semantic_chunk()`:** This method will use the OCI LLM mechanism to separate the paragraphs. It includes the intelligence to identify the titles, components of a table, the paragraphs to execute a smart chunk. The strategy here is to use the **Semantic Chunk** technique. It will took more time to complete the mission if compared with the common processing. So, the first processing will took a long time but the next will load all the FAISS pre-saved data.
- **`split_llm_output_into_chapters()`:** This method will finalize the chunk, separating the chapters.
![img.png](./images/img_9.png "image")
>**Note:** Download this code to process the **semantic chunking**: [`oci_genai_llm_context.py`](./files/oci_genai_llm_context.py)
### Combining Semantic Chunking with GraphRAG
**GraphRAG (Graph-Augmented Retrieval-Augmented Generation)** is an advanced AI architecture that combines traditional vector-based retrieval with structured knowledge graphs. In a standard RAG pipeline, a language model retrieves relevant document chunks using semantic similarity from a vector database (like FAISS). However, vector-based retrieval operates in an unstructured manner, relying purely on embeddings and distance metrics, which sometimes miss deeper contextual or relational meanings.
**GraphRAG** enhances this process by introducing a knowledge graph layer, where entities, concepts, components, and their relationships are explicitly represented as nodes and edges. This graph-based context enables the language model to reason over relationships, hierarchies, and dependencies that vector similarity alone cannot capture.
### Why Combine Semantic Chunking with GraphRAG?
Semantic chunking is the process of intelligently splitting large documents into meaningful units or “chunks,” based on the contents structure — such as chapters, headings, sections, or logical divisions. Rather than breaking documents purely by character limits or naive paragraph splitting, semantic chunking produces higher-quality, context-aware chunks that align better with human understanding.
When combined with GraphRAG, semantic chunking offers several powerful advantages:
1. Enhanced Knowledge Representation:
- Semantic chunks preserve logical boundaries in the content.
- Knowledge graphs extracted from these chunks maintain accurate relationships between entities, systems, APIs, processes, or services.
2. Multi-Modal Contextual Retrieval (the language model retrieves both):
- Unstructured context from the vector database (semantic similarity).
- Structured context from the knowledge graph (entity-relation triples).
- This hybrid approach leads to more complete and accurate answers.
3. Improved Reasoning Capabilities:
- Graph-based retrieval enables relational reasoning.
- The LLM can answer questions like:
> “What services does the Order API depend on?”
> “Which components are part of the SOA Suite?”
- These relational queries are often impossible with embedding-only approaches.
4. Higher Explainability and Traceability:
- Graph relationships are human-readable and transparent.
- Users can inspect how answers are derived from both textual and structural knowledge.
5. Reduced Hallucination:
- The graph acts as a constraint on the LLM, anchoring responses to verified relationships and factual connections extracted from the source documents.
6. Scalability Across Complex Domains:
- In technical domains (e.g., APIs, microservices, legal contracts, healthcare standards), relationships between components are as important as the components themselves.
- GraphRAG combined with semantic chunking scales effectively in these contexts, preserving both textual depth and relational structure.
>**Note 1:** Download this code to process the **semantic chunking** with **graphRAG**: [`oci_genai_llm_graphrag.py`](./files/oci_genai_llm_graphrag.py)
>**Note 2:** You will need:
>
> - A docker installed and active to use the **Neo4J** open-source graphos database to test
> - Install **neo4j** Python Library
There are 2 methods:
### create_knowledge_graph
- This method automatically extracts entities and relationships from text chunks and stores them in a Neo4j knowledge graph.
- For each document chunk, it sends the content to a LLM (Large Language Model) with a prompt asking to extract entities (like systems, components, services, APIs) and their relationships.
- It parses each line, extracts Entity1, RELATION, and Entity2.
- Stores this information as nodes and edges in the Neo4j graph database using Cypher queries:
MERGE (e1:Entity {name: $entity1})
MERGE (e2:Entity {name: $entity2})
MERGE (e1)-[:RELATION {source: $source}]->(e2)
![img.png](images/img_12.png)
### query_knowledge_graph
- This method queries the Neo4j knowledge graph to retrieve relationships related to a specific keyword or concept.
- Executes a Cypher query that searches for:
Any relationship (e1)-[r]->(e2)
Where e1.name, e2.name, or the relationship type contains the query_text (case-insensitive).
- Returns up to 20 matching triples formatted as:
Entity1 -[RELATION]-> Entity2
![img_1.png](images/img_13.png)
![img_2.png](images/img_14.png)
>**Disclaimer: Neo4j Usage Notice**
>
> This implementation uses Neo4j as an embedded knowledge graph database for demonstration and prototyping purposes. While Neo4j is a powerful and flexible graph database suitable for development, testing, and small to medium workloads, it may not meet the requirements for enterprise-grade, mission-critical, or highly secure workloads, especially in environments that demand high availability, scalability, and advanced security compliance.
>
> For production environments and enterprise scenarios, we recommend leveraging **Oracle Database** with **Graph** capabilities, which offers:
>
> Enterprise-grade reliability and security.
>
> Scalability for mission-critical workloads.
>
> Native graph models (Property Graph and RDF) integrated with relational data.
>
> Advanced analytics, security, high availability, and disaster recovery features.
>
> Full Oracle Cloud Infrastructure (OCI) integration.
>
> By using Oracle Database for graph workloads, organizations can unify structured, semi-structured, and graph data within a single, secure, and scalable enterprise platform.
## Task 3: Run Query for Oracle Integration and Oracle SOA Suite Contents
@@ -233,6 +347,12 @@ FOR SEMANTIC CHUNKING TECHNIQUE
python oci_genai_llm_context.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
```
```
FOR SEMANTIC CHUNKING COMBINED WITH GRAPHRAG TECHNIQUE
python oci_genai_llm_graphrag.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
```
> **Note:** The `--device` and `--gpu_name` parameters can be used to accelerate the processing in Python, using GPU if your machine has one. Consider that this code can be used with local models too.
The provided context distinguishes Oracle SOA Suite and Oracle Integration, you can test the code considering these points:
@@ -244,15 +364,12 @@ The provided context distinguishes Oracle SOA Suite and Oracle Integration, you
We can define the following context, which greatly helps in interpreting the documents correctly.
![img_7.png](images/img_7.png)
![img_7.png](./images/img_7.png "image")
The following image shows the example of comparison between Oracle SOA Suite and Oracle Integration.
![img.png](./images/img.png "image")
The following image shows the example for Kafka.
![img_1.png](./images/img_1.png "image")
## Next Steps
@@ -270,6 +387,8 @@ This approach can be applied in various fields, such as legal, compliance, techn
- [Introduction to Custom and Built-in Python LangChain Agents](https://wellsr.com/python/working-with-python-langchain-agents/)
- [Oracle Database Insider - Graph RAG: Bring the Power of Graphs to Generative AI](https://blogs.oracle.com/database/post/graph-rag-bring-the-power-of-graphs-to-generative-ai)
## Acknowledgments
- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)

View File

@@ -0,0 +1,414 @@
import os
import pickle
import re
import atexit
import subprocess
import socket
import time
from tqdm import tqdm
from neo4j import GraphDatabase
from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
from langchain_core.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain_community.embeddings import OCIGenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema.runnable import RunnableMap
from langchain_community.document_loaders import UnstructuredPDFLoader, PyMuPDFLoader
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
# =========================
# Graphos Database: Neo4J
# =========================
def is_port_open(host, port):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.settimeout(2)
return sock.connect_ex((host, port)) == 0
def start_neo4j_if_not_running():
if is_port_open('127.0.0.1', 7687):
print("🟢 Neo4j is already running on port 7687.")
return
print("🟡 Neo4j not found on port 7687. Starting via Docker...")
try:
subprocess.run([
"docker", "run", "--name", "neo4j-graphrag",
"-p", "7474:7474", "-p", "7687:7687",
"-d",
"-e", f"NEO4J_AUTH={NEO4J_USER}/{NEO4J_PASSWORD}",
"--restart", "unless-stopped",
"neo4j:5"
], check=True)
except subprocess.CalledProcessError as e:
print("🚫 Error starting Neo4j via Docker:", e)
raise
print("⏳ Waiting for Neo4j to start...", end="")
for _ in range(10):
if is_port_open('127.0.0.1', 7687):
print("✅ Neo4j is ready!")
return
print(".", end="", flush=True)
time.sleep(2)
print("\n❌ Failed to connect to Neo4j.")
raise ConnectionError("Neo4j did not start correctly.")
# =========================
# Global Configurations
# =========================
INDEX_PATH = "./faiss_index"
PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl")
chapter_separator_regex = r"^(#{1,6} .+|\*\*.+\*\*)$"
# Neo4j Configuration
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "your_password_here"
#LLM Definitions
llm = ChatOCIGenAI(
model_id="meta.llama-3.1-405b-instruct",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT",
model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 4000},
)
llm_for_rag = ChatOCIGenAI(
model_id="meta.llama-3.1-405b-instruct",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT",
)
embeddings = OCIGenAIEmbeddings(
model_id="cohere.embed-multilingual-v3.0",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT",
)
start_neo4j_if_not_running()
graph_driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
atexit.register(lambda: graph_driver.close())
# =========================
# Helper Functions
# =========================
def create_knowledge_graph(chunks):
with graph_driver.session() as session:
for doc in chunks:
text = doc.page_content
source = doc.metadata.get("source", "unknown")
if not text.strip():
print(f"⚠️ Skipping empty chunk from {source}")
continue
prompt = f"""
You are an expert in knowledge extraction.
Given the following technical text:
{text}
Extract key entities (systems, components, processes, protocols, APIs, services) and their relationships in the following format:
- Entity1 -[RELATION]-> Entity2
Use UPPERCASE for RELATION types, replacing spaces with underscores.
Example output:
SOA Suite -[HAS_COMPONENT]-> BPEL Process
Integration Flow -[USES]-> REST API
Order Service -[CALLS]-> Inventory Service
Important:
- If there are no entities or relationships, return: NONE
- Only output the list, no explanation.
"""
response = llm_for_rag.invoke(prompt)
if not hasattr(response, "content"):
print("[ERROR] Failed to get graph triples.")
continue
result = response.content.strip()
if result.upper() == "NONE":
print(f" No entities found in chunk from {source}")
continue
triples = result.splitlines()
for triple in triples:
parts = triple.split("-[")
if len(parts) != 2:
print(f"⚠️ Skipping malformed triple: {triple}")
continue
right_part = parts[1].split("]->")
if len(right_part) != 2:
print(f"⚠️ Skipping malformed relation part: {parts[1]}")
continue
relation, entity2 = right_part
relation = relation.strip().replace(" ", "_").upper()
entity1 = parts[0].strip()
entity2 = entity2.strip()
session.run(
f"""
MERGE (e1:Entity {{name: $entity1}})
MERGE (e2:Entity {{name: $entity2}})
MERGE (e1)-[:{relation} {{source: $source}}]->(e2)
""",
entity1=entity1,
entity2=entity2,
source=source
)
# =========================
# GraphRAG - Query the graph
# =========================
def query_knowledge_graph(query_text):
with graph_driver.session() as session:
result = session.run(
"""
MATCH (e1)-[r]->(e2)
WHERE toLower(e1.name) CONTAINS toLower($search)
OR toLower(e2.name) CONTAINS toLower($search)
OR toLower(type(r)) CONTAINS toLower($search)
RETURN e1.name AS from, type(r) AS relation, e2.name AS to
LIMIT 20
""",
search=query_text
)
relations = []
for record in result:
relations.append(f"{record['from']} -[{record['relation']}]-> {record['to']}")
return "\n".join(relations) if relations else "No related entities found."
# =========================
# Semantical Chunking
# =========================
def split_llm_output_into_chapters(llm_text):
chapters = []
current_chapter = []
lines = llm_text.splitlines()
for line in lines:
if re.match(chapter_separator_regex, line):
if current_chapter:
chapters.append("\n".join(current_chapter).strip())
current_chapter = [line]
else:
current_chapter.append(line)
if current_chapter:
chapters.append("\n".join(current_chapter).strip())
return chapters
def semantic_chunking(text):
prompt = f"""
You received the following text extracted via OCR:
{text}
Your task:
1. Identify headings (short uppercase or bold lines, no period at the end)
2. Separate paragraphs by heading
3. Indicate columns with [COLUMN 1], [COLUMN 2] if present
4. Indicate tables with [TABLE] in markdown format
"""
response = llm_for_rag.invoke(prompt)
return response
def read_pdfs(pdf_path):
if "-ocr" in pdf_path:
doc_pages = PyMuPDFLoader(str(pdf_path)).load()
else:
doc_pages = UnstructuredPDFLoader(str(pdf_path)).load()
full_text = "\n".join([page.page_content for page in doc_pages])
return full_text
def smart_split_text(text, max_chunk_size=10_000):
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + max_chunk_size, text_length)
split_point = max(
text.rfind('.', start, end),
text.rfind('!', start, end),
text.rfind('?', start, end),
text.rfind('\n\n', start, end)
)
if split_point == -1 or split_point <= start:
split_point = end
else:
split_point += 1
chunk = text[start:split_point].strip()
if chunk:
chunks.append(chunk)
start = split_point
return chunks
def load_previously_indexed_docs():
if os.path.exists(PROCESSED_DOCS_FILE):
with open(PROCESSED_DOCS_FILE, "rb") as f:
return pickle.load(f)
return set()
def save_indexed_docs(docs):
with open(PROCESSED_DOCS_FILE, "wb") as f:
pickle.dump(docs, f)
# =========================
# Main Function
# =========================
def chat():
pdf_paths = [
'./Manuals/SOASUITE.pdf',
'./Manuals/SOASUITEHL7.pdf',
'./Manuals/using-integrations-oracle-integration-3.pdf'
]
already_indexed_docs = load_previously_indexed_docs()
updated_docs = set()
try:
vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True)
print("✔️ FAISS index loaded.")
except Exception:
print("⚠️ FAISS index not found, creating a new one.")
vectorstore = None
new_chunks = []
for pdf_path in tqdm(pdf_paths, desc=f"📄 Processing PDFs"):
print(f" {os.path.basename(pdf_path)}")
if pdf_path in already_indexed_docs:
print(f"✅ Document already indexed: {pdf_path}")
continue
full_text = read_pdfs(pdf_path=pdf_path)
text_chunks = smart_split_text(full_text, max_chunk_size=10_000)
overflow_buffer = ""
for chunk in tqdm(text_chunks, desc=f"📄 Processing text chunks", dynamic_ncols=True, leave=False):
current_text = overflow_buffer + chunk
treated_text = semantic_chunking(current_text)
if hasattr(treated_text, "content"):
chapters = split_llm_output_into_chapters(treated_text.content)
last_chapter = chapters[-1] if chapters else ""
if last_chapter and not last_chapter.strip().endswith((".", "!", "?", "\n\n")):
print("📌 Last chapter seems incomplete, saving for the next cycle")
overflow_buffer = last_chapter
chapters = chapters[:-1]
else:
overflow_buffer = ""
for chapter_text in chapters:
doc = Document(page_content=chapter_text, metadata={"source": pdf_path})
new_chunks.append(doc)
print(f"✅ New chapter indexed:\n{chapter_text}...\n")
else:
print(f"[ERROR] semantic_chunking returned unexpected type: {type(treated_text)}")
updated_docs.add(str(pdf_path))
if new_chunks:
if vectorstore:
vectorstore.add_documents(new_chunks)
else:
vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings)
vectorstore.save_local(INDEX_PATH)
save_indexed_docs(already_indexed_docs.union(updated_docs))
print(f"💾 {len(new_chunks)} chunks added to FAISS index.")
print("🧠 Building knowledge graph...")
create_knowledge_graph(new_chunks)
else:
print("📁 No new documents to index.")
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100})
template = """
Document context:
{context}
Graph context:
{graph_context}
Question:
{input}
Interpretation rules:
- You can search for a step-by-step tutorial about a subject
- You can search a concept description about a subject
- You can search for a list of components about a subject
"""
prompt = PromptTemplate.from_template(template)
def get_context(x):
query = x.get("input") if isinstance(x, dict) else x
return retriever.invoke(query)
chain = (
RunnableMap({
"context": RunnableLambda(get_context),
"graph_context": RunnableLambda(lambda x: query_knowledge_graph(x.get("input") if isinstance(x, dict) else x)),
"input": lambda x: x.get("input") if isinstance(x, dict) else x
})
| prompt
| llm
| StrOutputParser()
)
print("✅ READY")
while True:
query = input("❓ Question (or 'quit' to exit): ")
if query.lower() == "quit":
break
response = chain.invoke(query)
print("\n📜 RESPONSE:\n")
print(response)
print("\n" + "=" * 80 + "\n")
# 🚀 Run
if __name__ == "__main__":
chat()

View File

@@ -14,3 +14,4 @@ chardet
lxml
oci
oci-cli
neo4j

BIN
images/img_12.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 129 KiB

BIN
images/img_13.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

BIN
images/img_14.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB