adjusmtents

This commit is contained in:
2025-06-19 09:33:12 -03:00
parent 0a6583752c
commit 83020a54e8
30 changed files with 727 additions and 245 deletions

BIN
.DS_Store vendored

Binary file not shown.

163
README.md
View File

@@ -1,163 +0,0 @@
# Consult PDF Documents in Natural Language with OCI Generative AI
## Introduction
Oracle Cloud Generative AI is an advanced generative artificial intelligence solution that enables companies and developers to create intelligent applications using cutting-edge language models. Based on powerful technologies such as LLMs (Large Language Models), this solution allows the automation of complex tasks, making processes faster, more efficient, and accessible through natural language interactions.
One of the most impactful applications of Oracle Cloud Generative AI is in PDF document analysis. Companies frequently deal with large volumes of documents, such as contracts, financial reports, technical manuals, and research papers. Manually searching for information in these files can be time-consuming and prone to errors.
With the use of generative artificial intelligence, it is possible to extract information instantly and accurately, allowing users to query complex documents simply by formulating questions in natural language. This means that instead of reading entire pages to find a specific clause in a contract or a relevant data point in a report, users can just ask the model, which quickly returns the answer based on the analyzed content.
Beyond information retrieval, Oracle Cloud Generative AI can also be used to summarize lengthy documents, compare content, classify information, and even generate strategic insights. These capabilities make the technology essential for various fields, such as legal, finance, healthcare, and engineering, optimizing decision-making and increasing productivity.
By integrating this technology with tools such as Oracle AI Services, OCI Data Science, and APIs for document processing, companies can build intelligent solutions that completely transform the way they interact with their data, making information retrieval faster and more effective.
### Prerequisites
To use the demo, you need to have the following pre-installed:
- Python 3.10 or higher
- OCI CLI
### Install Python Packages
The Python code requires certain libraries for using OCI Generative AI. Install the required Python packages by running:
pip install -r requirements.txt
## Understand the Code
This a demo of OCI Generative AI for querying functionalities of Oracle SOA SUITE and Oracle Integration.
Both tools are currently used for hybrid integration strategies, meaning they operate in both Cloud and on-prem environments.
Since these tools share functionalities and processes, this code helps in understanding how to implement the same integration approach in each. Additionally, it allows users to explore common characteristics and differences.
You can find the Python codes at:
- [requirements.txt](source/requirements.txt)
- [oci_genai_llm_context.py](source/oci_genai_llm_context.py)
Below, we will explain each section of the code.
### Import Libraries
Imports the necessary libraries for processing PDFs, Oracle generative AI, text vectorization, and storage in vector databases (FAISS and Chroma).
• PyPDFLoader is used to extract text from PDFs.
• ChatOCIGenAI enables the use of Oracle Cloud Generative AI models to answer questions.
• OCIGenAIEmbeddings creates embeddings (vector representations) of texts for semantic search.
![img_6.png](images/img_6.png)
### Load and Process PDFs
Lists the PDF files to be processed.
• PyPDFLoader reads each document and splits it into pages for easier indexing and searching.
• Document IDs are stored for future reference.
![img_7.png](images/img_7.png)
### Configure the Oracle Generative Model
Configures the Llama-3.1-405b model hosted on Oracle Cloud to generate responses based on the loaded documents.
• Defines parameters such as temperature (randomness control), top_p (diversity control), and token limit.
![img_8.png](images/img_8.png)
>**Note:** Please, confirm that the version of Llama available to use in your tenancy. Depending on when you are reading this tutorial, you cannot find this model anymore.
### Create Embeddings and Vector Indexing
Uses Oracle's embedding model to transform texts into numerical vectors, facilitating semantic searches in documents.
![img_9.png](images/img_9.png)
• FAISS (Facebook AI Similarity Search) stores the embeddings of the PDF documents for quick queries.
• retriever allows retrieving the most relevant excerpts based on the semantic similarity of the user's query.
![img_10.png](images/img_10.png)
### Define the Prompt
Creates an intelligent prompt for the generative model, guiding it to consider only relevant documents for each query.
• This improves the accuracy of responses and avoids unnecessary information.
![img_12.png](images/img_12.png)
### Create the Processing Chain (RAG - Retrieval-Augmented Generation)
Implements an RAG (Retrieval-Augmented Generation) flow, where:
1. retriever searches for the most relevant document excerpts.
2. prompt organizes the query for better context.
3. llm generates a response based on the retrieved documents.
4. StrOutputParser formats the final output.
![img_13.png](images/img_13.png)
### Question and Answer Loop
Maintains a loop where users can ask questions about the loaded documents.
• The AI responds using the knowledge base extracted from the PDFs.
• Typing "quit" exits the program.
![img_14.png](images/img_14.png)
## Query for OIC and SOA Suite contents
Run the following command:
python oci_genai_llm_context.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
>**Note:** the parameters --device and --gpu_name could be used to accelerate the processing in Python, using GPU if your machine have one. Consider that this code can be used with local models too.
Thanks to the context provided to distinguish between SOA SUITE and Oracle Integration, you can test the code considering this points:
- The query should be made only for SOA SUITE: Therefore, only SOA SUITE documents should be considered.
- The query should be made only for Oracle Integration: Therefore, only Oracle Integration documents should be considered.
- The query requires a comparison between SOA SUITE and Oracle Integration: Therefore, all documents should be considered.
We can define the following context, which greatly helps in interpreting the documents correctly:
![img_3.png](images/img_3.png)
Example of comparison between SOA SUITE and Oracle Integration:
![img.png](images/img.png)
Example regarding Kafka:
![img_1.png](images/img_1.png)
## Conclusion
This code demonstrates an application of Oracle Cloud Generative AI for intelligent PDF analysis. It enables users to efficiently query large volumes of documents using semantic searches and a generative AI model to generate accurate natural language responses.
This approach can be applied in various fields, such as legal, compliance, technical support, and academic research, making information retrieval much faster and smarter.
## References
- [Extending SaaS by AI/ML features - Part 8: OCI Generative AI Integration with LangChain Use Cases](https://www.ateam-oracle.com/post/oci-generative-ai-integration-with-langchain-usecases)
- [Bridging cloud and conversational AI: LangChain and OCI Data Science platform](https://blogs.oracle.com/ai-and-datascience/post/cloud-conversational-ai-langchain-oci-data-science)
- [Install OCI CLI](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm#Quickstart)
- [Introduction to Custom and Built-in Python LangChain Agents](https://wellsr.com/python/working-with-python-langchain-agents/)
## Acknowledgments
- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)

14
_config.yml Normal file
View File

@@ -0,0 +1,14 @@
title: Analyze PDF Documents in Natural Language with OCI Generative AI
description: Learn how to analyze PDF documents in natural language using Oracle Cloud Infrastructure Generative AI (OCI Generative AI).
baseurl: /en/learn/oci-genai-pdf
copyright: 2025
copyright-last: 2025
partno: G29356-03
print-month: June
print-year: 2025
duration: PT01H0M0S
level: Beginner
roles: Application Administrator;Application Developer;DevOps Engineer;Developer
products: en/cloud/oracle-cloud-infrastructure/oci;en/cloud/en/cloud/oracle-cloud-infrastructure/generative-ai
keywords: Cloud Native
inject-note: true

View File

@@ -0,0 +1,258 @@
from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
from langchain_core.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain_community.embeddings import OCIGenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema.runnable import RunnableMap
from langchain_community.document_loaders import PyPDFLoader, UnstructuredPowerPointLoader, UnstructuredPDFLoader, PyMuPDFLoader
from langchain_core.documents import Document
from tqdm import tqdm
import os
import pickle
import re
from langchain_core.runnables import RunnableLambda
INDEX_PATH = "./faiss_index"
PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl")
chapter_separator_regex = r"^(#{1,6} .+|\*\*.+\*\*)$"
def split_llm_output_into_chapters(llm_text):
"""
Splits the LLM output text into chapters, assuming the LLM separates chapters using markdown-style headings like '# Title'
"""
chapters = []
current_chapter = []
lines = llm_text.splitlines()
for line in lines:
if re.match(chapter_separator_regex, line):
if current_chapter:
chapters.append("\n".join(current_chapter).strip())
current_chapter = [line]
else:
current_chapter.append(line)
if current_chapter:
chapters.append("\n".join(current_chapter).strip())
return chapters
def semantic_chunking(text):
llm = ChatOCIGenAI(
model_id="meta.llama-3.1-405b-instruct",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT",
)
prompt = f"""
You received the following text extracted via OCR:
{text}
Your task:
1. Identify headings (short uppercase or bold lines, no period at the end)
2. Separate paragraphs by heading
3. Indicate columns with [COLUMN 1], [COLUMN 2] if present
4. Indicate tables with [TABLE] in markdown format
"""
response = llm.invoke(prompt)
return response
def read_pdfs(pdf_path):
if "-ocr" in pdf_path:
doc_pages = PyMuPDFLoader(str(pdf_path)).load()
else:
doc_pages = UnstructuredPDFLoader(str(pdf_path)).load()
full_text = "\n".join([page.page_content for page in doc_pages])
return full_text
def smart_split_text(text, max_chunk_size=10_000):
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + max_chunk_size, text_length)
# Try to find the last sentence end before the limit (., ?, !, \n\n)
split_point = max(
text.rfind('.', start, end),
text.rfind('!', start, end),
text.rfind('?', start, end),
text.rfind('\n\n', start, end)
)
# If not found, make a hard cut
if split_point == -1 or split_point <= start:
split_point = end
else:
split_point += 1 # Include the ending character
chunk = text[start:split_point].strip()
if chunk:
chunks.append(chunk)
start = split_point
return chunks
def load_previously_indexed_docs():
if os.path.exists(PROCESSED_DOCS_FILE):
with open(PROCESSED_DOCS_FILE, "rb") as f:
return pickle.load(f)
return set()
def save_indexed_docs(docs):
with open(PROCESSED_DOCS_FILE, "wb") as f:
pickle.dump(docs, f)
def append_text_to_file(file_path, text):
"""
Appends text to the end of a file.
If the file doesn't exist, it will be created.
Args:
file_path (str): Path to the file where the text will be saved.
text (str): Text to append.
"""
with open(file_path, "a", encoding="utf-8") as f:
f.write(text + "\n")
def chat():
llm = ChatOCIGenAI(
model_id="meta.llama-3.1-405b-instruct",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT", # Replace with your profile name,
model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 4000},
)
embeddings = OCIGenAIEmbeddings(
model_id="cohere.embed-multilingual-v3.0",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT", # Replace with your profile name,
)
pdf_paths = [
'./Manuals/using-integrations-oracle-integration-3.pdf',
'./Manuals/SOASE.pdf',
'./Manuals/SOASUITEHL7.pdf'
]
already_indexed_docs = load_previously_indexed_docs()
updated_docs = set()
# Try loading existing FAISS index
try:
vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True)
print("✔️ FAISS index loaded.")
except Exception:
print("⚠️ FAISS index not found, creating a new one.")
vectorstore = None
new_chunks = []
pages = []
for pdf_path in tqdm(pdf_paths, desc=f"📄 Processing PDFs"):
print(f" {os.path.basename(pdf_path)}")
if pdf_path in already_indexed_docs:
print(f"✅ Document already indexed: {pdf_path}")
continue
full_text = read_pdfs(pdf_path=pdf_path)
# Split the text into ~10 KB chunks (~10,000 characters)
text_chunks = smart_split_text(full_text, max_chunk_size=10_000)
overflow_buffer = "" # Remainder from the previous chapter, if any
for chunk in tqdm(text_chunks, desc=f"📄 Processing text chunks", dynamic_ncols=True, leave=False):
# Join with leftover from previous chunk
current_text = overflow_buffer + chunk
# Send text to LLM for semantic splitting
treated_text = semantic_chunking(current_text)
if hasattr(treated_text, "content"):
chapters = split_llm_output_into_chapters(treated_text.content)
# Check if the last chapter seems incomplete
last_chapter = chapters[-1] if chapters else ""
# Simple criteria: if text ends without punctuation (like . ! ?) or is too short
if last_chapter and not last_chapter.strip().endswith((".", "!", "?", "\n\n")):
print("📌 Last chapter seems incomplete, saving for the next cycle")
overflow_buffer = last_chapter
chapters = chapters[:-1] # Don't index the last incomplete chapter yet
else:
overflow_buffer = "" # Nothing left over
# Save complete chapters as document chunks
for chapter_text in chapters:
doc = Document(page_content=chapter_text, metadata={"source": pdf_path})
new_chunks.append(doc)
print(f"✅ New chapter indexed:\n{chapter_text}...\n")
else:
print(f"[ERROR] semantic_chunking returned unexpected type: {type(treated_text)}")
updated_docs.add(str(pdf_path))
# If there are new documents, index them
if new_chunks:
if vectorstore:
vectorstore.add_documents(new_chunks)
else:
vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings)
vectorstore.save_local(INDEX_PATH)
save_indexed_docs(already_indexed_docs.union(updated_docs))
print(f"💾 {len(new_chunks)} chunks added to FAISS index.")
else:
print("📁 No new documents to index.")
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100})
template = """
Document context:
{context}
Question:
{input}
Interpretation rules:
Rule 1: SOA SUITE documents: `SOASUITE.pdf` and `SOASUITEHL7.pdf`
Rule 2: Oracle Integration (known as OIC) document: `using-integrations-oracle-integration-3.pdf`
Rule 3: If the query is not a comparison between SOA SUITE and Oracle Integration (OIC), only consider documents relevant to the product.
Rule 4: If the question is a comparison between SOA SUITE and OIC, consider all documents and compare between them.
Mention at the beginning which tool is being addressed: {input}
"""
prompt = PromptTemplate.from_template(template)
def get_context(x):
query = x.get("input") if isinstance(x, dict) else x
return retriever.invoke(query)
chain = (
RunnableMap({
"context": RunnableLambda(get_context),
"input": lambda x: x.get("input") if isinstance(x, dict) else x
})
| prompt
| llm
| StrOutputParser()
)
print("READY")
while True:
query = input()
if query == "quit":
break
response = chain.invoke(query)
print(type(response)) # <class 'str'>
print(response)
chat()

View File

@@ -0,0 +1,164 @@
from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
from langchain_core.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain_community.embeddings import OCIGenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema.runnable import RunnableMap
from langchain_community.document_loaders import PyPDFLoader, UnstructuredPowerPointLoader, UnstructuredPDFLoader, PyMuPDFLoader
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from tqdm import tqdm
import os
import pickle
INDEX_PATH = "./faiss_index"
PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl")
def read_pdfs(pdf_path):
if "-ocr" in pdf_path:
doc_pages = PyMuPDFLoader(str(pdf_path)).load()
else:
doc_pages = UnstructuredPDFLoader(str(pdf_path)).load()
full_text = "\n".join([page.page_content for page in doc_pages])
return full_text
def smart_split_text(text, max_chunk_size=2000):
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + max_chunk_size, text_length)
split_point = max(
text.rfind('.', start, end),
text.rfind('!', start, end),
text.rfind('?', start, end),
text.rfind('\n\n', start, end)
)
if split_point == -1 or split_point <= start:
split_point = end
else:
split_point += 1
chunk = text[start:split_point].strip()
if chunk:
chunks.append(chunk)
start = split_point
return chunks
def load_previously_indexed_docs():
if os.path.exists(PROCESSED_DOCS_FILE):
with open(PROCESSED_DOCS_FILE, "rb") as f:
return pickle.load(f)
return set()
def save_indexed_docs(docs):
with open(PROCESSED_DOCS_FILE, "wb") as f:
pickle.dump(docs, f)
def chat():
llm = ChatOCIGenAI(
model_id="meta.llama-3.1-405b-instruct",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT",
model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 4000},
)
embeddings = OCIGenAIEmbeddings(
model_id="cohere.embed-multilingual-v3.0",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT",
)
pdf_paths = [
'./Manuals/using-integrations-oracle-integration-3.pdf',
'./Manuals/SOASE.pdf',
'./Manuals/SOASUITEHL7.pdf'
]
already_indexed_docs = load_previously_indexed_docs()
updated_docs = set()
try:
vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True)
print("✔️ FAISS index loaded.")
except Exception:
print("⚠️ FAISS index not found, creating a new one.")
vectorstore = None
new_chunks = []
for pdf_path in tqdm(pdf_paths, desc="📄 Processing PDFs"):
print(f" {os.path.basename(pdf_path)}")
if pdf_path in already_indexed_docs:
print(f"✅ Already indexed: {pdf_path}")
continue
full_text = read_pdfs(pdf_path=pdf_path)
text_chunks = smart_split_text(full_text, max_chunk_size=2000)
for chunk_text in tqdm(text_chunks, desc=f"📄 Splitting text", dynamic_ncols=True, leave=False):
doc = Document(page_content=chunk_text, metadata={"source": pdf_path})
new_chunks.append(doc)
print(f"✅ Indexed chunk with {len(chunk_text)} chars.")
updated_docs.add(str(pdf_path))
if new_chunks:
if vectorstore:
vectorstore.add_documents(new_chunks)
else:
vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings)
vectorstore.save_local(INDEX_PATH)
save_indexed_docs(already_indexed_docs.union(updated_docs))
print(f"💾 {len(new_chunks)} chunks saved to FAISS index.")
else:
print("📁 No new documents to index.")
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100})
template = """
Document context:
{context}
Question:
{input}
Interpretation rules:
Rule 1: SOA SUITE documents: `SOASUITE.pdf` and `SOASUITEHL7.pdf`
Rule 2: Oracle Integration (OIC) document: `using-integrations-oracle-integration-3.pdf`
Rule 3: If not a comparison between SOA SUITE and OIC, only consider documents relevant to the product.
Rule 4: If the question compares SOA SUITE and OIC, compare both.
Mention at the beginning which tool is being addressed: {input}
"""
prompt = PromptTemplate.from_template(template)
def get_context(x):
query = x.get("input") if isinstance(x, dict) else x
return retriever.invoke(query)
chain = (
RunnableMap({
"context": RunnableLambda(get_context),
"input": lambda x: x.get("input") if isinstance(x, dict) else x
})
| prompt
| llm
| StrOutputParser()
)
print("READY")
while True:
query = input()
if query == "quit":
break
response = chain.invoke(query)
print(response)
chat()

16
files/requirements.txt Normal file
View File

@@ -0,0 +1,16 @@
langchain==0.2.0
langchain-community==0.0.30
langchain-core==0.2.0
tqdm
faiss-cpu
unstructured[pdf,ppt]==0.13.2
PyMuPDF==1.24.1
PyPDF2==3.0.1
ocrmypdf==14.1.0 # opcional, se quiser OCR fallback
pypandoc # necessário para alguns loaders .pptx
pillow
python-docx
chardet
lxml
oci
oci-cli

BIN
images/.DS_Store vendored

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 120 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.2 KiB

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 66 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

BIN
images/img_1a.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 67 KiB

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.6 KiB

After

Width:  |  Height:  |  Size: 90 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 197 KiB

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 61 KiB

After

Width:  |  Height:  |  Size: 91 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 220 KiB

BIN
images/img_a.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

275
index.md Normal file
View File

@@ -0,0 +1,275 @@
# Analyze PDF Documents in Natural Language with OCI Generative AI
## Introduction
Oracle Cloud Infrastructure Generative AI (OCI Generative AI) is an advanced generative artificial intelligence solution that enables companies and developers to create intelligent applications using cutting-edge language models. Based on powerful technologies such as Large Language Models (LLMs), this solution allows the automation of complex tasks, making processes faster, more efficient, and accessible through natural language interactions.
One of the most impactful applications of OCI Generative AI is in PDF document analysis. Companies frequently deal with large volumes of documents, such as contracts, financial reports, technical manuals, and research papers. Manually searching for information in these files can be time-consuming and prone to errors.
With the use of generative artificial intelligence, it is possible to extract information instantly and accurately, allowing users to query complex documents simply by formulating questions in natural language. This means that instead of reading entire pages to find a specific clause in a contract or a relevant data point in a report, users can just ask the model, which quickly returns the answer based on the analyzed content.
Beyond information retrieval, OCI Generative AI can also be used to summarize lengthy documents, compare content, classify information, and even generate strategic insights. These capabilities make the technology essential for various fields, such as legal, finance, healthcare, and engineering, optimizing decision-making and increasing productivity.
By integrating this technology with tools such as Oracle AI services, OCI Data Science, and APIs for document processing, companies can build intelligent solutions that completely transform the way they interact with their data, making information retrieval faster and more effective.
### Prerequisites
- Install Python `version 3.10` or higher and Oracle Cloud Infrastructure Command Line Interface (OCI CLI).
## Task 1: Install Python Packages
The Python code requires certain libraries for using OCI Generative AI. Run the following command to install the required Python packages.
```
pip install -r requirements.txt
```
## Task 2: Understand the Python Code
This is a demo of OCI Generative AI for querying functionalities of Oracle SOA Suite and Oracle Integration. Both tools are currently used for hybrid integration strategies which means they operate in both cloud and on-premises environments.
Since these tools share functionalities and processes, this code helps in understanding how to implement the same integration approach in each tool. Additionally, it allows users to explore common characteristics and differences.
Download the Python code from here:
- [`requirements.txt`](./files/requirements.txt)
- [`oci_genai_llm_context.py`](./files/oci_genai_llm_context.py)
- [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py)
You can find the PDF documents here:
- [`SOASE.pdf`](https://docs.oracle.com/middleware/12211/soasuite/develop/SOASE.pdf)
- [`SOASUITEHL7.pdf`](https://docs.oracle.com/en/learn/oci-genai-pdf/files/SOASUITEHL7.pdf)
- [`using-integrations-oracle-integration-3.pdf`](https://docs.oracle.com/en/cloud/paas/application-integration/integrations-user/using-integrations-oracle-integration-3.pdf)
Create a folder named `Manuals` and move these PDFs there.
- **Import Libraries:**
Imports the necessary libraries for processing PDFs, OCI Generative AI, text vectorization, and storage in vector databases (Facebook AI Similarity Search (FAISS) and ChromaDB).
- `UnstructuredPDFLoader` is used to extract text from PDFs.
- `ChatOCIGenAI` enables the use of OCI Generative AI models to answer questions.
- `OCIGenAIEmbeddings` creates embeddings (vector representations) of text for semantic search.
![img.png](./images/img_a.png "image")
- **Load and Process PDFs:**
Lists the PDF files to be processed.
- `UnstructuredPDFLoader` reads each document and splits it into pages for easier indexing and searching.
- Document IDs are stored for future reference.
![img.png](./images/img_1.png "image")
- **Configure the OCI Generative AI Model:**
Configures the `Llama-3.1-405b` model hosted on OCI to generate responses based on the loaded documents.
Defines parameters such as `temperature` (randomness control), `top_p` (diversity control), and `max_tokens` (token limit).
![img_2.png](./images/img_2.png "image")
> **Note:** The available LLaMA version may change over time. Please check the current version in your tenancy and update your code if needed.
- **Create Embeddings and Vector Indexing:**
Uses Oracle's embedding model to transform text into numerical vectors, facilitating semantic searches in documents.
![img_3.png](./images/img_3.png "image")
- FAISS stores the embeddings of the PDF documents for quick queries.
- `retriever` allows retrieving the most relevant excerpts based on the semantic similarity of the user's query.
![img_5.png](./images/img_5.png "image")
- In the first processing execution, the vector data will be saved on a faiss database.
![img_6.png](./images/img_6.png "image")
- **Define the Prompt:**
Creates an intelligent prompt for the generative model, guiding it to consider only relevant documents for each query.
This improves the accuracy of responses and avoids unnecessary information.
![img_4.png](./images/img_4.png "image")
- **Create the Processing Chain (RAG - Retrieval-Augmented Generation):**
Implements a RAG flow, where:
- `retriever` searches for the most relevant document excerpts.
- `prompt` organizes the query for better context.
- `llm` generates a response based on the retrieved documents.
- `StrOutputParser` formats the final output.
- ![img_7.png](images/img_7.png)
- **Question and Answer Loop:**
Maintains a loop where users can ask questions about the loaded documents.
- The AI responds using the knowledge base extracted from the PDFs.
- If you enter `quit`, it exits the program.
![img_8.png](./images/img_8.png "image")
## Fixed Size Chunking
**(A Faster Alternative: Fixed-Size Chunking)**
Fixed-Size Chunking is a simple and efficient text-splitting strategy where documents are divided into chunks based on predefined size limits, typically measured in tokens, characters, or lines.
This method does not analyze the meaning or structure of the text. It simply slices the content at fixed intervals, regardless of whether the cut happens in the middle of a sentence, paragraph, or idea.
How Fixed-Size Chunking Works:
• Example Rule:
Split the document every 1000 tokens (or every 3000 characters).
• Optional Overlap:
To reduce the risk of splitting relevant context,
some implementations add an overlap between consecutive chunks (e.g., 200-token overlap)
to ensure that important context isnt lost at the boundary.
**Benefits of Fixed-Size Chunking:**
- **Fast processing:**
No need for semantic analysis, LLM inference, or content understanding. Just count and cut.
- **Low resource consumption:**
Minimal CPU/GPU and memory usage, making it scalable for large datasets.
- **Easy to implement:**
Works with simple scripts or standard text processing libraries.
**Limitations of Fixed-Size Chunking:**
- **Poor semantic awareness:**
Chunks may cut off sentences, paragraphs, or logical sections, leading to incomplete or fragmented ideas.
- **Reduced retrieval precision:**
In applications like semantic search or Retrieval-Augmented Generation (RAG), poor chunk boundaries can affect the relevance and quality of retrieved answers.
**When to Use Fixed-Size Chunking:**
- When processing speed and scalability are top priorities.
- For large-scale document ingestion pipelines where semantic precision is not critical.
- As a first step in scenarios where later refinement or semantic analysis will happen downstream.
- This is a very simple method to split text:
![img_10.png](images/img_10.png)
- And this is the main process of fixed chunking:
![img_11.png](images/img_11.png)
>**Note:** Download this code to process the **fixed chunking** more **faster**: [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py)
## Semantic Chunking
**What is Semantic Chunking?**
Semantic Chunking is a text pre-processing technique where large documents (such as PDFs, presentations, or articles) are split into smaller parts called “chunks”, with each chunk representing a semantically coherent block of text.
Unlike traditional fixed-size chunking (e.g., splitting every 1000 tokens or every X characters), Semantic Chunking uses Artificial Intelligence (typically Large Language Models - LLMs) to detect natural content boundaries, respecting topics, sections, and context.
Instead of cutting text arbitrarily, Semantic Chunking tries to preserve the full meaning of each section, creating standalone, context-aware pieces.
**Why Can Semantic Chunking Make Processing Slower?**
A traditional chunking process, based on fixed size, is fast: the system just counts tokens or characters and cuts accordingly.
With Semantic Chunking, several extra steps of semantic analysis are required:
1. Reading and interpreting the full text (or large blocks) before splitting:
The LLM needs to “understand” the content to identify the best chunk boundaries.
2. Running LLM prompts or topic classification models:
The system often queries the LLM with questions like:
“Is this the end of an idea?” or “Does this paragraph start a new section?”
3. Higher memory and CPU/GPU usage:
Because the model processes larger text blocks before making chunking decisions, resource consumption is significantly higher.
4. Sequential and incremental decision-making:
Semantic chunking often works in steps (e.g., analyzing 10,000-token blocks and then refining chunk boundaries inside that block), which increases total processing time.
>**Note 1:** Depending on your machine processing power, you will wait a long, long time to finalize the first execution using **Semantic Chunking**.
>**Note 2:** You can use this algorithm to produce customized chunking using **OCI Gen AI**.
- This is the main document process. It uses:
- **smart_split_text()**: separates the full-text in small pieces of 10kb (you can configure to adopt other strategies). The mechanism perceive the last paragraph. If part of the paragraph is in the next text piece, this part will be ignored in the processing and will be appended on the next processing text group.
- **semantic_chunk()**: This method will use the OCI LLM mechanism to separate the paragraphs. It includes the intelligence to identify the titles, components of a table, the paragraphs to execute a smart chunk. The strategy here is to use the **Semantic Chunk** technique. It will took more time to complete the mission if compared with the common processing. So, the first processing will took a long time but the next will load all the faiss pre-saved data.
- **split_llm_output_into_chapters()**: This method will finalize the chunk, separating the chapters.
![img.png](images/img_9.png)
## Task 3: Run Query for Oracle Integration and Oracle SOA Suite Contents
Run the following command.
```
FOR FIXED CHUNKING TECHNIQUE (MORE FASTER METHOD)
python oci_genai_llm_context_fast.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
```
```
FOR SEMANTIC CHUNKING TECHNIQUE
python oci_genai_llm_context.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
```
> **Note:** The `--device` and `--gpu_name` parameters can be used to accelerate the processing in Python, using GPU if your machine has one. Consider that this code can be used with local models too.
The provided context distinguishes Oracle SOA Suite and Oracle Integration, you can test the code considering these points:
- The query should be made only for Oracle SOA Suite: Therefore, only Oracle SOA Suite documents should be considered.
- The query should be made only for Oracle Integration: Therefore, only Oracle Integration documents should be considered.
- The query requires a comparison between Oracle SOA Suite and Oracle Integration: Therefore, all documents should be considered.
We can define the following context, which greatly helps in interpreting the documents correctly.
![img_7.png](images/img_7.png)
The following image shows the example of comparison between Oracle SOA Suite and Oracle Integration.
![img.png](./images/img.png "image")
The following image shows the example for Kafka.
![img_1.png](./images/img_1.png "image")
## Next Steps
This code demonstrates an application of OCI Generative AI for intelligent PDF analysis. It enables users to efficiently query large volumes of documents using semantic searches and a generative AI model to generate accurate natural language responses.
This approach can be applied in various fields, such as legal, compliance, technical support, and academic research, making information retrieval much faster and smarter.
## Related Links
- [Extending SaaS by AI/ML features - Part 8: OCI Generative AI Integration with LangChain Use Cases](https://www.ateam-oracle.com/post/oci-generative-ai-integration-with-langchain-usecases)
- [Bridging cloud and conversational AI: LangChain and OCI Data Science platform](https://blogs.oracle.com/ai-and-datascience/post/cloud-conversational-ai-langchain-oci-data-science)
- [Install OCI CLI](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm#Quickstart)
- [Introduction to Custom and Built-in Python LangChain Agents](https://wellsr.com/python/working-with-python-langchain-agents/)
## Acknowledgments
- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)

BIN
source/.DS_Store vendored

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -1,75 +0,0 @@
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
from langchain_core.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain_community.embeddings import OCIGenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores import FAISS
def chat():
caminhos_pdf = [ './Manuals/using-integrations-oracle-integration-3.pdf',
'./Manuals/SOASUITE.pdf',
'./Manuals/SOASUITEHL7.pdf'
]
pages = []
ids = []
counter = 1
for caminho_pdf in caminhos_pdf:
doc_pages = PyPDFLoader(caminho_pdf).load_and_split()
pages.extend(doc_pages)
ids.append(str(counter))
counter = counter + 1
llm = ChatOCIGenAI(
model_id="meta.llama-3.1-405b-instruct",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT", # replace with your profile name,
model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 1000},
)
embeddings = OCIGenAIEmbeddings(
model_id="cohere.embed-multilingual-v3.0",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
auth_profile="DEFAULT", # replace with your profile name
)
vectorstore = FAISS.from_documents(
pages,
embedding=embeddings
)
retriever = vectorstore.as_retriever()
#### YOU CAN PERSIST YOUR VECTOR STORE HERE
template = """
If the query in question is not a comparison between SOA SUITE and OIC, consider only the documents relevant to the subject,
that is, if the question is about SOA SUITE, consider only the SOA SUITE documents. If the question is about OIC,
consider only the OIC document. If the question is a comparison between SOA SUITE and OIC, consider all documents.
Inform at the beginning which tool is being discussed : {input}
"""
prompt = PromptTemplate.from_template(template)
chain = (
{"context": retriever,
"input": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
while (True):
query = input()
if query == "quit":
break
print(chain.invoke(query))
chat()

View File

@@ -1,7 +0,0 @@
faiss-cpu
oci-cli
langchain
langchain_community
langchain_cohere
langchain-core
langchain-text-splitters