adjusmtents

2026-03-03 16:09:39 +00:00 · 2025-06-19 09:33:12 -03:00
parent 0a6583752c
commit 83020a54e8
30 changed files with 727 additions and 245 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/README.md
+++ b/README.md
@@ -1,163 +0,0 @@
-# Consult PDF Documents in Natural Language with OCI Generative AI
-
-## Introduction
-
-Oracle Cloud Generative AI is an advanced generative artificial intelligence solution that enables companies and developers to create intelligent applications using cutting-edge language models. Based on powerful technologies such as LLMs (Large Language Models), this solution allows the automation of complex tasks, making processes faster, more efficient, and accessible through natural language interactions.
-
-One of the most impactful applications of Oracle Cloud Generative AI is in PDF document analysis. Companies frequently deal with large volumes of documents, such as contracts, financial reports, technical manuals, and research papers. Manually searching for information in these files can be time-consuming and prone to errors.
-
-With the use of generative artificial intelligence, it is possible to extract information instantly and accurately, allowing users to query complex documents simply by formulating questions in natural language. This means that instead of reading entire pages to find a specific clause in a contract or a relevant data point in a report, users can just ask the model, which quickly returns the answer based on the analyzed content.
-
-Beyond information retrieval, Oracle Cloud Generative AI can also be used to summarize lengthy documents, compare content, classify information, and even generate strategic insights. These capabilities make the technology essential for various fields, such as legal, finance, healthcare, and engineering, optimizing decision-making and increasing productivity.
-
-By integrating this technology with tools such as Oracle AI Services, OCI Data Science, and APIs for document processing, companies can build intelligent solutions that completely transform the way they interact with their data, making information retrieval faster and more effective.
-
-### Prerequisites
-
-To use the demo, you need to have the following pre-installed:
-
- Python 3.10 or higher
- OCI CLI
-
-### Install Python Packages
-
-The Python code requires certain libraries for using OCI Generative AI. Install the required Python packages by running:
-
-    pip install -r requirements.txt
-
-## Understand the Code
-
-This a demo of OCI Generative AI for querying functionalities of Oracle SOA SUITE and Oracle Integration.
-
-Both tools are currently used for hybrid integration strategies, meaning they operate in both Cloud and on-prem environments.
-
-Since these tools share functionalities and processes, this code helps in understanding how to implement the same integration approach in each. Additionally, it allows users to explore common characteristics and differences.
-
-
-You can find the Python codes at:
-
- [requirements.txt](source/requirements.txt)
- [oci_genai_llm_context.py](source/oci_genai_llm_context.py)
-
-Below, we will explain each section of the code.
-
-### Import Libraries
-
-Imports the necessary libraries for processing PDFs, Oracle generative AI, text vectorization, and storage in vector databases (FAISS and Chroma).
-
-• PyPDFLoader is used to extract text from PDFs.
-
-• ChatOCIGenAI enables the use of Oracle Cloud Generative AI models to answer questions.
-
-• OCIGenAIEmbeddings creates embeddings (vector representations) of texts for semantic search.
-
-![img_6.png](images/img_6.png)
-
-### Load and Process PDFs
-
-Lists the PDF files to be processed.
-
-• PyPDFLoader reads each document and splits it into pages for easier indexing and searching.
-
-• Document IDs are stored for future reference.
-
-![img_7.png](images/img_7.png)
-
-### Configure the Oracle Generative Model
-
-Configures the Llama-3.1-405b model hosted on Oracle Cloud to generate responses based on the loaded documents.
-
-• Defines parameters such as temperature (randomness control), top_p (diversity control), and token limit.
-
-![img_8.png](images/img_8.png)
-
->**Note:** Please, confirm that the version of Llama available to use in your tenancy. Depending on when you are reading this tutorial, you cannot find this model anymore. 
-
-### Create Embeddings and Vector Indexing
-
-Uses Oracle's embedding model to transform texts into numerical vectors, facilitating semantic searches in documents.
-
-![img_9.png](images/img_9.png)
-
-• FAISS (Facebook AI Similarity Search) stores the embeddings of the PDF documents for quick queries.
-
-• retriever allows retrieving the most relevant excerpts based on the semantic similarity of the user's query.
-
-![img_10.png](images/img_10.png)
-
-
-### Define the Prompt
-
-Creates an intelligent prompt for the generative model, guiding it to consider only relevant documents for each query.
-
-• This improves the accuracy of responses and avoids unnecessary information.
-
-![img_12.png](images/img_12.png)
-
-### Create the Processing Chain (RAG - Retrieval-Augmented Generation)
-
-Implements an RAG (Retrieval-Augmented Generation) flow, where:
-
-1. retriever searches for the most relevant document excerpts.
-2. prompt organizes the query for better context.
-3. llm generates a response based on the retrieved documents.
-4. StrOutputParser formats the final output.
-
-![img_13.png](images/img_13.png)
-
-### Question and Answer Loop
-
-Maintains a loop where users can ask questions about the loaded documents.
-
-• The AI responds using the knowledge base extracted from the PDFs.
-
-• Typing "quit" exits the program.
-
-![img_14.png](images/img_14.png)
-
-## Query for OIC and SOA Suite contents
-
-Run the following command:
-
-    python oci_genai_llm_context.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
-
-
->**Note:** the parameters --device and --gpu_name could be used to accelerate the processing in Python, using GPU if your machine have one.  Consider that this code can be used with local models too.
-
-
-Thanks to the context provided to distinguish between SOA SUITE and Oracle Integration, you can test the code considering this points:
-
- The query should be made only for SOA SUITE: Therefore, only SOA SUITE documents should be considered.
- The query should be made only for Oracle Integration: Therefore, only Oracle Integration documents should be considered.
- The query requires a comparison between SOA SUITE and Oracle Integration: Therefore, all documents should be considered.
-
-We can define the following context, which greatly helps in interpreting the documents correctly:
-
-![img_3.png](images/img_3.png)
-
-
-Example of comparison between SOA SUITE and Oracle Integration:
-
-![img.png](images/img.png)
-
-Example regarding Kafka:
-
-![img_1.png](images/img_1.png)
-
-## Conclusion
-
-This code demonstrates an application of Oracle Cloud Generative AI for intelligent PDF analysis. It enables users to efficiently query large volumes of documents using semantic searches and a generative AI model to generate accurate natural language responses.
-
-This approach can be applied in various fields, such as legal, compliance, technical support, and academic research, making information retrieval much faster and smarter.
-
-## References
-
- [Extending SaaS by AI/ML features - Part 8: OCI Generative AI Integration with LangChain Use Cases](https://www.ateam-oracle.com/post/oci-generative-ai-integration-with-langchain-usecases)
- [Bridging cloud and conversational AI: LangChain and OCI Data Science platform](https://blogs.oracle.com/ai-and-datascience/post/cloud-conversational-ai-langchain-oci-data-science)
- [Install OCI CLI](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm#Quickstart)
- [Introduction to Custom and Built-in Python LangChain Agents](https://wellsr.com/python/working-with-python-langchain-agents/)
-
-## Acknowledgments
-
- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
-
--- a/_config.yml
+++ b/_config.yml
@@ -0,0 +1,14 @@
+title: Analyze PDF Documents in Natural Language with OCI Generative AI
+description: Learn how to analyze PDF documents in natural language using Oracle Cloud Infrastructure Generative AI (OCI Generative AI).
+baseurl: /en/learn/oci-genai-pdf
+copyright: 2025
+copyright-last: 2025
+partno: G29356-03
+print-month: June
+print-year: 2025
+duration: PT01H0M0S
+level: Beginner
+roles: Application Administrator;Application Developer;DevOps Engineer;Developer
+products: en/cloud/oracle-cloud-infrastructure/oci;en/cloud/en/cloud/oracle-cloud-infrastructure/generative-ai
+keywords: Cloud Native
+inject-note: true
--- a/files/oci_genai_llm_context.py
+++ b/files/oci_genai_llm_context.py
@@ -0,0 +1,258 @@
+from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
+from langchain_core.prompts import PromptTemplate
+from langchain.schema.output_parser import StrOutputParser
+from langchain_community.embeddings import OCIGenAIEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain.schema.runnable import RunnableMap
+from langchain_community.document_loaders import PyPDFLoader, UnstructuredPowerPointLoader, UnstructuredPDFLoader, PyMuPDFLoader
+from langchain_core.documents import Document
+from tqdm import tqdm
+import os
+import pickle
+import re
+from langchain_core.runnables import RunnableLambda
+
+INDEX_PATH = "./faiss_index"
+PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl")
+
+chapter_separator_regex = r"^(#{1,6} .+|\*\*.+\*\*)$"
+
+def split_llm_output_into_chapters(llm_text):
+    """
+    Splits the LLM output text into chapters, assuming the LLM separates chapters using markdown-style headings like '# Title'
+    """
+    chapters = []
+    current_chapter = []
+    lines = llm_text.splitlines()
+
+    for line in lines:
+        if re.match(chapter_separator_regex, line):
+            if current_chapter:
+                chapters.append("\n".join(current_chapter).strip())
+            current_chapter = [line]
+        else:
+            current_chapter.append(line)
+
+    if current_chapter:
+        chapters.append("\n".join(current_chapter).strip())
+
+    return chapters
+
+def semantic_chunking(text):
+    llm = ChatOCIGenAI(
+        model_id="meta.llama-3.1-405b-instruct",
+        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
+        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
+        auth_profile="DEFAULT",
+    )
+
+    prompt = f"""
+    You received the following text extracted via OCR:
+
+    {text}
+
+    Your task:
+    1. Identify headings (short uppercase or bold lines, no period at the end)
+    2. Separate paragraphs by heading
+    3. Indicate columns with [COLUMN 1], [COLUMN 2] if present
+    4. Indicate tables with [TABLE] in markdown format
+    """
+
+    response = llm.invoke(prompt)
+    return response
+
+def read_pdfs(pdf_path):
+    if "-ocr" in pdf_path:
+        doc_pages = PyMuPDFLoader(str(pdf_path)).load()
+    else:
+        doc_pages = UnstructuredPDFLoader(str(pdf_path)).load()
+    full_text = "\n".join([page.page_content for page in doc_pages])
+    return full_text
+
+def smart_split_text(text, max_chunk_size=10_000):
+    chunks = []
+    start = 0
+    text_length = len(text)
+
+    while start < text_length:
+        end = min(start + max_chunk_size, text_length)
+
+        # Try to find the last sentence end before the limit (., ?, !, \n\n)
+        split_point = max(
+            text.rfind('.', start, end),
+            text.rfind('!', start, end),
+            text.rfind('?', start, end),
+            text.rfind('\n\n', start, end)
+        )
+
+        # If not found, make a hard cut
+        if split_point == -1 or split_point <= start:
+            split_point = end
+        else:
+            split_point += 1  # Include the ending character
+
+        chunk = text[start:split_point].strip()
+        if chunk:
+            chunks.append(chunk)
+
+        start = split_point
+
+    return chunks
+
+def load_previously_indexed_docs():
+    if os.path.exists(PROCESSED_DOCS_FILE):
+        with open(PROCESSED_DOCS_FILE, "rb") as f:
+            return pickle.load(f)
+    return set()
+
+def save_indexed_docs(docs):
+    with open(PROCESSED_DOCS_FILE, "wb") as f:
+        pickle.dump(docs, f)
+
+def append_text_to_file(file_path, text):
+    """
+    Appends text to the end of a file.
+    If the file doesn't exist, it will be created.
+
+    Args:
+        file_path (str): Path to the file where the text will be saved.
+        text (str): Text to append.
+    """
+    with open(file_path, "a", encoding="utf-8") as f:
+        f.write(text + "\n")
+
+def chat():
+    llm = ChatOCIGenAI(
+        model_id="meta.llama-3.1-405b-instruct",
+        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
+        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
+        auth_profile="DEFAULT",  # Replace with your profile name,
+        model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 4000},
+    )
+
+    embeddings = OCIGenAIEmbeddings(
+        model_id="cohere.embed-multilingual-v3.0",
+        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
+        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
+        auth_profile="DEFAULT",  # Replace with your profile name,
+    )
+
+    pdf_paths = [
+        './Manuals/using-integrations-oracle-integration-3.pdf',
+        './Manuals/SOASE.pdf',
+        './Manuals/SOASUITEHL7.pdf'
+    ]
+
+    already_indexed_docs = load_previously_indexed_docs()
+    updated_docs = set()
+
+    # Try loading existing FAISS index
+    try:
+        vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True)
+        print("✔️ FAISS index loaded.")
+    except Exception:
+        print("⚠️ FAISS index not found, creating a new one.")
+        vectorstore = None
+
+    new_chunks = []
+
+    pages = []
+    for pdf_path in tqdm(pdf_paths, desc=f"📄 Processing PDFs"):
+        print(f" {os.path.basename(pdf_path)}")
+        if pdf_path in already_indexed_docs:
+            print(f"✅ Document already indexed: {pdf_path}")
+            continue
+        full_text = read_pdfs(pdf_path=pdf_path)
+
+        # Split the text into ~10 KB chunks (~10,000 characters)
+        text_chunks = smart_split_text(full_text, max_chunk_size=10_000)
+        overflow_buffer = ""  # Remainder from the previous chapter, if any
+
+        for chunk in tqdm(text_chunks, desc=f"📄 Processing text chunks", dynamic_ncols=True, leave=False):
+            # Join with leftover from previous chunk
+            current_text = overflow_buffer + chunk
+
+            # Send text to LLM for semantic splitting
+            treated_text = semantic_chunking(current_text)
+
+            if hasattr(treated_text, "content"):
+                chapters = split_llm_output_into_chapters(treated_text.content)
+
+                # Check if the last chapter seems incomplete
+                last_chapter = chapters[-1] if chapters else ""
+
+                # Simple criteria: if text ends without punctuation (like . ! ?) or is too short
+                if last_chapter and not last_chapter.strip().endswith((".", "!", "?", "\n\n")):
+                    print("📌 Last chapter seems incomplete, saving for the next cycle")
+                    overflow_buffer = last_chapter
+                    chapters = chapters[:-1]  # Don't index the last incomplete chapter yet
+                else:
+                    overflow_buffer = ""  # Nothing left over
+
+                # Save complete chapters as document chunks
+                for chapter_text in chapters:
+                    doc = Document(page_content=chapter_text, metadata={"source": pdf_path})
+                    new_chunks.append(doc)
+                    print(f"✅ New chapter indexed:\n{chapter_text}...\n")
+
+            else:
+                print(f"[ERROR] semantic_chunking returned unexpected type: {type(treated_text)}")
+
+        updated_docs.add(str(pdf_path))
+
+    # If there are new documents, index them
+    if new_chunks:
+        if vectorstore:
+            vectorstore.add_documents(new_chunks)
+        else:
+            vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings)
+
+        vectorstore.save_local(INDEX_PATH)
+        save_indexed_docs(already_indexed_docs.union(updated_docs))
+        print(f"💾 {len(new_chunks)} chunks added to FAISS index.")
+    else:
+        print("📁 No new documents to index.")
+
+    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100})
+
+    template = """ 
+        Document context:
+        {context}
+        
+        Question:
+        {input}
+        
+        Interpretation rules:
+        Rule 1: SOA SUITE documents: `SOASUITE.pdf` and `SOASUITEHL7.pdf`
+        Rule 2: Oracle Integration (known as OIC) document: `using-integrations-oracle-integration-3.pdf`
+        Rule 3: If the query is not a comparison between SOA SUITE and Oracle Integration (OIC), only consider documents relevant to the product.
+        Rule 4: If the question is a comparison between SOA SUITE and OIC, consider all documents and compare between them.
+        Mention at the beginning which tool is being addressed: {input}
+    """
+    prompt = PromptTemplate.from_template(template)
+
+    def get_context(x):
+        query = x.get("input") if isinstance(x, dict) else x
+        return retriever.invoke(query)
+
+    chain = (
+            RunnableMap({
+                "context": RunnableLambda(get_context),
+                "input": lambda x: x.get("input") if isinstance(x, dict) else x
+            })
+            | prompt
+            | llm
+            | StrOutputParser()
+    )
+
+    print("READY")
+
+    while True:
+        query = input()
+        if query == "quit":
+            break
+        response = chain.invoke(query)
+        print(type(response))  # <class 'str'>
+        print(response)
+
+chat()
--- a/files/oci_genai_llm_context_fast.py
+++ b/files/oci_genai_llm_context_fast.py
@@ -0,0 +1,164 @@
+from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
+from langchain_core.prompts import PromptTemplate
+from langchain.schema.output_parser import StrOutputParser
+from langchain_community.embeddings import OCIGenAIEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain.schema.runnable import RunnableMap
+from langchain_community.document_loaders import PyPDFLoader, UnstructuredPowerPointLoader, UnstructuredPDFLoader, PyMuPDFLoader
+from langchain_core.documents import Document
+from langchain_core.runnables import RunnableLambda
+from tqdm import tqdm
+import os
+import pickle
+
+INDEX_PATH = "./faiss_index"
+PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl")
+
+def read_pdfs(pdf_path):
+    if "-ocr" in pdf_path:
+        doc_pages = PyMuPDFLoader(str(pdf_path)).load()
+    else:
+        doc_pages = UnstructuredPDFLoader(str(pdf_path)).load()
+    full_text = "\n".join([page.page_content for page in doc_pages])
+    return full_text
+
+def smart_split_text(text, max_chunk_size=2000):
+    chunks = []
+    start = 0
+    text_length = len(text)
+
+    while start < text_length:
+        end = min(start + max_chunk_size, text_length)
+        split_point = max(
+            text.rfind('.', start, end),
+            text.rfind('!', start, end),
+            text.rfind('?', start, end),
+            text.rfind('\n\n', start, end)
+        )
+
+        if split_point == -1 or split_point <= start:
+            split_point = end
+        else:
+            split_point += 1
+
+        chunk = text[start:split_point].strip()
+        if chunk:
+            chunks.append(chunk)
+
+        start = split_point
+
+    return chunks
+
+def load_previously_indexed_docs():
+    if os.path.exists(PROCESSED_DOCS_FILE):
+        with open(PROCESSED_DOCS_FILE, "rb") as f:
+            return pickle.load(f)
+    return set()
+
+def save_indexed_docs(docs):
+    with open(PROCESSED_DOCS_FILE, "wb") as f:
+        pickle.dump(docs, f)
+
+def chat():
+    llm = ChatOCIGenAI(
+        model_id="meta.llama-3.1-405b-instruct",
+        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
+        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
+        auth_profile="DEFAULT",
+        model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 4000},
+    )
+
+    embeddings = OCIGenAIEmbeddings(
+        model_id="cohere.embed-multilingual-v3.0",
+        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
+        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
+        auth_profile="DEFAULT",
+    )
+
+    pdf_paths = [
+        './Manuals/using-integrations-oracle-integration-3.pdf',
+        './Manuals/SOASE.pdf',
+        './Manuals/SOASUITEHL7.pdf'
+    ]
+
+    already_indexed_docs = load_previously_indexed_docs()
+    updated_docs = set()
+
+    try:
+        vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True)
+        print("✔️ FAISS index loaded.")
+    except Exception:
+        print("⚠️ FAISS index not found, creating a new one.")
+        vectorstore = None
+
+    new_chunks = []
+
+    for pdf_path in tqdm(pdf_paths, desc="📄 Processing PDFs"):
+        print(f" {os.path.basename(pdf_path)}")
+        if pdf_path in already_indexed_docs:
+            print(f"✅ Already indexed: {pdf_path}")
+            continue
+
+        full_text = read_pdfs(pdf_path=pdf_path)
+        text_chunks = smart_split_text(full_text, max_chunk_size=2000)
+
+        for chunk_text in tqdm(text_chunks, desc=f"📄 Splitting text", dynamic_ncols=True, leave=False):
+            doc = Document(page_content=chunk_text, metadata={"source": pdf_path})
+            new_chunks.append(doc)
+            print(f"✅ Indexed chunk with {len(chunk_text)} chars.")
+
+        updated_docs.add(str(pdf_path))
+
+    if new_chunks:
+        if vectorstore:
+            vectorstore.add_documents(new_chunks)
+        else:
+            vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings)
+
+        vectorstore.save_local(INDEX_PATH)
+        save_indexed_docs(already_indexed_docs.union(updated_docs))
+        print(f"💾 {len(new_chunks)} chunks saved to FAISS index.")
+    else:
+        print("📁 No new documents to index.")
+
+    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100})
+
+    template = """ 
+        Document context:
+        {context}
+        
+        Question:
+        {input}
+        
+        Interpretation rules:
+        Rule 1: SOA SUITE documents: `SOASUITE.pdf` and `SOASUITEHL7.pdf`
+        Rule 2: Oracle Integration (OIC) document: `using-integrations-oracle-integration-3.pdf`
+        Rule 3: If not a comparison between SOA SUITE and OIC, only consider documents relevant to the product.
+        Rule 4: If the question compares SOA SUITE and OIC, compare both.
+        Mention at the beginning which tool is being addressed: {input}
+    """
+    prompt = PromptTemplate.from_template(template)
+
+    def get_context(x):
+        query = x.get("input") if isinstance(x, dict) else x
+        return retriever.invoke(query)
+
+    chain = (
+            RunnableMap({
+                "context": RunnableLambda(get_context),
+                "input": lambda x: x.get("input") if isinstance(x, dict) else x
+            })
+            | prompt
+            | llm
+            | StrOutputParser()
+    )
+
+    print("READY")
+    while True:
+        query = input()
+        if query == "quit":
+            break
+        response = chain.invoke(query)
+        print(response)
+
+chat()
--- a/files/requirements.txt
+++ b/files/requirements.txt
@@ -0,0 +1,16 @@
+langchain==0.2.0
+langchain-community==0.0.30
+langchain-core==0.2.0
+tqdm
+faiss-cpu
+unstructured[pdf,ppt]==0.13.2
+PyMuPDF==1.24.1
+PyPDF2==3.0.1
+ocrmypdf==14.1.0  # opcional, se quiser OCR fallback
+pypandoc  # necessário para alguns loaders .pptx
+pillow
+python-docx
+chardet
+lxml
+oci
+oci-cli
--- a/images/.DS_Store
+++ b/images/.DS_Store
--- a/images/img_1.png
+++ b/images/img_1.png
--- a/images/img_10.png
+++ b/images/img_10.png
--- a/images/img_11.png
+++ b/images/img_11.png
--- a/images/img_12.png
+++ b/images/img_12.png
--- a/images/img_13.png
+++ b/images/img_13.png
--- a/images/img_14.png
+++ b/images/img_14.png
--- a/images/img_1a.png
+++ b/images/img_1a.png
--- a/images/img_2.png
+++ b/images/img_2.png
--- a/images/img_3.png
+++ b/images/img_3.png
--- a/images/img_4.png
+++ b/images/img_4.png
--- a/images/img_5.png
+++ b/images/img_5.png
--- a/images/img_6.png
+++ b/images/img_6.png
--- a/images/img_7.png
+++ b/images/img_7.png
--- a/images/img_8.png
+++ b/images/img_8.png
--- a/images/img_9.png
+++ b/images/img_9.png
--- a/images/img_a.png
+++ b/images/img_a.png
--- a/index.md
+++ b/index.md
@@ -0,0 +1,275 @@
+# Analyze PDF Documents in Natural Language with OCI Generative AI
+
+## Introduction
+
+Oracle Cloud Infrastructure Generative AI (OCI Generative AI) is an advanced generative artificial intelligence solution that enables companies and developers to create intelligent applications using cutting-edge language models. Based on powerful technologies such as Large Language Models (LLMs), this solution allows the automation of complex tasks, making processes faster, more efficient, and accessible through natural language interactions.
+
+One of the most impactful applications of OCI Generative AI is in PDF document analysis. Companies frequently deal with large volumes of documents, such as contracts, financial reports, technical manuals, and research papers. Manually searching for information in these files can be time-consuming and prone to errors.
+
+With the use of generative artificial intelligence, it is possible to extract information instantly and accurately, allowing users to query complex documents simply by formulating questions in natural language. This means that instead of reading entire pages to find a specific clause in a contract or a relevant data point in a report, users can just ask the model, which quickly returns the answer based on the analyzed content.
+
+Beyond information retrieval, OCI Generative AI can also be used to summarize lengthy documents, compare content, classify information, and even generate strategic insights. These capabilities make the technology essential for various fields, such as legal, finance, healthcare, and engineering, optimizing decision-making and increasing productivity.
+
+By integrating this technology with tools such as Oracle AI services, OCI Data Science, and APIs for document processing, companies can build intelligent solutions that completely transform the way they interact with their data, making information retrieval faster and more effective.
+
+### Prerequisites
+
+- Install Python `version 3.10` or higher and Oracle Cloud Infrastructure Command Line Interface (OCI CLI).
+
+## Task 1: Install Python Packages
+
+The Python code requires certain libraries for using OCI Generative AI. Run the following command to install the required Python packages.
+
+```
+pip install -r requirements.txt
+```
+
+## Task 2: Understand the Python Code
+
+This is a demo of OCI Generative AI for querying functionalities of Oracle SOA Suite and Oracle Integration. Both tools are currently used for hybrid integration strategies which means they operate in both cloud and on-premises environments.
+
+Since these tools share functionalities and processes, this code helps in understanding how to implement the same integration approach in each tool. Additionally, it allows users to explore common characteristics and differences.
+
+Download the Python code from here: 
+
+- [`requirements.txt`](./files/requirements.txt)
+- [`oci_genai_llm_context.py`](./files/oci_genai_llm_context.py)
+- [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py)
+
+You can find the PDF documents here:
+
+- [`SOASE.pdf`](https://docs.oracle.com/middleware/12211/soasuite/develop/SOASE.pdf)
+- [`SOASUITEHL7.pdf`](https://docs.oracle.com/en/learn/oci-genai-pdf/files/SOASUITEHL7.pdf)
+- [`using-integrations-oracle-integration-3.pdf`](https://docs.oracle.com/en/cloud/paas/application-integration/integrations-user/using-integrations-oracle-integration-3.pdf)
+
+Create a folder named `Manuals` and move these PDFs there.
+
+- **Import Libraries:**
+
+  Imports the necessary libraries for processing PDFs, OCI Generative AI, text vectorization, and storage in vector databases (Facebook AI Similarity Search (FAISS) and ChromaDB).
+
+  - `UnstructuredPDFLoader` is used to extract text from PDFs.
+
+  - `ChatOCIGenAI` enables the use of OCI Generative AI models to answer questions.
+
+  - `OCIGenAIEmbeddings` creates embeddings (vector representations) of text for semantic search.
+
+    ![img.png](./images/img_a.png "image")
+
+- **Load and Process PDFs:**
+
+  Lists the PDF files to be processed.
+
+  - `UnstructuredPDFLoader` reads each document and splits it into pages for easier indexing and searching.
+
+  - Document IDs are stored for future reference.
+
+  ![img.png](./images/img_1.png "image")
+
+- **Configure the OCI Generative AI Model:**
+
+  Configures the `Llama-3.1-405b` model hosted on OCI to generate responses based on the loaded documents.
+
+  Defines parameters such as `temperature` (randomness control), `top_p` (diversity control), and `max_tokens` (token limit).
+
+  ![img_2.png](./images/img_2.png "image")
+
+  > **Note:** The available LLaMA version may change over time. Please check the current version in your tenancy and update your code if needed.
+
+- **Create Embeddings and Vector Indexing:**
+
+  Uses Oracle's embedding model to transform text into numerical vectors, facilitating semantic searches in documents.
+
+  ![img_3.png](./images/img_3.png "image")
+
+  - FAISS stores the embeddings of the PDF documents for quick queries.
+
+  - `retriever` allows retrieving the most relevant excerpts based on the semantic similarity of the user's query.
+
+  ![img_5.png](./images/img_5.png "image")
+
+
+- In the first processing execution, the vector data will be saved on a faiss database.
+
+![img_6.png](./images/img_6.png "image")
+
+- **Define the Prompt:**
+
+  Creates an intelligent prompt for the generative model, guiding it to consider only relevant documents for each query.
+
+  This improves the accuracy of responses and avoids unnecessary information.
+
+  ![img_4.png](./images/img_4.png "image")
+
+- **Create the Processing Chain (RAG - Retrieval-Augmented Generation):**
+
+  Implements a RAG flow, where:
+
+  - `retriever` searches for the most relevant document excerpts.
+  - `prompt` organizes the query for better context.
+  - `llm` generates a response based on the retrieved documents.
+  - `StrOutputParser` formats the final output.
+
+- ![img_7.png](images/img_7.png)
+
+
+- **Question and Answer Loop:**
+
+  Maintains a loop where users can ask questions about the loaded documents.
+
+  - The AI responds using the knowledge base extracted from the PDFs.
+
+  - If you enter `quit`, it exits the program.
+
+  ![img_8.png](./images/img_8.png "image")
+
+
+## Fixed Size Chunking
+
+**(A Faster Alternative: Fixed-Size Chunking)**
+
+Fixed-Size Chunking is a simple and efficient text-splitting strategy where documents are divided into chunks based on predefined size limits, typically measured in tokens, characters, or lines.
+
+This method does not analyze the meaning or structure of the text. It simply slices the content at fixed intervals, regardless of whether the cut happens in the middle of a sentence, paragraph, or idea.
+
+How Fixed-Size Chunking Works:
+
+•	Example Rule:
+
+    Split the document every 1000 tokens (or every 3000 characters).
+
+•	Optional Overlap:
+
+    To reduce the risk of splitting relevant context, 
+    some implementations add an overlap between consecutive chunks (e.g., 200-token overlap)
+    to ensure that important context isn’t lost at the boundary.
+
+**Benefits of Fixed-Size Chunking:**
+
+- **Fast processing:**
+No need for semantic analysis, LLM inference, or content understanding. Just count and cut.
+
+- **Low resource consumption:**
+Minimal CPU/GPU and memory usage, making it scalable for large datasets.
+
+- **Easy to implement:**
+Works with simple scripts or standard text processing libraries.
+
+**Limitations of Fixed-Size Chunking:**
+
+- **Poor semantic awareness:**
+Chunks may cut off sentences, paragraphs, or logical sections, leading to incomplete or fragmented ideas.
+
+- **Reduced retrieval precision:**
+In applications like semantic search or Retrieval-Augmented Generation (RAG), poor chunk boundaries can affect the relevance and quality of retrieved answers.
+
+**When to Use Fixed-Size Chunking:**
+
+- When processing speed and scalability are top priorities.
+
+- For large-scale document ingestion pipelines where semantic precision is not critical.
+
+- As a first step in scenarios where later refinement or semantic analysis will happen downstream.
+
+
+- This is a very simple method to split text:
+
+![img_10.png](images/img_10.png)
+
+- And this is the main process of fixed chunking:
+
+![img_11.png](images/img_11.png)
+
+>**Note:** Download this code to process the **fixed chunking** more **faster**: [`oci_genai_llm_context_fast.py`](./files/oci_genai_llm_context_fast.py) 
+
+## Semantic Chunking
+
+**What is Semantic Chunking?**
+
+Semantic Chunking is a text pre-processing technique where large documents (such as PDFs, presentations, or articles) are split into smaller parts called “chunks”, with each chunk representing a semantically coherent block of text.
+
+Unlike traditional fixed-size chunking (e.g., splitting every 1000 tokens or every X characters), Semantic Chunking uses Artificial Intelligence (typically Large Language Models - LLMs) to detect natural content boundaries, respecting topics, sections, and context.
+
+Instead of cutting text arbitrarily, Semantic Chunking tries to preserve the full meaning of each section, creating standalone, context-aware pieces.
+
+**Why Can Semantic Chunking Make Processing Slower?**
+
+A traditional chunking process, based on fixed size, is fast: the system just counts tokens or characters and cuts accordingly.
+
+With Semantic Chunking, several extra steps of semantic analysis are required:
+1.	Reading and interpreting the full text (or large blocks) before splitting:
+      The LLM needs to “understand” the content to identify the best chunk boundaries.
+2.	Running LLM prompts or topic classification models:
+      The system often queries the LLM with questions like:
+      “Is this the end of an idea?” or “Does this paragraph start a new section?”
+3.	Higher memory and CPU/GPU usage:
+      Because the model processes larger text blocks before making chunking decisions, resource consumption is significantly higher.
+4.	Sequential and incremental decision-making:
+      Semantic chunking often works in steps (e.g., analyzing 10,000-token blocks and then refining chunk boundaries inside that block), which increases total processing time.
+
+>**Note 1:** Depending on your machine processing power, you will wait a long, long time to finalize the first execution using **Semantic Chunking**.
+
+
+>**Note 2:** You can use this algorithm to produce customized chunking using **OCI Gen AI**.
+
+- This is the main document process. It uses:
+
+  - **smart_split_text()**: separates the full-text in small pieces of 10kb (you can configure to adopt other strategies). The mechanism perceive the last paragraph. If part of the paragraph is in the next text piece, this part will be ignored in the processing and will be appended on the next processing text group.
+  - **semantic_chunk()**: This method will use the OCI LLM mechanism to separate the paragraphs. It includes the intelligence to identify the titles, components of a table, the paragraphs to execute a smart chunk. The strategy here is to use the **Semantic Chunk** technique. It will took more time to complete the mission if compared with the common processing. So, the first processing will took a long time but the next will load all the faiss pre-saved data.
+  - **split_llm_output_into_chapters()**: This method will finalize the chunk, separating the chapters.
+    ![img.png](images/img_9.png)
+
+## Task 3: Run Query for Oracle Integration and Oracle SOA Suite Contents
+
+Run the following command.
+
+```
+FOR FIXED CHUNKING TECHNIQUE (MORE FASTER METHOD)
+python oci_genai_llm_context_fast.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
+```
+
+```
+FOR SEMANTIC CHUNKING TECHNIQUE
+python oci_genai_llm_context.py --device="mps" --gpu_name="M2Max GPU 32 Cores"
+```
+
+> **Note:** The `--device` and `--gpu_name` parameters can be used to accelerate the processing in Python, using GPU if your machine has one. Consider that this code can be used with local models too.
+
+The provided context distinguishes Oracle SOA Suite and Oracle Integration, you can test the code considering these points:
+
+- The query should be made only for Oracle SOA Suite: Therefore, only Oracle SOA Suite documents should be considered.
+- The query should be made only for Oracle Integration: Therefore, only Oracle Integration documents should be considered.
+- The query requires a comparison between Oracle SOA Suite and Oracle Integration: Therefore, all documents should be considered.
+
+We can define the following context, which greatly helps in interpreting the documents correctly.
+
+
+![img_7.png](images/img_7.png)
+
+The following image shows the example of comparison between Oracle SOA Suite and Oracle Integration.
+
+![img.png](./images/img.png "image")
+
+The following image shows the example for Kafka.
+
+![img_1.png](./images/img_1.png "image")
+
+## Next Steps
+
+This code demonstrates an application of OCI Generative AI for intelligent PDF analysis. It enables users to efficiently query large volumes of documents using semantic searches and a generative AI model to generate accurate natural language responses.
+
+This approach can be applied in various fields, such as legal, compliance, technical support, and academic research, making information retrieval much faster and smarter.
+
+## Related Links
+
+- [Extending SaaS by AI/ML features - Part 8: OCI Generative AI Integration with LangChain Use Cases](https://www.ateam-oracle.com/post/oci-generative-ai-integration-with-langchain-usecases)
+
+- [Bridging cloud and conversational AI: LangChain and OCI Data Science platform](https://blogs.oracle.com/ai-and-datascience/post/cloud-conversational-ai-langchain-oci-data-science)
+
+- [Install OCI CLI](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm#Quickstart)
+
+- [Introduction to Custom and Built-in Python LangChain Agents](https://wellsr.com/python/working-with-python-langchain-agents/)
+
+## Acknowledgments
+
+- **Author** - Cristiano Hoshikawa (Oracle LAD A-Team Solution Engineer)
--- a/source/.DS_Store
+++ b/source/.DS_Store
--- a/source/Manuals/SOASUITE.pdf
+++ b/source/Manuals/SOASUITE.pdf
--- a/source/Manuals/SOASUITEHL7.pdf
+++ b/source/Manuals/SOASUITEHL7.pdf
--- a/source/Manuals/using-integrations-oracle-integration-3.pdf
+++ b/source/Manuals/using-integrations-oracle-integration-3.pdf
--- a/source/oci_genai_llm_context.py
+++ b/source/oci_genai_llm_context.py
@@ -1,75 +0,0 @@
-from langchain_community.document_loaders import PyPDFLoader
-from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI
-from langchain_core.prompts import PromptTemplate
-from langchain.schema.output_parser import StrOutputParser
-from langchain.schema.runnable import RunnablePassthrough
-from langchain_community.embeddings import OCIGenAIEmbeddings
-from langchain_community.vectorstores import Chroma
-from langchain_community.vectorstores import FAISS
-
-def chat():
-
-    caminhos_pdf = [    './Manuals/using-integrations-oracle-integration-3.pdf',
-                        './Manuals/SOASUITE.pdf',
-                        './Manuals/SOASUITEHL7.pdf'
-                        ]
-
-    pages = []
-    ids = []
-    counter = 1
-    for caminho_pdf in caminhos_pdf:
-        doc_pages = PyPDFLoader(caminho_pdf).load_and_split()
-
-        pages.extend(doc_pages)
-        ids.append(str(counter))
-        counter = counter + 1
-
-    llm = ChatOCIGenAI(
-        model_id="meta.llama-3.1-405b-instruct",
-        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
-        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
-        auth_profile="DEFAULT",  # replace with your profile name,
-        model_kwargs={"temperature": 0.7, "top_p": 0.75, "max_tokens": 1000},
-    )
-
-    embeddings = OCIGenAIEmbeddings(
-        model_id="cohere.embed-multilingual-v3.0",
-        service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
-        compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
-        auth_profile="DEFAULT",  # replace with your profile name
-    )
-
-    vectorstore = FAISS.from_documents(
-        pages,
-        embedding=embeddings
-    )
-    retriever = vectorstore.as_retriever()
-
-    #### YOU CAN PERSIST YOUR VECTOR STORE HERE
-
-    template = """ 
-        If the query in question is not a comparison between SOA SUITE and OIC, consider only the documents relevant to the subject,
-         that is, if the question is about SOA SUITE, consider only the SOA SUITE documents. If the question is about OIC,
-          consider only the OIC document. If the question is a comparison between SOA SUITE and OIC, consider all documents. 
-          Inform at the beginning which tool is being discussed    : {input} 
-    """
-    prompt = PromptTemplate.from_template(template)
-
-
-    chain = (
-            {"context": retriever,
-             "input": RunnablePassthrough()}
-            | prompt
-            | llm
-            | StrOutputParser()
-    )
-
-    while (True):
-        query = input()
-        if query == "quit":
-            break
-        print(chain.invoke(query))
-
-
-chat()
-
--- a/source/requirements.txt
+++ b/source/requirements.txt
@@ -1,7 +0,0 @@
-faiss-cpu
-oci-cli
-langchain
-langchain_community
-langchain_cohere
-langchain-core
-langchain-text-splitters