diff --git a/README.md b/README.md index 75db6ae..64b31e7 100644 --- a/README.md +++ b/README.md @@ -64,15 +64,23 @@ Each question is parsed into a structured requirement: Facts are extracted **only when explicitly stated** in documentation and stored as graph triples: ``` -REQUIREMENT -[HAS_METRIC]-> messages per hour -REQUIREMENT -[HAS_VALUE]-> < 1 hour -REQUIREMENT -[SUPPORTED_BY]-> Document section + SERVICE -[SUPPORTS_CAPABILITY]-> CAPABILITY + SERVICE -[DOES_NOT_SUPPORT]-> CAPABILITY + SERVICE -[HAS_LIMITATION]-> LIMITATION + SERVICE -[HAS_SLA]-> SLA_VALUE ``` -There are three types of information: -- What metric: HAS_METRIC -- Value of metric: HAS_VALUE -- Font of information: SUPPORTED_BY +There are four types of structured relationships extracted explicitly from documentation: +* Capability support: SERVICE -[SUPPORTS_CAPABILITY]-> CAPABILITY +* Capability exclusion: SERVICE -[DOES_NOT_SUPPORT]-> CAPABILITY +* Technical limitation: SERVICE -[HAS_LIMITATION]-> LIMITATION +* Service level definition: SERVICE -[HAS_SLA]-> SLA_VALUE + +Each relationship is: +* Extracted strictly from explicit documentary evidence +* Linked to a specific document chunk (CHUNK_HASH) +* Associated with structured JSON node properties +* Backed by an evidence table for full auditability This ensures: - No hallucination @@ -178,7 +186,7 @@ POST /chat This code implements a **GraphRAG-based pipeline focused on RFP (Request for Proposal) validation**, not generic Q&A. ->**Download** the code [graphrag_rerank.py](./files/graphrag_rerank.py) +>**Download** here the [Source Code](./files/source_code.zip) The main goal is to: - Extract **explicit, verifiable facts** from large PDF contracts and datasheets @@ -212,7 +220,7 @@ This represents a **strategic shift** from concept-based LLM answers to **compli - `REQUIREMENT -[HAS_VALUE]-> 1 hour` - Stored in Oracle Property Graph tables -![img_1.png](img_1.png) +![img_3.png](img_3.png) 5. **RFP Requirement Parsing** - Each user question is converted into a structured requirement: @@ -294,8 +302,7 @@ FIrst of all, you need to run the code to prepare the Vector and Graph database. ![img.png](img.png) -![img_1.png](img_1.png) - +![img_3.png](img_3.png) After the execution, the code will chat with you to test. You can give some questions like: @@ -368,13 +375,13 @@ root Call the http://localhost:8100 in your browser. -![img_2.png](img_2.png) +![img_4.png](img_4.png) There is also a REST Service implemented in the code, so you can automatize a RFP list calling item by item and obtain the responses you want: YES/NO - curl -X POST http://localhost:8100/chat \ - -H "Content-Type: application/json" \ - -d '{"question": "What is the RTO of Oracle Application?"}' + curl -X POST http://demo-orcl-api-ai.hoshikawa.com.br:8100/rest/chat \ + -H "Content-Type: application/json" -u app_user:app_password \ + -d '{ "question": "Does Oracle Cloud Infrastructure (OCI) Compute support online resizing of memory for running virtual machine instances?" }' --- diff --git a/README_COMPLETE_TUTORIAL.md b/README_COMPLETE_TUTORIAL.md new file mode 100644 index 0000000..06f147d --- /dev/null +++ b/README_COMPLETE_TUTORIAL.md @@ -0,0 +1,295 @@ +# 🧠 Oracle GraphRAG RFP AI -- Complete Tutorial + +Enterprise-grade deterministic RFP validation engine built with: + +- Oracle Autonomous Database 23ai +- Oracle Property Graph +- OCI Generative AI (LLMs + Embeddings) +- FAISS Vector Search +- Flask REST API +- Hybrid Graph + Vector + JSON reasoning + +------------------------------------------------------------------------ + +# πŸ“Œ Introduction + +This project implements a **deterministic RFP validation engine**. + +Unlike traditional RAG systems that generate conceptual answers, this +solution is designed to: + +- Validate contractual and compliance requirements +- Produce only: YES / NO / PARTIAL +- Provide exact documentary evidence +- Eliminate hallucination risk +- Ensure full traceability + +This tutorial walks through the full architecture and implementation. + +------------------------------------------------------------------------ + +# πŸ—οΈ Full Architecture + + PDF Documents + └─► Semantic Chunking + β”œβ”€β–Ί FAISS Vector Index + β”œβ”€β–Ί LLM Triple Extraction + β”‚ └─► Oracle 23ai Property Graph + β”‚ β”œβ”€β–Ί Structured JSON Node Properties + β”‚ β”œβ”€β–Ί Edge Confidence Weights + β”‚ └─► Evidence Table + └─► Hybrid Retrieval Layer + β”œβ”€β–Ί Vector Recall + β”œβ”€β–Ί Graph Filtering + β”œβ”€β–Ί Oracle Text + └─► Graph-aware Reranking + └─► Deterministic LLM Decision + └─► REST Response + +------------------------------------------------------------------------ + +# 🧩 Step 1 -- Environment Setup + +You need: + +- Oracle Autonomous Database 23ai +- OCI Generative AI enabled +- Python 3.10+ +- FAISS installed +- Oracle Python driver (`oracledb`) + +Install dependencies: + + pip install oracledb langchain faiss-cpu flask pypandoc + +------------------------------------------------------------------------ + +# πŸ“„ Step 2 -- PDF Ingestion + +- Load PDFs +- Perform semantic chunking +- Normalize headings and tables +- Store chunk metadata including: + - chunk_hash + - source_url + +Chunks feed both: + +- FAISS +- Graph extraction + +------------------------------------------------------------------------ + +# 🧠 Step 3 -- Triple Extraction (Graph Creation) + +The function: + + create_knowledge_graph(chunks) + +Uses LLM to extract ONLY explicit relationships: + + SERVICE -[SUPPORTS_CAPABILITY]-> CAPABILITY + SERVICE -[DOES_NOT_SUPPORT]-> CAPABILITY + SERVICE -[HAS_LIMITATION]-> LIMITATION + SERVICE -[HAS_SLA]-> SLA_VALUE + +No inference allowed. + +------------------------------------------------------------------------ + +# πŸ›οΈ Step 4 -- Oracle Property Graph Setup + +Graph is created automatically: + + CREATE PROPERTY GRAPH GRAPH_NAME + VERTEX TABLES (...) + EDGE TABLES (...) + +Nodes are stored in: + + KG_NODES_GRAPH_NAME + +Edges in: + + KG_EDGES_GRAPH_NAME + +Evidence in: + + KG_EVIDENCE_GRAPH_NAME + +------------------------------------------------------------------------ + +# 🧩 Step 5 -- Structured Node Properties (Important) + +Each node includes structured JSON properties. + +Default structure: + +``` json +{ + "metadata": { + "created_by": "RFP_AI_V2", + "version": "2.0", + "created_at": "UTC_TIMESTAMP" + }, + "analysis": { + "confidence_score": null, + "source": "DOCUMENT_RAG", + "extraction_method": "LLM_TRIPLE_EXTRACTION" + }, + "governance": { + "validated": false, + "review_required": false + } +} +``` + +Implementation: + +``` python +def build_default_node_properties(): + return { + "metadata": { + "created_by": "RFP_AI_V2", + "version": "2.0", + "created_at": datetime.utcnow().isoformat() + }, + "analysis": { + "confidence_score": None, + "source": "DOCUMENT_RAG", + "extraction_method": "LLM_TRIPLE_EXTRACTION" + }, + "governance": { + "validated": False, + "review_required": False + } + } +``` + +This guarantees: + +- No empty `{}` stored +- Auditability +- Governance extension capability +- Enterprise extensibility + +------------------------------------------------------------------------ + +# πŸ”Ž Step 6 -- Hybrid Retrieval Strategy + +The system combines: + +1. FAISS semantic recall +2. Graph filtering via Oracle Text +3. Graph-aware reranking +4. Deterministic LLM evaluation + +This ensures: + +- High recall +- High precision +- No hallucination + +------------------------------------------------------------------------ + +# 🎯 Step 7 -- RFP Requirement Parsing + +Each question becomes structured: + +``` json +{ + "requirement_type": "NON_FUNCTIONAL", + "subject": "authentication", + "expected_value": "MFA", + "keywords": ["authentication", "mfa"] +} +``` + +This structure guides retrieval and evaluation. + +------------------------------------------------------------------------ + +# πŸ“Š Step 8 -- Deterministic Decision Engine + +LLM output format: + +``` json +{ + "answer": "YES | NO | PARTIAL", + "confidence": "HIGH | MEDIUM | LOW", + "justification": "Short factual explanation", + "evidence": [ + { + "quote": "Exact document text", + "source": "Document reference" + } + ] +} +``` + +Rules: + +- If not explicitly stated β†’ NO +- No inference +- Must provide documentary evidence + +------------------------------------------------------------------------ + +# 🌐 Step 9 -- Running the Application + +Run preprocessing once: + + python graphrag_rerank.py + +Run web UI: + + python app.py + +Open: + + http://localhost:8100 + +Or use REST: + + curl -X POST http://localhost:8100/chat -H "Content-Type: application/json" -d '{"question": "Does the platform support MFA?"}' + +------------------------------------------------------------------------ + +# πŸ§ͺ Example RFP Questions + +Security, SLA, Performance, Compliance, Vendor Lock-in, Backup, +Governance. + +The engine validates each with deterministic logic. + +------------------------------------------------------------------------ + +# πŸ” Design Principles + +- Evidence-first +- Deterministic outputs +- Zero hallucination tolerance +- Enterprise auditability +- Structured graph reasoning + +------------------------------------------------------------------------ + +# πŸš€ Future Extensions + +- Confidence scoring via graph density +- Weighted edge scoring +- SLA numeric comparison engine +- JSON-based filtering +- PGQL advanced reasoning +- Enterprise governance workflows + +------------------------------------------------------------------------ + +# πŸ“Œ Conclusion + +Oracle GraphRAG RFP AI is not a chatbot. + +It is a compliance validation engine built for enterprise RFP +automation, legal due diligence, and procurement decision support. + +Deterministic. Traceable. Expandable. diff --git a/files/app.py b/files/app.py index 077e50d..9831ede 100644 --- a/files/app.py +++ b/files/app.py @@ -1,67 +1,83 @@ -from flask import Flask, render_template, request, jsonify -import traceback -import json +from flask import Flask -# πŸ”₯ IMPORTA SEU PIPELINE -from graphrag_rerank import answer_question +from modules.users import users_bp +from modules.home.routes import home_bp +from modules.chat.routes import chat_bp +from modules.excel.routes import excel_bp +from modules.health.routes import health_bp +from modules.architecture.routes import architecture_bp +from modules.admin.routes import admin_bp +from modules.auth.routes import auth_bp +from modules.rest.routes import rest_bp -app = Flask(__name__) - -def parse_llm_json(raw: str) -> dict: - try: - raw = raw.replace("```json", "") - raw = raw.replace("```", "") - return json.loads(raw) - except Exception: - return { - "answer": "ERROR", - "justification": "LLM returned invalid JSON", - "raw_output": raw - } - -# ========================= -# Health check (Load Balancer) -# ========================= -@app.route("/health", methods=["GET"]) -def health(): - return jsonify({"status": "UP"}), 200 +from config_loader import load_config +from modules.excel.queue_manager import start_excel_worker +from modules.users.service import create_user +from modules.users.db import get_pool +import bcrypt +import oracledb +from werkzeug.security import generate_password_hash -# ========================= -# PΓ‘gina Web -# ========================= -@app.route("/", methods=["GET"]) -def index(): - return render_template("index.html") +def ensure_default_admin(): + """ + Cria admin default direto no Oracle (sem SQLAlchemy) + """ + + pool = get_pool() + + sql_check = "SELECT id FROM app_users WHERE user_role='admin'" + sql_insert = """ + INSERT INTO app_users (name,email,user_role,password_hash,active) + VALUES (:1,:2,'admin',:3,1) \ + """ + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql_check) + if not cur.fetchone(): + pwd = generate_password_hash("admin123") + cur.execute(sql_insert, ["Admin", "admin@local", pwd]) + conn.commit() + print("Default admin created: admin@local / admin123") -# ========================= -# Endpoint de Chat -# ========================= -@app.route("/chat", methods=["POST"]) -def chat(): - try: - data = request.get_json() - question = data.get("question", "").strip() +def create_app(): - if not question: - return jsonify({"error": "Empty question"}), 400 + app = Flask(__name__) + app.secret_key = "super-secret" - raw_answer = answer_question(question) - parsed_answer = parse_llm_json(raw_answer) + # NΓƒO EXISTE MAIS SQLite + # NΓƒO EXISTE MAIS SQLAlchemy - return jsonify({ - "question": question, - "result": parsed_answer - }) + start_excel_worker() - except Exception as e: - traceback.print_exc() - return jsonify({"error": str(e)}), 500 + # cria admin no Oracle + ensure_default_admin() + + app.register_blueprint(users_bp, url_prefix="/admin/users") + app.register_blueprint(chat_bp) + app.register_blueprint(excel_bp) + app.register_blueprint(health_bp) + app.register_blueprint(architecture_bp) + app.register_blueprint(home_bp) + app.register_blueprint(admin_bp, url_prefix="/admin") + app.register_blueprint(auth_bp) + app.register_blueprint(rest_bp) + + from modules.core.security import get_current_user + + @app.context_processor + def inject_user(): + return dict(current_user=get_current_user()) + + return app + + +app = create_app() + +config = load_config() +API_BASE_URL = f"{config.app_base}:{config.service_port}" if __name__ == "__main__": - app.run( - host="0.0.0.0", - port=8100, - debug=False - ) \ No newline at end of file + app.run(host="0.0.0.0", port=config.service_port) \ No newline at end of file diff --git a/files/config.json b/files/config.json new file mode 100644 index 0000000..0ead701 --- /dev/null +++ b/files/config.json @@ -0,0 +1,26 @@ +{ + "wallet_path": "Wallet_oradb23aiDev", + "db_alias": "oradb23aiDev_high", + "username": "admin", + "password": "Moniquinha1972", + + "service_endpoint": "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com", + "compartment_id": "ocid1.compartment.oc1..aaaaaaaaexpiw4a7dio64mkfv2t273s2hgdl6mgfvvyv7tycalnjlvpvfl3q", + "auth_profile": "LATINOAMERICA", + + "llm_model": "meta.llama-3.1-405b-instruct", + "embedding_model": "cohere.embed-multilingual-v3.0", + + "index_path": "./faiss_index", + "docs_path": "./docs", + + "graph_name": "OCI_5", + "service_port": 8102, + "app_base": "http://127.0.0.1", + "dev_mode": 0, + "service_server": "10.0.1.136", + + "bucket_profile": "LATINOAMERICA-SaoPaulo", + "oci_bucket": "genai_hoshikawa_bucket", + "oci_namespace": "idi1o0a010nx" +} diff --git a/files/faiss_to_oracle_vector.py b/files/faiss_to_oracle_vector.py new file mode 100644 index 0000000..86053cf --- /dev/null +++ b/files/faiss_to_oracle_vector.py @@ -0,0 +1,205 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +""" +FAISS β†’ Oracle 23ai Vector migration (FULL GOVERNANCE VERSION) + +Migra: +- content +- source +- chunk_hash +- origin +- created_at +- status +- embedding + +""" + +import json +import argparse +import hashlib +import datetime +import oracledb + +from langchain.vectorstores import FAISS +from langchain.embeddings import HuggingFaceEmbeddings +from tqdm import tqdm + + +# ===================================================== +# CONFIG +# ===================================================== + +VECTOR_DIM = 1024 +TABLE_NAME = "RAG_DOCS" +BATCH_SIZE = 500 + + +# ===================================================== +# CLI +# ===================================================== + +parser = argparse.ArgumentParser() +parser.add_argument("--faiss", required=True) +parser.add_argument("--dsn", required=True) +parser.add_argument("--user", required=True) +parser.add_argument("--password", required=True) +args = parser.parse_args() + + +# ===================================================== +# HELPERS +# ===================================================== + +def chunk_hash(text: str) -> str: + return hashlib.sha256(text.encode("utf-8")).hexdigest() + + +# ===================================================== +# 1) LOAD FAISS +# ===================================================== + +print("πŸ”„ Loading FAISS index...") + +dummy_embeddings = HuggingFaceEmbeddings( + model_name="sentence-transformers/all-MiniLM-L6-v2" +) + +vs = FAISS.load_local( + args.faiss, + dummy_embeddings, + allow_dangerous_deserialization=True +) + +docs = vs.docstore._dict +index = vs.index +vectors = index.reconstruct_n(0, index.ntotal) + +print(f"βœ… Loaded {len(docs)} vectors") + +# ========================= +# Oracle Autonomous Configuration +# ========================= +WALLET_PATH = "Wallet_oradb23aiDev" +DB_ALIAS = "oradb23aiDev_high" +USERNAME = "admin" +PASSWORD = "Moniquinha1972" +os.environ["TNS_ADMIN"] = WALLET_PATH + +# ===================================================== +# 2) CONNECT ORACLE +# ===================================================== + +print("πŸ”Œ Connecting to Oracle...") + +conn = oracledb.connect( + user=USERNAME, + password=PASSWORD, + dsn=DB_ALIAS, + config_dir=WALLET_PATH, + wallet_location=WALLET_PATH, + wallet_password=PASSWORD +) + +cur = conn.cursor() + + +# ===================================================== +# 3) CREATE TABLE (FULL SCHEMA) +# ===================================================== + +print("πŸ“¦ Creating table if not exists...") + +cur.execute(f""" +BEGIN + EXECUTE IMMEDIATE ' + CREATE TABLE {TABLE_NAME} ( + ID NUMBER GENERATED BY DEFAULT AS IDENTITY, + CONTENT CLOB, + SOURCE VARCHAR2(1000), + CHUNK_HASH VARCHAR2(64), + STATUS VARCHAR2(20), + ORIGIN VARCHAR2(50), + CREATED_AT TIMESTAMP, + EMBED VECTOR({VECTOR_DIM}) + )'; +EXCEPTION + WHEN OTHERS THEN + IF SQLCODE != -955 THEN RAISE; +END; +""") + +conn.commit() + + +# ===================================================== +# 4) INSERT BATCH +# ===================================================== + +print("⬆️ Migrating vectors...") + +sql = f""" +INSERT INTO {TABLE_NAME} +(CONTENT, SOURCE, CHUNK_HASH, STATUS, ORIGIN, CREATED_AT, EMBED) +VALUES (:1, :2, :3, :4, :5, :6, :7) +""" + +batch = [] + +for i, (doc_id, doc) in enumerate(tqdm(docs.items())): + + content = doc.page_content + source = doc.metadata.get("source", "") + origin = doc.metadata.get("origin", "FAISS") + created = doc.metadata.get( + "created_at", + datetime.datetime.utcnow() + ) + + h = doc.metadata.get("chunk_hash") or chunk_hash(content) + + batch.append(( + content, + source, + h, + "ACTIVE", + origin, + created, + json.dumps(vectors[i].tolist()) + )) + + if len(batch) >= BATCH_SIZE: + cur.executemany(sql, batch) + batch.clear() + +if batch: + cur.executemany(sql, batch) + +conn.commit() + +print("βœ… Insert finished") + + +# ===================================================== +# 5) CREATE VECTOR INDEX +# ===================================================== + +print("⚑ Creating HNSW index...") + +cur.execute(f""" +BEGIN + EXECUTE IMMEDIATE ' + CREATE VECTOR INDEX {TABLE_NAME}_IDX + ON {TABLE_NAME}(EMBED) + ORGANIZATION HNSW + DISTANCE COSINE + '; +EXCEPTION + WHEN OTHERS THEN + IF SQLCODE != -955 THEN RAISE; +END; +""") + +conn.commit() + +print("πŸŽ‰ Migration complete!") \ No newline at end of file diff --git a/files/graphrag_rerank.py b/files/graphrag_rerank.py deleted file mode 100644 index e7be703..0000000 --- a/files/graphrag_rerank.py +++ /dev/null @@ -1,1104 +0,0 @@ -from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI -from langchain_core.prompts import PromptTemplate -from langchain.schema.output_parser import StrOutputParser -from langchain_community.embeddings import OCIGenAIEmbeddings -from langchain_community.vectorstores import FAISS -from langchain.schema.runnable import RunnableMap -from langchain_community.document_loaders import UnstructuredPDFLoader, PyMuPDFLoader -from langchain_core.documents import Document -from langchain_core.runnables import RunnableLambda -from pathlib import Path -from tqdm import tqdm -import os -import pickle -import re -import atexit -import oracledb -import json -import base64 - -# ========================= -# Oracle Autonomous Configuration -# ========================= -WALLET_PATH = "Wallet_oradb23ai" -DB_ALIAS = "oradb23ai_high" -USERNAME = "admin" -PASSWORD = "**********" -os.environ["TNS_ADMIN"] = WALLET_PATH - -# ========================= -# Global Configurations -# ========================= -INDEX_PATH = "./faiss_index" -PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl") -chapter_separator_regex = r"^(#{1,6} .+|\*\*.+\*\*)$" -GRAPH_NAME = "OCI_GRAPH" - -# ========================= -# LLM Definitions -# ========================= -llm = ChatOCIGenAI( - model_id="meta.llama-3.1-405b-instruct", - service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com", - compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - auth_profile="DEFAULT", - model_kwargs={"temperature": 0, "top_p": 1, "max_tokens": 4000}, -) - -llm_for_rag = ChatOCIGenAI( - model_id="meta.llama-3.1-405b-instruct", - service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com", - compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - auth_profile="DEFAULT", -) - -embeddings = OCIGenAIEmbeddings( - model_id="cohere.embed-multilingual-v3.0", - service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com", - compartment_id="ocid1.compartment.oc1..aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - auth_profile="DEFAULT", -) - -oracle_conn = oracledb.connect( - user=USERNAME, - password=PASSWORD, - dsn=DB_ALIAS, - config_dir=WALLET_PATH, - wallet_location=WALLET_PATH, - wallet_password=PASSWORD -) -atexit.register(lambda: oracle_conn.close()) - -def filename_to_url(filename: str, suffix: str = ".pdf") -> str: - if filename.endswith(suffix): - filename = filename[: -len(suffix)] - decoded = base64.urlsafe_b64decode(filename.encode("ascii")) - return decoded.decode("utf-8") - -# ========================= -# Oracle Graph Client -# ========================= -def ensure_oracle_text_index( - conn, - table_name: str, - column_name: str, - index_name: str -): - cursor = conn.cursor() - - cursor.execute(""" - SELECT status - FROM user_indexes - WHERE index_name = :idx - """, {"idx": index_name.upper()}) - - row = cursor.fetchone() - index_exists = row is not None - index_status = row[0] if row else None - - if not index_exists: - print(f"πŸ› οΈ Creating Oracle Text index {index_name}") - - cursor.execute(f""" - CREATE INDEX {index_name} - ON {table_name} ({column_name}) - INDEXTYPE IS CTXSYS.CONTEXT - """) - - conn.commit() - cursor.close() - print(f"βœ… Index {index_name} created (sync deferred)") - return - - if index_status != "VALID": - print(f"⚠️ Index {index_name} is {index_status}. Recreating...") - - try: - cursor.execute(f"DROP INDEX {index_name}") - conn.commit() - except Exception as e: - print(f"❌ Failed to drop index {index_name}: {e}") - cursor.close() - return - - cursor.execute(f""" - CREATE INDEX {index_name} - ON {table_name} ({column_name}) - INDEXTYPE IS CTXSYS.CONTEXT - """) - conn.commit() - cursor.close() - print(f"♻️ Index {index_name} recreated (sync deferred)") - return - - print(f"πŸ”„ Syncing Oracle Text index: {index_name}") - try: - cursor.execute(f""" - BEGIN - CTX_DDL.SYNC_INDEX('{index_name}', '2M'); - END; - """) - conn.commit() - print(f"βœ… Index {index_name} synced") - except Exception as e: - print(f"⚠️ Sync failed for {index_name}: {e}") - print("⚠️ Continuing without breaking pipeline") - - cursor.close() - -def create_tables_if_not_exist(conn): - cursor = conn.cursor() - - try: - cursor.execute(f""" - BEGIN - EXECUTE IMMEDIATE ' - CREATE TABLE ENTITIES_{GRAPH_NAME} ( - ID NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY PRIMARY KEY, - NAME VARCHAR2(500) - ) - '; - EXCEPTION - WHEN OTHERS THEN - IF SQLCODE != -955 THEN - RAISE; - END IF; - END; - """) - cursor.execute(f""" - BEGIN - EXECUTE IMMEDIATE ' - CREATE TABLE RELATIONS_{GRAPH_NAME} ( - ID NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY PRIMARY KEY, - SOURCE_ID NUMBER, - TARGET_ID NUMBER, - RELATION_TYPE VARCHAR2(100), - SOURCE_TEXT VARCHAR2(4000) - ) - '; - EXCEPTION - WHEN OTHERS THEN - IF SQLCODE != -955 THEN - RAISE; - END IF; - END; - """) - conn.commit() - print("βœ… ENTITIES and RELATIONS tables created or already exist.") - except Exception as e: - print(f"[ERROR] Failed to create tables: {e}") - finally: - cursor.close() - - -create_tables_if_not_exist(oracle_conn) - -# IF GRAPH INDEX PROBLEM, Reindex -# ensure_oracle_text_index( -# oracle_conn, -# "ENTITIES_" + GRAPH_NAME, -# "NAME", -# "IDX_ENT_" + GRAPH_NAME + "_NAME" -# ) -# -# ensure_oracle_text_index( -# oracle_conn, -# "RELATIONS_" + GRAPH_NAME, -# "RELATION_TYPE", -# "IDX_REL_" + GRAPH_NAME + "_RELTYPE" -# ) - -def create_knowledge_graph(chunks): - cursor = oracle_conn.cursor() - - # Creates graph if it does not exist - try: - cursor.execute(f""" - BEGIN - EXECUTE IMMEDIATE ' - CREATE PROPERTY GRAPH {GRAPH_NAME} - VERTEX TABLES (ENTITIES_{GRAPH_NAME} - KEY (ID) - LABEL ENTITIES - PROPERTIES (NAME)) - EDGE TABLES (RELATIONS_{GRAPH_NAME} - KEY (ID) - SOURCE KEY (SOURCE_ID) REFERENCES ENTITIES_{GRAPH_NAME}(ID) - DESTINATION KEY (TARGET_ID) REFERENCES ENTITIES_{GRAPH_NAME}(ID) - LABEL RELATIONS - PROPERTIES (RELATION_TYPE, SOURCE_TEXT)) - '; - EXCEPTION - WHEN OTHERS THEN - IF SQLCODE != -55358 THEN -- ORA-55358: Graph already exists - RAISE; - END IF; - END; - """) - print(f"🧠 Graph '{GRAPH_NAME}' created or already exists.") - except Exception as e: - print(f"[GRAPH ERROR] Failed to create graph: {e}") - - # Inserting vertices and edges into the tables - for doc in chunks: - text = doc.page_content - source = doc.metadata.get("source", "unknown") - - if not text.strip(): - continue - - prompt = f""" - You are extracting structured RFP evidence from technical documentation. - - Given the text below, identify ONLY explicit, verifiable facts. - - Text: - {text} - - Extract triples in ONE of the following formats ONLY: - - 1. REQUIREMENT -[HAS_SUBJECT]-> - 2. REQUIREMENT -[HAS_METRIC]-> - 3. REQUIREMENT -[HAS_VALUE]-> - 4. REQUIREMENT -[SUPPORTED_BY]-> - - Rules: - - Use REQUIREMENT as the source entity - - Use UPPERCASE relation names - - Do NOT infer or assume - - If nothing explicit is found, return NONE - """ - try: - response = llm_for_rag.invoke(prompt) - result = response.content.strip() - except Exception as e: - print(f"[ERROR] Gen AI call error: {e}") - continue - - if result.upper() == "NONE": - continue - - triples = result.splitlines() - for triple in triples: - parts = triple.split("-[") - if len(parts) != 2: - continue - - right_part = parts[1].split("]->") - if len(right_part) != 2: - continue - - raw_relation, entity2 = right_part - relation = re.sub(r'\W+', '_', raw_relation.strip().upper()) - entity1 = parts[0].strip() - entity2 = entity2.strip() - - if entity1.upper() != "REQUIREMENT": - entity1 = "REQUIREMENT" - - try: - # Insertion of entities (with existence check) - cursor.execute(f"MERGE INTO ENTITIES_{GRAPH_NAME} e USING (SELECT :name AS NAME FROM dual) src ON (e.name = src.name) WHEN NOT MATCHED THEN INSERT (NAME) VALUES (:name)", [entity1, entity1]) - cursor.execute(f"MERGE INTO ENTITIES_{GRAPH_NAME} e USING (SELECT :name AS NAME FROM dual) src ON (e.name = src.name) WHEN NOT MATCHED THEN INSERT (NAME) VALUES (:name)", [entity2, entity2]) - # Retrieve the IDs - cursor.execute(f"SELECT ID FROM ENTITIES_{GRAPH_NAME} WHERE NAME = :name", [entity1]) - source_id = cursor.fetchone()[0] - cursor.execute(f"SELECT ID FROM ENTITIES_{GRAPH_NAME} WHERE NAME = :name", [entity2]) - target_id = cursor.fetchone()[0] - # Create relations - cursor.execute(f""" - INSERT INTO RELATIONS_{GRAPH_NAME} (SOURCE_ID, TARGET_ID, RELATION_TYPE, SOURCE_TEXT) - VALUES (:src, :tgt, :rel, :txt) - """, [source_id, target_id, relation, source]) - print(f"βœ… {entity1} -[{relation}]-> {entity2}") - except Exception as e: - print(f"[INSERT ERROR] {e}") - - oracle_conn.commit() - cursor.close() - print("πŸ’Ύ Knowledge graph updated.") - -def parse_rfp_requirement(question: str) -> dict: - prompt = f""" - You are an RFP requirement NORMALIZER for Oracle Cloud Infrastructure (OCI). - - Your job is NOT to summarize the question. - Your job is to STRUCTURE the requirement so it can be searched in: - - Technical documentation - - Knowledge Graph - - Vector databases - - ──────────────────────────────── - STEP 1 β€” Understand the requirement - ──────────────────────────────── - From the question, identify: - 1. The PRIMARY OCI SERVICE CATEGORY involved - 2. The MAIN TECHNICAL SUBJECT (short and precise) - 3. The EXPECTED TECHNICAL CAPABILITY or CONDITION (if any) - - IMPORTANT: - - Ignore marketing language - - Ignore phrases like "possui", "permite", "oferece" - - Focus ONLY on concrete technical meaning - - ──────────────────────────────── - STEP 2 β€” Mandatory service classification - ──────────────────────────────── - You MUST choose ONE primary technology from the list below - and INCLUDE IT EXPLICITLY in the keywords list. - - Choose the MOST SPECIFIC applicable item. - - ServiΓ§os da Oracle Cloud Infrastructure (OCI): - - Compute (IaaS) - β€’ Compute Instances (VM) - β€’ Bare Metal Instances - β€’ Dedicated VM Hosts - β€’ GPU Instances - β€’ Confidential Computing - β€’ Capacity Reservations - β€’ Autoscaling (Instance Pools) - β€’ Live Migration - β€’ Oracle Cloud VMware Solution (OCVS) - β€’ HPC (High Performance Computing) - β€’ Arm-based Compute (Ampere) - - Storage - - Object Storage - β€’ Object Storage - β€’ Object Storage – Archive - β€’ Pre-Authenticated Requests - β€’ Replication - - Block & File - β€’ Block Volume - β€’ Boot Volume - β€’ Volume Groups - β€’ File Storage - β€’ File Storage Snapshots - β€’ Data Transfer Service - - Networking - β€’ Virtual Cloud Network (VCN) - β€’ Subnets - β€’ Internet Gateway - β€’ NAT Gateway - β€’ Service Gateway - β€’ Dynamic Routing Gateway (DRG) - β€’ FastConnect - β€’ Load Balancer (L7 / L4) - β€’ Network Load Balancer - β€’ DNS - β€’ Traffic Management Steering Policies - β€’ IP Address Management (IPAM) - β€’ Network Firewall - β€’ Web Application Firewall (WAF) - β€’ Bastion - β€’ Capture Traffic (VTAP) - β€’ Private Endpoints - - Security, Identity & Compliance - β€’ Identity and Access Management (IAM) - β€’ Compartments - β€’ Policies - β€’ OCI Vault - β€’ OCI Key Management (KMS) - β€’ OCI Certificates - β€’ OCI Secrets - β€’ OCI Bastion - β€’ Cloud Guard - β€’ Security Zones - β€’ Vulnerability Scanning Service - β€’ Data Safe - β€’ Logging - β€’ Audit - β€’ OS Management / OS Management Hub - β€’ Shielded Instances - β€’ Zero Trust Packet Routing - - Databases - - Autonomous - β€’ Autonomous Database (ATP) - β€’ Autonomous Data Warehouse (ADW) - β€’ Autonomous JSON Database - - Databases Gerenciados - β€’ Oracle Database Service - β€’ Oracle Exadata Database Service - β€’ Exadata Cloud@Customer - β€’ Base Database Service - β€’ MySQL Database Service - β€’ MySQL HeatWave - β€’ NoSQL Database Cloud Service - β€’ TimesTen - β€’ PostgreSQL (OCI managed) - β€’ MongoDB API (OCI NoSQL compatibility) - - Analytics & BI - β€’ Oracle Analytics Cloud (OAC) - β€’ OCI Data Catalog - β€’ OCI Data Integration - β€’ OCI Streaming Analytics - β€’ OCI GoldenGate - β€’ OCI Big Data Service (Hadoop/Spark) - β€’ OCI Data Science - β€’ OCI AI Anomaly Detection - β€’ OCI AI Forecasting - - AI & Machine Learning - - Generative AI - β€’ OCI Generative AI - β€’ OCI Generative AI Agents - β€’ OCI Generative AI RAG - β€’ OCI Generative AI Embeddings - β€’ OCI AI Gateway (OpenAI-compatible) - - AI Services - β€’ OCI Vision (OCR, image analysis) - β€’ OCI Speech (STT / TTS) - β€’ OCI Language (NLP) - β€’ OCI Document Understanding - β€’ OCI Anomaly Detection - β€’ OCI Forecasting - β€’ OCI Data Labeling - - Containers & Cloud Native - β€’ OCI Container Engine for Kubernetes (OKE) - β€’ Container Registry (OCIR) - β€’ Service Mesh - β€’ API Gateway - β€’ OCI Functions (FaaS) - β€’ OCI Streaming (Kafka-compatible) - β€’ OCI Queue - β€’ OCI Events - β€’ OCI Resource Manager (Terraform) - - Integration & Messaging - β€’ OCI Integration Cloud (OIC) - β€’ OCI Service Connector Hub - β€’ OCI Streaming - β€’ OCI GoldenGate - β€’ OCI API Gateway - β€’ OCI Events Service - β€’ OCI Queue - β€’ Real Applications Clusters (RAC) - - Developer Services - β€’ OCI DevOps (CI/CD) - β€’ OCI Code Repository - β€’ OCI Build Pipelines - β€’ OCI Artifact Registry - β€’ OCI Logging Analytics - β€’ OCI Monitoring - β€’ OCI Notifications - β€’ OCI Bastion - β€’ OCI CLI - β€’ OCI SDKs - - Observability & Management - β€’ OCI Monitoring - β€’ OCI Alarms - β€’ OCI Logging - β€’ OCI Logging Analytics - β€’ OCI Application Performance Monitoring (APM) - β€’ OCI Operations Insights - β€’ OCI Management Agent - β€’ OCI Resource Discovery - - Enterprise & Hybrid - β€’ Oracle Cloud@Customer - β€’ Exadata Cloud@Customer - β€’ Compute Cloud@Customer - β€’ Dedicated Region Cloud@Customer - β€’ OCI Roving Edge Infrastructure - β€’ OCI Alloy - - Governance & FinOps - β€’ OCI Budgets - β€’ Cost Analysis - β€’ Usage Reports - β€’ Quotas - β€’ Tagging - β€’ Compartments - β€’ Resource Search - - Regions & Edge - β€’ OCI Regions (Commercial, Government, EU Sovereign) - β€’ OCI Edge Services - β€’ OCI Roving Edge - β€’ OCI Dedicated Region - - ──────────────────────────────── - STEP 3 β€” Keywords rules (CRITICAL) - ──────────────────────────────── - The "keywords" field MUST: - - ALWAYS include at least ONE OCI service keyword (e.g. "compute", "object storage", "oke") - - Include technical capability terms (e.g. resize, autoscaling, encryption) - - NEVER include generic verbs (permitir, possuir, oferecer) - - NEVER include full sentences - - ──────────────────────────────── - STEP 4 β€” Output rules - ──────────────────────────────── - Return ONLY valid JSON between tags. - Do NOT explain your reasoning. - - Question: - {question} - - - {{ - "requirement_type": "COMPLIANCE | FUNCTIONAL | NON_FUNCTIONAL", - "subject": "", - "expected_value": "", - "decision_type": "YES_NO | YES_NO_PARTIAL", - "keywords": ["mandatory_oci_service", "technical_capability", "additional_term"] - }} - - """ - - resp = llm_for_rag.invoke(prompt) - raw = resp.content.strip() - - try: - # remove ```json ``` ou ``` ``` - raw = re.sub(r"```json|```", "", raw).strip() - - match = re.search(r"\s*(\{.*?\})\s*", raw, re.DOTALL) - if not match: - raise ValueError("No JSON block found") - json_text = match.group(1) - - return json.loads(json_text) - - except Exception as e: - print("⚠️ RFP PARSER FAILED") - print("RAW RESPONSE:") - print(raw) - - return { - "requirement_type": "UNKNOWN", - "subject": question, - "expected_value": "", - "decision_type": "YES_NO_PARTIAL", - "keywords": re.findall(r"\b\w+\b", question.lower())[:5] - } - -def extract_graph_keywords_from_requirement(req: dict) -> str: - keywords = set(req.get("keywords", [])) - if req.get("subject"): - keywords.add(req["subject"].lower()) - if req.get("expected_value"): - keywords.add(str(req["expected_value"]).lower()) - return ", ".join(sorted(keywords)) - -def build_oracle_text_query(text: str) -> str | None: - ORACLE_TEXT_STOPWORDS = { - "and", "or", "the", "with", "between", "of", "to", "for", - "in", "on", "by", "is", "are", "was", "were", "be", "within", "between" - } - - tokens = [] - text = text.lower() - text = re.sub(r"[^a-z0-9\s]", " ", text) - - for token in text.split(): - if len(token) >= 4 and token not in ORACLE_TEXT_STOPWORDS: - tokens.append(f"{token}") - - tokens = sorted(set(tokens)) - return " OR ".join(tokens) if tokens else None - -def query_knowledge_graph(raw_keywords: str, top_k: int = 20, min_score: int = 50): - cursor = oracle_conn.cursor() - - safe_query = build_oracle_text_query(raw_keywords) - if not safe_query: - cursor.close() - return [] - - sql = f""" - SELECT - e1.NAME AS source_name, - r.RELATION_TYPE, - e2.NAME AS target_name, - GREATEST(SCORE(1), SCORE(2)) AS relevance_score - FROM RELATIONS_{GRAPH_NAME} r - JOIN ENTITIES_{GRAPH_NAME} e1 ON e1.ID = r.SOURCE_ID - JOIN ENTITIES_{GRAPH_NAME} e2 ON e2.ID = r.TARGET_ID - WHERE e1.NAME = 'REQUIREMENT' - AND ( - CONTAINS(e2.NAME, '{safe_query}', 1) > 0 - OR CONTAINS(r.RELATION_TYPE, '{safe_query}', 2) > 0 - ) - AND GREATEST(SCORE(1), SCORE(2)) >= {min_score} - ORDER BY relevance_score DESC - FETCH FIRST {top_k} ROWS ONLY - """ - - print("πŸ”Ž GRAPH QUERY (ranked):") - print(sql) - - cursor.execute(sql) - rows = cursor.fetchall() - cursor.close() - - print(f"πŸ“Š GRAPH FACTS (top {top_k}):") - for s, r, t, sc in rows: - print(f" [{sc:>3}] REQUIREMENT -[{r}]-> {t}") - - # mantΓ©m compatibilidade com o pipeline atual - return [(s, r, t) for s, r, t, _ in rows] - -# RE-RANK - -def extract_terms_from_graph_text(graph_context): - if not graph_context: - return set() - - if isinstance(graph_context, list): - terms = set() - for row in graph_context: - for col in row: - if isinstance(col, str): - terms.add(col.lower()) - return terms - - if isinstance(graph_context, str): - terms = set() - pattern = re.findall(r"([\w\s]+)-$begin:math:display$\[\\w\_\]\+$end:math:display$->([\w\s]+)", graph_context) - for e1, e2 in pattern: - terms.add(e1.strip().lower()) - terms.add(e2.strip().lower()) - return terms - - return set() - -def rerank_documents_with_graph_terms(docs, query, graph_terms): - query_terms = set(re.findall(r'\b\w+\b', query.lower())) - all_terms = query_terms.union(graph_terms) - - scored_docs = [] - for doc in docs: - doc_text = doc.page_content.lower() - score = sum(1 for term in all_terms if term in doc_text) - scored_docs.append((score, doc)) - - top_docs = sorted(scored_docs, key=lambda x: x[0], reverse=True)[:5] - return [doc.page_content for _, doc in top_docs] - -# SEMANTIC CHUNKING - -def split_llm_output_into_chapters(llm_text): - chapters = [] - current_chapter = [] - lines = llm_text.splitlines() - - for line in lines: - if re.match(chapter_separator_regex, line): - if current_chapter: - chapters.append("\n".join(current_chapter).strip()) - current_chapter = [line] - else: - current_chapter.append(line) - - if current_chapter: - chapters.append("\n".join(current_chapter).strip()) - - return chapters - - -def semantic_chunking(text): - prompt = f""" - You received the following text extracted via OCR: - - {text} - - Your task: - 1. Identify headings (short uppercase or bold lines, no period at the end) putting the Product Name (Application Name) and the Subject - 2. Separate paragraphs by heading - 3. Indicate columns with [COLUMN 1], [COLUMN 2] if present - 4. Indicate tables with [TABLE] in markdown format - 5. ALWAYS PUT THE URL if there is a Reference - 6. Indicate explicity metrics (if it exists) - Examples: - - Oracle Financial Services RTO is 1 hour - - The Oracle Banking Supply Chain Finance Cloud Service A maximum number of 10K Hosted Transactions - - The Oracle Banking Payments Cloud Service, Additional Non-Production Environment: You may purchase up to a maximum of ten (10) additional Non-Production Environments - """ - - get_out = False - while not get_out: - try: - response = llm_for_rag.invoke(prompt) - get_out = True - except: - print("[ERROR] Gen AI call error") - - return response - -def read_pdfs(pdf_path): - if "-ocr" in pdf_path: - doc_pages = PyMuPDFLoader(str(pdf_path)).load() - else: - doc_pages = UnstructuredPDFLoader(str(pdf_path)).load() - full_text = "\n".join([page.page_content for page in doc_pages]) - return full_text - - -def smart_split_text(text, max_chunk_size=10_000): - chunks = [] - start = 0 - text_length = len(text) - - while start < text_length: - end = min(start + max_chunk_size, text_length) - split_point = max( - text.rfind('.', start, end), - text.rfind('!', start, end), - text.rfind('?', start, end), - text.rfind('\n\n', start, end) - ) - if split_point == -1 or split_point <= start: - split_point = end - else: - split_point += 1 - - chunk = text[start:split_point].strip() - if chunk: - chunks.append(chunk) - - start = split_point - - return chunks - - -def load_previously_indexed_docs(): - if os.path.exists(PROCESSED_DOCS_FILE): - with open(PROCESSED_DOCS_FILE, "rb") as f: - return pickle.load(f) - return set() - - -def save_indexed_docs(docs): - with open(PROCESSED_DOCS_FILE, "wb") as f: - pickle.dump(docs, f) - - -# ========================= -# Main Function -# ========================= -def chat(): - PDF_FOLDER = Path("docs") # pasta onde estΓ£o os PDFs - - pdf_paths = sorted( - str(p) for p in PDF_FOLDER.glob("*.pdf") - ) - - already_indexed_docs = load_previously_indexed_docs() - updated_docs = set() - - try: - vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True) - print("βœ”οΈ FAISS index loaded.") - except Exception: - print("⚠️ FAISS index not found, creating a new one.") - vectorstore = None - - new_chunks = [] - - for pdf_path in tqdm(pdf_paths, desc=f"πŸ“„ Processing PDFs"): - print(f" {os.path.basename(pdf_path)}") - if pdf_path in already_indexed_docs: - print(f"βœ… Document already indexed: {pdf_path}") - continue - full_text = read_pdfs(pdf_path=pdf_path) - path_url = filename_to_url(os.path.basename(pdf_path)) - - text_chunks = smart_split_text(full_text, max_chunk_size=10_000) - overflow_buffer = "" - - for chunk in tqdm(text_chunks, desc=f"πŸ“„ Processing text chunks", dynamic_ncols=True, leave=False): - current_text = overflow_buffer + chunk - - treated_text = semantic_chunking(current_text) - - if hasattr(treated_text, "content"): - chapters = split_llm_output_into_chapters(treated_text.content) - - last_chapter = chapters[-1] if chapters else "" - - if last_chapter and not last_chapter.strip().endswith((".", "!", "?", "\n\n")): - print("πŸ“Œ Last chapter seems incomplete, saving for the next cycle") - overflow_buffer = last_chapter - chapters = chapters[:-1] - else: - overflow_buffer = "" - - for chapter_text in chapters: - reference_url = "Reference: " + path_url - chapter_text = chapter_text + "\n" + reference_url - doc = Document(page_content=chapter_text, metadata={"source": pdf_path, "reference": reference_url}) - new_chunks.append(doc) - print(f"βœ… New chapter indexed:\n{chapter_text}...\n") - - else: - print(f"[ERROR] semantic_chunking returned unexpected type: {type(treated_text)}") - - updated_docs.add(str(pdf_path)) - - if new_chunks: - if vectorstore: - vectorstore.add_documents(new_chunks) - else: - vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings) - - vectorstore.save_local(INDEX_PATH) - save_indexed_docs(already_indexed_docs.union(updated_docs)) - print(f"πŸ’Ύ {len(new_chunks)} chunks added to FAISS index.") - - print("🧠 Building knowledge graph...") - create_knowledge_graph(new_chunks) - - else: - print("πŸ“ No new documents to index.") - - retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100}) - - RFP_DECISION_TEMPLATE = """ - You are answering an RFP requirement with risk awareness. - - Requirement: - Type: {requirement_type} - Subject: {subject} - Expected value: {expected_value} - - Document evidence: - {text_context} - - Graph evidence: - {graph_context} - - Decision rules: - - Answer ONLY with YES, NO or PARTIAL - - If value differs, answer PARTIAL - - If not found, answer NO - - Interpretation rules (MANDATORY): - - If a capability is supported but requires reboot, downtime, or restart, it STILL counts as YES unless the requirement explicitly forbids it. - - "Servidor em funcionamento" means the resource exists and is active before the operation, not that it must remain online without interruption. - - Only answer NO if the operation is NOT supported at all or requires destroying and recreating the resource. - - Reboot, restart, or brief unavailability MUST NOT be interpreted as lack of support. - - Confidence rules: - - HIGH: Explicit evidence directly answers the requirement - - MEDIUM: Evidence partially matches or requires light interpretation - - LOW: Requirement is ambiguous OR evidence is indirect OR missing - - Ambiguity rules: - - ambiguity_detected = true if: - - The requirement can be interpreted in more than one way - - Keywords are vague (e.g. "support", "integration", "capability") - - Evidence does not clearly bind to subject + expected value - - Service scope rules (MANDATORY): - - Do NOT use evidence from a different Oracle Cloud service to justify another. - - OUTPUT CONSTRAINTS (MANDATORY): - - Return ONLY a valid JSON object - - Do NOT include explanations, comments, markdown, lists, or code fences - - Do NOT write any text before or after the JSON - - The response must start with an opening curly brace and end with a closing curly brace - - JSON schema (return exactly this structure): - {{ - "answer": "YES | NO | PARTIAL", - "confidence": "HIGH | MEDIUM | LOW", - "ambiguity_detected": true, - "confidence_reason": "", - "justification": "", - "evidence": [ - {{ - "quote": "", - "source": "" - }} - ] - }} - """ - prompt = PromptTemplate.from_template(RFP_DECISION_TEMPLATE) - - def get_context_from_requirement(req: dict): - query_terms = extract_graph_keywords_from_requirement(req) - - docs = retriever.invoke(query_terms) - graph_context = query_knowledge_graph(query_terms) - - return { - "text_context": "\n\n".join(doc.page_content for doc in docs), - "graph_context": graph_context, - "requirement_type": req["requirement_type"], - "subject": req["subject"], - "expected_value": req.get("expected_value", "") - } - - parse_requirement_runnable = RunnableLambda( - lambda q: parse_rfp_requirement(q) - ) - chain = ( - parse_requirement_runnable - | RunnableMap({ - "text_context": RunnableLambda( - lambda req: get_context_from_requirement(req)["text_context"] - ), - "graph_context": RunnableLambda( - lambda req: get_context_from_requirement(req)["graph_context"] - ), - "requirement_type": lambda req: req["requirement_type"], - "subject": lambda req: req["subject"], - "expected_value": lambda req: req.get("expected_value", "") - }) - | prompt - | llm - | StrOutputParser() - ) - - print("βœ… READY") - - while True: - query = input("❓ Question (or 'quit' to exit): ") - if query.lower() == "quit": - break - response = chain.invoke(query) - print("\nπŸ“œ RESPONSE:\n") - print(response) - print("\n" + "=" * 80 + "\n") - -def get_context_from_requirement(req: dict): - query_terms = extract_graph_keywords_from_requirement(req) - - docs = retriever.invoke(query_terms) - graph_context = query_knowledge_graph(query_terms) - - graph_terms = extract_terms_from_graph_text(graph_context) - reranked_chunks = rerank_documents_with_graph_terms( - docs, - query_terms, - graph_terms - ) - - return { - "text_context": "\n\n".join(reranked_chunks), - "graph_context": graph_context, - "requirement_type": req["requirement_type"], - "subject": req["subject"], - "expected_value": req.get("expected_value", "") - } - -try: - vectorstore = FAISS.load_local( - INDEX_PATH, - embeddings, - allow_dangerous_deserialization=True - ) - - retriever = vectorstore.as_retriever( - search_type="similarity", - search_kwargs={"k": 50, "fetch_k": 100} - ) -except: - print("No Faiss") - -RFP_DECISION_TEMPLATE = """ -You are answering an RFP requirement with risk awareness. - -Requirement: -Type: {requirement_type} -Subject: {subject} -Expected value: {expected_value} - -Document evidence: -{text_context} - -Graph evidence: -{graph_context} - -Decision rules: -- Answer ONLY with YES, NO or PARTIAL -- If value differs, answer PARTIAL -- If not found, answer NO - -Interpretation rules (MANDATORY): -- If a capability is supported but requires reboot, downtime, or restart, it STILL counts as YES unless the requirement explicitly forbids it. -- "Servidor em funcionamento" means the resource exists and is active before the operation, not that it must remain online without interruption. -- Only answer NO if the operation is NOT supported at all or requires destroying and recreating the resource. -- Reboot, restart, or brief unavailability MUST NOT be interpreted as lack of support. - -Confidence rules: -- HIGH: Explicit evidence directly answers the requirement -- MEDIUM: Evidence partially matches or requires light interpretation -- LOW: Requirement is ambiguous OR evidence is indirect OR missing - -Ambiguity rules: -- ambiguity_detected = true if: - - The requirement can be interpreted in more than one way - - Keywords are vague (e.g. "support", "integration", "capability") - - Evidence does not clearly bind to subject + expected value - -Service scope rules (MANDATORY): -- Do NOT use evidence from a different Oracle Cloud service to justify another. - -OUTPUT CONSTRAINTS (MANDATORY): -- Return ONLY a valid JSON object -- Do NOT include explanations, comments, markdown, lists, or code fences -- Do NOT write any text before or after the JSON -- The response must start with an opening curly brace and end with a closing curly brace - -JSON schema (return exactly this structure): -{{ - "answer": "YES | NO | PARTIAL", - "confidence": "HIGH | MEDIUM | LOW", - "ambiguity_detected": true, - "confidence_reason": "", - "justification": "", - "evidence": [ - {{ - "quote": "", - "source": "" - }} - ] -}} -""" -prompt = PromptTemplate.from_template(RFP_DECISION_TEMPLATE) - -parse_requirement_runnable = RunnableLambda( - lambda q: parse_rfp_requirement(q) -) - -chain = ( - parse_requirement_runnable - | RunnableMap({ - "text_context": RunnableLambda( - lambda req: get_context_from_requirement(req)["text_context"] - ), - "graph_context": RunnableLambda( - lambda req: get_context_from_requirement(req)["graph_context"] - ), - "requirement_type": lambda req: req["requirement_type"], - "subject": lambda req: req["subject"], - "expected_value": lambda req: req.get("expected_value", "") -}) - | prompt - | llm - | StrOutputParser() -) - -def answer_question(question: str) -> str: - return chain.invoke(question) - -# πŸš€ Run -if __name__ == "__main__": - chat() \ No newline at end of file diff --git a/files/index.html b/files/index.html deleted file mode 100644 index f5ae32a..0000000 --- a/files/index.html +++ /dev/null @@ -1,297 +0,0 @@ - - - - - Oracle AI RFP Response - - - - - -

🧠 Oracle AI RFP Response

- - -
- -

- Oracle LAD A-Team
- Cristiano Hoshikawa
- cristiano.hoshikawa@oracle.com -

- -

- Tutorial: - https://docs.oracle.com/en/learn/oci-genai-pdf
- REST Service Endpoint: - http://demo-orcl-api-ai.hoshikawa.com.br:8101/chat -

- -
- -

Overview

- -

- This application provides an AI-assisted RFP response engine for - Oracle Cloud Infrastructure (OCI). - It analyzes natural language requirements and returns a - structured, evidence-based technical response. -

- -
    -
  • Official Oracle technical documentation
  • -
  • Semantic search using vector embeddings
  • -
  • Knowledge Graph signals
  • -
  • Large Language Models (LLMs)
  • -
- -
- - -
- -

Try It β€” Live RFP Question

- -

- Enter an RFP requirement or technical question below. - The API will return a structured JSON response. -

- - - - -

AI Response

-

-
-
- - -
- -

REST API Usage

- -

- The service exposes a POST endpoint that accepts a JSON payload. -

- - - curl -X POST http://demo-orcl-api-ai.hoshikawa.com.br:8101/chat \ - -H "Content-Type: application/json" \ - -d '{ - "question": "Does Oracle Cloud Infrastructure (OCI) Compute support online resizing of memory for running virtual machine instances?" - }' - - -

Request Parameters

- -

- question (string)
- Natural language description of an RFP requirement or technical capability. - Small wording changes may affect how intent and evidence are interpreted. -

- -
- - -
- -

AI Response JSON Structure

- -

- The API always returns a strict and normalized JSON structure, - designed for traceability, auditing, and human validation. -

- -

answer

-

- Final assessment of the requirement: - YES, NO, or PARTIAL. - A NO means the requirement is not explicitly satisfied as written. -

- -

confidence

-

- Indicates the strength of the supporting evidence: - HIGH, MEDIUM, or LOW. -

- -

ambiguity_detected

-

- Flags whether the requirement is vague, overloaded, or open to interpretation. -

- -

confidence_reason

-

- Short explanation justifying the confidence level. -

- -

justification

-

- Technical rationale connecting the evidence to the requirement. - This is not marketing text. -

- -

evidence

-

- List of supporting references: -

-
    -
  • quote – Exact extracted text
  • -
  • source – URL or document reference
  • -
- -
- - -
- -

Important Notes

- -
    -
  • - Responses are generated by an LLM. - Even with low temperature, minor variations may occur across executions. -
  • -
  • - Results depend on wording, terminology, and framing of the requirement. -
  • -
  • - In many RFPs, an initial NO can be reframed into a valid - YES by mapping the requirement to the correct OCI service. -
  • -
  • - Human review is mandatory. - This tool supports architects and RFP teams β€” it does not replace them. -
  • -
- -

- GraphRAG β€’ Oracle Autonomous Database 23ai β€’ Embeddings β€’ Knowledge Graph β€’ LLM β€’ Flask API -

- -
- - - - - \ No newline at end of file diff --git a/files/modules/admin/routes.py b/files/modules/admin/routes.py new file mode 100644 index 0000000..c76b4c9 --- /dev/null +++ b/files/modules/admin/routes.py @@ -0,0 +1,160 @@ +from flask import Blueprint, render_template, request, jsonify, redirect, flash +from modules.core.security import requires_admin_auth +from modules.core.audit import audit_log +import threading +from modules.core.audit import audit_log + +from oci_genai_llm_graphrag_rerank_rfp import ( + search_chunks_for_invalidation, + revoke_chunk_by_hash, + get_chunk_metadata, + add_manual_knowledge_entry, + reload_all +) + +admin_bp = Blueprint("admin", __name__) + + +# ========================= +# ADMIN HOME (invalidate UI) +# ========================= +@admin_bp.route("/") +@requires_admin_auth +def admin_home(): + return render_template("admin_menu.html") + +@admin_bp.route("/invalidate") +@requires_admin_auth +def invalidate_page(): + return render_template( + "invalidate.html", + results=[], + statement="" + ) + +# ========================= +# SEARCH CHUNKS +# ========================= +@admin_bp.route("/search", methods=["POST"]) +@requires_admin_auth +def search_for_invalidation(): + + statement = request.form["statement"] + + docs = search_chunks_for_invalidation(statement) + + hashes = [d.metadata.get("chunk_hash") for d in docs if d.metadata.get("chunk_hash")] + meta = get_chunk_metadata(hashes) + + results = [] + + for d in docs: + h = d.metadata.get("chunk_hash") + m = meta.get(h, {}) + + results.append({ + "chunk_hash": h, + "source": d.metadata.get("source"), + "text": d.page_content, + "origin": m.get("origin"), + "status": m.get("status") + }) + + return render_template( + "invalidate.html", + statement=statement, + results=results + ) + + +# ========================= +# REVOKE +# ========================= +@admin_bp.route("/revoke", methods=["POST"]) +@requires_admin_auth +def revoke_chunk_ui(): + + data = request.get_json() + + chunk_hash = str(data["chunk_hash"]) + reason = str(data.get("reason", "Manual revoke")) + audit_log("INVALIDATE", f"chunk_hash={chunk_hash}") + + print("chunk_hash", chunk_hash) + print("reason", reason) + + revoke_chunk_by_hash(chunk_hash, reason) + + return {"status": "ok", "chunk_hash": chunk_hash} + + +# ========================= +# ADD MANUAL KNOWLEDGE +# ========================= +@admin_bp.route("/add-knowledge", methods=["POST"]) +@requires_admin_auth +def add_manual_knowledge(): + + data = request.get_json(force=True) + + chunk_hash = add_manual_knowledge_entry( + text=data["text"], + author="ADMIN", + reason=data.get("reason"), + source="MANUAL_INPUT", + origin="MANUAL", + also_update_graph=True + ) + audit_log("ADD_KNOWLEDGE", f"chunk_hash={chunk_hash}") + + return jsonify({ + "status": "OK", + "chunk_hash": chunk_hash + }) + +# ========================= +# UPDATE CHUNK +# ========================= +@admin_bp.route("/update-chunk", methods=["POST"]) +@requires_admin_auth +def update_chunk(): + + data = request.get_json() or {} + + chunk_hash = str(data.get("chunk_hash", "")).strip() + text = str(data.get("text", "")).strip() + + print("chunk_hash", chunk_hash) + print("text", text) + + if not chunk_hash: + return {"status": "error", "message": "missing hash"}, 400 + + reason = str(data.get("reason", "Manual change")) + + revoke_chunk_by_hash(chunk_hash, reason=reason) + chunk_hash = add_manual_knowledge_entry( + text=text, + author="ADMIN", + reason=reason, + source="MANUAL_INPUT", + origin="MANUAL", + also_update_graph=True + ) + audit_log("UPDATE CHUNK", f"chunk_hash={chunk_hash}") + + return jsonify({ + "status": "OK", + "chunk_hash": chunk_hash + }) + +@admin_bp.route("/reboot", methods=["POST"]) +@requires_admin_auth +def reboot_service(): + # roda em background pra nΓ£o travar request + threading.Thread(target=reload_all, daemon=True).start() + + return jsonify({ + "status": "ok", + "message": "Knowledge reload started" + }) \ No newline at end of file diff --git a/files/modules/architecture/routes.py b/files/modules/architecture/routes.py new file mode 100644 index 0000000..c9378a6 --- /dev/null +++ b/files/modules/architecture/routes.py @@ -0,0 +1,83 @@ +from flask import Blueprint, request, jsonify +import uuid +import json +from pathlib import Path +from modules.core.audit import audit_log + +from modules.core.security import requires_app_auth +from .service import start_architecture_job +from .store import ARCH_JOBS, ARCH_LOCK + +architecture_bp = Blueprint("architecture", __name__) + +ARCH_FOLDER = Path("architecture") + +@architecture_bp.route("/architecture/start", methods=["POST"]) +@requires_app_auth +def architecture_start(): + data = request.get_json(force=True) or {} + question = (data.get("question") or "").strip() + + if not question: + return jsonify({"error": "Empty question"}), 400 + + job_id = str(uuid.uuid4()) + audit_log("ARCHITECTURE", f"job_id={job_id}") + + with ARCH_LOCK: + ARCH_JOBS[job_id] = { + "status": "RUNNING", + "logs": [] + } + + start_architecture_job(job_id, question) + return jsonify({"job_id": job_id}) + + +@architecture_bp.route("/architecture//status", methods=["GET"]) +@requires_app_auth +def architecture_status(job_id): + job_dir = ARCH_FOLDER / job_id + status_file = job_dir / "status.json" + + # fallback 1: status persistido + if status_file.exists(): + try: + return jsonify(json.loads(status_file.read_text(encoding="utf-8"))) + except Exception: + return jsonify({"status": "ERROR", "detail": "Invalid status file"}), 500 + + # fallback 2: status em memΓ³ria + with ARCH_LOCK: + job = ARCH_JOBS.get(job_id) + + if job: + return jsonify({"status": job.get("status", "PROCESSING")}) + + return jsonify({"status": "NOT_FOUND"}), 404 + + +@architecture_bp.route("/architecture//logs", methods=["GET"]) +@requires_app_auth +def architecture_logs(job_id): + with ARCH_LOCK: + job = ARCH_JOBS.get(job_id, {}) + return jsonify({"logs": job.get("logs", [])}) + +@architecture_bp.route("/architecture//result", methods=["GET"]) +@requires_app_auth +def architecture_result(job_id): + job_dir = ARCH_FOLDER / job_id + result_file = job_dir / "architecture.json" + + # ainda nΓ£o terminou + if not result_file.exists(): + return jsonify({"error": "not ready"}), 404 + + try: + raw = result_file.read_text(encoding="utf-8") + plan = json.loads(raw) + return jsonify(plan) + + except Exception as e: + return jsonify({"error": str(e)}), 500 \ No newline at end of file diff --git a/files/modules/architecture/service.py b/files/modules/architecture/service.py new file mode 100644 index 0000000..1bfedeb --- /dev/null +++ b/files/modules/architecture/service.py @@ -0,0 +1,56 @@ +import threading +import json +from pathlib import Path + +from .store import ARCH_JOBS, ARCH_LOCK +from oci_genai_llm_graphrag_rerank_rfp import call_architecture_planner, architecture_to_mermaid + +ARCH_FOLDER = Path("architecture") +ARCH_FOLDER.mkdir(exist_ok=True) + +def make_job_logger(job_id: str): + def _log(msg): + with ARCH_LOCK: + job = ARCH_JOBS.get(job_id) + if job: + job["logs"].append(str(msg)) + return _log + +def start_architecture_job(job_id: str, question: str): + job_dir = ARCH_FOLDER / job_id + job_dir.mkdir(parents=True, exist_ok=True) + + status_file = job_dir / "status.json" + result_file = job_dir / "architecture.json" + + def write_status(state: str, detail: str | None = None): + payload = {"status": state} + if detail: + payload["detail"] = detail + status_file.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8") + + with ARCH_LOCK: + if job_id in ARCH_JOBS: + ARCH_JOBS[job_id]["status"] = state + if detail: + ARCH_JOBS[job_id]["detail"] = detail + + write_status("PROCESSING") + + def background(): + try: + logger = make_job_logger(job_id) + + plan = call_architecture_planner(question, log=logger) + if not isinstance(plan, dict): + raise TypeError(f"Planner returned {type(plan)}") + + plan["mermaid"] = architecture_to_mermaid(plan) + + result_file.write_text(json.dumps(plan, ensure_ascii=False, indent=2), encoding="utf-8") + write_status("DONE") + + except Exception as e: + write_status("ERROR", str(e)) + + threading.Thread(target=background, daemon=True).start() \ No newline at end of file diff --git a/files/modules/architecture/store.py b/files/modules/architecture/store.py new file mode 100644 index 0000000..a3346e5 --- /dev/null +++ b/files/modules/architecture/store.py @@ -0,0 +1,4 @@ +from threading import Lock + +ARCH_LOCK = Lock() +ARCH_JOBS = {} \ No newline at end of file diff --git a/files/modules/auth/routes.py b/files/modules/auth/routes.py new file mode 100644 index 0000000..ea40e2c --- /dev/null +++ b/files/modules/auth/routes.py @@ -0,0 +1,64 @@ +from flask import Blueprint, render_template, request, redirect, url_for, flash, jsonify, session +from modules.users.service import signup_user +from config_loader import load_config +from modules.users.service import create_user, authenticate_user + +auth_bp = Blueprint( + "auth", + __name__, + template_folder="../../templates/users" +) + +config = load_config() + +@auth_bp.route("/signup", methods=["GET", "POST"]) +def signup(): + + if request.method == "POST": + email = request.form.get("email") + name = request.form.get("name") + + try: + link = signup_user(email, name) + + if link and config.dev_mode == 1: + flash(f"DEV MODE: password link β†’ {link}", "success") + else: + flash("User created and email sent", "success") + except Exception as e: + flash(str(e), "danger") + + return redirect(url_for("auth.signup")) + + return render_template("users/signup.html") + +@auth_bp.route("/register", methods=["POST"]) +def register(): + data = request.json + create_user(data["username"], data["password"]) + return jsonify({"status": "ok"}) + +@auth_bp.route("/login", methods=["POST"]) +def login(): + + email = request.form.get("username") + password = request.form.get("password") + + ok = authenticate_user(email, password) + + if not ok: + flash("Invalid credentials") + return redirect("/login") + + session["user_email"] = email + + return redirect("/") + +@auth_bp.route("/login", methods=["GET"]) +def login_page(): + return render_template("users/login.html") + +@auth_bp.route("/logout") +def logout(): + session.clear() # remove tudo da sessΓ£o + return redirect("/login") diff --git a/files/modules/chat/routes.py b/files/modules/chat/routes.py new file mode 100644 index 0000000..95efc05 --- /dev/null +++ b/files/modules/chat/routes.py @@ -0,0 +1,82 @@ +import json +from flask import Blueprint, request, jsonify +from modules.core.security import requires_app_auth +from oci_genai_llm_graphrag_rerank_rfp import answer_question, search_active_chunks +from modules.core.audit import audit_log +from .service import start_chat_job +from .store import CHAT_JOBS, CHAT_LOCK + +chat_bp = Blueprint("chat", __name__) + +def parse_llm_json(raw: str) -> dict: + try: + if not isinstance(raw, str): + return {"answer": "ERROR", "justification": "LLM output is not a string", "raw_output": str(raw)} + raw = raw.replace("```json", "").replace("```", "").strip() + return json.loads(raw) + except Exception: + return {"answer": "ERROR", "justification": "LLM returned invalid JSON", "raw_output": raw} + +@chat_bp.route("/chat", methods=["POST"]) +@requires_app_auth +def chat(): + data = request.get_json(force=True) or {} + question = (data.get("question") or "").strip() + + if not question: + return jsonify({"error": "Empty question"}), 400 + + raw_answer = answer_question(question) + parsed_answer = parse_llm_json(raw_answer) + audit_log("RFP_QUESTION", f"question={question}") + + # (opcional) manter comportamento antigo de evidence/full_text se vocΓͺ quiser + # docs = search_active_chunks(question) + + return jsonify({ + "question": question, + "result": parsed_answer + }) + +@chat_bp.post("/chat/start") +def start(): + + question = request.json["question"] + + job_id = start_chat_job(question) + + return jsonify({"job_id": job_id}) + +@chat_bp.get("/chat//status") +def status(job_id): + + with CHAT_LOCK: + job = CHAT_JOBS.get(job_id) + + if not job: + return jsonify({"error": "not found"}), 404 + + return jsonify({"status": job["status"]}) + +@chat_bp.get("/chat//result") +def result(job_id): + + with CHAT_LOCK: + job = CHAT_JOBS.get(job_id) + + if not job: + return jsonify({"error": "not found"}), 404 + + return jsonify({ + "result": parse_llm_json(job["result"]), + "error": job["error"] + }) + +@chat_bp.get("/chat//logs") +def logs(job_id): + + with CHAT_LOCK: + job = CHAT_JOBS.get(job_id) + + return jsonify({"logs": job["logs"]}) + diff --git a/files/modules/chat/service.py b/files/modules/chat/service.py new file mode 100644 index 0000000..0ba1cd4 --- /dev/null +++ b/files/modules/chat/service.py @@ -0,0 +1,44 @@ +import threading +import uuid +from .store import CHAT_JOBS, CHAT_LOCK +from oci_genai_llm_graphrag_rerank_rfp import answer_question + + +def start_chat_job(question: str): + + job_id = str(uuid.uuid4()) + + with CHAT_LOCK: + CHAT_JOBS[job_id] = { + "status": "PROCESSING", + "result": None, + "error": None, + "logs": [] + } + + def log(msg): + with CHAT_LOCK: + CHAT_JOBS[job_id]["logs"].append(str(msg)) + + def background(): + try: + log("Starting answer_question()") + + result = answer_question(question) + + with CHAT_LOCK: + CHAT_JOBS[job_id]["result"] = result + CHAT_JOBS[job_id]["status"] = "DONE" + + log("DONE") + + except Exception as e: + with CHAT_LOCK: + CHAT_JOBS[job_id]["error"] = str(e) + CHAT_JOBS[job_id]["status"] = "ERROR" + + log(f"ERROR: {e}") + + threading.Thread(target=background, daemon=True).start() + + return job_id \ No newline at end of file diff --git a/files/modules/chat/store.py b/files/modules/chat/store.py new file mode 100644 index 0000000..8787749 --- /dev/null +++ b/files/modules/chat/store.py @@ -0,0 +1,4 @@ +import threading + +CHAT_JOBS = {} +CHAT_LOCK = threading.Lock() \ No newline at end of file diff --git a/files/modules/core/audit.py b/files/modules/core/audit.py new file mode 100644 index 0000000..484f3f0 --- /dev/null +++ b/files/modules/core/audit.py @@ -0,0 +1,11 @@ +from flask import session, request +from datetime import datetime + +def audit_log(action: str, detail: str = ""): + email = session.get("user_email", "anonymous") + ip = request.remote_addr + + line = f"{datetime.utcnow().isoformat()} | {email} | {ip} | {action} | {detail}\n" + + with open("audit.log", "a", encoding="utf-8") as f: + f.write(line) \ No newline at end of file diff --git a/files/modules/core/security.py b/files/modules/core/security.py new file mode 100644 index 0000000..2b78874 --- /dev/null +++ b/files/modules/core/security.py @@ -0,0 +1,92 @@ +from functools import wraps +from flask import request, Response, url_for, session, redirect +from werkzeug.security import check_password_hash +from modules.core.audit import audit_log +from modules.users.db import get_pool + +# ========================= +# Base authentication +# ========================= + +def authenticate(): + return redirect(url_for("auth.login_page")) + +def get_current_user(): + + email = session.get("user_email") + if not email: + return None + + sql = """ + SELECT id, username, email, user_role, active + FROM app_users + WHERE email = :1 \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [email]) + row = cur.fetchone() + + if not row: + return None + + return { + "id": row[0], + "username": row[1], + "email": row[2], + "role": row[3], + "active": row[4] + } + + +# ========================= +# Decorators +# ========================= + +def requires_login(f): + @wraps(f) + def wrapper(*args, **kwargs): + user = get_current_user() + if not user: + return authenticate() + return f(*args, **kwargs) + return wrapper + + +def requires_app_auth(f): + @wraps(f) + def wrapper(*args, **kwargs): + user = get_current_user() + + if not user: + return authenticate() + + role = (user.get("role") or "").strip().lower() + + if role not in ["user", "admin"]: + return authenticate() + + audit_log("LOGIN_SUCCESS", f"user={user}") + + return f(*args, **kwargs) + return wrapper + + +def requires_admin_auth(f): + @wraps(f) + def wrapper(*args, **kwargs): + user = get_current_user() + + if not user: + return authenticate() + + if user.get("role") != "admin": + return authenticate() + + audit_log("LOGIN_ADMIN_SUCCESS", f"user={user}") + + return f(*args, **kwargs) + return wrapper \ No newline at end of file diff --git a/files/modules/excel/queue_manager.py b/files/modules/excel/queue_manager.py new file mode 100644 index 0000000..632bb74 --- /dev/null +++ b/files/modules/excel/queue_manager.py @@ -0,0 +1,113 @@ +from queue import Queue +import threading +import logging + +logger = logging.getLogger(__name__) + +# ========================================= +# CONFIG +# ========================================= + +MAX_CONCURRENT_EXCEL = 10 + +# ========================================= +# STATE +# ========================================= + +EXCEL_QUEUE = Queue() +EXCEL_LOCK = threading.Lock() + +ACTIVE_JOBS = set() # jobs em execuΓ§Γ£o + +# ========================================= +# Helpers +# ========================================= + +def get_queue_position(job_id: str) -> int: + """ + Retorna: + 0 = jΓ‘ estΓ‘ executando + 1..N = posiΓ§Γ£o na fila + -1 = nΓ£o encontrado + """ + with EXCEL_LOCK: + + if job_id in ACTIVE_JOBS: + return 0 + + items = list(EXCEL_QUEUE.queue) + + for i, item in enumerate(items): + if item["job_id"] == job_id: + return i + 1 + + return -1 + + +# ========================================= +# WORKER +# ========================================= + +def _worker(worker_id: int): + logger.info(f"🟒 Excel worker-{worker_id} started") + + while True: + job = EXCEL_QUEUE.get() + + job_id = job["job_id"] + + try: + with EXCEL_LOCK: + ACTIVE_JOBS.add(job_id) + + logger.info(f"πŸš€ [worker-{worker_id}] Processing {job_id}") + + job["fn"](*job["args"], **job["kwargs"]) + + logger.info(f"βœ… [worker-{worker_id}] Finished {job_id}") + + except Exception as e: + logger.exception(f"❌ [worker-{worker_id}] Failed {job_id}: {e}") + + finally: + with EXCEL_LOCK: + ACTIVE_JOBS.discard(job_id) + + EXCEL_QUEUE.task_done() + + +# ========================================= +# START POOL +# ========================================= + +def start_excel_worker(): + """ + Inicia pool com N workers simultΓ’neos + """ + for i in range(MAX_CONCURRENT_EXCEL): + threading.Thread( + target=_worker, + args=(i + 1,), + daemon=True + ).start() + + logger.info(f"πŸ”₯ Excel queue started with {MAX_CONCURRENT_EXCEL} workers") + + +# ========================================= +# ENQUEUE +# ========================================= + +def enqueue_excel_job(job_id, fn, *args, **kwargs): + job = { + "job_id": job_id, + "fn": fn, + "args": args, + "kwargs": kwargs + } + + with EXCEL_LOCK: + EXCEL_QUEUE.put(job) + position = EXCEL_QUEUE.qsize() + + return position \ No newline at end of file diff --git a/files/modules/excel/routes.py b/files/modules/excel/routes.py new file mode 100644 index 0000000..a19a234 --- /dev/null +++ b/files/modules/excel/routes.py @@ -0,0 +1,110 @@ +from flask import Blueprint, request, jsonify, send_file, render_template +from pathlib import Path +import uuid +import json +from config_loader import load_config +from modules.core.audit import audit_log +from modules.core.security import get_current_user + +from modules.core.security import requires_app_auth +from .service import start_excel_job +from .store import EXCEL_JOBS, EXCEL_LOCK + +excel_bp = Blueprint("excel", __name__) +config = load_config() +API_BASE_URL = f"{config.app_base}:{config.service_port}" + +UPLOAD_FOLDER = Path("./uploads") +UPLOAD_FOLDER.mkdir(exist_ok=True) + +ALLOWED_EXTENSIONS = {"xlsx"} +API_URL = API_BASE_URL + "/chat" + + +def allowed_file(filename): + return "." in filename and filename.rsplit(".", 1)[1].lower() in ALLOWED_EXTENSIONS + + +# ========================= +# Upload + start processing +# ========================= +@excel_bp.route("/upload/excel", methods=["POST"]) +@requires_app_auth +def upload_excel(): + file = request.files.get("file") + email = request.form.get("email") + + if not file or not email: + return jsonify({"error": "file and email required"}), 400 + + if not allowed_file(file.filename): + return jsonify({"error": "invalid file type"}), 400 + + job_id = str(uuid.uuid4()) + audit_log("UPLOAD_EXCEL", f"job_id={job_id}") + + job_dir = UPLOAD_FOLDER / job_id + job_dir.mkdir(parents=True, exist_ok=True) + + input_path = job_dir / "input.xlsx" + file.save(input_path) + + with EXCEL_LOCK: + EXCEL_JOBS[job_id] = {"status": "RUNNING"} + + user = get_current_user() + + start_excel_job( + job_id=job_id, + input_path=input_path, + email=email, + auth_user=None, + auth_pass=None, + api_url=API_URL + ) + + return jsonify({"status": "STARTED", "job_id": job_id}) + + +# ========================= +# Status +# ========================= +@excel_bp.route("/job//status") +@requires_app_auth +def job_status(job_id): + status_file = UPLOAD_FOLDER / job_id / "status.json" + + if not status_file.exists(): + return jsonify({"status": "PROCESSING"}) + + return jsonify(json.loads(status_file.read_text())) + + +# ========================= +# Download result +# ========================= +@excel_bp.route("/download/") +@requires_app_auth +def download(job_id): + result_file = UPLOAD_FOLDER / job_id / "result.xlsx" + + if not result_file.exists(): + return jsonify({"error": "not ready"}), 404 + + return send_file( + result_file, + as_attachment=True, + download_name=f"RFP_result_{job_id}.xlsx" + ) + +@excel_bp.route("/job//logs", methods=["GET"]) +@requires_app_auth +def excel_logs(job_id): + with EXCEL_LOCK: + job = EXCEL_JOBS.get(job_id, {}) + return jsonify({"logs": job.get("logs", [])}) + +@excel_bp.route("/excel/job/") +@requires_app_auth +def job_page(job_id): + return render_template("excel/job_status.html", job_id=job_id) \ No newline at end of file diff --git a/files/modules/excel/service.py b/files/modules/excel/service.py new file mode 100644 index 0000000..51d02f2 --- /dev/null +++ b/files/modules/excel/service.py @@ -0,0 +1,115 @@ +import threading +import json +from pathlib import Path +from datetime import datetime +from flask import jsonify +from .storage import upload_file, generate_download_url + +from rfp_process import process_excel_rfp +from .store import EXCEL_JOBS, EXCEL_LOCK +from modules.users.email_service import send_completion_email +from modules.excel.queue_manager import enqueue_excel_job + +EXECUTION_METHOD = "QUEUE" # THREAD OR QUEUE + +UPLOAD_FOLDER = Path("uploads") +UPLOAD_FOLDER.mkdir(exist_ok=True) + + +def make_job_logger(job_id: str): + """Logger simples: guarda logs na memΓ³ria (igual ao arquiteto).""" + def _log(msg): + with EXCEL_LOCK: + job = EXCEL_JOBS.get(job_id) + if job is not None: + job.setdefault("logs", []).append(str(msg)) + return _log + + +def start_excel_job(job_id: str, input_path: Path, email: str, auth_user: str, auth_pass: str, api_url: str): + + job_dir = UPLOAD_FOLDER / job_id + job_dir.mkdir(parents=True, exist_ok=True) + + output_path = job_dir / "result.xlsx" + status_file = job_dir / "status.json" + object_name = f"{job_id}/result.xlsx" + + logger = make_job_logger(job_id) + + def write_status(state: str, detail: str | None = None): + payload = { + "status": state, + "updated_at": datetime.utcnow().isoformat(), + } + if detail: + payload["detail"] = detail + + status_file.write_text( + json.dumps(payload, ensure_ascii=False, indent=2), + encoding="utf-8" + ) + + with EXCEL_LOCK: + job = EXCEL_JOBS.get(job_id) + if job is not None: + job["status"] = state + if detail: + job["detail"] = detail + + # garante estrutura do job na memΓ³ria + with EXCEL_LOCK: + EXCEL_JOBS.setdefault(job_id, {}) + EXCEL_JOBS[job_id].setdefault("logs", []) + EXCEL_JOBS[job_id]["status"] = "PROCESSING" + + write_status("PROCESSING") + logger(f"Starting Excel job {job_id}") + logger(f"Input: {input_path}") + logger(f"Output: {output_path}") + + def background(): + try: + # processamento principal + process_excel_rfp( + input_excel=input_path, + output_excel=output_path, + api_url=api_url, + auth_user=auth_user, + auth_pass=auth_pass, + ) + + write_status("DONE") + logger("Excel processing DONE") + + upload_file(output_path, object_name) + download_url = generate_download_url(object_name) + + write_status("DONE", download_url) + + # email / dev message + dev_message = send_completion_email(email, download_url, job_id) + if dev_message: + logger(f"DEV email message/link: {dev_message}") + + except Exception as e: + logger(f"ERROR: {e}") + write_status("ERROR", str(e)) + + try: + dev_message = send_completion_email( + email=email, + download_url=download_url, + job_id=job_id, + status="ERROR", + detail=str(e) + ) + if dev_message: + logger(f"DEV email error message/link: {dev_message}") + except Exception as mail_err: + logger(f"EMAIL ERROR: {mail_err}") + + if EXECUTION_METHOD == "THREAD": + threading.Thread(target=background, daemon=True).start() + else: + enqueue_excel_job(job_id, background) diff --git a/files/modules/excel/storage.py b/files/modules/excel/storage.py new file mode 100644 index 0000000..cef3817 --- /dev/null +++ b/files/modules/excel/storage.py @@ -0,0 +1,67 @@ +import oci +from datetime import datetime, timedelta +from config_loader import load_config +from oci.object_storage.models import CreatePreauthenticatedRequestDetails + +config = load_config() + + +oci_config = oci.config.from_file( + file_location="~/.oci/config", + profile_name=config.bucket_profile +) + +object_storage = oci.object_storage.ObjectStorageClient(oci_config) + + +def _namespace(): + if config.oci_namespace != "auto": + return config.oci_namespace + + return object_storage.get_namespace().data + + +# ========================= +# Upload file +# ========================= +def upload_file(local_path: str, object_name: str): + + with open(local_path, "rb") as f: + object_storage.put_object( + namespace_name=_namespace(), + bucket_name=config.oci_bucket, + object_name=object_name, + put_object_body=f + ) + print(f"SUCCESS on Upload {object_name}") + +# ========================= +# Pre-authenticated download URL +# ========================= +def generate_download_url(object_name: str, hours=24): + + expire = datetime.utcnow() + timedelta(hours=hours) + + details = CreatePreauthenticatedRequestDetails( + name=f"job-{object_name}", + access_type="ObjectRead", + object_name=object_name, + time_expires=expire + ) + + response = object_storage.create_preauthenticated_request( + namespace_name=_namespace(), + bucket_name=config.oci_bucket, + create_preauthenticated_request_details=details + ) + + par = response.data + + download_link = ( + f"https://objectstorage.{oci_config['region']}.oraclecloud.com{par.access_uri}" + ) + + print("PAR CREATED OK") + print(download_link) + + return download_link \ No newline at end of file diff --git a/files/modules/excel/store.py b/files/modules/excel/store.py new file mode 100644 index 0000000..cca4ced --- /dev/null +++ b/files/modules/excel/store.py @@ -0,0 +1,4 @@ +from threading import Lock + +EXCEL_JOBS = {} +EXCEL_LOCK = Lock() \ No newline at end of file diff --git a/files/modules/health/routes.py b/files/modules/health/routes.py new file mode 100644 index 0000000..f7396de --- /dev/null +++ b/files/modules/health/routes.py @@ -0,0 +1,8 @@ +from flask import Blueprint, jsonify + +health_bp = Blueprint("health", __name__) + + +@health_bp.route("/health") +def health(): + return jsonify({"status": "UP"}) \ No newline at end of file diff --git a/files/modules/home/routes.py b/files/modules/home/routes.py new file mode 100644 index 0000000..305f512 --- /dev/null +++ b/files/modules/home/routes.py @@ -0,0 +1,16 @@ +from flask import Blueprint, render_template +from modules.core.security import requires_app_auth +from config_loader import load_config + +home_bp = Blueprint("home", __name__) +config = load_config() +API_BASE_URL = f"{config.app_base}:{config.service_port}" + +@home_bp.route("/") +@requires_app_auth +def index(): + return render_template( + "index.html", + api_base_url=API_BASE_URL, + config=config + ) \ No newline at end of file diff --git a/files/modules/rest/routes.py b/files/modules/rest/routes.py new file mode 100644 index 0000000..706b3ad --- /dev/null +++ b/files/modules/rest/routes.py @@ -0,0 +1,29 @@ +from flask import Blueprint, request, jsonify +from modules.rest.security import rest_auth_required +from modules.chat.service import answer_question # reutiliza lΓ³gica + +rest_bp = Blueprint("rest", __name__, url_prefix="/rest") + + +import json + +@rest_bp.route("/chat", methods=["POST"]) +@rest_auth_required +def rest_chat(): + data = request.get_json(force=True) or {} + + question = (data.get("question") or "").strip() + if not question: + return jsonify({"error": "question required"}), 400 + + raw_result = answer_question(question) + + try: + parsed = json.loads(raw_result) + except Exception: + return jsonify({ + "error": "invalid LLM response", + "raw": raw_result + }), 500 + + return json.dumps(parsed) \ No newline at end of file diff --git a/files/modules/rest/security.py b/files/modules/rest/security.py new file mode 100644 index 0000000..e6a3f66 --- /dev/null +++ b/files/modules/rest/security.py @@ -0,0 +1,30 @@ +import base64 +from flask import request, jsonify +from functools import wraps +from modules.users.service import authenticate_user + + +def rest_auth_required(f): + @wraps(f) + def wrapper(*args, **kwargs): + auth = request.headers.get("Authorization") + + if not auth or not auth.startswith("Basic "): + return jsonify({"error": "authorization required"}), 401 + + try: + decoded = base64.b64decode(auth.split(" ")[1]).decode() + username, password = decoded.split(":", 1) + except Exception: + return jsonify({"error": "invalid authorization header"}), 401 + + user = authenticate_user(username, password) + if not user: + return jsonify({"error": "invalid credentials"}), 401 + + # opcional: passar user adiante + request.rest_user = user + + return f(*args, **kwargs) + + return wrapper \ No newline at end of file diff --git a/files/modules/users/__init__.py b/files/modules/users/__init__.py new file mode 100644 index 0000000..a4b7f91 --- /dev/null +++ b/files/modules/users/__init__.py @@ -0,0 +1,4 @@ +from .routes import users_bp +from .model import db + +__all__ = ["users_bp", "db"] \ No newline at end of file diff --git a/files/modules/users/db.py b/files/modules/users/db.py new file mode 100644 index 0000000..90eaebc --- /dev/null +++ b/files/modules/users/db.py @@ -0,0 +1,50 @@ +from pathlib import Path +import os +import re +import oracledb +import json +import base64 +import hashlib +from datetime import datetime +import requests +import textwrap +import unicodedata +from typing import Optional +from collections import deque +from config_loader import load_config + +def chunk_hash(text: str) -> str: + return hashlib.sha256(text.encode("utf-8")).hexdigest() + +config = load_config() + +# ========================= +# Oracle Autonomous Configuration +# ========================= +WALLET_PATH = config.wallet_path +DB_ALIAS = config.db_alias +USERNAME = config.username +PASSWORD = config.password +os.environ["TNS_ADMIN"] = WALLET_PATH + +_pool = None + +def get_pool(): + global _pool + + if _pool: + return _pool + + _pool = oracledb.create_pool( + user=USERNAME, + password=PASSWORD, + dsn=DB_ALIAS, + config_dir=WALLET_PATH, + wallet_location=WALLET_PATH, + wallet_password=PASSWORD, + min=2, + max=8, + increment=1 + ) + + return _pool diff --git a/files/modules/users/email_service.py b/files/modules/users/email_service.py new file mode 100644 index 0000000..73248f8 --- /dev/null +++ b/files/modules/users/email_service.py @@ -0,0 +1,72 @@ +import os +import smtplib +from email.message import EmailMessage +from flask import current_app +from config_loader import load_config + +config = load_config() +API_BASE_URL = f"{config.app_base}:{config.service_port}" + +def send_user_created_email(email, link, name=""): + """ + DEV -> return link only + PROD -> send real email + """ + + if config.dev_mode == 1: + return link # πŸ‘ˆ sΓ³ devolve o link + + host = os.getenv("RFP_SMTP_HOST", "localhost") + port = int(os.getenv("RFP_SMTP_PORT", 25)) + + msg = EmailMessage() + msg["Subject"] = "Your account has been created" + msg["From"] = "noreply@rfp.local" + msg["To"] = email + + msg.set_content(f""" + Hello {name or email}, + + Your account was created. + + Set your password here: + {link} + """) + + with smtplib.SMTP(host, port) as s: + s.send_message(msg) + + return link + +def send_completion_email(email, download_url, job_id): + """ + DEV -> return download link + PROD -> send real email + """ + + if config.dev_mode == 1: + return download_url # πŸ‘ˆ sΓ³ devolve o link no DEV + + host = os.getenv("RFP_SMTP_HOST", "localhost") + port = int(os.getenv("RFP_SMTP_PORT", 25)) + + msg = EmailMessage() + msg["Subject"] = "Your RFP processing is complete" + msg["From"] = "noreply@rfp.local" + msg["To"] = email + + msg.set_content(f""" +Hello, + +Your RFP Excel processing has finished successfully. + +Download your file here: +{download_url} + +Job ID: {job_id} +""") + + with smtplib.SMTP(host, port) as s: + s.send_message(msg) + + return None \ No newline at end of file diff --git a/files/modules/users/model.py b/files/modules/users/model.py new file mode 100644 index 0000000..108c114 --- /dev/null +++ b/files/modules/users/model.py @@ -0,0 +1,27 @@ +from datetime import datetime +from flask_sqlalchemy import SQLAlchemy + +db = SQLAlchemy() + + +class User(db.Model): + __tablename__ = "users" + + id = db.Column(db.Integer, primary_key=True) + + name = db.Column(db.String(120), nullable=False) + email = db.Column(db.String(160), unique=True, nullable=False, index=True) + + role = db.Column(db.String(50), default="app") # app | admin + active = db.Column(db.Boolean, default=True) + + password_hash = db.Column(db.String(255)) + must_change_password = db.Column(db.Boolean, default=True) + + reset_token = db.Column(db.String(255)) + reset_expire = db.Column(db.DateTime) + + created_at = db.Column(db.DateTime, default=datetime.utcnow) + + def __repr__(self): + return f"" \ No newline at end of file diff --git a/files/modules/users/routes.py b/files/modules/users/routes.py new file mode 100644 index 0000000..99c7e0b --- /dev/null +++ b/files/modules/users/routes.py @@ -0,0 +1,157 @@ +from flask import Blueprint, render_template, request, redirect, url_for, flash +from modules.core.security import requires_admin_auth + +from .service import ( + signup_user, + list_users as svc_list_users, + create_user, + update_user, + delete_user as svc_delete_user, + get_user_by_token, + set_password_service +) + +from .token_service import generate_token, expiration, is_expired +from .email_service import send_user_created_email +from config_loader import load_config + +users_bp = Blueprint( + "users", + __name__, + template_folder="../../templates/users" +) + +config = load_config() + + +# ========================= +# LIST USERS (Oracle) +# ========================= +@users_bp.route("/") +@requires_admin_auth +def list_users(): + users = svc_list_users() + return render_template("list.html", users=users) + + +# ========================= +# PUBLIC SIGNUP (Oracle) +# ========================= +@users_bp.route("/signup", methods=["GET", "POST"]) +def signup(): + + if request.method == "POST": + email = request.form.get("email", "").strip() + name = request.form.get("name", "").strip() + + try: + link = signup_user(email=email, name=name) + except Exception as e: + flash(str(e), "danger") + return render_template("users/signup.html") + + if link and config.dev_mode == 1: + flash(f"DEV MODE: password link β†’ {link}", "success") + else: + flash("User created and email sent", "success") + + return redirect(url_for("users.signup")) + + return render_template("users/signup.html") + + +# ========================= +# CREATE USER (Oracle) +# ========================= +@users_bp.route("/new", methods=["GET", "POST"]) +@requires_admin_auth +def new_user(): + + if request.method == "POST": + + token = generate_token() + + create_user( + name=request.form["name"], + email=request.form["email"], + role=request.form["role"], + active="active" in request.form, + token=token + ) + + link = url_for("users.set_password", token=token, _external=True) + + dev_link = send_user_created_email( + request.form["email"], + link, + request.form["name"] + ) + + flash("User created and email sent", "success") + return redirect(url_for("users.list_users")) + + return render_template("form.html", user=None) + + +# ========================= +# EDIT USER (Oracle) +# ========================= +@users_bp.route("/edit/", methods=["GET", "POST"]) +@requires_admin_auth +def edit_user(user_id): + + if request.method == "POST": + update_user( + user_id=user_id, + name=request.form["name"], + email=request.form["email"], + role=request.form["role"], + active="active" in request.form + ) + + return redirect(url_for("users.list_users")) + + # busca lista inteira e filtra (simples e funciona bem) + users = svc_list_users() + user = next((u for u in users if u["id"] == user_id), None) + + return render_template("form.html", user=user) + + +# ========================= +# DELETE USER (Oracle) +# ========================= +@users_bp.route("/delete/") +@requires_admin_auth +def delete_user(user_id): + + svc_delete_user(user_id) + return redirect(url_for("users.list_users")) + + +# ========================= +# SET PASSWORD (Oracle) +# ========================= +@users_bp.route("/set-password/", methods=["GET", "POST"]) +def set_password(token): + + user = get_user_by_token(token) + + if not user or is_expired(user["expire"]): + return render_template("set_password.html", expired=True) + + if request.method == "POST": + + pwd = request.form["password"] + pwd2 = request.form["password2"] + + if pwd != pwd2: + flash("Passwords do not match") + return render_template("set_password.html", expired=False) + + set_password_service(user["id"], pwd) + + flash("Password updated successfully") + return redirect("/") + + return render_template("set_password.html", expired=False) \ No newline at end of file diff --git a/files/modules/users/service.py b/files/modules/users/service.py new file mode 100644 index 0000000..d06af9c --- /dev/null +++ b/files/modules/users/service.py @@ -0,0 +1,204 @@ +#from .model import db, User +from .token_service import generate_token, expiration +from .email_service import send_user_created_email +from config_loader import load_config +from .db import get_pool +import bcrypt +from werkzeug.security import generate_password_hash, check_password_hash + +config = load_config() + +def authenticate_user(username: str, password: str): + + print("LOGIN TRY:", username, password) + + sql = """ + SELECT password_hash + FROM app_users + WHERE email = :1 \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [username]) + row = cur.fetchone() + + # print("ROW:", row) + + if not row: + # print("USER NOT FOUND") + return False + + stored_hash = row[0] + # print("HASH:", stored_hash) + + ok = check_password_hash(stored_hash, password) + + # print("MATCH:", ok) + + return ok + +def create_user(username: str, password: str): + + hashed = bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode() + + sql = """ + INSERT INTO app_users (username, password_hash) + VALUES (:1, :2) \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [username, hashed]) + conn.commit() + +def _default_name(email: str) -> str: + return (email or "").split("@")[0] + + +def signup_user(email: str, name: str = ""): + + if not email: + raise ValueError("Email required") + + email = email.lower().strip() + name = name or email.split("@")[0] + + token = generate_token() + + pool = get_pool() + + sql_check = """ + SELECT id + FROM app_users + WHERE email = :1 \ + """ + + sql_insert = """ + INSERT INTO app_users + (name,email,user_role,active,reset_token,reset_expire,must_change_password) + VALUES (:1,:2,'user',1,:3,:4,1) \ + """ + + sql_update = """ + UPDATE app_users + SET reset_token=:1, + reset_expire=:2, + must_change_password=1 + WHERE email=:3 \ + """ + + with pool.acquire() as conn: + with conn.cursor() as cur: + + cur.execute(sql_check, [email]) + row = cur.fetchone() + + if not row: + cur.execute(sql_insert, [name, email, token, expiration()]) + else: + cur.execute(sql_update, [token, expiration(), email]) + + conn.commit() + + link = f"{config.app_base}:{config.service_port}/admin/users/set-password/{token}" + + dev_link = send_user_created_email(email, link, name) + + return dev_link or link + +def list_users(): + sql = """ + SELECT id, name, email, user_role, active + FROM app_users + ORDER BY name \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql) + cols = [c[0].lower() for c in cur.description] + return [dict(zip(cols, r)) for r in cur.fetchall()] + +def create_user(name, email, role, active, token): + sql = """ + INSERT INTO app_users + (name,email,user_role,active,reset_token,reset_expire,must_change_password) + VALUES (:1,:2,:3,:4,:5,SYSTIMESTAMP + INTERVAL '1' DAY,1) \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [name, email, role, active, token]) + conn.commit() + +def update_user(user_id, name, email, role, active): + sql = """ + UPDATE app_users + SET name=:1, email=:2, user_role=:3, active=:4 + WHERE id=:5 \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [name, email, role, active, user_id]) + conn.commit() + +def delete_user(user_id): + sql = "DELETE FROM app_users WHERE id=:1" + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [user_id]) + conn.commit() + +def get_user_by_token(token): + sql = """ + SELECT id, reset_expire + FROM app_users + WHERE reset_token=:1 \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [token]) + row = cur.fetchone() + + if not row: + return None + + return {"id": row[0], "expire": row[1]} + +def set_password_service(user_id, pwd): + hashed = generate_password_hash(pwd) + + sql = """ + UPDATE app_users + SET password_hash=:1, + must_change_password=0, + reset_token=NULL, + reset_expire=NULL + WHERE id=:2 \ + """ + + pool = get_pool() + + with pool.acquire() as conn: + with conn.cursor() as cur: + cur.execute(sql, [hashed, user_id]) + conn.commit() + diff --git a/files/modules/users/token_service.py b/files/modules/users/token_service.py new file mode 100644 index 0000000..284b26a --- /dev/null +++ b/files/modules/users/token_service.py @@ -0,0 +1,14 @@ +import secrets +from datetime import datetime, timedelta + + +def generate_token(): + return secrets.token_urlsafe(48) + + +def expiration(hours=24): + return datetime.utcnow() + timedelta(hours=hours) + + +def is_expired(expire_dt): + return not expire_dt or expire_dt < datetime.utcnow() \ No newline at end of file diff --git a/files/oci_genai_llm_graphrag_rerank_rfp.py b/files/oci_genai_llm_graphrag_rerank_rfp.py new file mode 100644 index 0000000..9a9ef93 --- /dev/null +++ b/files/oci_genai_llm_graphrag_rerank_rfp.py @@ -0,0 +1,3095 @@ +from langchain_community.chat_models.oci_generative_ai import ChatOCIGenAI +from langchain_core.prompts import PromptTemplate +from langchain.schema.output_parser import StrOutputParser +from langchain_community.embeddings import OCIGenAIEmbeddings +from langchain_community.vectorstores import FAISS +from langchain.schema.runnable import RunnableMap +from langchain_community.document_loaders import UnstructuredPDFLoader, PyMuPDFLoader +from langchain_core.documents import Document +from langchain_core.runnables import RunnableLambda +from pathlib import Path +from tqdm import tqdm +import os +import pickle +import re +import atexit +import oracledb +import json +import base64 +import hashlib +from datetime import datetime +import requests +import textwrap +import unicodedata +from typing import Optional +from collections import deque +from langchain.callbacks.base import BaseCallbackHandler +from langdetect import detect +from config_loader import load_config +from concurrent.futures import ThreadPoolExecutor, as_completed +import threading + +def chunk_hash(text: str) -> str: + return hashlib.sha256(text.encode("utf-8")).hexdigest() + +config = load_config() + +# ========================= +# Oracle Autonomous Configuration +# ========================= +WALLET_PATH = config.wallet_path +DB_ALIAS = config.db_alias +USERNAME = config.username +PASSWORD = config.password +os.environ["TNS_ADMIN"] = WALLET_PATH + +# ========================= +# Global Configurations +# ========================= +INDEX_PATH = config.index_path +PROCESSED_DOCS_FILE = os.path.join(INDEX_PATH, "processed_docs.pkl") +chapter_separator_regex = r"^(#{1,6} .+|\*\*.+\*\*)$" +GRAPH_NAME = config.graph_name +LOG_BUFFER = deque(maxlen=500) +MAX_ATTEMPTS = 3 +GENAI_MAX_CONCURRENT = 1000 +GENAI_SEMAPHORE = threading.Semaphore(GENAI_MAX_CONCURRENT) + +def call_llm(fn, *args, **kwargs): + with GENAI_SEMAPHORE: + return fn(*args, **kwargs) + +# ========================= +# LLM Definitions +# ========================= + +llm = ChatOCIGenAI( + model_id=config.llm_model, + service_endpoint=config.service_endpoint, + compartment_id=config.compartment_id, + auth_profile=config.auth_profile, + model_kwargs={"temperature": 0, "top_p": 1, "max_tokens": 4000}, +) + +llm_for_rag = ChatOCIGenAI( + model_id=config.llm_model, + service_endpoint=config.service_endpoint, + compartment_id=config.compartment_id, + auth_profile=config.auth_profile, +) + +lrm_for_architect = ChatOCIGenAI( + model_id=config.llm_model, + service_endpoint=config.service_endpoint, + compartment_id=config.compartment_id, + auth_profile=config.auth_profile, +) + +embeddings = OCIGenAIEmbeddings( + model_id=config.embedding_model, + service_endpoint=config.service_endpoint, + compartment_id=config.compartment_id, + auth_profile=config.auth_profile, +) + +oracle_conn = oracledb.connect( + user=USERNAME, + password=PASSWORD, + dsn=DB_ALIAS, + config_dir=WALLET_PATH, + wallet_location=WALLET_PATH, + wallet_password=PASSWORD +) +atexit.register(lambda: oracle_conn.close()) + +def filename_to_url(filename: str, suffix: str = ".pdf") -> str: + if filename.endswith(suffix): + filename = filename[: -len(suffix)] + decoded = base64.urlsafe_b64decode(filename.encode("ascii")) + return decoded.decode("utf-8") + +def default_logger(msg): + print(msg) + +# ========================================= +# ARCHITECTURE-SPECIFIC SOURCE RANKING +# ========================================= + +class BrowserLogCallback(BaseCallbackHandler): + + def __init__(self, logger): + self.log = logger + + # ---------- CHAIN ---------- + def on_chain_start(self, serialized, inputs, **kwargs): + self.log("πŸ”΅ Chain started") + + def on_chain_end(self, outputs, **kwargs): + self.log("🟒 Chain finished") + + # ---------- LLM ---------- + def on_llm_start(self, serialized, prompts, **kwargs): + self.log("πŸ€– LLM call started") + self.log(f"πŸ“ Prompt size: {len(prompts[0])} chars") + + def on_llm_end(self, response, **kwargs): + self.log("βœ… LLM response received") + + # ---------- RETRIEVER ---------- + def on_retriever_start(self, serialized, query, **kwargs): + self.log(f"πŸ” Searching vector store: {query}") + + def on_retriever_end(self, documents, **kwargs): + self.log(f"πŸ“š Retrieved {len(documents)} chunks") + + # ---------- ERRORS ---------- + def on_chain_error(self, error, **kwargs): + self.log(f"❌ ERROR: {error}") + +ARCH_GOOD_HINTS = [ + "overview", + "concept", + "architecture", + "service", + "use-case" +] + +ARCH_BAD_HINTS = [ + "home", + "index", + "portal", + "release-notes", + "troubleshoot", + "known-issues" +] + + +def score_arch_url(url: str) -> int: + if not url: + return 0 + + u = url.lower() + score = 0 + + for g in ARCH_GOOD_HINTS: + if g in u: + score += 3 + + for b in ARCH_BAD_HINTS: + if b in u: + score -= 5 + + if "docs.oracle.com" in u: + score += 2 + + return score + + +def resolve_arch_source(doc): + """ + Igual resolve_chunk_source, mas com ranking especΓ­fico para arquitetura. + NΓƒO afeta o resto do pipeline. + """ + text = doc.page_content or "" + md = doc.metadata or {} + + candidates = [] + + candidates += URL_REGEX.findall(text) + + if md.get("reference"): + candidates.append(md["reference"]) + + if md.get("source"): + candidates.append(md["source"]) + + if not candidates: + return "Oracle Cloud Infrastructure documentation" + + candidates = list(set(candidates)) + candidates.sort(key=score_arch_url, reverse=True) + + return candidates[0] + +def strip_accents(s: str) -> str: + return ''.join( + c for c in unicodedata.normalize('NFD', s) + if unicodedata.category(c) != 'Mn' + ) + +def normalize_lang(code: str) -> str: + mapping = { + "pt": "Portuguese", + "en": "English", + "es": "Spanish", + "fr": "French", + "de": "German", + "it": "Italian" + } + return mapping.get(code, "English") + +# ========================= +# SOURCE VALIDATION (POST LLM) +# ========================= + +INVALID_SOURCE_TOKEN = "---------" +URL_TIMEOUT = 3 + +_url_cache = {} + +def url_exists(url: str) -> bool: + + if not url or not url.startswith("http"): + return False + + if url in _url_cache: + return _url_cache[url] + + try: + r = requests.get( + url, + timeout=URL_TIMEOUT, + allow_redirects=True, + headers={"User-Agent": "Mozilla/5.0"} + ) + + if r.status_code >= 400: + _url_cache[url] = False + return False + + html = (r.text or "").lower() + + # ==================================== + # πŸ”₯ ORACLE SOFT-404 TEMPLATE DETECTION + # ==================================== + soft_404_patterns = [ + "page not found", + "that page is not available", + "class=\"page-not-found\"", + "redwood-light-404.css", + "error-container" + ] + + if any(p in html for p in soft_404_patterns): + _url_cache[url] = False + return False + + # ==================================== + # πŸ”₯ conteΓΊdo mΓ­nimo REAL (sem menu) + # ==================================== + + from bs4 import BeautifulSoup + + soup = BeautifulSoup(html, "html.parser") + + for tag in soup(["script", "style", "nav", "footer", "header"]): + tag.decompose() + + visible_text = soup.get_text(" ", strip=True) + + # se depois de remover layout quase nΓ£o sobra texto => stub + # if len(visible_text) < 300: + # _url_cache[url] = False + # return False + + _url_cache[url] = True + return True + + except: + _url_cache[url] = False + return False + +def validate_and_sanitize_sources(answer: dict) -> dict: + if not isinstance(answer, dict): + return {"answer": "NO", "confidence": "LOW", "ambiguity_detected": True, + "confidence_reason": "Invalid answer type", "justification": "", "evidence": []} + + # cΓ³pia shallow do topo + cΓ³pia da lista evidence + out = dict(answer) + evidences = out.get("evidence", []) + + if not isinstance(evidences, list): + return out + + new_evidences = [] + for ev in evidences: + if not isinstance(ev, dict): + new_evidences.append(ev) + continue + + ev2 = dict(ev) + src = ev2.get("source") + + if isinstance(src, list): + ev2["source"] = [ + s if (isinstance(s, str) and url_exists(s)) else INVALID_SOURCE_TOKEN + for s in src + ] + else: + ev2["source"] = ( + src if (isinstance(src, str) and url_exists(src)) else INVALID_SOURCE_TOKEN + ) + + if ev2["source"] == INVALID_SOURCE_TOKEN: + print(src) + + new_evidences.append(ev2) + + out["evidence"] = new_evidences + return out + +# ========================= +# LRM Definitions +# ========================= +def build_architecture_evidence(docs, max_chunks=30): + ranked = sorted( + docs, + key=lambda d: score_arch_url(resolve_arch_source(d)), + reverse=True + ) + + evidence = [] + + for d in ranked[:max_chunks]: + quote = d.page_content[:3000] + quote = re.sub(r"Reference:\s*\S+", "", quote) + + evidence.append({ + "quote": quote, + "source": resolve_arch_source(d) + }) + + return evidence + +def enforce_architecture_sources(plan: dict, evidence: list[dict]) -> dict: + if not evidence: + return plan + + ev_list = [e for e in evidence if e.get("source")] + valid_sources = { + str(e.get("source", "")) + for e in ev_list + if e.get("source") + } + + def pick_best_source(service: str) -> dict | None: + if not service: + return None + + s = service.lower() + + best = None + best_score = -1 + + service_terms = [t for t in re.findall(r"[a-z0-9]+", s) if len(t) >= 3] + + for e in ev_list: + hay = (e.get("quote", "") + " " + e.get("source", "")).lower() + score = sum(1 for t in service_terms if t in hay) + + if score > best_score: + best_score = score + best = e + + return best if best_score > 0 else None + + for d in plan.get("decisions", []): + service = d.get("service", "") + ev = d.get("evidence", {}) or {} + + if ev.get("source") in valid_sources: + continue + + best_ev = pick_best_source(service) + if best_ev: + d["evidence"] = { + "quote": best_ev.get("quote", ev.get("quote", "")), + "source": best_ev["source"], + } + continue + + if not ev.get("source"): + d["evidence"] = { + "quote": ev.get("quote", ""), + "source": ev_list[0]["source"], + } + + return plan + +def build_architecture_chain(): + + ARCH_PROMPT = PromptTemplate.from_template(""" + You are a senior OCI Cloud Architect. + + TASK: + Design an OCI architecture for: + + {question} + + You MUST design the solution using the provided documentation evidence. + + **DOCUMENT EVIDENCE** (JSON): + {text_context} + + **GRAPH FACTS**: + {graph_context} + + Rules (MANDATORY): + - Use ONLY services supported by **DOCUMENT EVIDENCE**. + - The architecture may involve MULTIPLE OCI services; therefore decisions may require DIFFERENT sources (do not reuse a single URL for everything unless it truly applies). + - Set each service from this OCI Services: + Oracle Cloud Infrastructure Service (OCI): + + Compute (IaaS) + β€’ Compute Instances (VM) + β€’ Bare Metal Instances + β€’ Dedicated VM Hosts + β€’ GPU Instances + β€’ Confidential Computing + β€’ Capacity Reservations + β€’ Autoscaling (Instance Pools) + β€’ Live Migration + β€’ Oracle Cloud VMware Solution (OCVS) + β€’ HPC (High Performance Computing) + β€’ Arm-based Compute (Ampere) + + Storage + + Object Storage + β€’ Object Storage + β€’ Object Storage – Archive + β€’ Pre-Authenticated Requests + β€’ Replication + + Block & File + β€’ Block Volume + β€’ Boot Volume + β€’ Volume Groups + β€’ File Storage + β€’ File Storage Snapshots + β€’ Data Transfer Service + + Networking + β€’ Virtual Cloud Network (VCN) + β€’ Subnets + β€’ Internet Gateway + β€’ NAT Gateway + β€’ Service Gateway + β€’ Dynamic Routing Gateway (DRG) + β€’ FastConnect + β€’ Load Balancer (L7 / L4) + β€’ Network Load Balancer + β€’ DNS + β€’ Traffic Management Steering Policies + β€’ IP Address Management (IPAM) + β€’ Network Firewall + β€’ Web Application Firewall (WAF) + β€’ Bastion + β€’ Capture Traffic (VTAP) + β€’ Private Endpoints + + Security, Identity & Compliance + β€’ Identity and Access Management (IAM) + β€’ Compartments + β€’ Policies + β€’ OCI Vault + β€’ OCI Key Management (KMS) + β€’ OCI Certificates + β€’ OCI Secrets + β€’ OCI Bastion + β€’ Cloud Guard + β€’ Security Zones + β€’ Vulnerability Scanning Service + β€’ Data Safe + β€’ Logging + β€’ Audit + β€’ OS Management / OS Management Hub + β€’ Shielded Instances + β€’ Zero Trust Packet Routing + + Databases + + Autonomous + β€’ Autonomous Database (ATP) + β€’ Autonomous Data Warehouse (ADW) + β€’ Autonomous JSON Database + + Databases Gerenciados + β€’ Oracle Database Service + β€’ Oracle Exadata Database Service + β€’ Exadata Cloud@Customer + β€’ Base Database Service + β€’ MySQL Database Service + β€’ MySQL HeatWave + β€’ NoSQL Database Cloud Service + β€’ TimesTen + β€’ PostgreSQL (OCI managed) + β€’ MongoDB API (OCI NoSQL compatibility) + + Analytics & BI + β€’ Oracle Analytics Cloud (OAC) + β€’ OCI Data Catalog + β€’ OCI Data Integration + β€’ OCI Streaming Analytics + β€’ OCI GoldenGate + β€’ OCI Big Data Service (Hadoop/Spark) + β€’ OCI Data Science + β€’ OCI AI Anomaly Detection + β€’ OCI AI Forecasting + + AI & Machine Learning + + Generative AI + β€’ OCI Generative AI + β€’ OCI Generative AI Agents + β€’ OCI Generative AI RAG + β€’ OCI Generative AI Embeddings + β€’ OCI AI Gateway (OpenAI-compatible) + + AI Services + β€’ OCI Vision (OCR, image analysis) + β€’ OCI Speech (STT / TTS) + β€’ OCI Language (NLP) + β€’ OCI Document Understanding + β€’ OCI Anomaly Detection + β€’ OCI Forecasting + β€’ OCI Data Labeling + + Containers & Cloud Native + β€’ OCI Container Engine for Kubernetes (OKE) + β€’ Container Registry (OCIR) + β€’ Service Mesh + β€’ API Gateway + β€’ OCI Functions (FaaS) + β€’ OCI Streaming (Kafka-compatible) + β€’ OCI Queue + β€’ OCI Events + β€’ OCI Resource Manager (Terraform) + + Integration & Messaging + β€’ OCI Integration Cloud (OIC) + β€’ OCI Service Connector Hub + β€’ OCI Streaming + β€’ OCI GoldenGate + β€’ OCI API Gateway + β€’ OCI Events Service + β€’ OCI Queue + β€’ Real Applications Clusters (RAC) + + Developer Services + β€’ OCI DevOps (CI/CD) + β€’ OCI Code Repository + β€’ OCI Build Pipelines + β€’ OCI Artifact Registry + β€’ OCI Logging Analytics + β€’ OCI Monitoring + β€’ OCI Notifications + β€’ OCI Bastion + β€’ OCI CLI + β€’ OCI SDKs + + Observability & Management + β€’ OCI Monitoring + β€’ OCI Alarms + β€’ OCI Logging + β€’ OCI Logging Analytics + β€’ OCI Application Performance Monitoring (APM) + β€’ OCI Operations Insights + β€’ OCI Management Agent + β€’ OCI Resource Discovery + + Enterprise & Hybrid + β€’ Oracle Cloud@Customer + β€’ Exadata Cloud@Customer + β€’ Compute Cloud@Customer + β€’ Dedicated Region Cloud@Customer + β€’ OCI Roving Edge Infrastructure + β€’ OCI Alloy + + Governance & FinOps + β€’ OCI Budgets + β€’ Cost Analysis + β€’ Usage Reports + β€’ Quotas + β€’ Tagging + β€’ Compartments + β€’ Resource Search + + Regions & Edge + β€’ OCI Regions (Commercial, Government, EU Sovereign) + β€’ OCI Edge Services + β€’ OCI Roving Edge + β€’ OCI Dedicated Region + + STRICT SERVICE GROUNDING RULE (MANDATORY): + - For each decision, use evidence from the SAME service_group as the decision service. + - Do NOT justify one service using evidence from another service's documentation. + + SOURCE RULES (STRICT): + - Copy ONLY URLs that appear EXACTLY in DOCUMENT EVIDENCE or GRAPH FACTS + - NEVER create or guess URLs + - If no URL is explicitly present, set source = null + - It is allowed to return null + - GIVE MANY SOURCES URL + - GIVE A COMPLETE PATH OF URL SOURCES TO UNDERSTAND THE CONCEPTS THEME + - GIVE one or more OVERVIEW SOURCE URL + - GIVE one or more SOLUTION AND ARCHITECTURE SOURCE URL + + MANDATORY: + - Break into requirements + - Map each requirement to OCI services + - Justify each choice + + LANGUAGE RULE (MANDATORY): **DO TRANSLATION AS THE LAST STEP** + - Write ALL textual values in {lang} + - Keep JSON keys in English + - Do NOT translate keys + + Return ONLY JSON: + + {{ + "problem_summary": "...", + "architecture": {{ + "components": [ + {{ + "id": "api", + "service": "OCI API Gateway", + "purpose": "...", + "source": ["__AUTO__"], + "connects_to": [] + }} + ] + }}, + "decisions": [ + {{ + "service": "...", + "reason": "must cite evidence", + "evidence": {{ + "quote": "...", + "source": ["__AUTO__"] + }} + }} + ] + }} + """) + + callback = BrowserLogCallback(default_logger) + + chain_arch = ( + RunnableLambda(lambda q: { + "question": q, + "req": parse_rfp_requirement(q) + }) + | RunnableMap({ + "question": lambda x: x["question"], + "text_context": lambda x: get_architecture_context(x["req"])["text_context"], + "graph_context": lambda x: get_architecture_context(x["req"])["graph_context"], + "lang": lambda x: normalize_lang(detect(x["question"])) + }) + | ARCH_PROMPT + | lrm_for_architect + | StrOutputParser() + ).with_config(callbacks=[callback]) + + return chain_arch + +def score_url_quality(url: str) -> int: + if not url: + return 0 + + u = url.lower() + score = 0 + + if "/solutions/" in u: + score += 8 + + elif "youtube.com" in u or "youtu.be" in u: + score += 7 + + elif any(x in u for x in [ + "architecture", "overview", "concept", "how-to", "use-case" + ]): + score += 5 + + elif "docs.oracle.com" in u: + score += 3 + + if any(x in u for x in [ + "home", "index", "portal", "release-notes", "faq", "troubleshoot" + ]): + score -= 10 + + return score + + +def score_architecture_plan(plan: dict) -> int: + if not plan: + return -1 + + score = 0 + + comps = plan.get("architecture", {}).get("components", []) + decisions = plan.get("decisions", []) + + score += len(comps) * 3 + score += len(decisions) * 4 + + for d in decisions: + ev = d.get("evidence", {}) or {} + srcs = ev.get("source", []) + + if isinstance(srcs, str): + srcs = [srcs] + + for s in srcs: + score += score_url_quality(s) + + quote = "" + if ev.get("quote", ""): + quote = ev.get("quote", "") + + score += min(len(quote) // 500, 4) + + return score + +def call_architecture_planner( + question: str, + parallel_attempts: int = MAX_ATTEMPTS, + log=default_logger +): + log("\nπŸ—οΈ ARCHITECTURE (PARALLEL SELF-CONSISTENCY)") + + def worker(i): + print(0) + # raw = chain_architecture.invoke(question) + raw = call_llm(chain_architecture.invoke, question) + print(1) + plan = safe_parse_architecture_json(raw) + print(2) + + try: + score = score_architecture_plan(plan) + except: + print("Error scoring", plan) + score = 0 + print(3) + + return { + "attempt": i, + "plan": plan, + "score": score + } + + results = [] + + with ThreadPoolExecutor(max_workers=parallel_attempts) as executor: + futures = [ + executor.submit(worker, i) + for i in range(1, parallel_attempts + 1) + ] + + for f in as_completed(futures): + results.append(f.result()) + + results.sort(key=lambda r: r["score"], reverse=True) + + for r in results: + log(f"⚑ Attempt {r['attempt']} score={r['score']}") + + best = results[0]["plan"] + + log(f"\nπŸ† Selected architecture from attempt {results[0]['attempt']}") + + + plan = safe_parse_architecture_json(best) + plan = validate_architecture_sources(plan) + + return plan + +def architecture_to_mermaid(plan: dict) -> str: + + architecture = plan.get("architecture", {}) + comps = architecture.get("components", []) + + if not comps: + return "flowchart TB\nEmpty[No components]" + + direction = "TB" if len(comps) > 6 else "LR" + + lines = [f"flowchart {direction}"] + + # nodes + for c in comps: + cid = c["id"] + purpose = "\n".join(textwrap.wrap(c["purpose"], 28)) + label = f'{c["service"]}\\n{purpose}' + lines.append(f'{cid}["{label}"]') + + # edges + for c in comps: + src = c["id"] + + for target in c.get("connects_to", []): + + if isinstance(target, dict): + dst = target.get("id") + + elif isinstance(target, str): + dst = target + + else: + continue + + if dst: + lines.append(f"{src} --> {dst}") + + return "\n".join(lines) + +def get_architecture_context(req: dict): + query_terms = extract_graph_keywords_from_requirement(req) + + docs = search_active_chunks(query_terms) + graph_context = query_knowledge_graph(query_terms, top_k=20, min_score=1) + + graph_terms = extract_terms_from_graph_text(graph_context) + reranked_chunks = rerank_documents_with_graph_terms( + docs, + query_terms, + graph_terms, + top_k=8 + ) + structured_evidence = build_architecture_evidence(reranked_chunks) + + return { + "text_context": structured_evidence, + "graph_context": graph_context, + "requirement_type": req["requirement_type"], + "subject": req["subject"], + "expected_value": req.get("expected_value", "") + } + +def extract_first_balanced_json(text: str) -> str | None: + """ + Extrai APENAS o primeiro JSON balanceado. + Ignora QUALQUER coisa depois. + """ + + start = text.find("{") + if start == -1: + return None + + depth = 0 + + for i, c in enumerate(text[start:], start): + if c == "{": + depth += 1 + elif c == "}": + depth -= 1 + + if depth == 0: + return text[start:i+1] + + return None + +def sanitize_json_string(text: str) -> str: + return re.sub(r'\\(?!["\\/bfnrtu])', r'\\\\', text) + + +def recover_json_object(text: str) -> str | None: + """ + Extrai o primeiro objeto JSON possΓ­vel. + - ignora lixo antes/depois + - tolera truncamento + - fecha chaves automaticamente + """ + + if not text: + return None + + # remove markdown fences + text = text.replace("```json", "").replace("```", "") + + start = text.find("{") + if start == -1: + return None + + text = text[start:] + + depth = 0 + end_pos = None + + for i, c in enumerate(text): + if c == "{": + depth += 1 + elif c == "}": + depth -= 1 + + if depth == 0: + end_pos = i + 1 + break + + # βœ… caso normal β†’ JSON completo + if end_pos: + return text[:end_pos] + + # πŸ”₯ caso TRUNCADO β†’ fecha automaticamente + opens = text.count("{") + closes = text.count("}") + + missing = opens - closes + + if missing > 0: + text = text + ("}" * missing) + + text = sanitize_json_string(text) + text = extract_first_balanced_json(text) + + return text + +def safe_parse_architecture_json(raw): + """ + Robust JSON parser for LLM output. + Remove control chars + fix invalid escapes. + """ + + if isinstance(raw, dict): + return raw + + if not raw: + return {} + + raw = raw.replace("```json", "").replace("```", "").strip() + + # pega sΓ³ o bloco JSON + m = re.search(r"\{.*\}", raw, re.DOTALL) + if not m: + raise ValueError(f"Invalid architecture JSON:\n{raw}") + + json_text = m.group(0) + + # πŸ”₯ remove caracteres de controle invisΓ­veis + json_text = re.sub(r"[\x00-\x1F\x7F]", " ", json_text) + + # πŸ”₯ normaliza newlines + json_text = json_text.replace("\r", " ").replace("\n", " ") + + try: + return json.loads(json_text) + except Exception as e: + print(e) + print("⚠️ JSON RAW (debug):") + print(json_text) + try: + return json.loads(recover_json_object(json_text)) + except Exception as e1: + print(e1) + raise e1 + +# ========================= +# Oracle Graph Client +# ========================= +ANSWER_RANK = { + "YES": 3, + "PARTIAL": 2, + "NO": 1 +} + +CONFIDENCE_RANK = { + "HIGH": 3, + "MEDIUM": 2, + "LOW": 1 +} + +def score_answer(parsed: dict) -> int: + ans = parsed.get("answer", "NO") + conf = parsed.get("confidence", "LOW") + evidence = parsed.get("evidence", []) + + # πŸ”Ή base lΓ³gica (jΓ‘ existia) + base = ANSWER_RANK.get(ans, 0) * 10 + CONFIDENCE_RANK.get(conf, 0) + + unique_sources = set() + quote_size = 0 + source_quality = 0 + + for e in evidence: + src = (e.get("source") or "").lower() + quote = e.get("quote", "") + + if not src: + continue + + unique_sources.add(src) + quote_size = 0 + if quote: + quote_size += len(quote) + + # πŸ”₯ pesos inteligentes por qualidade + if "/solutions/" in src: + source_quality += 6 + + elif "youtube.com" in src: + source_quality += 5 + + elif any(x in src for x in ["architecture", "overview", "concepts", "how-to"]): + source_quality += 4 + + elif "docs.oracle.com" in src: + source_quality += 2 + + elif any(x in src for x in ["home", "index", "portal"]): + source_quality -= 5 + + evidence_score = ( + len(unique_sources) * 3 + + min(quote_size // 500, 5) + + source_quality + ) + + return base + evidence_score + +def ensure_oracle_text_index( + conn, + table_name: str, + column_name: str, + index_name: str +): + cursor = conn.cursor() + + cursor.execute(""" + SELECT status + FROM user_indexes + WHERE index_name = :idx + """, {"idx": index_name.upper()}) + + row = cursor.fetchone() + index_exists = row is not None + index_status = row[0] if row else None + + if not index_exists: + print(f"πŸ› οΈ Creating Oracle Text index {index_name}") + + cursor.execute(f""" + CREATE INDEX {index_name} + ON {table_name} ({column_name}) + INDEXTYPE IS CTXSYS.CONTEXT + """) + + conn.commit() + cursor.close() + print(f"βœ… Index {index_name} created (sync deferred)") + return + + if index_status != "VALID": + print(f"⚠️ Index {index_name} is {index_status}. Recreating...") + + try: + cursor.execute(f"DROP INDEX {index_name}") + conn.commit() + except Exception as e: + print(f"❌ Failed to drop index {index_name}: {e}") + cursor.close() + return + + cursor.execute(f""" + CREATE INDEX {index_name} + ON {table_name} ({column_name}) + INDEXTYPE IS CTXSYS.CONTEXT + """) + conn.commit() + cursor.close() + print(f"♻️ Index {index_name} recreated (sync deferred)") + return + + print(f"πŸ”„ Syncing Oracle Text index: {index_name}") + try: + cursor.execute(f""" + BEGIN + CTX_DDL.SYNC_INDEX('{index_name}', '2M'); + END; + """) + conn.commit() + print(f"βœ… Index {index_name} synced") + except Exception as e: + print(f"⚠️ Sync failed for {index_name}: {e}") + print("⚠️ Continuing without breaking pipeline") + + cursor.close() + +def _col_exists(conn, table: str, col: str) -> bool: + cur = conn.cursor() + cur.execute(""" + SELECT 1 + FROM user_tab_cols + WHERE table_name = :t + AND column_name = :c + """, {"t": table.upper(), "c": col.upper()}) + ok = cur.fetchone() is not None + cur.close() + return ok + + +def create_tables_if_not_exist(conn): + cursor = conn.cursor() + + try: + # --------------------------- + # KG_NODES + # --------------------------- + cursor.execute(f""" + BEGIN + EXECUTE IMMEDIATE ' + CREATE TABLE KG_NODES_{GRAPH_NAME} ( + ID NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY PRIMARY KEY, + NODE_TYPE VARCHAR2(100), + NAME VARCHAR2(1000), + DESCRIPTION CLOB, + PROPERTIES CLOB, + CREATED_AT TIMESTAMP DEFAULT SYSTIMESTAMP + ) + '; + EXCEPTION WHEN OTHERS THEN + IF SQLCODE != -955 THEN RAISE; END IF; + END; + """) + + # --------------------------- + # KG_EDGES (estrutura + ponte pro chunk) + # --------------------------- + cursor.execute(f""" + BEGIN + EXECUTE IMMEDIATE ' + CREATE TABLE KG_EDGES_{GRAPH_NAME} ( + ID NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY PRIMARY KEY, + SOURCE_ID NUMBER, + TARGET_ID NUMBER, + EDGE_TYPE VARCHAR2(100), + + -- βœ… governanΓ§a / revogaΓ§Γ£o + CHUNK_HASH VARCHAR2(64), + + -- βœ… link principal (melhor url do chunk) + SOURCE_URL VARCHAR2(2000), + + CONFIDENCE_WEIGHT NUMBER DEFAULT 1, + CREATED_AT TIMESTAMP DEFAULT SYSTIMESTAMP + ) + '; + EXCEPTION WHEN OTHERS THEN + IF SQLCODE != -955 THEN RAISE; END IF; + END; + """) + + cursor.execute(f""" + BEGIN + EXECUTE IMMEDIATE ' + CREATE UNIQUE INDEX KG_EDGE_UNQ_{GRAPH_NAME} + ON KG_EDGES_{GRAPH_NAME} + (SOURCE_ID, TARGET_ID, EDGE_TYPE, CHUNK_HASH) + '; + EXCEPTION WHEN OTHERS THEN + IF SQLCODE != -955 THEN RAISE; END IF; + END; + """) + + # --------------------------- + # KG_EVIDENCE (prova) + # --------------------------- + cursor.execute(f""" + BEGIN + EXECUTE IMMEDIATE ' + CREATE TABLE KG_EVIDENCE_{GRAPH_NAME} ( + ID NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY PRIMARY KEY, + EDGE_ID NUMBER, + CHUNK_HASH VARCHAR2(64), + SOURCE_URL VARCHAR2(2000), + QUOTE CLOB, + CREATED_AT TIMESTAMP DEFAULT SYSTIMESTAMP + ) + '; + EXCEPTION WHEN OTHERS THEN + IF SQLCODE != -955 THEN RAISE; END IF; + END; + """) + + conn.commit() + + finally: + cursor.close() + + # --------------------------- + # MigraΓ§Γ£o leve (se seu KG_EDGES jΓ‘ existia antigo) + # --------------------------- + edges_table = f"KG_EDGES_{GRAPH_NAME}" + + if not _col_exists(conn, edges_table, "CHUNK_HASH"): + cur = conn.cursor() + cur.execute(f"ALTER TABLE {edges_table} ADD (CHUNK_HASH VARCHAR2(64))") + conn.commit() + cur.close() + ensure_oracle_text_index(conn, f"KG_NODES_{GRAPH_NAME}", "NAME", f"KG_NODES_{GRAPH_NAME}_NAME") + + print("βœ… Graph schema (probatΓ³rio) ready.") + +#create_tables_if_not_exist(oracle_conn) + +# IF GRAPH INDEX PROBLEM, Reindex +# ensure_oracle_text_index( +# oracle_conn, +# "ENTITIES_" + GRAPH_NAME, +# "NAME", +# "IDX_ENT_" + GRAPH_NAME + "_NAME" +# ) +# +# ensure_oracle_text_index( +# oracle_conn, +# "RELATIONS_" + GRAPH_NAME, +# "RELATION_TYPE", +# "IDX_REL_" + GRAPH_NAME + "_RELTYPE" +# ) + +def create_knowledge_graph(chunks): + + cursor = oracle_conn.cursor() + inserted_counter = 0 + COMMIT_BATCH = 500 + + # ===================================================== + # 1️⃣ CREATE PROPERTY GRAPH (se nΓ£o existir) + # ===================================================== + try: + cursor.execute(f""" + BEGIN + EXECUTE IMMEDIATE ' + CREATE PROPERTY GRAPH {GRAPH_NAME} + VERTEX TABLES ( + KG_NODES_{GRAPH_NAME} + KEY (ID) + LABEL NODE_TYPE + PROPERTIES (NAME, DESCRIPTION, PROPERTIES) + ) + EDGE TABLES ( + KG_EDGES_{GRAPH_NAME} + KEY (ID) + SOURCE KEY (SOURCE_ID) REFERENCES KG_NODES_{GRAPH_NAME}(ID) + DESTINATION KEY (TARGET_ID) REFERENCES KG_NODES_{GRAPH_NAME}(ID) + LABEL EDGE_TYPE + PROPERTIES (CHUNK_HASH, SOURCE_URL, CONFIDENCE_WEIGHT) + ) + '; + EXCEPTION + WHEN OTHERS THEN + IF SQLCODE NOT IN (-55358, -955) THEN + RAISE; + END IF; + END; + """) + print(f"🧠 Graph '{GRAPH_NAME}' ready.") + except Exception as e: + print(f"[GRAPH ERROR] {e}") + + # ===================================================== + # 2️⃣ Helper: MERGE NODE (otimizado) + # ===================================================== + def build_default_node_properties(): + return { + "metadata": { + "created_by": "RFP_AI_V2", + "version": "2.0", + "created_at": datetime.utcnow().isoformat() + }, + "analysis": { + "confidence_score": None, + "source": "DOCUMENT_RAG", + "extraction_method": "LLM_TRIPLE_EXTRACTION" + }, + "governance": { + "validated": False, + "review_required": False + } + } + + + def ensure_node_properties_structure(properties): + base = build_default_node_properties() + + if not properties: + return base + + def merge(d1, d2): + for k, v in d1.items(): + if k not in d2: + d2[k] = v + elif isinstance(v, dict): + d2[k] = merge(v, d2.get(k, {})) + return d2 + + return merge(base, properties) + + def merge_node(node_type, name, description=None, properties=None): + + name = (name or "").strip()[:500] + + # 1️⃣ Try existing + cursor.execute(f""" + SELECT ID + FROM KG_NODES_{GRAPH_NAME} + WHERE NAME = :name_val + AND NODE_TYPE = :node_type_val + """, { + "name_val": name, + "node_type_val": node_type + }) + + row = cursor.fetchone() + if row: + return row[0] + + # 2️⃣ Insert safely + node_id_var = cursor.var(oracledb.NUMBER) + + try: + cursor.execute(f""" + INSERT INTO KG_NODES_{GRAPH_NAME} + (NODE_TYPE, NAME, DESCRIPTION, PROPERTIES) + VALUES (:node_type_val, :name_val, :desc_val, :props_val) + RETURNING ID INTO :node_id + """, { + "node_type_val": node_type, + "name_val": name, + "desc_val": description, + "props_val": json.dumps( + ensure_node_properties_structure(properties) + ), + "node_id": node_id_var + }) + + return int(node_id_var.getvalue()[0]) + + except oracledb.IntegrityError: + # if unique constraint exists + cursor.execute(f""" + SELECT ID + FROM KG_NODES_{GRAPH_NAME} + WHERE NAME = :name_val + AND NODE_TYPE = :node_type_val + """, { + "name_val": name, + "node_type_val": node_type + }) + row = cursor.fetchone() + if row: + return row[0] + raise + + def extract_sentence_with_term(text, term): + sentences = re.split(r'(?<=[.!?]) +', text) + for s in sentences: + if term.lower() in s.lower(): + return s.strip() + return text[:1000] + + # ===================================================== + # 3️⃣ PROCESS CHUNKS + # ===================================================== + for doc in chunks: + + text = doc.page_content or "" + chunk_hash_value = doc.metadata.get("chunk_hash") + source_url = resolve_chunk_source(doc) + + if not text.strip() or not chunk_hash_value: + continue + + prompt = f""" + Extract explicit technical capabilities from the text below. + + Text: + {text} + + Return triples ONLY in format: + + SERVICE -[SUPPORTS_CAPABILITY]-> CAPABILITY + SERVICE -[DOES_NOT_SUPPORT]-> CAPABILITY + SERVICE -[HAS_LIMITATION]-> LIMITATION + SERVICE -[HAS_SLA]-> SLA_VALUE + + Rules: + - Use exact service names if present + - Use UPPERCASE relation names + - No inference + - If none found return NONE + """ + + try: + response = call_llm(llm_for_rag.invoke, prompt) + result = response.content.strip() + except Exception as e: + print(f"[LLM ERROR] {e}") + continue + + if result.upper() == "NONE": + continue + + triples = result.splitlines() + + for triple in triples: + + parts = triple.split("-[") + if len(parts) != 2: + continue + + right = parts[1].split("]->") + if len(right) != 2: + continue + + entity1 = parts[0].strip() + raw_relation = right[0].strip().upper() + entity2 = right[1].strip() + + MAX_NODE_NAME = 500 + + entity1 = entity1[:MAX_NODE_NAME] + entity2 = entity2[:MAX_NODE_NAME] + + relation = re.sub(r'\W+', '_', raw_relation) + + source_id = merge_node( + node_type="SERVICE", + name=entity1, + description=None, + properties={ + "chunk_hash": chunk_hash_value, + "source_url": source_url + } + ) + + if relation == "DOES_NOT_SUPPORT": + target_type = "UNSUPPORTED_CAPABILITY" + elif relation == "HAS_LIMITATION": + target_type = "LIMITATION" + elif relation == "HAS_SLA": + target_type = "SLA" + else: + target_type = "CAPABILITY" + + description_text = extract_sentence_with_term(text, entity2) + + target_id = merge_node( + node_type=target_type, + name=entity2, + description=description_text + ) + + # πŸ”₯ Evitar duplicaΓ§Γ£o de edge + cursor.execute(f""" + SELECT ID + FROM KG_EDGES_{GRAPH_NAME} + WHERE SOURCE_ID = :s + AND TARGET_ID = :t + AND EDGE_TYPE = :r + AND CHUNK_HASH = :h + """, { + "s": source_id, + "t": target_id, + "r": relation, + "h": chunk_hash_value + }) + + if cursor.fetchone(): + continue + + # ===================================================== + # INSERT EDGE + RETURNING ID + # ===================================================== + edge_id_var = cursor.var(oracledb.NUMBER) + + cursor.execute(f""" + INSERT INTO KG_EDGES_{GRAPH_NAME} + (SOURCE_ID, TARGET_ID, EDGE_TYPE, CHUNK_HASH, SOURCE_URL, CONFIDENCE_WEIGHT) + VALUES (:src, :tgt, :rel, :h, :url, :w) + RETURNING ID INTO :edge_id + """, { + "src": source_id, + "tgt": target_id, + "rel": relation, + "h": chunk_hash_value, + "url": source_url, + "w": 1, + "edge_id": edge_id_var + }) + + edge_id = int(edge_id_var.getvalue()[0]) + + # ===================================================== + # INSERT EVIDENCE + # ===================================================== + quote = text[:1500] + + cursor.execute(f""" + INSERT INTO KG_EVIDENCE_{GRAPH_NAME} + (EDGE_ID, CHUNK_HASH, SOURCE_URL, QUOTE) + VALUES (:eid, :h, :url, :q) + """, { + "eid": edge_id, + "h": chunk_hash_value, + "url": source_url, + "q": quote + }) + + inserted_counter += 1 + + print(f"βœ… {entity1} -[{relation}]-> {entity2}") + + # ===================================================== + # COMMIT A CADA 500 + # ===================================================== + if inserted_counter % COMMIT_BATCH == 0: + oracle_conn.commit() + print(f"πŸ’Ύ Batch commit ({inserted_counter} records)") + + # Commit final + oracle_conn.commit() + cursor.close() + + print(f"πŸ’Ύ Knowledge graph updated. Total inserted: {inserted_counter}") + +def parse_rfp_requirement(question: str) -> dict: + prompt = f""" + You are an RFP requirement NORMALIZER for Oracle Cloud Infrastructure (OCI). + + Your job is NOT to summarize the question. + Your job is to STRUCTURE the requirement so it can be searched in: + - Technical documentation + - Knowledge Graph + - Vector databases + + ──────────────────────────────── + STEP 1 β€” Understand the requirement + ──────────────────────────────── + From the question, identify: + 1. The PRIMARY OCI SERVICE CATEGORY involved + 2. The MAIN TECHNICAL SUBJECT (short and precise) + 3. The EXPECTED TECHNICAL CAPABILITY or CONDITION (if any) + + IMPORTANT: + - Ignore marketing language + - Ignore phrases like "possui", "permite", "oferece" + - Focus ONLY on concrete technical meaning + + ──────────────────────────────── + STEP 2 β€” Mandatory service classification + ──────────────────────────────── + You MUST choose ONE primary technology from the list below + and INCLUDE IT EXPLICITLY in the keywords list. + + Choose the MOST SPECIFIC applicable item. + + ServiΓ§os da Oracle Cloud Infrastructure (OCI): + + Compute (IaaS) + β€’ Compute Instances (VM) + β€’ Bare Metal Instances + β€’ Dedicated VM Hosts + β€’ GPU Instances + β€’ Confidential Computing + β€’ Capacity Reservations + β€’ Autoscaling (Instance Pools) + β€’ Live Migration + β€’ Oracle Cloud VMware Solution (OCVS) + β€’ HPC (High Performance Computing) + β€’ Arm-based Compute (Ampere) + + Storage + + Object Storage + β€’ Object Storage + β€’ Object Storage – Archive + β€’ Pre-Authenticated Requests + β€’ Replication + + Block & File + β€’ Block Volume + β€’ Boot Volume + β€’ Volume Groups + β€’ File Storage + β€’ File Storage Snapshots + β€’ Data Transfer Service + + Networking + β€’ Virtual Cloud Network (VCN) + β€’ Subnets + β€’ Internet Gateway + β€’ NAT Gateway + β€’ Service Gateway + β€’ Dynamic Routing Gateway (DRG) + β€’ FastConnect + β€’ Load Balancer (L7 / L4) + β€’ Network Load Balancer + β€’ DNS + β€’ Traffic Management Steering Policies + β€’ IP Address Management (IPAM) + β€’ Network Firewall + β€’ Web Application Firewall (WAF) + β€’ Bastion + β€’ Capture Traffic (VTAP) + β€’ Private Endpoints + + Security, Identity & Compliance + β€’ Identity and Access Management (IAM) + β€’ Compartments + β€’ Policies + β€’ OCI Vault + β€’ OCI Key Management (KMS) + β€’ OCI Certificates + β€’ OCI Secrets + β€’ OCI Bastion + β€’ Cloud Guard + β€’ Security Zones + β€’ Vulnerability Scanning Service + β€’ Data Safe + β€’ Logging + β€’ Audit + β€’ OS Management / OS Management Hub + β€’ Shielded Instances + β€’ Zero Trust Packet Routing + + Databases + + Autonomous + β€’ Autonomous Database (ATP) + β€’ Autonomous Data Warehouse (ADW) + β€’ Autonomous JSON Database + + Databases Gerenciados + β€’ Oracle Database Service + β€’ Oracle Exadata Database Service + β€’ Exadata Cloud@Customer + β€’ Base Database Service + β€’ MySQL Database Service + β€’ MySQL HeatWave + β€’ NoSQL Database Cloud Service + β€’ TimesTen + β€’ PostgreSQL (OCI managed) + β€’ MongoDB API (OCI NoSQL compatibility) + + Analytics & BI + β€’ Oracle Analytics Cloud (OAC) + β€’ OCI Data Catalog + β€’ OCI Data Integration + β€’ OCI Streaming Analytics + β€’ OCI GoldenGate + β€’ OCI Big Data Service (Hadoop/Spark) + β€’ OCI Data Science + β€’ OCI AI Anomaly Detection + β€’ OCI AI Forecasting + + AI & Machine Learning + + Generative AI + β€’ OCI Generative AI + β€’ OCI Generative AI Agents + β€’ OCI Generative AI RAG + β€’ OCI Generative AI Embeddings + β€’ OCI AI Gateway (OpenAI-compatible) + + AI Services + β€’ OCI Vision (OCR, image analysis) + β€’ OCI Speech (STT / TTS) + β€’ OCI Language (NLP) + β€’ OCI Document Understanding + β€’ OCI Anomaly Detection + β€’ OCI Forecasting + β€’ OCI Data Labeling + + Containers & Cloud Native + β€’ OCI Container Engine for Kubernetes (OKE) + β€’ Container Registry (OCIR) + β€’ Service Mesh + β€’ API Gateway + β€’ OCI Functions (FaaS) + β€’ OCI Streaming (Kafka-compatible) + β€’ OCI Queue + β€’ OCI Events + β€’ OCI Resource Manager (Terraform) + + Integration & Messaging + β€’ OCI Integration Cloud (OIC) + β€’ OCI Service Connector Hub + β€’ OCI Streaming + β€’ OCI GoldenGate + β€’ OCI API Gateway + β€’ OCI Events Service + β€’ OCI Queue + β€’ Real Applications Clusters (RAC) + + Developer Services + β€’ OCI DevOps (CI/CD) + β€’ OCI Code Repository + β€’ OCI Build Pipelines + β€’ OCI Artifact Registry + β€’ OCI Logging Analytics + β€’ OCI Monitoring + β€’ OCI Notifications + β€’ OCI Bastion + β€’ OCI CLI + β€’ OCI SDKs + + Observability & Management + β€’ OCI Monitoring + β€’ OCI Alarms + β€’ OCI Logging + β€’ OCI Logging Analytics + β€’ OCI Application Performance Monitoring (APM) + β€’ OCI Operations Insights + β€’ OCI Management Agent + β€’ OCI Resource Discovery + + Enterprise & Hybrid + β€’ Oracle Cloud@Customer + β€’ Exadata Cloud@Customer + β€’ Compute Cloud@Customer + β€’ Dedicated Region Cloud@Customer + β€’ OCI Roving Edge Infrastructure + β€’ OCI Alloy + + Governance & FinOps + β€’ OCI Budgets + β€’ Cost Analysis + β€’ Usage Reports + β€’ Quotas + β€’ Tagging + β€’ Compartments + β€’ Resource Search + + Regions & Edge + β€’ OCI Regions (Commercial, Government, EU Sovereign) + β€’ OCI Edge Services + β€’ OCI Roving Edge + β€’ OCI Dedicated Region + + ──────────────────────────────── + STEP 3 β€” Keywords rules (CRITICAL) + ──────────────────────────────── + The "keywords" field MUST: + - ALWAYS include at least ONE OCI service keyword (e.g. "compute", "object storage", "oke") + - Include technical capability terms (e.g. resize, autoscaling, encryption) + - NEVER include generic verbs (permitir, possuir, oferecer) + - NEVER include full sentences + + ──────────────────────────────── + STEP 4 β€” Output rules + ──────────────────────────────── + Return ONLY valid JSON between <json> tags. + Do NOT explain your reasoning. + + Question: + {question} + + <json> + {{ + "requirement_type": "COMPLIANCE | FUNCTIONAL | NON_FUNCTIONAL", + "subject": "<short technical subject, e.g. 'Compute Instances'>", + "expected_value": "<technical capability or condition, or empty string>", + "decision_type": "YES_NO | YES_NO_PARTIAL", + "keywords": ["mandatory_oci_service", "technical_capability", "additional_term"] + }} + </json> + """ + + # resp = llm_for_rag.invoke(prompt) + resp = call_llm(llm_for_rag.invoke, prompt) + raw = resp.content.strip() + + try: + # remove ```json ``` ou ``` ``` + raw = re.sub(r"```json|```", "", raw).strip() + + match = re.search(r"<json>\s*(\{.*?\})\s*</json>", raw, re.DOTALL) + if not match: + raise ValueError("No JSON block found") + json_text = match.group(1) + + return json.loads(json_text) + + except Exception as e: + print("⚠️ RFP PARSER FAILED") + print("RAW RESPONSE:") + print(raw) + + return { + "requirement_type": "UNKNOWN", + "subject": question, + "expected_value": "", + "decision_type": "YES_NO_PARTIAL", + "keywords": re.findall(r"\b\w+\b", question.lower())[:5] + } + +def extract_graph_keywords_from_requirement(req: dict) -> str: + keywords = set(req.get("keywords", [])) + if req.get("subject"): + keywords.add(req["subject"].lower()) + if req.get("expected_value"): + keywords.add(str(req["expected_value"]).lower()) + return ", ".join(sorted(keywords)) + +STOPWORDS = { + "and", "or", "not", + "de", "da", "do", "das", "dos", "a", "o", "as", "os", "e", "em", "no", "na", "nos", "nas", "para", "por", "com" +} + +def build_oracle_text_query(text: str) -> Optional[str]: + if not text: + return None + + text = strip_accents(text.lower()) + + # pega sequΓͺncias "fraseΓ‘veis" (com espaΓ§os/hΓ­fens) + phrases = re.findall(r"[a-z0-9][a-z0-9\- ]{2,}", text) + + tokens: list[str] = [] + + for p in phrases: + p = p.strip() + if not p: + continue + + # 1) quebra em palavras e hΓ­fens (word-level) + # Ex: "store-and-forward" -> ["store", "and", "forward"] + words = re.findall(r"[a-z0-9]+", p) + + # 2) remove stopwords e palavras curtas + words = [w for w in words if w not in STOPWORDS and len(w) >= 4] + + if not words: + continue + + # 3) recombina + if len(words) == 1: + tokens.append(words[0]) + else: + # se quiser manter hΓ­fen, vocΓͺ teria que remontar com '-' e sempre entre aspas + # aqui eu normalizo pra espaΓ§o (mais seguro no Oracle Text) + tokens.append(f"\"{' '.join(words)}\"") + + tokens = sorted(set(tokens)) + return " OR ".join(tokens) if tokens else None + +def detect_negative_conflict(graph_context, req): + + expected = (req.get("expected_value") or "").lower() + subject = (req.get("subject") or "").lower() + + for row in graph_context: + service, edge_type, target, *_ = row + + if edge_type == "DOES_NOT_SUPPORT": + if expected in target.lower() or subject in target.lower(): + return { + "conflict": True, + "service": service, + "capability": target + } + + return {"conflict": False} + +def query_knowledge_graph(raw_keywords: str, top_k: int = 20, min_score: int = 0): + + cursor = oracle_conn.cursor() + + safe_query = build_oracle_text_query(raw_keywords) + if not safe_query: + cursor.close() + return [] + + sql = f""" + select * FROM ( + SELECT + s.NAME AS service_name, + e.EDGE_TYPE AS relation_type, + t.NAME AS target_name, + e.SOURCE_URL, + e.CONFIDENCE_WEIGHT, + CASE + WHEN CONTAINS(s.NAME, '{safe_query}') > 0 AND CONTAINS(t.NAME, '{safe_query}') > 0 THEN 3 + WHEN CONTAINS(s.NAME, '{safe_query}') > 0 THEN 2 + WHEN CONTAINS(t.NAME, '{safe_query}') > 0 THEN 1 + ELSE 0 + END AS relevance_score + FROM KG_EDGES_{GRAPH_NAME} e + JOIN KG_NODES_{GRAPH_NAME} s ON s.ID = e.SOURCE_ID + JOIN KG_NODES_{GRAPH_NAME} t ON t.ID = e.TARGET_ID + WHERE s.NODE_TYPE = 'SERVICE' + AND ( + CONTAINS(t.NAME, '{safe_query}') > 0 + OR CONTAINS(s.NAME, '{safe_query}') > 0 + ) + AND e.CHUNK_HASH NOT IN ( + SELECT CHUNK_HASH + FROM RAG_CHUNKS_GOV + WHERE STATUS = 'REVOKED' + ) + ) + WHERE relevance_score >= {min_score} + AND CONFIDENCE_WEIGHT > 0 + ORDER BY relevance_score DESC + FETCH FIRST {top_k} ROWS ONLY + """ + + print(sql) + + cursor.execute(sql) + rows = cursor.fetchall() + cursor.close() + + for row in rows: + print(row) + + return rows + +# RE-RANK + +def extract_terms_from_graph_text(graph_context): + if not graph_context: + return set() + + if isinstance(graph_context, list): + terms = set() + for row in graph_context: + for col in row: + if isinstance(col, str): + terms.add(col.lower()) + return terms + + if isinstance(graph_context, str): + terms = set() + pattern = re.findall(r"([\w\s]+)-$begin:math:display$\[\\w\_\]\+$end:math:display$->([\w\s]+)", graph_context) + for e1, e2 in pattern: + terms.add(e1.strip().lower()) + terms.add(e2.strip().lower()) + return terms + + return set() + +def rerank_documents_with_graph_terms(docs, query, graph_terms, top_k=12, per_source=2): + query_terms = set(re.findall(r'\b\w+\b', query.lower())) + all_terms = query_terms.union(graph_terms) + + scored = [] + + for doc in docs: + text = doc.page_content.lower() + src = (doc.metadata.get("source") or "").lower() + + term_hits = sum(text.count(t) for t in all_terms) + + density = term_hits / max(len(text.split()), 1) + + url_score = score_arch_url(src) + + score = (term_hits * 2) + (density * 20) + url_score + + scored.append((score, doc)) + + scored.sort(key=lambda x: x[0], reverse=True) + + selected = [] + by_source = {} + + for score, doc in scored: + src = doc.metadata.get("source") + + if by_source.get(src, 0) >= per_source: + continue + + selected.append(doc) + by_source[src] = by_source.get(src, 0) + 1 + + if len(selected) >= top_k: + break + + return selected + +def load_processed_hashes_from_graph(): + cursor = oracle_conn.cursor() + cursor.execute(f""" + SELECT DISTINCT CHUNK_HASH + FROM KG_EDGES_{GRAPH_NAME} + """) + hashes = {r[0] for r in cursor.fetchall()} + cursor.close() + return hashes + +def rebuild_graph_from_faiss( + faiss_path: str, + reverse_resume: bool = True, + consecutive_threshold: int = 20 +): + from langchain_community.vectorstores import FAISS + + print("πŸ”„ Loading FAISS index...") + + vectorstore = FAISS.load_local( + faiss_path, + embeddings, + allow_dangerous_deserialization=True + ) + + docs = list(vectorstore.docstore._dict.values()) + print(f"πŸ“„ {len(docs)} chunks loaded") + + if reverse_resume: + print("πŸ” Reverse resume mode active") + + processed_hashes = load_processed_hashes_from_graph() + + docs_to_process = [] + consecutive_processed = 0 + + for d in reversed(docs): + h = d.metadata.get("chunk_hash") + + if h in processed_hashes: + consecutive_processed += 1 + if consecutive_processed >= consecutive_threshold: + print("πŸ›‘ Boundary detected. Stopping reverse scan.") + break + continue + else: + consecutive_processed = 0 + docs_to_process.append(d) + + docs = list(reversed(docs_to_process)) + print(f"πŸš€ Will process {len(docs)} chunks") + + for chunk in tqdm(docs, desc="🧠 Building Graph"): + create_knowledge_graph([chunk]) + + print("βœ… Graph rebuild completed.") + +# SEMANTIC CHUNKING + +def split_llm_output_into_chapters(llm_text): + chapters = [] + current_chapter = [] + lines = llm_text.splitlines() + + for line in lines: + if re.match(chapter_separator_regex, line): + if current_chapter: + chapters.append("\n".join(current_chapter).strip()) + current_chapter = [line] + else: + current_chapter.append(line) + + if current_chapter: + chapters.append("\n".join(current_chapter).strip()) + + return chapters + + +def semantic_chunking(text): + prompt = f""" + You received the following text extracted via OCR: + + {text} + + Your task: + 1. Identify headings (short uppercase or bold lines, no period at the end) putting the Product Name (Application Name) and the Subject + 2. Separate paragraphs by heading + 3. Indicate columns with [COLUMN 1], [COLUMN 2] if present + 4. Indicate tables with [TABLE] in markdown format + 5. ALWAYS PUT THE URL if there is a Reference + 6. Indicate explicity metrics (if it exists) + Examples: + - Oracle Financial Services RTO is 1 hour + - The Oracle Banking Supply Chain Finance Cloud Service A maximum number of 10K Hosted Transactions + - The Oracle Banking Payments Cloud Service, Additional Non-Production Environment: You may purchase up to a maximum of ten (10) additional Non-Production Environments + """ + + get_out = False + while not get_out: + try: + # response = llm_for_rag.invoke(prompt) + response = call_llm(llm_for_rag.invoke, prompt) + get_out = True + except: + print("[ERROR] Gen AI call error") + + return response + +def read_pdfs(pdf_path): + if "-ocr" in pdf_path: + doc_pages = PyMuPDFLoader(str(pdf_path)).load() + else: + doc_pages = UnstructuredPDFLoader(str(pdf_path)).load() + full_text = "\n".join([page.page_content for page in doc_pages]) + return full_text + + +def smart_split_text(text, max_chunk_size=10_000): + chunks = [] + start = 0 + text_length = len(text) + + while start < text_length: + end = min(start + max_chunk_size, text_length) + split_point = max( + text.rfind('.', start, end), + text.rfind('!', start, end), + text.rfind('?', start, end), + text.rfind('\n\n', start, end) + ) + if split_point == -1 or split_point <= start: + split_point = end + else: + split_point += 1 + + chunk = text[start:split_point].strip() + if chunk: + chunks.append(chunk) + + start = split_point + + return chunks + + +def load_previously_indexed_docs(): + if os.path.exists(PROCESSED_DOCS_FILE): + with open(PROCESSED_DOCS_FILE, "rb") as f: + return pickle.load(f) + return set() + + +def save_indexed_docs(docs): + with open(PROCESSED_DOCS_FILE, "wb") as f: + pickle.dump(docs, f) + +def retrieve_active_docs(query_terms: str, k: int = 50): + docs = retriever.invoke(query_terms) + + hashes = [d.metadata.get("chunk_hash") for d in docs if d.metadata.get("chunk_hash")] + if not hashes: + return docs[:k] + + cursor = oracle_conn.cursor() + cursor.execute(""" + SELECT chunk_hash + FROM RAG_CHUNKS_GOV + WHERE chunk_hash IN (SELECT COLUMN_VALUE FROM TABLE(:hashes)) + AND status = 'ACTIVE' + """, {"hashes": hashes}) + + active = {r[0] for r in cursor.fetchall()} + cursor.close() + + return [d for d in docs if d.metadata.get("chunk_hash") in active][:k] + +URL_REGEX = re.compile(r"https?://[^\s\)\]\}<>\"']+", re.IGNORECASE) + +def best_url(urls): + def score(u): + u = u.lower() + s = 0 + + if "docs.oracle.com" in u: + s += 3 + + if any(x in u for x in [ + "compute","database","oke","storage","network","security" + ]): + s += 5 + + if any(x in u for x in [ + "overview","architecture","concepts","how-to","use-case" + ]): + s += 4 + + if any(x in u for x in [ + "home","index","portal","release-notes","faq" + ]): + s -= 10 + + return s + + return max(urls, key=score) + +def resolve_chunk_source(doc): + text = doc.page_content or "" + md = doc.metadata or {} + + # πŸ”₯ 1) URLs reais dentro do conteΓΊdo + text_urls = URL_REGEX.findall(text) + + # remove urls genΓ©ricas + text_urls = [ + u for u in text_urls + if not any(x in u.lower() for x in ["home", "index", "portal"]) + ] + + if text_urls: + return best_url(text_urls) + + # πŸ”₯ 2) reference (geralmente melhor que source) + ref = md.get("reference") + if ref and ref.startswith("http"): + return ref + + # πŸ”₯ 3) source + src = md.get("source") + if src and src.startswith("http"): + return src + + return "Oracle Cloud Infrastructure documentation" + +def extract_first_url_from_chunk(*texts: str) -> str | None: + """ + Recebe mΓΊltiplos textos (chunk, metadata, etc) e retorna a PRIMEIRA URL encontrada. + """ + for text in texts: + if not text: + continue + m = URL_REGEX.search(text) + if m: + return m.group(0) + return None + +def aggregate_chunks_by_source(docs): + buckets = {} + + for d in docs: + md = d.metadata or {} + key = ( + extract_first_url_from_chunk( + d.page_content, + md.get("reference", ""), + md.get("source", "") + ) + or md.get("reference") + or md.get("source") + or "UNKNOWN" + ) + + buckets.setdefault(key, []).append(d.page_content) + + return buckets + +def search_active_chunks(statement: str, k: int = 3000): + docs = retriever.invoke(statement) + + hashes = [ + d.metadata.get("chunk_hash") + for d in docs + if d.metadata.get("chunk_hash") + ] + + if not hashes: + return docs[:k] + + in_clause = ",".join(f"'{h}'" for h in hashes) + + sql = f""" + SELECT chunk_hash + FROM RAG_CHUNKS_GOV + WHERE status = 'ACTIVE' + AND chunk_hash IN ({in_clause}) + """ + + cursor = oracle_conn.cursor() + cursor.execute(sql) + + active_hashes = {r[0] for r in cursor.fetchall()} + cursor.close() + + final_docs = [] + for d in docs: + h = d.metadata.get("chunk_hash") + if h in active_hashes: + d.metadata["source"] = resolve_chunk_source(d) + final_docs.append(d) + + return final_docs[:k] + +def search_manual_chunks_by_text(statement: str): + sql = """ + SELECT + chunk_hash, + source, + created_at, + origin, + status + FROM rag_chunks_gov + WHERE status = 'ACTIVE' + AND ( + LOWER(source) LIKE '%' || LOWER(:q) || '%' + OR LOWER(origin) LIKE '%' || LOWER(:q) || '%' + OR LOWER(chunk_hash) LIKE '%' || LOWER(:q) || '%' + ) + ORDER BY created_at DESC + FETCH FIRST 50 ROWS ONLY \ + """ + + cursor = oracle_conn.cursor() + cursor.execute(sql, {"q": statement}) + rows = cursor.fetchall() + cursor.close() + + from langchain.schema import Document + + docs = [] + for h, source, created_at, origin, status in rows: + docs.append( + Document( + page_content="", + metadata={ + "chunk_hash": h, + "source": source, + "created_at": created_at, + "origin": origin, + "status": status, + } + ) + ) + + return docs + +def search_chunks_for_invalidation(statement: str, k: int = 3000): + results = [] + + manual_chunks = search_manual_chunks_by_text(statement) + results.extend(manual_chunks) + + semantic_chunks = search_active_chunks(statement, k) + + seen = set() + final = [] + + for d in results + semantic_chunks: + h = d.metadata.get("chunk_hash") + if h and h not in seen: + seen.add(h) + final.append(d) + + return final + +def revoke_chunk_by_hash(chunk_hash: str, reason: str): + cursor = oracle_conn.cursor() + cursor.execute(""" + UPDATE RAG_CHUNKS_GOV + SET status = 'REVOKED', + revoked_at = SYSTIMESTAMP, + revocation_reason = :reason + WHERE chunk_hash = :h + """, {"h": chunk_hash, "reason": reason}) + + cursor.execute(f""" + UPDATE KG_EDGES_{GRAPH_NAME} + SET CONFIDENCE_WEIGHT = 0 + WHERE SOURCE_URL = :h + """, {"h": chunk_hash}) + + + oracle_conn.commit() + cursor.close() + +def get_chunk_metadata(chunk_hashes: list[str]) -> dict: + if not chunk_hashes: + return {} + + cursor = oracle_conn.cursor() + + cursor.execute(f""" + SELECT + CHUNK_HASH, + ORIGIN, + CREATED_AT, + STATUS + FROM RAG_CHUNKS_GOV + WHERE CHUNK_HASH IN ({",".join([f":h{i}" for i in range(len(chunk_hashes))])}) + """, {f"h{i}": h for i, h in enumerate(chunk_hashes)}) + + rows = cursor.fetchall() + cursor.close() + + return { + r[0]: { + "origin": r[1], + "created_at": r[2], + "status": r[3] + } + for r in rows + } + +def add_manual_knowledge_entry( + *, + text: str, + author: str, + reason: str, + source: str = "MANUAL_INPUT", + origin: str = "MANUAL", + index_path: str = INDEX_PATH, + also_update_graph: bool = True, +) -> str: + text = (text or "").strip() + reason = (reason or "").strip() + author = (author or "").strip() or "unknown" + + h = chunk_hash(text) + + doc = Document( + page_content=text, + metadata={ + "source": source, + "author": author, + "reason": reason, + "origin": origin, + "created_at": datetime.utcnow().isoformat(), + "chunk_hash": h, + }, + ) + + cur = oracle_conn.cursor() + cur.execute( + """ + MERGE INTO RAG_CHUNKS_GOV g + USING (SELECT :h AS h FROM dual) src + ON (g.CHUNK_HASH = src.h) + WHEN NOT MATCHED THEN + INSERT (CHUNK_HASH, SOURCE, STATUS, CREATED_AT, ORIGIN) + VALUES (:h, :src, 'ACTIVE', SYSTIMESTAMP, :origin) + """, + {"h": h, "src": source, "origin": origin}, + ) + oracle_conn.commit() + cur.close() + + try: + vs = FAISS.load_local(index_path, embeddings, allow_dangerous_deserialization=True) + vs.add_documents([doc]) + except Exception: + vs = FAISS.from_documents([doc], embedding=embeddings) + + vs.save_local(index_path) + + if also_update_graph: + try: + create_knowledge_graph([doc]) + except Exception: + pass + + return h + +def build_structured_evidence(docs, max_chunks=150): + evidence = [] + + for d in docs[:max_chunks]: + quote = d.page_content[:3000] + + # πŸ”₯ remove qualquer Reference textual + quote = re.sub(r"Reference:\s*\S+", "", quote) + + evidence.append({ + "quote": quote, + "source": resolve_chunk_source(d) + }) + + return evidence + +def get_context_from_requirement(req: dict): + query_terms = extract_graph_keywords_from_requirement(req) + + docs = search_active_chunks(query_terms) + + graph_context = query_knowledge_graph(query_terms, top_k=50, min_score=1) + + neg = detect_negative_conflict(graph_context, req) + + if neg["conflict"]: + print("⚠️ Negative capability found in graph.") + graph_context.append(( + neg["service"], + "NEGATIVE_CONFLICT_DETECTED", + neg["capability"], + None, + 999 + )) + + graph_terms = extract_terms_from_graph_text(graph_context) + reranked_chunks = rerank_documents_with_graph_terms( + docs, + query_terms, + graph_terms, + top_k=30 + ) + structured_evidence = build_structured_evidence(reranked_chunks) + + return { + "text_context": structured_evidence, + "graph_context": graph_context, + "requirement_type": req["requirement_type"], + "subject": req["subject"], + "expected_value": req.get("expected_value", "") + } + +# ========================= +# Main Function +# ========================= +def chat(): + PDF_FOLDER = Path("docs") # pasta onde estΓ£o os PDFs + + pdf_paths = sorted( + str(p) for p in PDF_FOLDER.glob("*.pdf") + ) + + already_indexed_docs = load_previously_indexed_docs() + updated_docs = set() + + try: + vectorstore = FAISS.load_local(INDEX_PATH, embeddings, allow_dangerous_deserialization=True) + print("βœ”οΈ FAISS index loaded.") + except Exception: + print("⚠️ FAISS index not found, creating a new one.") + vectorstore = None + + new_chunks = [] + + for pdf_path in tqdm(pdf_paths, desc=f"πŸ“„ Processing PDFs"): + print(f" {os.path.basename(pdf_path)}") + if pdf_path in already_indexed_docs: + print(f"βœ… Document already indexed: {pdf_path}") + continue + full_text = read_pdfs(pdf_path=pdf_path) + path_url = filename_to_url(os.path.basename(pdf_path)) + + text_chunks = smart_split_text(full_text, max_chunk_size=10_000) + overflow_buffer = "" + + for chunk in tqdm(text_chunks, desc=f"πŸ“„ Processing text chunks", dynamic_ncols=True, leave=False): + current_text = overflow_buffer + chunk + + treated_text = semantic_chunking(current_text) + + if hasattr(treated_text, "content"): + chapters = split_llm_output_into_chapters(treated_text.content) + + last_chapter = chapters[-1] if chapters else "" + + if last_chapter and not last_chapter.strip().endswith((".", "!", "?", "\n\n")): + print("πŸ“Œ Last chapter seems incomplete, saving for the next cycle") + overflow_buffer = last_chapter + chapters = chapters[:-1] + else: + overflow_buffer = "" + + for chapter_text in chapters: + reference_url = "Reference: " + path_url + chapter_text = chapter_text + "\n" + reference_url + # doc = Document(page_content=chapter_text, metadata={"source": pdf_path, "reference": reference_url}) + + h = chunk_hash(chapter_text) + + cursor = oracle_conn.cursor() + + cursor.execute(""" + MERGE INTO RAG_CHUNKS_GOV g + USING (SELECT :h AS h FROM dual) src + ON (g.CHUNK_HASH = src.h) + WHEN NOT MATCHED THEN + INSERT ( + CHUNK_HASH, + SOURCE, + STATUS, + CREATED_AT, + ORIGIN + ) + VALUES ( + :h, + :src, + 'ACTIVE', + SYSTIMESTAMP, + :origin + ) + """, { + "h": h, + "src": pdf_path, + "origin": "PDF" + }) + oracle_conn.commit() + cursor.close() + + doc = Document( + page_content=chapter_text, + metadata={ + "source": pdf_path, + "reference": reference_url, + "chunk_hash": h, + "created_at": datetime.utcnow().isoformat() + } + ) + + new_chunks.append(doc) + print(f"βœ… New chapter indexed:\n{chapter_text}...\n") + + else: + print(f"[ERROR] semantic_chunking returned unexpected type: {type(treated_text)}") + + updated_docs.add(str(pdf_path)) + + if new_chunks: + if vectorstore: + vectorstore.add_documents(new_chunks) + else: + vectorstore = FAISS.from_documents(new_chunks, embedding=embeddings) + + vectorstore.save_local(INDEX_PATH) + save_indexed_docs(already_indexed_docs.union(updated_docs)) + print(f"πŸ’Ύ {len(new_chunks)} chunks added to FAISS index.") + + print("🧠 Building knowledge graph...") + create_knowledge_graph(new_chunks) + + else: + print("πŸ“ No new documents to index.") + + retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 50, "fetch_k": 100}) + + RFP_DECISION_TEMPLATE = """ + You are answering an RFP requirement with risk awareness. + + You MUST validate the answer ONLY using CAPABILITY nodes returned in GRAPH FACTS. + If no capability exists, answer MUST be "UNKNOWN". + + Requirement: + Type: {requirement_type} + Subject: {subject} + Expected value: {expected_value} + + **Document evidence**: + {text_context} + + **Graph evidence**: + {graph_context} + + Decision rules: + - Answer ONLY with YES, NO or PARTIAL + - If value differs, answer PARTIAL + - If not found, answer NO + + Interpretation rules (MANDATORY): + - If a capability is supported but requires reboot, downtime, or restart, it STILL counts as YES unless the requirement explicitly forbids it. + - "Servidor em funcionamento" means the resource exists and is active before the operation, not that it must remain online without interruption. + - Only answer NO if the operation is NOT supported at all or requires destroying and recreating the resource. + - Reboot, restart, or brief unavailability MUST NOT be interpreted as lack of support. + + Confidence rules: + - HIGH: Explicit evidence directly answers the requirement + - MEDIUM: Evidence partially matches or requires light interpretation + - LOW: Requirement is ambiguous OR evidence is indirect OR missing + + Ambiguity rules: + - ambiguity_detected = true if: + - The requirement can be interpreted in more than one way + - Keywords are vague (e.g. "support", "integration", "capability") + - Evidence does not clearly bind to subject + expected value + + Service scope rules (MANDATORY): + - Evidence is valid ONLY if it refers to the SAME service category as the requirement. + - Do NOT use evidence from a different Oracle Cloud service to justify another. + - PREFER ALWAYS URL to source + + SOURCE RULE: + - GIVE MANY SOURCES URL + - GIVE A COMPLETE PATH OF URL SOURCES TO UNDERSTAND THE CONCEPTS THEME + - GIVE one or more OVERVIEW SOURCE URL + - GIVE one or more SOLUTION AND ARCHITECTURE SOURCE URL + + OUTPUT CONSTRAINTS (MANDATORY): + - Return ONLY a valid JSON object + - Do NOT include explanations, comments, markdown, lists, or code fences + - Do NOT write any text before or after the JSON + - The response must start with an opening curly brace and end with a closing curly brace + + LANGUAGE RULE (MANDATORY): **DO TRANSLATION AS THE LAST STEP** + - Write ALL textual values in {lang} + - Keep JSON keys in English + - Do NOT translate keys + + JSON schema (return exactly this structure): + {{ + "answer": "YES | NO | PARTIAL", + "confidence": "HIGH | MEDIUM | LOW", + "ambiguity_detected": true, + "confidence_reason": "<short reason>", + "justification": "<short factual explanation>", + "evidence": [ + {{ "quote": "...", + "source": "..." }}, + {{ "quote": "...", + "source": "..." }}, + {{ "quote": "...", + "source": "..." }}, + ] + }} + """ + prompt = PromptTemplate.from_template(RFP_DECISION_TEMPLATE) + + chain = ( + RunnableLambda(lambda q: { + "question": q, + "req": parse_rfp_requirement(q) + }) + | RunnableMap({ + "text_context": lambda x: get_context_from_requirement(x["req"])["text_context"], + "graph_context": lambda x: get_context_from_requirement(x["req"])["graph_context"], + "requirement_type": lambda x: x["req"]["requirement_type"], + "subject": lambda x: x["req"]["subject"], + "expected_value": lambda x: x["req"].get("expected_value", ""), + "lang": lambda x: normalize_lang(detect(x["question"])) + }) + | prompt + | llm + | StrOutputParser() + ) + + print("βœ… READY") + + while True: + query = input("❓ Question (or 'quit' to exit): ") + if query.lower() == "quit": + break + # response = chain.invoke(query) + response = answer_question_with_retries(query, max_attempts=3) + + print("\nπŸ“œ RESPONSE:\n") + print(response) + print("\n" + "=" * 80 + "\n") + + +def safe_parse_llm_answer(raw: str) -> dict: + try: + raw = raw.replace("```json", "").replace("```", "").strip() + return json.loads(raw) + except Exception: + return { + "answer": "NO", + "confidence": "LOW", + "confidence_reason": "Invalid JSON from LLM", + "justification": "", + "evidence": [] + } + +def answer_question_with_retries( + question: str, + max_attempts: int = 3 +) -> dict: + + best = None + best_score = -1 + + # πŸ”₯ Importante: precisamos do graph_context aqui + req = parse_rfp_requirement(question) + graph_context = query_knowledge_graph( + extract_graph_keywords_from_requirement(req), + top_k=20 + ) + + for attempt in range(1, max_attempts + 1): + + raw = answer_question(question) + parsed = safe_parse_llm_answer(raw) + + # ===================================================== + # πŸ”₯ AQUI entra o tratamento de conflito negativo + # ===================================================== + if parsed.get("answer") == "YES": + for row in graph_context: + service, edge_type, target, *_ = row + + if edge_type in ("DOES_NOT_SUPPORT", "NEGATIVE_CONFLICT_DETECTED"): + print("❌ Conflict detected β€” forcing downgrade to NO") + + parsed["answer"] = "NO" + parsed["confidence"] = "HIGH" + parsed["confidence_reason"] = \ + "Graph contains explicit negative capability" + + break + + # ===================================================== + + ans = parsed.get("answer", "NO") + conf = parsed.get("confidence", "LOW") + + score = score_answer(parsed) + + print( + f"πŸ” Attempt {attempt}: " + f"answer={ans} confidence={conf} score={score}" + ) + + if score > best_score: + best = parsed + best_score = score + + # condiΓ§Γ£o de parada ideal + if ans == "YES" and conf == "HIGH": + print("βœ… Optimal answer found (YES/HIGH)") + return parsed + + print("⚠️ Optimal answer not found, returning best available") + + best = validate_and_sanitize_sources(best) + + return best + +# ========================= +# ARCHITECTURE SOURCE VALIDATION +# ========================= + +def _sanitize_source_field(src): + """ + Normaliza string ou lista de URLs. + """ + if not src: + return INVALID_SOURCE_TOKEN + + if isinstance(src, list): + cleaned = [] + for s in src: + cleaned.append(s if url_exists(s) else INVALID_SOURCE_TOKEN) + return cleaned + + return src if url_exists(src) else INVALID_SOURCE_TOKEN + + +def validate_architecture_sources(plan: dict) -> dict: + if not plan: + return plan + + # ------------------------- + # components[].source + # ------------------------- + comps = plan.get("architecture", {}).get("components", []) + + for c in comps: + c["source"] = _sanitize_source_field(c.get("source")) + + # ------------------------- + # decisions[].evidence.source + # ------------------------- + decisions = plan.get("decisions", []) + + for d in decisions: + ev = d.get("evidence", {}) or {} + ev["source"] = _sanitize_source_field(ev.get("source")) + d["evidence"] = ev + + return plan + + +# ========================= +# LOADERS +# ========================= + +def load_all(): + global vectorstore, retriever, graph, chain, RFP_DECISION_TEMPLATE, chain_architecture + print("πŸ”„ Loading FAISS + Graph + Chain...") + + try: + vectorstore = FAISS.load_local( + INDEX_PATH, + embeddings, + allow_dangerous_deserialization=True + ) + + retriever = vectorstore.as_retriever( + search_type="similarity", + search_kwargs={"k": 50, "fetch_k": 100} + ) + except: + print("No Faiss") + + RFP_DECISION_TEMPLATE = """ + You are answering an RFP requirement with risk awareness. + + Requirement: + Type: {requirement_type} + Subject: {subject} + Expected value: {expected_value} + + Document evidence: + {text_context} + + Graph evidence: + {graph_context} + + Decision rules: + - Answer ONLY with YES, NO or PARTIAL + - If value differs, answer PARTIAL + - If not found, answer NO + + Interpretation rules (MANDATORY): + - If a capability is supported but requires reboot, downtime, or restart, it STILL counts as YES unless the requirement explicitly forbids it. + - "Servidor em funcionamento" means the resource exists and is active before the operation, not that it must remain online without interruption. + - Only answer NO if the operation is NOT supported at all or requires destroying and recreating the resource. + - Reboot, restart, or brief unavailability MUST NOT be interpreted as lack of support. + + Confidence rules: + - HIGH: Explicit evidence directly answers the requirement + - MEDIUM: Evidence partially matches or requires light interpretation + - LOW: Requirement is ambiguous OR evidence is indirect OR missing + + Ambiguity rules: + - ambiguity_detected = true if: + - The requirement can be interpreted in more than one way + - Keywords are vague (e.g. "support", "integration", "capability") + - Evidence does not clearly bind to subject + expected value + + Service scope rules (MANDATORY): + - Do NOT use evidence from a different Oracle Cloud service to justify another. + + SOURCE RULE: + - GIVE MANY SOURCES URL + - GIVE A COMPLETE PATH OF URL SOURCES TO UNDERSTAND THE CONCEPTS THEME + - GIVE one or more OVERVIEW SOURCE URL + - GIVE one or more SOLUTION AND ARCHITECTURE SOURCE URL + + OUTPUT CONSTRAINTS (MANDATORY): + - Return ONLY a valid JSON object + - Do NOT include explanations, comments, markdown, lists, or code fences + - Do NOT write any text before or after the JSON + - The response must start with an opening curly brace and end with a closing curly brace + + LANGUAGE RULE (MANDATORY): **DO TRANSLATION AS THE LAST STEP** + - Write ALL textual values in {lang} + - Keep JSON keys in English + - Do NOT translate keys + + JSON schema (return exactly this structure): + {{ + "answer": "YES | NO | PARTIAL", + "confidence": "HIGH | MEDIUM | LOW", + "ambiguity_detected": true, + "confidence_reason": "<short reason>", + "justification": "<short factual explanation>", + "evidence": [ + {{ "quote": "...", + "source": "..." }}, + {{ "quote": "...", + "source": "..." }}, + {{ "quote": "...", + "source": "..." }}, + ] + + }} + """ + prompt = PromptTemplate.from_template(RFP_DECISION_TEMPLATE) + + chain = ( + RunnableLambda(lambda q: { + "question": q, + "req": parse_rfp_requirement(q) + }) + | RunnableMap({ + "text_context": lambda x: get_context_from_requirement(x["req"])["text_context"], + "graph_context": lambda x: get_context_from_requirement(x["req"])["graph_context"], + "requirement_type": lambda x: x["req"]["requirement_type"], + "subject": lambda x: x["req"]["subject"], + "expected_value": lambda x: x["req"].get("expected_value", ""), + "lang": lambda x: normalize_lang(detect(x["question"])) + }) + | prompt + | llm + | StrOutputParser() + ) + + chain_architecture = build_architecture_chain() + + print("βœ… Loaded!") + +def answer_question( + question: str, + max_attempts: int = MAX_ATTEMPTS +) -> str: + + def worker(i): + try: + # raw = chain.invoke(question) + raw = call_llm(chain.invoke, question) + + parsed = safe_parse_llm_answer(raw) + score = score_answer(parsed) + + return { + "attempt": i, + "raw": raw, + "parsed": parsed, + "score": score + } + + except Exception as e: + print(f"❌ Attempt {i} failed: {e}") + + return { + "attempt": i, + "raw": "", + "parsed": {}, + "score": -1 + } + + results = [] + + with ThreadPoolExecutor(max_workers=max_attempts) as executor: + futures = [ + executor.submit(worker, i) + for i in range(1, max_attempts + 1) + ] + + for f in as_completed(futures): + try: + r = f.result() + if r["score"] >= 0: + results.append(r) + except Exception as e: + print("❌ Future crashed:", e) + + # πŸ”₯ fallback absoluto + if not results: + print("⚠️ All attempts failed β€” returning safe fallback") + return json.dumps({ + "answer": "NO", + "confidence": "LOW", + "ambiguity_detected": True, + "confidence_reason": "All LLM attempts failed", + "justification": "", + "evidence": [] + }) + + results.sort(key=lambda r: r["score"], reverse=True) + + for r in results: + print( + f"⚑ Attempt {r['attempt']} | " + f"answer={r['parsed'].get('answer')} | " + f"confidence={r['parsed'].get('confidence')} | " + f"score={r['score']}" + ) + + best = results[0] + + print(f"\nπŸ† Selected attempt {best['attempt']} (score={best['score']})") + + #return best["raw"] + sanitized = validate_and_sanitize_sources(best["parsed"]) + return json.dumps(sanitized) + + +def reload_all(): + load_all() + +reload_all() + +# πŸš€ Run +if __name__ == "__main__": + chat() + #rebuild_graph_from_faiss(INDEX_PATH, reverse_resume=True) diff --git a/files/pgql_oracle23ai.sql b/files/pgql_oracle23ai.sql new file mode 100644 index 0000000..86b45fb --- /dev/null +++ b/files/pgql_oracle23ai.sql @@ -0,0 +1,112 @@ +-- Tabela de entidades +CREATE TABLE entities ( + id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, + name VARCHAR2(255) UNIQUE NOT NULL +); + +-- Tabela de relaΓ§Γ΅es +CREATE TABLE relations ( + id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, + from_entity_id NUMBER REFERENCES entities(id), + to_entity_id NUMBER REFERENCES entities(id), + relation VARCHAR2(255), + source_text VARCHAR2(1000) +); + +BEGIN + ordsadmin.graph_view_admin.create_graph_view( + graph_view_name => 'my_graph', + vertex_table_names => 'ENTITIES', + edge_table_names => 'RELATIONS', + vertex_id_column => 'ID', + edge_source_column => 'FROM_ENTITY_ID', + edge_destination_column => 'TO_ENTITY_ID' + ); +END; +/ + + +CREATE PROPERTY GRAPH my_graph + VERTEX TABLES (ENTITIES + KEY (ID) + LABEL ENTITIES + PROPERTIES (NAME)) + EDGE TABLES (RELATIONS + KEY (ID) + SOURCE KEY (SOURCE_ID) REFERENCES ENTITIES(ID) + DESTINATION KEY (TARGET_ID) REFERENCES ENTITIES(ID) + LABEL RELATIONS + PROPERTIES (RELATION_TYPE, SOURCE_TEXT)) + options (PG_PGQL) + +-- Drope o Γ­ndice antigo, se necessΓ‘rio +DROP INDEX ent_name_text_idx; +DROP INDEX rel_type_text_idx; + +-- Crie com o tipo correto +CREATE INDEX ent_name_text_idx ON ENTITIES(NAME) INDEXTYPE IS CTXSYS.CONTEXT; +CREATE INDEX rel_type_text_idx ON RELATIONS(RELATION_TYPE) INDEXTYPE IS CTXSYS.CONTEXT; + +EXEC CTX_DDL.SYNC_INDEX('ENT_NAME_TEXT_IDX'); +EXEC CTX_DDL.SYNC_INDEX('REL_TYPE_TEXT_IDX'); + +SELECT from_entity, + relation_type, + to_entity +FROM GRAPH_TABLE( + my_graph + MATCH (e1 is ENTITIES)-[r is RELATIONS]->(e2 is ENTITIES) + WHERE CONTAINS(LOWER(e1.name), 'gateway') > 0 + OR CONTAINS(LOWER(e2.name), 'gateway') > 0 + OR CONTAINS(LOWER(r.RELATION_TYPE), 'gateway') > 0 + COLUMNS ( + e1.name AS from_entity, r.RELATION_TYPE AS relation_type, e2.name AS to_entity + ) + ) + FETCH FIRST 20 ROWS ONLY + +--------------- +-- # 2026-01-29 - VECTOR 23ai + +CREATE TABLE rag_docs ( + id NUMBER GENERATED BY DEFAULT AS IDENTITY, + content CLOB, + source VARCHAR2(1000), + chunk_hash VARCHAR2(64), + status VARCHAR2(20), + embed VECTOR(1024) +); + +CREATE VECTOR INDEX rag_docs_idx +ON rag_docs(embed) +ORGANIZATION HNSW +DISTANCE COSINE; + +------------------- +-- #2026-02-07 - app_users + +DROP TABLE app_users; + +CREATE TABLE app_users ( + id NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY, + + username VARCHAR2(100) UNIQUE, + name VARCHAR2(200), + email VARCHAR2(200) UNIQUE, + + user_role VARCHAR2(50), + + password_hash VARCHAR2(300), + + active NUMBER(1) DEFAULT 1, + + reset_token VARCHAR2(300), + reset_expire TIMESTAMP, + + must_change_password NUMBER(1) DEFAULT 0, + + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); + +CREATE INDEX idx_users_email ON app_users(email); +CREATE INDEX idx_users_token ON app_users(reset_token); diff --git a/files/process_excel_rfp.py b/files/process_excel_rfp.py new file mode 100644 index 0000000..e3e206d --- /dev/null +++ b/files/process_excel_rfp.py @@ -0,0 +1,286 @@ +import pandas as pd +import requests +import json +from pathlib import Path +import os +import re + +# ========================= +# ConfiguraΓ§Γ΅es +# ========================= +EXCEL_PATH = "<YOUR_EXCEL_XLSX_FILE>" +API_URL = "http://demo-orcl-api-ai.hoshikawa.com.br:8101/rest/chat" +QUERY_LOG_FILE = Path("queries_with_low_confidence_or_no.txt") +TIMEOUT = 120 +APP_USER = os.environ.get("APP_USER", "<YOUR_USER_NAME>") +APP_PASS = os.environ.get("APP_PASS", "<YOUR_PASSWORD>") + +CONTEXT_COLUMNS = [1, 2] # USE IF YOU HAVE A NON-HIERARQUICAL STRUCTURE +ORDER_COLUMN = 0 # WHERE ARE YOUR ORDER LINE COLUMN +QUESTION_COLUMN = 4 # WHERE ARE YOUR QUESTION/TEXT to submit to RFP AI +ALLOWED_STRUCTURES = [ + "x.x", + "x.x.x", + "x.x.x.x", + "x.x.x.x.x", + "x.x.x.x.x.x" +] +ALLOWED_SEPARATORS = [".", "-", "/", "_", ">"] + +ANSWER_COL = "ANSWER" # NAME YOUR COLUMN for the YES/NO/PARTIAL result +JSON_COL = "RESULT_JSON" # NAME YOUR COLUMN for the RFP AI automation results + +CONFIDENCE_COL = "CONFIDENCE" +AMBIGUITY_COL = "AMBIGUITY" +CONF_REASON_COL = "CONFIDENCE_REASON" +JUSTIFICATION_COL = "JUSTIFICATION" + +# ========================= +# Helpers +# ========================= + +def normalize_structure(num: str, separators: list[str]) -> str: + if not num: + return "" + + pattern = "[" + re.escape("".join(separators)) + "]" + return re.sub(pattern, ".", num.strip()) + +def should_process(num: str, allowed_patterns: list[str], separators: list[str]) -> bool: + normalized = normalize_structure(num, separators) + + if not is_hierarchical(normalized): + return True + + depth = normalized.count(".") + 1 + + allowed_depths = { + pattern.count(".") + 1 + for pattern in allowed_patterns + } + + return depth in allowed_depths + +def register_failed_query(query: str, answer: str, confidence: str): + QUERY_LOG_FILE.parent.mkdir(parents=True, exist_ok=True) + print("Negative/Doubt result") + with QUERY_LOG_FILE.open("a", encoding="utf-8") as f: + f.write("----------------------------\n") + f.write(f"Query:\n{query}\n\n") + f.write(f"Answer: {answer}\n") + f.write(f"Confidence: {confidence}\n\n") + +def normalize_num(num: str) -> str: + return num.strip().rstrip(".") + +def build_question_from_columns(row, context_cols: list[int], question_col: int) -> str: + context_parts = [] + + for col in context_cols: + value = str(row.iloc[col]).strip() + if value: + context_parts.append(value) + + question = str(row.iloc[question_col]).strip() + + if not context_parts: + return question + + context = " > ".join(dict.fromkeys(context_parts)) + return f'Considering the context of "{context}", {question}' + +def build_question(hierarchy: dict, current_num: str) -> str: + if not is_hierarchical(current_num): + return hierarchy[current_num]["text"] + + parts = current_num.split(".") + + main_subject = None + main_key = None + + # ancestral mais alto existente + for i in range(1, len(parts) + 1): + key = ".".join(parts[:i]) + if key in hierarchy: + main_subject = hierarchy[key]["text"] + main_key = key + break + + if not main_subject: + raise ValueError(f"No valid root subject for {current_num}") + + subtopics = [] + for i in range(1, len(parts)): + key = ".".join(parts[: i + 1]) + if key in hierarchy and key != main_key: + subtopics.append(hierarchy[key]["text"]) + + specific = hierarchy[current_num]["text"] + + if subtopics: + context = " > ".join(subtopics) + return ( + f'Considering the context of "{context}"' + ) + + return f'What is the {specific} of {main_subject}?' + +def normalize_api_response(api_response: dict) -> dict: + if isinstance(api_response, dict) and "result" in api_response and isinstance(api_response["result"], dict): + if "answer" in api_response["result"]: + return api_response["result"] + return api_response + +def call_api(question: str) -> dict: + payload = {"question": question} + + response = requests.post( + API_URL, + json=payload, + auth=(APP_USER, APP_PASS), # πŸ” BASIC AUTH + timeout=TIMEOUT + ) + + response.raise_for_status() + return response.json() + +def is_explicit_url(source: str) -> bool: + return source.startswith("http://") or source.startswith("https://") + +def is_hierarchical(num: str) -> bool: + return bool( + num + and "." in num + and all(p.isdigit() for p in num.split(".")) + ) + +def normalize_evidence_sources(evidence: list[dict]) -> list[dict]: + normalized = [] + + for ev in evidence: + source = ev.get("source", "").strip() + quote = ev.get("quote", "").strip() + + if is_explicit_url(source): + normalized.append(ev) + continue + + normalized.append({ + "quote": quote, + "source": source or "Oracle Cloud Infrastructure documentation" + }) + + return normalized + +# ========================= +# Main +# ========================= +def main(): + df = pd.read_excel(EXCEL_PATH, dtype=str).fillna("") + + if ANSWER_COL not in df.columns: + df[ANSWER_COL] = "" + + if JSON_COL not in df.columns: + df[JSON_COL] = "" + + for col in [ + ANSWER_COL, + JSON_COL, + CONFIDENCE_COL, + AMBIGUITY_COL, + CONF_REASON_COL, + JUSTIFICATION_COL + ]: + if col not in df.columns: + df[col] = "" + + hierarchy = {} + for idx, row in df.iterrows(): + num = normalize_num(str(row.iloc[ORDER_COLUMN])) + text = str(row.iloc[QUESTION_COLUMN]).strip() + + if num and text: + hierarchy[num] = { + "text": text, + "row": idx + } + + for num, info in hierarchy.items(): + if not should_process(num, ALLOWED_STRUCTURES, ALLOWED_SEPARATORS): + print(f"⏭️ SKIPPED (structure not allowed): {num}") + continue + + try: + row = df.loc[info["row"]] + num = normalize_num(str(row.iloc[ORDER_COLUMN])) + + if is_hierarchical(num): + question = build_question(hierarchy, num) + else: + question = build_question_from_columns( + row, + CONTEXT_COLUMNS, + QUESTION_COLUMN + ) + + print(f"\n❓ QUESTION SENT TO API:\n{question}") + + api_response_raw = call_api(question) + api_response = normalize_api_response(api_response_raw) + + if "evidence" in api_response: + api_response["evidence"] = normalize_evidence_sources( + api_response.get("evidence", []) + ) + + if ( + api_response.get("answer") == "NO" + or api_response.get("confidence") in ("MEDIUM", "LOW") + ): + register_failed_query( + query=question, + answer=api_response.get("answer", ""), + confidence=api_response.get("confidence", "") + ) + + print("πŸ“„ JSON RESPONSE (normalized):") + print(json.dumps(api_response, ensure_ascii=False, indent=2)) + print("-" * 80) + + df.at[info["row"], ANSWER_COL] = api_response.get("answer", "ERROR") + df.at[info["row"], CONFIDENCE_COL] = api_response.get("confidence", "") + df.at[info["row"], AMBIGUITY_COL] = str(api_response.get("ambiguity_detected", "")) + df.at[info["row"], CONF_REASON_COL] = api_response.get("confidence_reason", "") + df.at[info["row"], JUSTIFICATION_COL] = api_response.get("justification", "") + + df.at[info["row"], JSON_COL] = json.dumps(api_response, ensure_ascii=False) + + except Exception as e: + error_json = { + "answer": "ERROR", + "confidence": "LOW", + "ambiguity_detected": True, + "confidence_reason": "Processing error", + "justification": str(e), + "evidence": [] + } + + df.at[info["row"], ANSWER_COL] = "ERROR" + df.at[info["row"], CONFIDENCE_COL] = "LOW" + df.at[info["row"], AMBIGUITY_COL] = "True" + df.at[info["row"], CONF_REASON_COL] = "Processing error" + df.at[info["row"], JUSTIFICATION_COL] = str(e) + df.at[info["row"], JSON_COL] = json.dumps(error_json, ensure_ascii=False) + + print(f"❌ ERROR processing item {num}: {e}") + + output_path = Path(EXCEL_PATH).with_name( + Path(EXCEL_PATH).stem + "_result.xlsx" + ) + df.to_excel(output_path, index=False) + + print(f"\nβœ… Saved in: {output_path}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/files/rfp_process.py b/files/rfp_process.py new file mode 100644 index 0000000..db264ef --- /dev/null +++ b/files/rfp_process.py @@ -0,0 +1,428 @@ +import pandas as pd +import requests +import json +from pathlib import Path +import os +import re +import logging +from config_loader import load_config +from concurrent.futures import ThreadPoolExecutor, as_completed +import time +from queue import Queue +import threading +from oci_genai_llm_graphrag_rerank_rfp import answer_question + +config = load_config() + +logger = logging.getLogger(__name__) +logging.basicConfig( + level=logging.INFO, + format="%(asctime)s | %(levelname)s | %(name)s | %(message)s" +) + +EXCEL_QUEUE = Queue() + +# ========================= +# ConfiguraΓ§Γ΅es +# ========================= +API_URL = "http://127.0.0.1:" + str(config.service_port) + "/chat" +QUERY_LOG_FILE = Path("queries_with_low_confidence_or_no.txt") + +CONTEXT_COLUMNS = [1, 2] # USE IF YOU HAVE A NON-HIERARQUICAL STRUCTURE +ORDER_COLUMN = 0 # WHERE ARE YOUR ORDER LINE COLUMN +QUESTION_COLUMN = 4 # WHERE ARE YOUR QUESTION/TEXT to submit to RFP AI +ALLOWED_STRUCTURES = [ + "x.x", + "x.x.x", + "x.x.x.x", + "x.x.x.x.x", + "x.x.x.x.x.x" +] +ALLOWED_SEPARATORS = [".", "-", "/", "_", ">"] + +ANSWER_COL = "ANSWER" # NAME YOUR COLUMN for the YES/NO/PARTIAL result +JSON_COL = "RESULT_JSON" # NAME YOUR COLUMN for the RFP AI automation results +ARCH_PLAN_COL = "ARCH_PLAN" +MERMAID_COL = "MERMAID" + +CONFIDENCE_COL = "CONFIDENCE" +AMBIGUITY_COL = "AMBIGUITY" +CONF_REASON_COL = "CONFIDENCE_REASON" +JUSTIFICATION_COL = "JUSTIFICATION" + +# ========================= +# Helpers +# ========================= + +def normalize_structure(num: str, separators: list[str]) -> str: + if not num: + return "" + + pattern = "[" + re.escape("".join(separators)) + "]" + return re.sub(pattern, ".", num.strip()) + +def should_process(num: str, allowed_patterns: list[str], separators: list[str]) -> bool: + normalized = normalize_structure(num, separators) + + if not is_hierarchical(normalized): + return True + + depth = normalized.count(".") + 1 + + allowed_depths = { + pattern.count(".") + 1 + for pattern in allowed_patterns + } + + return depth in allowed_depths + +def register_failed_query(query: str, answer: str, confidence: str): + QUERY_LOG_FILE.parent.mkdir(parents=True, exist_ok=True) + logger.info("Negative/Doubt result") + with QUERY_LOG_FILE.open("a", encoding="utf-8") as f: + f.write("----------------------------\n") + f.write(f"Query:\n{query}\n\n") + f.write(f"Answer: {answer}\n") + f.write(f"Confidence: {confidence}\n\n") + +def normalize_num(num: str) -> str: + return num.strip().rstrip(".") + +def build_question_from_columns(row, context_cols: list[int], question_col: int) -> str: + context_parts = [] + + for col in context_cols: + value = str(row.iloc[col]).strip() + if value: + context_parts.append(value) + + question = str(row.iloc[question_col]).strip() + + if not context_parts: + return question + + context = " > ".join(dict.fromkeys(context_parts)) + return f'Considering the context of "{context}", {question}' + +def build_question(hierarchy: dict, current_num: str) -> str: + if not is_hierarchical(current_num): + return hierarchy[current_num]["text"] + + parts = current_num.split(".") + + main_subject = None + main_key = None + + # ancestral mais alto existente + for i in range(1, len(parts) + 1): + key = ".".join(parts[:i]) + if key in hierarchy: + main_subject = hierarchy[key]["text"] + main_key = key + break + + if not main_subject: + raise ValueError(f"No valid root subject for {current_num}") + + subtopics = [] + for i in range(1, len(parts)): + key = ".".join(parts[: i + 1]) + if key in hierarchy and key != main_key: + subtopics.append(hierarchy[key]["text"]) + + specific = hierarchy[current_num]["text"] + + if subtopics: + context = " > ".join(subtopics) + return ( + f'Considering the context of "{context}"' + ) + + return f'What is the {specific} of {main_subject}?' + +def normalize_api_response(api_response) -> dict: + # -------------------------------- + # πŸ”₯ STRING β†’ JSON + # -------------------------------- + if isinstance(api_response, str): + try: + api_response = json.loads(api_response) + except Exception: + return {"error": f"Invalid string response: {api_response[:300]}"} + + if not isinstance(api_response, dict): + return {"error": f"Invalid type: {type(api_response)}"} + + if "error" in api_response: + return api_response + + if isinstance(api_response.get("result"), dict): + return api_response["result"] + + if "answer" in api_response: + return api_response + + return {"error": f"Unexpected format: {str(api_response)[:300]}"} + +def call_api( + question: str, + *, + api_url: str, + timeout: int, + auth_user: str | None, + auth_pass: str | None, +) -> dict: + + payload = {"question": question} + + response = requests.post( + api_url, + json=payload, + auth=(auth_user, auth_pass) if auth_user else None, + timeout=timeout + ) + + if response.status_code >= 500: + raise RuntimeError( + f"Server error {response.status_code}: {response.text}", + response=response + ) + + text = response.text.lower() + + if "gateway time" in text or "timeout" in text: + raise RuntimeError(response.text) + + try: + return response.json() + except: + raise RuntimeError( + f"Invalid JSON: {response.text[:300]}" + ) + +def is_explicit_url(source: str) -> bool: + return source.startswith("http://") or source.startswith("https://") + +def is_hierarchical(num: str) -> bool: + return bool( + num + and "." in num + and all(p.isdigit() for p in num.split(".")) + ) + +def normalize_evidence_sources(evidence: list[dict]) -> list[dict]: + normalized = [] + + for ev in evidence: + source = ev.get("source", "").strip() + quote = ev.get("quote", "").strip() + + if is_explicit_url(source): + normalized.append(ev) + continue + + normalized.append({ + "quote": quote, + "source": source or "Oracle Cloud Infrastructure documentation" + }) + + return normalized + +def build_justification_with_links(justification: str, evidence: list[dict]) -> str: + """ + Combine justification text + evidence URLs in a readable format for Excel. + """ + + if not evidence: + return justification or "" + + urls = [] + + for ev in evidence: + src = ev.get("source", "").strip() + if is_explicit_url(src): + urls.append(src) + + if not urls: + return justification or "" + + links_text = "\n".join(f"- {u}" for u in sorted(set(urls))) + + if justification: + return f"{justification}\n\nSources:\n{links_text}" + + return f"Sources:\n{links_text}" + +def call_api_with_retry(question, max_minutes=30, **kwargs): + start = time.time() + attempt = 0 + delay = 5 + + while True: + try: + return call_api(question, **kwargs) + + except Exception as e: + attempt += 1 + elapsed = time.time() - start + + msg = str(e).lower() + if any(x in msg for x in ["401", "403", "400", "invalid json format"]): + raise + + if elapsed > max_minutes * 60: + raise RuntimeError( + f"Timeout after {attempt} attempts / {int(elapsed)}s" + ) + + logger.info( + f"πŸ” Retry {attempt} | waiting {delay}s | {e}" + ) + + time.sleep(delay) + + delay = min(delay * 1.5, 60) + +def call_local_engine(question: str) -> dict: + return answer_question(question) + +# ========================= +# Main +# ========================= +def process_excel_rfp( + input_excel: Path, + output_excel: Path, + *, + api_url: str, + timeout: int = 120, + auth_user: str | None = None, + auth_pass: str | None = None, +) -> Path: + + df = pd.read_excel(input_excel, dtype=str).fillna("") + + for col in [ + ANSWER_COL, + JSON_COL, + CONFIDENCE_COL, + AMBIGUITY_COL, + CONF_REASON_COL, + JUSTIFICATION_COL + ]: + if col not in df.columns: + df[col] = "" + + hierarchy = {} + for idx, row in df.iterrows(): + num = normalize_num(str(row.iloc[ORDER_COLUMN])) + text = str(row.iloc[QUESTION_COLUMN]).strip() + + if num and text: + hierarchy[num] = {"text": text, "row": idx} + + # ========================================= + # πŸ”₯ WORKER PARALELO + # ========================================= + def process_row(num, info): + try: + row = df.loc[info["row"]] + + if is_hierarchical(num): + question = build_question(hierarchy, num) + else: + question = build_question_from_columns( + row, + CONTEXT_COLUMNS, + QUESTION_COLUMN + ) + + logger.info(f"\nπŸ”Έ QUESTION {num} SENT TO API:\n{question}") + + # raw = call_api_with_retry( + # question, + # api_url=api_url, + # timeout=timeout, + # auth_user=auth_user, + # auth_pass=auth_pass + # ) + raw = call_local_engine(question) + + resp = normalize_api_response(raw) + + return info["row"], question, resp + + except Exception as e: + return info["row"], "", {"error": str(e)} + + # ========================================= + # PARALLEL EXECUTION - FUTURE - OCI ACCEPTS ONLY 1 HERE + # ========================================= + futures = [] + + with ThreadPoolExecutor(max_workers=1) as executor: + + for num, info in hierarchy.items(): + + if not should_process(num, ALLOWED_STRUCTURES, ALLOWED_SEPARATORS): + continue + + futures.append(executor.submit(process_row, num, info)) + + for f in as_completed(futures): + + row_idx, question, api_response = f.result() + api_response = normalize_api_response(api_response) + + try: + if "error" in api_response: + raise Exception(api_response["error"]) + + if "evidence" in api_response: + api_response["evidence"] = normalize_evidence_sources( + api_response["evidence"] + ) + + if ( + api_response.get("answer") == "NO" + or api_response.get("confidence") in ("MEDIUM", "LOW") + ): + register_failed_query( + query=question, + answer=api_response.get("answer", ""), + confidence=api_response.get("confidence", "") + ) + + df.at[row_idx, ANSWER_COL] = api_response.get("answer", "ERROR") + df.at[row_idx, CONFIDENCE_COL] = api_response.get("confidence", "") + df.at[row_idx, AMBIGUITY_COL] = str(api_response.get("ambiguity_detected", "")) + df.at[row_idx, CONF_REASON_COL] = api_response.get("confidence_reason", "") + df.at[row_idx, JUSTIFICATION_COL] = build_justification_with_links( + api_response.get("justification", ""), + api_response.get("evidence", []) + ) + df.at[row_idx, JSON_COL] = json.dumps(api_response, ensure_ascii=False) + + logger.info(json.dumps(api_response, indent=2)) + + except Exception as e: + df.at[row_idx, ANSWER_COL] = "ERROR" + df.at[row_idx, CONFIDENCE_COL] = "LOW" + df.at[row_idx, JUSTIFICATION_COL] = str(e) + + logger.info(f"❌ ERROR: {e}") + + df.to_excel(output_excel, index=False) + + return output_excel + +if __name__ == "__main__": + import sys + + input_path = Path(sys.argv[1]) + output_path = input_path.with_name(input_path.stem + "_result.xlsx") + + process_excel_rfp( + input_excel=input_path, + output_excel=output_path, + api_url=API_URL, + ) \ No newline at end of file diff --git a/files/source_code.zip b/files/source_code.zip new file mode 100644 index 0000000..541d58a Binary files /dev/null and b/files/source_code.zip differ diff --git a/files/templates/admin_menu.html b/files/templates/admin_menu.html new file mode 100644 index 0000000..a10e0cb --- /dev/null +++ b/files/templates/admin_menu.html @@ -0,0 +1,67 @@ +{% extends "base.html" %} +{% block content %} + + <h1>βš™οΈ Admin Panel</h1> + + <!-- USERS --> + <div class="card"> + + <h2>πŸ‘€ Users</h2> + + <p class="small"> + Create, edit and manage system users and permissions. + </p> + + <a href="{{ url_for('users.list_users') }}" class="btn"> + Open User Management + </a> + + </div> + + + <!-- KNOWLEDGE --> + <div class="card"> + + <h2>πŸ” Knowledge Governance</h2> + + <p class="small"> + Invalidate outdated knowledge or manually add validated information to the RAG base. + </p> + + <a href="{{ url_for('admin.invalidate_page') }}" class="btn"> + Open Governance Tools + </a> + + </div> + + <div class="card"> + + <h2>♻️ Maintenance</h2> + + <p class="small"> + Reload all knowledge indexes, embeddings and caches without restarting the server. + </p> + + <button class="btn" onclick="rebootSystem()"> + Reload Knowledge + </button> + + <pre id="rebootResult" style="margin-top:10px;"></pre> + + </div> + <script> + async function rebootSystem() { + + const box = document.getElementById("rebootResult"); + box.textContent = "⏳ Reloading..."; + + const res = await fetch("/admin/reboot", { + method: "POST" + }); + + const data = await res.json(); + + box.textContent = "βœ… " + data.message; + } + </script> +{% endblock %} \ No newline at end of file diff --git a/files/templates/base.html b/files/templates/base.html new file mode 100644 index 0000000..ac6bf08 --- /dev/null +++ b/files/templates/base.html @@ -0,0 +1,321 @@ +<!DOCTYPE html> +<html lang="en"> +<head> + <meta charset="UTF-8" /> + <meta name="viewport" content="width=device-width, initial-scale=1" /> + <title>ORACLE RFP AI Platform + + + + + + + + + + +
+ + {% with messages = get_flashed_messages(with_categories=true) %} + {% if messages %} + {% for category, message in messages %} +
+ {{ message }} +
+ {% endfor %} + {% endif %} + {% endwith %} + + {% block content %} + {% endblock %} + +
+ + + \ No newline at end of file diff --git a/files/templates/excel/job_status.html b/files/templates/excel/job_status.html new file mode 100644 index 0000000..e51bdb0 --- /dev/null +++ b/files/templates/excel/job_status.html @@ -0,0 +1,82 @@ +{% extends "base.html" %} + +{% block content %} + + + +
+ +

Excel Processing

+

Job ID: {{ job_id }}

+ +
+
+

Processing...

+
+ + + + + +
+ + + +{% endblock %} \ No newline at end of file diff --git a/files/templates/index.html b/files/templates/index.html new file mode 100644 index 0000000..edd6010 --- /dev/null +++ b/files/templates/index.html @@ -0,0 +1,1023 @@ +{% extends "base.html" %} +{% block content %} + + + + + + Oracle AI RFP Response + + + + + + + +
+ +

+ Oracle LAD A-Team
+ Cristiano Hoshikawa
+ cristiano.hoshikawa@oracle.com +

+ +

+ Tutorial
+ + Oracle Learn – OCI Generative AI PDF RAG + +
+ + Oracle GraphRAG for RFP Validation + + +

+ + REST Service Endpoint
+ + {{ api_base_url }}/rest/chat + +

+ +
+ +

Overview

+ +

+ This application provides an AI-assisted RFP response engine for + Oracle Cloud Infrastructure (OCI). + It analyzes natural language requirements and returns a + structured, evidence-based technical response. +

+ +
    +
  • Official Oracle technical documentation
  • +
  • Semantic search using vector embeddings
  • +
  • Knowledge Graph signals
  • +
  • Large Language Models (LLMs)
  • +
+ +
+ + +
+ +

Important Notes

+ +
    +
  • + Responses are generated by an LLM. + Even with low temperature, minor variations may occur across executions. +
  • +
  • + Results depend on wording, terminology, and framing of the requirement. +
  • +
  • + In many RFPs, an initial NO can be reframed into a valid + YES by mapping the requirement to the correct OCI service. +
  • +
  • + Human review is mandatory. + This tool supports architects and RFP teams β€” it does not replace them. +
  • +
+ +

+ GraphRAG β€’ Oracle Autonomous Database 23ai β€’ Embeddings β€’ Knowledge Graph β€’ LLM β€’ Flask API +

+ +
+ + +{% if current_user and current_user.role in ("admin", "user") %} +
+ +

Try It β€” Live RFP Question

+ +

+ Enter an RFP requirement or technical question below. + The API will return a structured JSON response. +

+ + + + + + + +
+{% endif %} + + +{% if current_user and current_user.role in ("admin", "user") %} +
+

πŸ— Architecture Planner

+ +

+ This is an advanced analysis engine for designing Architectural Solutions based on OCI resources. + It uses LRM mechanism with Chain-of-Tought to prepare solutions that require a set of components in OCI. +

+ + + + + + + + + +
+{% endif %} + + +
+

Submit your RFP (Excel)

+ +

+ Upload an Excel file and receive the processed result by email. + You do not need to keep this page open. +

+

+ Follow the Excel format:
+ Column A: MUST be a sequential numeric
+ Column B, C: MUST be fill with a context. Could be a domain and sub-domain for the question
+ Column D: Optional
+ Column E: MUST be the main question
+

+

+ Exemplo de planilha Excel +

+ +

+ + +

+ + + + +
+ + +
+ +

REST API Usage

+ +

+ The service exposes a POST endpoint that accepts a JSON payload. +

+ + + curl -X POST {{ api_base_url }}/rest/chat + -H "Content-Type: application/json" + -u app_user:app_password + -d '{ + "question": "Does Oracle Cloud Infrastructure (OCI) Compute support online resizing of memory for running virtual machine instances?" + }' + + +

Request Parameters

+ +

+ question (string)
+ Natural language description of an RFP requirement or technical capability. + Small wording changes may affect how intent and evidence are interpreted. +

+ +
+ + +
+ +

AI Response JSON Structure

+ +

+ The API always returns a strict and normalized JSON structure, + designed for traceability, auditing, and human validation. +

+ +

answer

+

+ Final assessment of the requirement: + YES, NO, or PARTIAL. + A NO means the requirement is not explicitly satisfied as written. +

+ +

confidence

+

+ Indicates the strength of the supporting evidence: + HIGH, MEDIUM, or LOW. +

+ +

ambiguity_detected

+

+ Flags whether the requirement is vague, overloaded, or open to interpretation. +

+ +

confidence_reason

+

+ Short explanation justifying the confidence level. +

+ +

justification

+

+ Technical rationale connecting the evidence to the requirement. + This is not marketing text. +

+ +

evidence

+

+ List of supporting references: +

+
    +
  • quote – Exact extracted text
  • +
  • source – URL or document reference
  • +
+ +
+ +
+ +
+ +

How to Use the RFP AI with a custom Python Code

+ +

+ This solution exposes a REST API that allows RFP questions to be evaluated programmatically. + By consuming this API, users can execute a Python automation that reads spreadsheet files, + builds contextualized questions, sends them to the AI service, and writes the results back + to the same spreadsheet. +

+ +

+ The automation supports both hierarchical and non-hierarchical + spreadsheet structures. Depending on how the spreadsheet is organized, the Python code + automatically determines how to construct each question, ensuring that the context sent + to the AI is accurate, consistent, and auditable. +

+ +

+ This approach enables large RFP documents to be processed in bulk, replacing manual analysis + with a repeatable and controlled workflow driven by a REST interface and a Python execution layer. +

+ +
+ Source Code Download

+ The Python script responsible for reading RFP spreadsheets, calling the REST API, + and writing results back to the file can be downloaded below: +

+ + πŸ“₯ process_excel_rfp.py + +
+ +
+ +

1. Hierarchical Spreadsheet

+ +

+ A spreadsheet is considered hierarchical when it contains a numbering column + that represents a tree structure, such as: +

+ +
+    1
+    1.1
+    1.1.1
+    1.2
+      
+ +

+ In this format: +

+ +
    +
  • The hierarchy is explicitly defined by the numbering
  • +
  • Parent items provide contextual meaning
  • +
  • Leaf items (those without children) are sent to the AI for evaluation
  • +
+ +
+ Example:
+ Item 1.2.3 inherits context from 1 β†’ 1.2 +
+ +

2. Non-Hierarchical Spreadsheet

+ +

+ A spreadsheet is considered non-hierarchical when no valid hierarchical + numbering exists or when numbering does not represent a logical structure. +

+ +

+ In these cases, context is distributed across specific columns, for example: +

+ +
+    Domain | Subdomain | Topic | Question
+      
+ +

+ The pipeline uses only explicitly declared context columns, preventing semantic noise + such as internal IDs or technical codes from being included in the prompt. +

+ +

3. How the Pipeline Selects the Mode

+ +
+    If the order value is hierarchical:
+        use numeric hierarchy
+    Else:
+        use column-based hierarchy
+      
+ +
+ This decision ensures deterministic and auditable behavior. +
+ +

4. Key Code Sections

+ +

4.1 Hierarchy Detection

+ +

+    def is_hierarchical(num: str) -> bool:
+        if not num:
+            return False
+        parts = num.split(".")
+        return all(p.isdigit() for p in parts)
+      
+ +

+ This function determines whether a row belongs to the numeric hierarchy. +

+ +

4.2 Hierarchical Question Builder

+ +

+    def build_question(hierarchy: dict, current_num: str) -> str:
+        ...
+        return f'Considering the context of "{context}", {specific}'
+      
+ +

+ This logic walks up the hierarchy tree to build a contextualized question. +

+ +

4.3 Column-Based Question Builder

+ +

+    def build_question_from_columns(row, context_cols, question_col):
+        ...
+        return f'Considering the context of "{context}", {question}'
+      
+ +

+ This builder is used only when no numeric hierarchy exists. +

+ +

4.4 Correct Row Retrieval (Critical)

+ +

+    row = df.loc[info["row"]]
+    num = normalize_num(str(row.iloc[ORDER_COLUMN]))
+      
+ +
+ Important:
+ Always retrieve the correct DataFrame row before accessing column values. + If this step is skipped, hierarchical processing will not work correctly. +
+ +

5. Best Practices

+ +
    +
  • Do not include internal IDs or technical codes as semantic context
  • +
  • Explicitly define context columns (CONTEXT_COLUMNS)
  • +
  • Avoid heuristic-based hierarchy guessing
  • +
  • Prefer deterministic, auditable logic
  • +
+ +

6. Summary

+ +
+ This pipeline is designed to: +
    +
  • Support multiple RFP spreadsheet formats
  • +
  • Eliminate semantic noise
  • +
  • Produce consistent, high-quality prompts
  • +
  • Scale for enterprise usage
  • +
+
+ +
+
+ +
+ +
+

Configure and Test your Custom Python Code

+ + Before running process_excel_rfp.py, you must configure the input spreadsheet, the REST endpoint, + and authentication. These parameters can be set directly in the script or provided via environment variables. +
+ +

0. Prerequisites

+
    +
  • Python 3.10+ installed
  • +
  • Install dependencies: pip install pandas requests openpyxl
  • +
  • Access to the REST API endpoint (network + credentials)
  • +
+ +
+ +

1. Script Parameters (Edit in the .py file)

+ +
+ Main configuration variables

+ Open process_excel_rfp.py and update the values below: +
+ +
+    EXCEL_PATH      = "/path/to/your/RFP.xlsx"
+    API_URL         = "{{ api_base_url }}/chat"
+    TIMEOUT         = 120
+
+    ORDER_COLUMN    = 0   # column index containing the order/numbering
+    QUESTION_COLUMN = 1   # column index containing the question text
+
+    # Use this only for NON-hierarchical spreadsheets:
+    CONTEXT_COLUMNS = [1, 2]  # columns that contain context (domain, topic, section, etc.)
+
+    # Output column names (created if missing):
+    ANSWER_COL      = "ANSWER"
+    JSON_COL        = "RESULT_JSON"
+    
+ +

EXCEL_PATH

+

+ Full path to the spreadsheet you want to process. The script reads this file and writes a new output file + named _resultado.xlsx in the same folder. +

+ +

API_URL

+

+ The REST endpoint exposed by the AI service. It must accept: + POST with JSON payload {"question": "..."} . +

+ +

ORDER_COLUMN and QUESTION_COLUMN

+

+ The script uses ORDER_COLUMN to identify hierarchy (e.g., 1, 1.1, 1.1.1). + The QUESTION_COLUMN is the text that will be sent to the AI. +

+ +

CONTEXT_COLUMNS (Non-hierarchical mode)

+

+ If your spreadsheet is not hierarchical, context comes from fixed columns (for example: Domain β†’ Subdomain β†’ Topic). + Only the columns listed in CONTEXT_COLUMNS will be used to build context. + This avoids adding noisy values such as internal IDs or codes. +

+ +
+ +

2. Authentication (Environment Variables)

+ +
+ Important:
+ Do not hardcode credentials inside the spreadsheet or the script if the file will be shared. + Prefer environment variables. +
+ +

+ The script uses HTTP Basic Auth to call the API. Configure credentials using environment variables: +

+ +
+    export APP_USER="YOU USER"
+    export APP_PASS="YOUR PASSWORD"
+    
+ +

+ On Windows PowerShell: +

+ +
+    setx APP_USER "YOUR USER"
+    setx APP_PASS "YOUR PASSWORD"
+    
+ +

+ If not provided, the script falls back to the defaults defined in the code: + APP_USER / APP_PASS. +

+ +
+ +

3. How to Run

+ +
+    python process_excel_rfp.py
+    
+ +

+ The script will: +

+
    +
  • Load the spreadsheet
  • +
  • Detect whether the sheet is hierarchical (based on numbering)
  • +
  • Build a contextual question for each leaf item
  • +
  • Send each question to the REST API
  • +
  • Write ANSWER + JSON results back to a new spreadsheet file
  • +
  • Log LOW/MEDIUM confidence or NO answers into queries_with_low_confidence_or_no.txt
  • +
+ +
+ Output

+ A new file will be created next to the original spreadsheet:
+ RFP_result.xlsx +
+
+ + + + + + + +{% endblock %} \ No newline at end of file diff --git a/files/templates/invalidate.html b/files/templates/invalidate.html new file mode 100644 index 0000000..8d49dae --- /dev/null +++ b/files/templates/invalidate.html @@ -0,0 +1,203 @@ +{% extends "base.html" %} +{% block content %} + +

πŸ” RAG Knowledge Governance

+ + + + +
+

❌ Invalidate Knowledge

+ +
+ +

+ +
+
+ +
+ + + + +
+

βž• Add Manual Knowledge

+ +
+ + +

+ + + +

+ + +
+ +

+
+ +
+ + + + +
+

πŸ“š Knowledge Matches

+ + {% if results|length == 0 %} +

No matching knowledge found.

+ {% endif %} + + {% for r in results %} +
+ + Chunk Hash: + {{ r.chunk_hash or "β€”" }}
+ Origin: {{ r.origin or "UNKNOWN" }}
+ Created at: {{ r.created_at or "β€”" }}
+ Status: {{ r.status }}
+ Source: {{ r.source }}
+ +
+ Content: +
{{ r.text }}
+
+ Change to: + + +
+
+ + {% if r.chunk_hash %} + + + {% else %} +

Derived from Knowledge Graph (non-revocable)

+ {% endif %} +
+ {% endfor %} +
+ + +{% endblock %} \ No newline at end of file diff --git a/files/templates/users/form.html b/files/templates/users/form.html new file mode 100644 index 0000000..2b086c4 --- /dev/null +++ b/files/templates/users/form.html @@ -0,0 +1,25 @@ +{% extends "base.html" %} +{% block content %} + +
+

{{ "Edit User" if user else "New User" }}

+ +
+ + + + + + + + +
+
+ +{% endblock %} \ No newline at end of file diff --git a/files/templates/users/list.html b/files/templates/users/list.html new file mode 100644 index 0000000..a38c2fb --- /dev/null +++ b/files/templates/users/list.html @@ -0,0 +1,33 @@ +{% extends "base.html" %} +{% block content %} + +
+

Users

+ + + New User + + + + + + + + + + + {% for u in users %} + + + + + + + + {% endfor %} +
NameEmailRoleActive
{{ u.name }}{{ u.email }}{{ u.role }}{{ "Yes" if u.active else "No" }} + Edit | + Delete +
+
+ +{% endblock %} \ No newline at end of file diff --git a/files/templates/users/login.html b/files/templates/users/login.html new file mode 100644 index 0000000..677f66c --- /dev/null +++ b/files/templates/users/login.html @@ -0,0 +1,114 @@ +{% extends "base.html" %} + +{% block content %} + + + + + +{% endblock %} \ No newline at end of file diff --git a/files/templates/users/set_password.html b/files/templates/users/set_password.html new file mode 100644 index 0000000..11317b7 --- /dev/null +++ b/files/templates/users/set_password.html @@ -0,0 +1,20 @@ +{% extends "base.html" %} +{% block content %} + +
+ +

Set Password

+ + {% if expired %} +

Link expired or invalid.

+ {% else %} +
+ + + +
+ {% endif %} + +
+ +{% endblock %} \ No newline at end of file diff --git a/files/templates/users/signup.html b/files/templates/users/signup.html new file mode 100644 index 0000000..b18790d --- /dev/null +++ b/files/templates/users/signup.html @@ -0,0 +1,51 @@ +{% extends "base.html" %} +{% block content %} + +
+ +

Create Access

+ +

+ Enter your email to receive a secure link and set your password. +

+ +
+ + + +

+ + + +

+ + + +
+ +
+ + {% with messages = get_flashed_messages(with_categories=true) %} + {% if messages %} + {% for cat, msg in messages %} +
+ {{ msg }} +
+ {% endfor %} + {% endif %} + {% endwith %} + +
+ +{% endblock %} \ No newline at end of file diff --git a/img_3.png b/img_3.png new file mode 100644 index 0000000..3363846 Binary files /dev/null and b/img_3.png differ diff --git a/img_4.png b/img_4.png new file mode 100644 index 0000000..0a3425c Binary files /dev/null and b/img_4.png differ