Python API Reference#
Core Application#
The main chatbot and user interface functionality.
Key Functions#
The core application provides functions for:
Configuring retrievers for document search
Handling user input and responses
Setting up the Streamlit interface
Streamlit chatbot interface for the Euclid AI Assistant. Uses an existing FAISS vector store and E5 embeddings for retrieval.
- euclid.rag.chatbot.configure_retriever(config: dict, index_dir: str) VectorStoreRetriever #
Build and cache a FAISS-based retriever for Euclid knowledge sources.
- Parameters:
- Returns:
Configured Retriever with
search_type="similarity"
andk=6
.- Return type:
VectorStoreRetriever
- Raises:
RuntimeError – If the FAISS index cannot be found or loaded.
- euclid.rag.chatbot.create_euclid_router(config: dict) Callable[[dict, list[BaseCallbackHandler] | None], dict] #
Return Euclid-AI that always delegates to at least one sub-agent.
- euclid.rag.chatbot.handle_user_input(router: Callable[[dict, list[BaseCallbackHandler] | None], dict], msgs: StreamlitChatMessageHistory) None #
Display chat history and handle new user input in Streamlit.
- Parameters:
router (Callable) – Callable that routs to correct tools for the response.
msgs (StreamlitChatMessageHistory) – Chat history object for managing messages.
- euclid.rag.chatbot.submit_text() None #
Flag that the user pressed <enter> in the chat box (Streamlit callback).
Set up the sidebar, landing page, and header/footer for a Streamlit app that interacts with the chatbot.
Set up the header and footer for the Streamlit app.
Create a Streamlit callback handler that dynamically updates a UI container with new tokens from a language model.
- euclid.rag.streamlit_callback.get_streamlit_cb(parent_container: DeltaGenerator) BaseCallbackHandler #
Create a Streamlit callback handler that updates the provided Streamlit container with new tokens. :param parent_container: The Streamlit container :type parent_container: DeltaGenerator :param where the text will be rendered.:
- Returns:
BaseCallbackHandler (An instance of a callback handler)
configured for Streamlit.
Retrieval System#
Tools for retrieving and formatting information from document stores.
Overview#
The retrieval system provides specialized tools for querying Euclid mission documents. The system retrieves relevant documents from vector stores, ranks them using similarity and metadata scoring, and provides the top-ranked sources as context for response generation.
Source Attribution#
The chatbot responses include the top-ranked documents that were provided as context to the language model. These sources represent the retrieved, reranked, and deduplicated documents used to generate the response, not necessarily a selection made by the language model itself.
Generic Retriever Tool for querying Euclid-Consortium documents.
- euclid.rag.retrievers.generic_retrieval_tool.bonus_overlap(q: set[str], field: str | None, weight: float) float #
Compute weighted count of query tokens in a metadata field.
- euclid.rag.retrievers.generic_retrieval_tool.format_source(m: dict) str #
Format a source line based on document metadata.
- euclid.rag.retrievers.generic_retrieval_tool.get_generic_retrieval_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) Tool #
Return a generic Euclid retrieval tool that answers questions from any document type (publications, DPDD, etc.).
- Parameters:
llm (BaseLanguageModel) – The language model used to generate answers.
retriever (VectorStoreRetriever) – The retriever for accessing vectorstore documents.
- Returns:
A callable tool that answers questions and formats sources for all Euclid document types.
- Return type:
Tool
- euclid.rag.retrievers.generic_retrieval_tool.normalize_url(url: str | None) str | None #
Remove URL fragments and query parameters for deduplication.
- euclid.rag.retrievers.generic_retrieval_tool.semantic_rerank(query: str, docs: list) list #
Rerank a list of documents by semantic similarity to the query.
- euclid.rag.retrievers.generic_retrieval_tool.tokenize(text: str) set[str] #
Convert text into a set of lowercase tokens ≥3 characters.
Tool for querying Euclid-Consortium publications and metadata.
- euclid.rag.retrievers.publication_tool.bonus_overlap(q: set[str], field: str | None, weight: float) float #
Compute a weighted count of query tokens found in a metadata field.
- euclid.rag.retrievers.publication_tool.get_publication_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) Tool #
Return a tool that answers questions using Euclid Consortium publications.
Uses a language model and vectorstore retriever,
to find and summarize relevant papers.
- Parameters:
llm (BaseLanguageModel) – The language model used to generate answers.
retriever (VectorStoreRetriever) – The retriever for accessing publication documents.
- Returns:
A callable tool that answers questions based on EC publications.
- Return type:
Tool
- euclid.rag.retrievers.publication_tool.semantic_rerank(query: str, docs: list) list #
Rerank a list of documents based on semantic similarity to a query.
- euclid.rag.retrievers.publication_tool.tokenize(text: str) set[str] #
Convert text into a set of lowercase tokens with ≥3 characters.
Punctuation is removed before tokenization.
Tool for querying Euclid-Consortium redmine and metadata.
- class euclid.rag.retrievers.redmine_tool.RedmineRetrieverHelper(query: str, dedup_hash: HashDeduplicator, dedup_semantic: SemanticSimilarityDeduplicator)#
Bases:
object
Helper class to encapsulate the multi-stage retrieval logic for Redmine.
- remove_duplicate_docs(scored_docs: list[tuple]) list[tuple] #
Remove exact and semantic duplicates from retrieved documents.
- Parameters:
scored_docs (list of tuple) – A list of (document, score) pairs. Each document should have a .page_content attribute containing its text.
- Returns:
Deduplicated (score, document) pairs, sorted by score in descending order.
- Return type:
Notes
Exact duplicates are removed using self.dedup_hash.
Semantic duplicates are removed using self.dedup_semantic if enabled.
The input is expected in the form (document, score), while the output
is returned in the form (score, document) for downstream ranking.
- score_by_metadata(docs: list, top_k_docs: int = 10) list[tuple] #
Score documents based on metadata keyword overlap and recency.
- euclid.rag.retrievers.redmine_tool.get_redmine_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) Tool #
Return a tool that answers questions using Euclid Consortium Redmine. Uses a language model and vector store retriever, to find and summarize relevant redmine wikis.
- Parameters:
llm (BaseLanguageModel) – The language model used to generate answers.
retriever (VectorStoreRetriever) – The retriever for accessing redmine documents.
- Returns:
A callable tool that answers questions based on EC redmine.
- Return type:
Tool
Data Ingestion#
Ingest Euclid Science Ground Segment Data Product Description Document (DPDD).
This script downloads the DPDD from the Euclid website, processes it, and ingests the data into a FAISS vectorstore for use in the Euclid RAG system.
- class euclid.rag.ingestion.ingest_dpdd.EuclidDPDDIngestor(vector_store_dir: Path, dpdd_config_path: Path)#
Bases:
object
Downloads and ingests DPDD data into the vector store.
- ingest_new_data() None #
Ingest new data into the vector store.
This method fetches DPDD entries, processes them, and adds them to the vector store, avoiding duplicates based on the ‘source’ metadata field.
- Raises:
RuntimeError – If the vector store directory is missing or cannot be created.
- Returns:
This function does not return anything; it performs the ingestion.
- Return type:
None
- euclid.rag.ingestion.ingest_dpdd.run_dpdd_ingestion(config: dict) None #
Run the DPDD ingestion process.
- Parameters:
config (dict) – Configuration dictionary containing paths and settings.
- Raises:
RuntimeError – If the vector store directory is missing or cannot be created.
- Returns:
This function does not return anything; it performs the ingestion.
- Return type:
None
Ingest publications into a FAISS vector store from the official EC BibTeX. Each paper is embedded immediately after download and deleted afterward.
- class euclid.rag.ingestion.ingest_publications.EuclidBibIngestor(index_dir: Path, temp_dir: Path, data_config: dict)#
Bases:
object
Downloads and updates the vector store from the Euclid BibTeX file.
- euclid.rag.ingestion.ingest_publications.run_bibtex_ingestion(config: dict) None #
Run the bibtex ingestion script.
Module to ingest JSON-exported pages into a FAISS vector store.
- class euclid.rag.ingestion.ingest_redmine.JSONIngestor(index_dir: Path, json_dir: Path, cleaner: RedmineCleaner, data_config: dict)#
Bases:
object
Ingest JSON-exported pages into a FAISS vector store.
The JSON structure should be as follows:
{ "content": "Full text of the page...", "metadata": { "field1": "", "field2": "", ... } }
- Parameters:
index_dir (Path) – Directory where the FAISS index will be stored.
json_dir (Path) – Directory containing JSON files to ingest.
cleaner (RedmineCleaner) – Text cleaning utility for preprocessing content.
data_config (dict) – Configuration dictionary containing embedding and processing settings.
Utilities#
Module providing functions to load, extract, and expand acronyms from a JSON file and within a given text.
- euclid.rag.utils.acronym_handler.expand_acronyms_in_query(query: str, acronyms: dict) str #
Expand acronyms found in a given query string by replacing them with their definitions.
- Parameters:
query (str) – The input string containing potential acronyms. acronyms (dict): A dictionary mapping acronyms (keys) to their definitions (values) based on http://ycopin.pages.euclid-sgs.uk/euclidator/ by Y. Copin
- Returns:
str – to include their definitions.
- Return type:
The modified query string with acronyms expanded
Example
>>> acro = {"DSS": "Data Storage System | Distributed Storage System"} >>> expand_acronyms_in_query("What is DSS?", acro) 'What is DSS (Data Storage System | Distributed Storage System)?'
- euclid.rag.utils.acronym_handler.extract_acronyms(text: str) set[str] #
Extract acronyms from a string.
- Parameters:
text (str) – The string containing acronyms.
- Returns:
set[str]
- Return type:
A set of the acronyms.
- euclid.rag.utils.acronym_handler.load_acronyms(path: str | Path) dict[str, str] #
Load acronyms from a JSON file.
- Parameters:
path (str | Path) – The file path to the JSON file containing acronyms.
- Returns:
dict[str, str] – and values are their corresponding definitions.
- Return type:
A dictionary where keys are acronyms
- euclid.rag.utils.acronym_handler.match_acronyms(text: str, acronym_dict: dict[str, str]) dict[str, str] #
Match acronyms between a string and a dictionary of acronyms.
Utility for loading and parsing config files.
- euclid.rag.utils.config.load_config(config_path: Path) dict #
Load YAML configuration from a file.
- Parameters:
config_path (Path) – Path to the config YAML file.
- Returns:
Parsed configuration dictionary.
- Return type:
Utility for loading current device type.
- euclid.rag.utils.device.get_device() device #
Return the torch device to use for embedding.
Checks for available hardware acceleration in the following order: CUDA, MPS, then CPU.
- Returns:
The selected device (‘cuda’, ‘mps’, or ‘cpu’).
- Return type:
torch.device
Module providing a utility class for cleaning and preparing Redmine-exported data.
- class euclid.rag.utils.redmine_cleaner.RedmineCleaner(max_chunk_length: int = 1000)#
Bases:
object
A utility class for cleaning and preparing Redmine-exported data for ingestion in a RAG pipeline.
- Parameters:
max_chunk_length (int, optional) – Maximum length for each split content chunk, by default 1000.
- convert_redmine_bold_italic(line: str) str #
Convert bold and _italic_ Redmine syntax to Markdown bold and italic.
- convert_redmine_code_blocks(lines: list[str]) list[str] #
Convert HTML pre tags to Markdown code blocks.
Supports multi-line
<pre>
sections by converting them to triple-backtick code blocks for better Markdown compatibility.- Parameters:
lines (list of str) – List of text lines that may contain Redmine-style code blocks.
- Returns:
List of lines with
<pre>
tags converted to Markdown code blocks.- Return type:
Examples
>>> cleaner = RedmineCleaner() >>> lines = ["Some text", "<pre>code here</pre>", "more text"] >>> result = cleaner.convert_redmine_code_blocks(lines) >>> print(result) ['Some text', '```', 'code here', '```', 'more text']
- convert_redmine_headers(line: str) str | None #
Convert Redmine headers (h1. to h6.) to Markdown (# to ######).
- convert_redmine_images(line: str) str #
Convert Redmine image syntax !image.png! or !image.png|widthxheight! to Markdown .
- convert_redmine_linebreaks(line: str) str #
Convert explicit Redmine line breaks in text to Markdown double spaces + newline.
- convert_redmine_lists(line: str) str | None #
Convert Redmine nested lists (, *) to Markdown lists with indentation.
- convert_redmine_table(lines: list[str]) tuple[list[str], int] #
Convert Redmine table block lines starting with | to Markdown table. Returns tuple (converted_lines, number_of_lines_consumed).
- enrich_with_context(entry: dict[str, Any], chunk: str) str #
Add page hierarchy information as a context prefix to the content.
- Parameters:
entry – Original Redmine entry.
chunk – A chunk of cleaned text content.
- Return type:
Chunk prefixed with hierarchy context.
- filter_valid_entries(data: list[dict[str, Any]]) list[dict[str, Any]] #
Keep only entries whose metadata status is not ‘NOK’.
- normalize_metadata(metadata: dict[str, Any]) dict[str, Any] #
Clean and normalize metadata fields (e.g., timestamps).
- Parameters:
metadata – The metadata dictionary from a Redmine page.
- Return type:
A normalized metadata dictionary.
- prepare_for_ingestion(raw_data: list[dict[str, Any]]) list[dict[str, Any]] #
Full pipeline: filter, clean, split and enrich Redmine data.
- Parameters:
raw_data – List of Redmine entries (JSON-like).
- Return type:
List of prepared documents ready for ingestion.
Extra Tools#
Deduplication filter using hash, FAISS similarity, and cross-encoder re-ranking.
- class euclid.rag.extra_scripts.deduplication.ChunkClusterer(distance_threshold: float = 0.1)#
Bases:
object
Cluster embedding vectors and return one representative text per cluster.
This class uses clustering on embedding vectors to identify similar groups of text chunks.
- Parameters:
distance_threshold (float, optional) – Maximum cosine distance between elements in a cluster. Lower values produce tighter, more conservative clusters. Default is 0.1.
- class euclid.rag.extra_scripts.deduplication.HashDeduplicator#
Bases:
object
Deduplicator using SHA256 hashes for exact match detection.
Tracks seen inputs by their hash and filters out exact duplicates.
- class euclid.rag.extra_scripts.deduplication.SemanticSimilarityDeduplicator(vectorstore: FAISS | None, reranker_model: str, similarity_threshold: float, rerank_threshold: float, k_candidates: int = 5)#
Bases:
object
Deduplicator using semantic similarity and optional reranking.
Uses FAISS to find similar texts and CrossEncoder to refine scoring. Texts are considered duplicates if both thresholds are exceeded.
Embedding and vector store management utilities for Euclid document ingestion.
This module provides: - An E5 embedding class with support for MPS/CUDA/CPU. - A function to load or create a FAISS vector store from PDFs.
- class euclid.rag.extra_scripts.vectorstore_embedder.Embedder(model_name: str = 'intfloat/e5-small-v2', batch_size: int = 16)#
Bases:
Embeddings
Embeds text into dense vectors using a HuggingFace model.
Supports MPS, CUDA, or CPU.
Pooling strategy (CLS or mean) is inferred automatically.
- Parameters:
- property device: device#
Return the torch device used by the model.
- euclid.rag.extra_scripts.vectorstore_embedder.load_json_documents(json_paths: list[Path]) list[Document] #
Load documents from a list of JSON files.
Each JSON file should contain a list of dicts with at least a “content” field and optionally a “metadata” field.
- Parameters:
json_paths (List[Path]) – List of paths to JSON files.
- Returns:
A list of LangChain Document objects.
- Return type:
List[Document]
- euclid.rag.extra_scripts.vectorstore_embedder.load_or_create_index(index_dir: Path, embedder: Embeddings, pdf_paths: None | list[Path] = None, json_paths: None | list[Path] = None) FAISS #
Load an existing FAISS index, or build one from given documents.
- Parameters:
index_dir (Path) – Directory where the FAISS index is stored (or will be created).
embedder (Embeddings) – Embedding model implementing the LangChain Embeddings interface.
pdf_paths (list[Path], optional) – Lists of input documents to embed.
json_paths (list[Path], optional) – Lists of input documents to embed.
- Returns:
A ready-to-use FAISS vectorstore.
- Return type:
FAISS