Python API Reference#

Core Application#

The main chatbot and user interface functionality.

Key Functions#

The core application provides functions for:

Configuring retrievers for document search
Handling user input and responses
Setting up the Streamlit interface

Streamlit chatbot interface for the Euclid AI Assistant. Uses an existing FAISS vector store and E5 embeddings for retrieval.

euclid.rag.chatbot.configure_retriever(config: dict, index_dir: str) → VectorStoreRetriever#

Build and cache a FAISS-based retriever for Euclid knowledge sources.

Parameters:

config (dict) – Embedding model configuration.
index_dir (str) – Path to the directory containing the FAISS index.

Returns:

Configured Retriever with search_type="similarity" and k=6.

Return type:

VectorStoreRetriever

Raises:

RuntimeError – If the FAISS index cannot be found or loaded.

euclid.rag.chatbot.create_euclid_router(config: dict) → Callable[[dict, list[BaseCallbackHandler] | None], dict]#: Return Euclid-AI that always delegates to at least one sub-agent.

euclid.rag.chatbot.handle_user_input(router: Callable[[dict, list[BaseCallbackHandler] | None], dict], msgs: StreamlitChatMessageHistory) → None#

Display chat history and handle new user input in Streamlit.

Parameters:

router (Callable) – Callable that routs to correct tools for the response.
msgs (StreamlitChatMessageHistory) – Chat history object for managing messages.

euclid.rag.chatbot.submit_text() → None#: Flag that the user pressed <enter> in the chat box (Streamlit callback).

Set up the sidebar, landing page, and header/footer for a Streamlit app that interacts with the chatbot.

euclid.rag.layout.setup_header_and_footer(msgs: StreamlitChatMessageHistory) → None#: Set up the header and footer for the Streamlit app.

euclid.rag.layout.setup_landing_page() → None#: Set up the landing page for the Streamlit app.

euclid.rag.layout.setup_sidebar() → None#: Set up the sidebar for the Streamlit app.

Create a Streamlit callback handler that dynamically updates a UI container with new tokens from a language model.

euclid.rag.streamlit_callback.get_streamlit_cb(parent_container: DeltaGenerator) → BaseCallbackHandler#

Create a Streamlit callback handler that updates the provided Streamlit container with new tokens. :param parent_container: The Streamlit container :type parent_container: DeltaGenerator :param where the text will be rendered.:

Returns:

BaseCallbackHandler (An instance of a callback handler)
configured for Streamlit.

Retrieval System#

Tools for retrieving and formatting information from document stores.

Overview#

The retrieval system provides specialized tools for querying Euclid mission documents. The system retrieves relevant documents from vector stores, ranks them using similarity and metadata scoring, and provides the top-ranked sources as context for response generation.

Source Attribution#

The chatbot responses include the top-ranked documents that were provided as context to the language model. These sources represent the retrieved, reranked, and deduplicated documents used to generate the response, not necessarily a selection made by the language model itself.

Generic Retriever Tool for querying Euclid-Consortium documents.

euclid.rag.retrievers.generic_retrieval_tool.bonus_overlap(q: set[str], field: str | None, weight: float) → float#: Compute weighted count of query tokens in a metadata field.

euclid.rag.retrievers.generic_retrieval_tool.format_source(m: dict) → str#: Format a source line based on document metadata.

euclid.rag.retrievers.generic_retrieval_tool.get_generic_retrieval_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) → Tool#

Return a generic Euclid retrieval tool that answers questions from any document type (publications, DPDD, etc.).

Parameters:

llm (BaseLanguageModel) – The language model used to generate answers.
retriever (VectorStoreRetriever) – The retriever for accessing vectorstore documents.

Returns:

A callable tool that answers questions and formats sources for all Euclid document types.

Return type:

Tool

euclid.rag.retrievers.generic_retrieval_tool.normalize_url(url: str | None) → str | None#: Remove URL fragments and query parameters for deduplication.

euclid.rag.retrievers.generic_retrieval_tool.semantic_rerank(query: str, docs: list) → list#: Rerank a list of documents by semantic similarity to the query.

euclid.rag.retrievers.generic_retrieval_tool.tokenize(text: str) → set[str]#: Convert text into a set of lowercase tokens ≥3 characters.

Tool for querying Euclid-Consortium publications and metadata.

euclid.rag.retrievers.publication_tool.bonus_overlap(q: set[str], field: str | None, weight: float) → float#

Compute a weighted count of query tokens found in a metadata field.

Parameters:

q (set of str) – Query tokens.
field (str or None) – Metadata field to search for token matches.
weight (float) – Weight to multiply the count by.

Returns:

Weighted token overlap score.

Return type:

float

euclid.rag.retrievers.publication_tool.get_publication_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) → Tool#

Return a tool that answers questions using Euclid Consortium publications.

Uses a language model and vectorstore retriever,

to find and summarize relevant papers.

Parameters:

llm (BaseLanguageModel) – The language model used to generate answers.
retriever (VectorStoreRetriever) – The retriever for accessing publication documents.

Returns:

A callable tool that answers questions based on EC publications.

Return type:

Tool

euclid.rag.retrievers.publication_tool.semantic_rerank(query: str, docs: list) → list#

Rerank a list of documents based on semantic similarity to a query.

Parameters:

query (str) – The input query string.
docs (list) –
A list of documents to rerank.

Each document must have a page_content attribute.

Returns:

The input documents sorted by descending relevance to the query.

Return type:

list

euclid.rag.retrievers.publication_tool.tokenize(text: str) → set[str]#

Convert text into a set of lowercase tokens with ≥3 characters.

Punctuation is removed before tokenization.

Parameters:: text (str) – The input text string.
Returns:: Set of cleaned, lowercase tokens at least 3 characters long.
Return type:: set of str

Tool for querying Euclid-Consortium redmine and metadata.

class euclid.rag.retrievers.redmine_tool.RedmineRetrieverHelper(query: str, dedup_hash: HashDeduplicator, dedup_semantic: SemanticSimilarityDeduplicator)#

Bases: object

Helper class to encapsulate the multi-stage retrieval logic for Redmine.

remove_duplicate_docs(scored_docs: list[tuple]) → list[tuple]#

Remove exact and semantic duplicates from retrieved documents.

Parameters:: scored_docs (list of tuple) – A list of (document, score) pairs. Each document should have a .page_content attribute containing its text.
Returns:: Deduplicated (score, document) pairs, sorted by score in descending order.
Return type:: list of tuple

Notes

Exact duplicates are removed using self.dedup_hash.
Semantic duplicates are removed using self.dedup_semantic if enabled.
The input is expected in the form (document, score), while the output

is returned in the form (score, document) for downstream ranking.

score_by_metadata(docs: list, top_k_docs: int = 10) → list[tuple]#: Score documents based on metadata keyword overlap and recency.

semantic_rerank(docs: list) → list#

Rerank a list of documents based on semantic similarity to a query.

Parameters:

query (str) – The input query string.
docs (list) –
A list of (LangChain) documents to rerank.

Each document must have a page_content attribute.

Returns:

The input documents sorted by descending relevance to the query.

Return type:

list

euclid.rag.retrievers.redmine_tool.get_redmine_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) → Tool#

Return a tool that answers questions using Euclid Consortium Redmine. Uses a language model and vector store retriever, to find and summarize relevant redmine wikis.

Parameters:

llm (BaseLanguageModel) – The language model used to generate answers.
retriever (VectorStoreRetriever) – The retriever for accessing redmine documents.

Returns:

A callable tool that answers questions based on EC redmine.

Return type:

Tool

Data Ingestion#

Ingest Euclid Science Ground Segment Data Product Description Document (DPDD).

This script downloads the DPDD from the Euclid website, processes it, and ingests the data into a FAISS vectorstore for use in the Euclid RAG system.

class euclid.rag.ingestion.ingest_dpdd.EuclidDPDDIngestor(vector_store_dir: Path, dpdd_config_path: Path)#

Bases: object

Downloads and ingests DPDD data into the vector store.

ingest_new_data() → None#

Ingest new data into the vector store.

This method fetches DPDD entries, processes them, and adds them to the vector store, avoiding duplicates based on the ‘source’ metadata field.

Raises:: RuntimeError – If the vector store directory is missing or cannot be created.
Returns:: This function does not return anything; it performs the ingestion.
Return type:: None

euclid.rag.ingestion.ingest_dpdd.main() → None#: Run the ingestion script.

euclid.rag.ingestion.ingest_dpdd.run_dpdd_ingestion(config: dict) → None#

Run the DPDD ingestion process.

Parameters:: config (dict) – Configuration dictionary containing paths and settings.
Raises:: RuntimeError – If the vector store directory is missing or cannot be created.
Returns:: This function does not return anything; it performs the ingestion.
Return type:: None

Ingest publications into a FAISS vector store from the official EC BibTeX. Each paper is embedded immediately after download and deleted afterward.

class euclid.rag.ingestion.ingest_publications.EuclidBibIngestor(index_dir: Path, temp_dir: Path, data_config: dict)#

Bases: object

Downloads and updates the vector store from the Euclid BibTeX file.

update_from_bibtex() → None#: Fetch new BibTeX entries and update the vector store.

euclid.rag.ingestion.ingest_publications.main() → None#: Run the ingestion script.

euclid.rag.ingestion.ingest_publications.run_bibtex_ingestion(config: dict) → None#: Run the bibtex ingestion script.

Module to ingest JSON-exported pages into a FAISS vector store.

class euclid.rag.ingestion.ingest_redmine.JSONIngestor(index_dir: Path, json_dir: Path, cleaner: RedmineCleaner, data_config: dict)#

Bases: object

Ingest JSON-exported pages into a FAISS vector store.

The JSON structure should be as follows:

{
    "content": "Full text of the page...",
    "metadata": {
        "field1": "",
        "field2": "",
        ...
    }
}

Parameters:

index_dir (Path) – Directory where the FAISS index will be stored.
json_dir (Path) – Directory containing JSON files to ingest.
cleaner (RedmineCleaner) – Text cleaning utility for preprocessing content.
data_config (dict) – Configuration dictionary containing embedding and processing settings.

ingest_json_files() → None#: Ingest documents from JSON files into the FAISS vector store.

euclid.rag.ingestion.ingest_redmine.main() → None#: Run the ingestion script.

euclid.rag.ingestion.ingest_redmine.run_redmine_ingestion(config: dict) → None#: Run the ingestion process for Redmine pages using the provided configuration.

Utilities#

Module providing functions to load, extract, and expand acronyms from a JSON file and within a given text.

euclid.rag.utils.acronym_handler.expand_acronyms_in_query(query: str, acronyms: dict) → str#

Expand acronyms found in a given query string by replacing them with their definitions.

Parameters:: query (str) – The input string containing potential acronyms. acronyms (dict): A dictionary mapping acronyms (keys) to their definitions (values) based on http://ycopin.pages.euclid-sgs.uk/euclidator/ by Y. Copin
Returns:: str – to include their definitions.
Return type:: The modified query string with acronyms expanded

Example

>>> acro = {"DSS": "Data Storage System | Distributed Storage System"}
>>> expand_acronyms_in_query("What is DSS?", acro)
'What is DSS (Data Storage System | Distributed Storage System)?'

euclid.rag.utils.acronym_handler.extract_acronyms(text: str) → set[str]#

Extract acronyms from a string.

Parameters:: text (str) – The string containing acronyms.
Returns:: set[str]
Return type:: A set of the acronyms.

euclid.rag.utils.acronym_handler.load_acronyms(path: str | Path) → dict[str, str]#

Load acronyms from a JSON file.

Parameters:: path (str | Path) – The file path to the JSON file containing acronyms.
Returns:: dict[str, str] – and values are their corresponding definitions.
Return type:: A dictionary where keys are acronyms

euclid.rag.utils.acronym_handler.match_acronyms(text: str, acronym_dict: dict[str, str]) → dict[str, str]#

Match acronyms between a string and a dictionary of acronyms.

Parameters:

text (str) – The string containing acronyms.
acronym_dict (dict[str, str]) – A dictionary mapping acronyms (keys) to their definitions (values).

Returns:

dict[str]

Return type:

A susbset of acronym_dict

Utility for loading and parsing config files.

euclid.rag.utils.config.load_config(config_path: Path) → dict#

Load YAML configuration from a file.

Parameters:: config_path (Path) – Path to the config YAML file.
Returns:: Parsed configuration dictionary.
Return type:: dict

Utility for loading current device type.

euclid.rag.utils.device.get_device() → device#

Return the torch device to use for embedding.

Checks for available hardware acceleration in the following order: CUDA, MPS, then CPU.

Returns:: The selected device (‘cuda’, ‘mps’, or ‘cpu’).
Return type:: torch.device

Module providing a utility class for cleaning and preparing Redmine-exported data.

class euclid.rag.utils.redmine_cleaner.RedmineCleaner(max_chunk_length: int = 1000)#

Bases: object

A utility class for cleaning and preparing Redmine-exported data for ingestion in a RAG pipeline.

Parameters:: max_chunk_length (int, optional) – Maximum length for each split content chunk, by default 1000.

convert_redmine_bold_italic(line: str) → str#: Convert bold and _italic_ Redmine syntax to Markdown bold and italic.

convert_redmine_code_blocks(lines: list[str]) → list[str]#

Convert HTML pre tags to Markdown code blocks.

Supports multi-line <pre> sections by converting them to triple-backtick code blocks for better Markdown compatibility.

Parameters:: lines (list of str) – List of text lines that may contain Redmine-style code blocks.
Returns:: List of lines with <pre> tags converted to Markdown code blocks.
Return type:: list of str

Examples

>>> cleaner = RedmineCleaner()
>>> lines = ["Some text", "<pre>code here</pre>", "more text"]
>>> result = cleaner.convert_redmine_code_blocks(lines)
>>> print(result)
['Some text', '```', 'code here', '```', 'more text']

convert_redmine_headers(line: str) → str | None#: Convert Redmine headers (h1. to h6.) to Markdown (# to ######).

convert_redmine_images(line: str) → str#: Convert Redmine image syntax !image.png! or !image.png|widthxheight! to Markdown ![alt](image.png).

convert_redmine_linebreaks(line: str) → str#: Convert explicit Redmine line breaks in text to Markdown double spaces + newline.

convert_redmine_links(line: str) → str#: Convert Redmine links “text”:url to Markdown [text](url).

convert_redmine_lists(line: str) → str | None#: Convert Redmine nested lists (, *) to Markdown lists with indentation.

convert_redmine_table(lines: list[str]) → tuple[list[str], int]#: Convert Redmine table block lines starting with | to Markdown table. Returns tuple (converted_lines, number_of_lines_consumed).

enrich_with_context(entry: dict[str, Any], chunk: str) → str#

Add page hierarchy information as a context prefix to the content.

Parameters:

entry – Original Redmine entry.
chunk – A chunk of cleaned text content.

Return type:

Chunk prefixed with hierarchy context.

filter_valid_entries(data: list[dict[str, Any]]) → list[dict[str, Any]]#

Keep only entries whose metadata status is not ‘NOK’.

Parameters:: data (list of dict) – A list of raw Redmine-exported entries.
Returns:: A filtered list of entries with acceptable statuses.
Return type:: list of dict

normalize_metadata(metadata: dict[str, Any]) → dict[str, Any]#

Clean and normalize metadata fields (e.g., timestamps).

Parameters:: metadata – The metadata dictionary from a Redmine page.
Return type:: A normalized metadata dictionary.

prepare_for_ingestion(raw_data: list[dict[str, Any]]) → list[dict[str, Any]]#

Full pipeline: filter, clean, split and enrich Redmine data.

Parameters:: raw_data – List of Redmine entries (JSON-like).
Return type:: List of prepared documents ready for ingestion.

redmine_to_markdown(text: str) → str#: Convert a multiline Redmine-formatted text to Markdown, handling headers, lists, links, bold/italic, images, tables, code blocks, and line breaks.

split_content(content: str) → list[str]#

Split long text into smaller chunks based on sentence boundaries.

Parameters:: content – The cleaned full page text.
Return type:: A list of shorter text chunks.

Extra Tools#

Deduplication filter using hash, FAISS similarity, and cross-encoder re-ranking.

class euclid.rag.extra_scripts.deduplication.ChunkClusterer(distance_threshold: float = 0.1)#

Bases: object

Cluster embedding vectors and return one representative text per cluster.

This class uses clustering on embedding vectors to identify similar groups of text chunks.

Parameters:: distance_threshold (float, optional) – Maximum cosine distance between elements in a cluster. Lower values produce tighter, more conservative clusters. Default is 0.1.

filter(texts: list[str], embeddings: ndarray) → list[str]#

Filter semantically similar texts using clustering.

Parameters:

texts (list of str) – The input text chunks.
embeddings (np.ndarray) – The embedding vectors.

Returns:

One text per cluster.

Return type:

list of str

class euclid.rag.extra_scripts.deduplication.HashDeduplicator#

Bases: object

Deduplicator using SHA256 hashes for exact match detection.

Tracks seen inputs by their hash and filters out exact duplicates.

filter(text: str) → bool#

Check if the text has already been seen via hashing.

Parameters:: text (str) – Input text.
Returns:: True if the text is a duplicate, False otherwise.
Return type:: bool

class euclid.rag.extra_scripts.deduplication.SemanticSimilarityDeduplicator(vectorstore: FAISS | None, reranker_model: str, similarity_threshold: float, rerank_threshold: float, k_candidates: int = 5)#

Bases: object

Deduplicator using semantic similarity and optional reranking.

Uses FAISS to find similar texts and CrossEncoder to refine scoring. Texts are considered duplicates if both thresholds are exceeded.

filter(text: str) → bool#

Check if the text is semantically similar to existing vectors based on a similarity and reranking threshold.

Parameters:: text (str) – Input text.
Returns:: True if semantically duplicate, False otherwise.
Return type:: bool

Embedding and vector store management utilities for Euclid document ingestion.

This module provides: - An E5 embedding class with support for MPS/CUDA/CPU. - A function to load or create a FAISS vector store from PDFs.

class euclid.rag.extra_scripts.vectorstore_embedder.Embedder(model_name: str = 'intfloat/e5-small-v2', batch_size: int = 16)#

Bases: Embeddings

Embeds text into dense vectors using a HuggingFace model.

Supports MPS, CUDA, or CPU.

Pooling strategy (CLS or mean) is inferred automatically.

Parameters:

model_name (str, optional) – HuggingFace model to use. Default is “intfloat/e5-small-v2”.
batch_size (int, optional) – Number of texts per batch. Default is 16.

property device: device#: Return the torch device used by the model.

embed_documents(texts: list[str]) → list[list[float]]#: Embed a list of documents into dense vectors.

embed_query(text: str) → list[float]#: Embed a single query into a dense vector.

euclid.rag.extra_scripts.vectorstore_embedder.load_json_documents(json_paths: list[Path]) → list[Document]#

Load documents from a list of JSON files.

Each JSON file should contain a list of dicts with at least a “content” field and optionally a “metadata” field.

Parameters:: json_paths (List[Path]) – List of paths to JSON files.
Returns:: A list of LangChain Document objects.
Return type:: List[Document]

euclid.rag.extra_scripts.vectorstore_embedder.load_or_create_index(index_dir: Path, embedder: Embeddings, pdf_paths: None | list[Path] = None, json_paths: None | list[Path] = None) → FAISS#

Load an existing FAISS index, or build one from given documents.

Parameters:

index_dir (Path) – Directory where the FAISS index is stored (or will be created).
embedder (Embeddings) – Embedding model implementing the LangChain Embeddings interface.
pdf_paths (list[Path], optional) – Lists of input documents to embed.
json_paths (list[Path], optional) – Lists of input documents to embed.

Returns:

A ready-to-use FAISS vectorstore.

Return type:

FAISS

euclid.rag.extra_scripts.vectorstore_embedder.load_pdf_documents(pdf_paths: list[Path]) → list[Document]#

Load documents from a list of PDF files using PyMuPDF.

Parameters:: pdf_paths (List[Path]) – List of paths to PDF files.
Returns:: A list of LangChain Document objects.
Return type:: List[Document]

Python API Reference#

Core Application#

Key Functions#

Retrieval System#

Overview#

Source Attribution#

Data Ingestion#

Utilities#

Extra Tools#

This Page