Python API Reference#

Core Application#

The main chatbot and user interface functionality.

Key Functions#

The core application provides functions for:

  • Configuring retrievers for document search

  • Handling user input and responses

  • Setting up the Streamlit interface

Streamlit chatbot interface for the Euclid AI Assistant. Uses an existing FAISS vector store and E5 embeddings for retrieval.

euclid.rag.chatbot.configure_retriever(config: dict, index_dir: str) VectorStoreRetriever#

Build and cache a FAISS-based retriever for Euclid knowledge sources.

Parameters:
  • config (dict) – Embedding model configuration.

  • index_dir (str) – Path to the directory containing the FAISS index.

Returns:

Configured Retriever with search_type="similarity" and k=6.

Return type:

VectorStoreRetriever

Raises:

RuntimeError – If the FAISS index cannot be found or loaded.

euclid.rag.chatbot.create_euclid_router(config: dict) Callable[[dict, list[BaseCallbackHandler] | None], dict]#

Return Euclid-AI that always delegates to at least one sub-agent.

euclid.rag.chatbot.handle_user_input(router: Callable[[dict, list[BaseCallbackHandler] | None], dict], msgs: StreamlitChatMessageHistory) None#

Display chat history and handle new user input in Streamlit.

Parameters:
  • router (Callable) – Callable that routs to correct tools for the response.

  • msgs (StreamlitChatMessageHistory) – Chat history object for managing messages.

euclid.rag.chatbot.submit_text() None#

Flag that the user pressed <enter> in the chat box (Streamlit callback).

Set up the sidebar, landing page, and header/footer for a Streamlit app that interacts with the chatbot.

Set up the header and footer for the Streamlit app.

euclid.rag.layout.setup_landing_page() None#

Set up the landing page for the Streamlit app.

euclid.rag.layout.setup_sidebar() None#

Set up the sidebar for the Streamlit app.

Create a Streamlit callback handler that dynamically updates a UI container with new tokens from a language model.

euclid.rag.streamlit_callback.get_streamlit_cb(parent_container: DeltaGenerator) BaseCallbackHandler#

Create a Streamlit callback handler that updates the provided Streamlit container with new tokens. :param parent_container: The Streamlit container :type parent_container: DeltaGenerator :param where the text will be rendered.:

Returns:

  • BaseCallbackHandler (An instance of a callback handler)

  • configured for Streamlit.

Retrieval System#

Tools for retrieving and formatting information from document stores.

Overview#

The retrieval system provides specialized tools for querying Euclid mission documents. The system retrieves relevant documents from vector stores, ranks them using similarity and metadata scoring, and provides the top-ranked sources as context for response generation.

Source Attribution#

The chatbot responses include the top-ranked documents that were provided as context to the language model. These sources represent the retrieved, reranked, and deduplicated documents used to generate the response, not necessarily a selection made by the language model itself.

Generic Retriever Tool for querying Euclid-Consortium documents.

euclid.rag.retrievers.generic_retrieval_tool.bonus_overlap(q: set[str], field: str | None, weight: float) float#

Compute weighted count of query tokens in a metadata field.

euclid.rag.retrievers.generic_retrieval_tool.format_source(m: dict) str#

Format a source line based on document metadata.

euclid.rag.retrievers.generic_retrieval_tool.get_generic_retrieval_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) Tool#

Return a generic Euclid retrieval tool that answers questions from any document type (publications, DPDD, etc.).

Parameters:
  • llm (BaseLanguageModel) – The language model used to generate answers.

  • retriever (VectorStoreRetriever) – The retriever for accessing vectorstore documents.

Returns:

A callable tool that answers questions and formats sources for all Euclid document types.

Return type:

Tool

euclid.rag.retrievers.generic_retrieval_tool.normalize_url(url: str | None) str | None#

Remove URL fragments and query parameters for deduplication.

euclid.rag.retrievers.generic_retrieval_tool.semantic_rerank(query: str, docs: list) list#

Rerank a list of documents by semantic similarity to the query.

euclid.rag.retrievers.generic_retrieval_tool.tokenize(text: str) set[str]#

Convert text into a set of lowercase tokens ≥3 characters.

Tool for querying Euclid-Consortium publications and metadata.

euclid.rag.retrievers.publication_tool.bonus_overlap(q: set[str], field: str | None, weight: float) float#

Compute a weighted count of query tokens found in a metadata field.

Parameters:
  • q (set of str) – Query tokens.

  • field (str or None) – Metadata field to search for token matches.

  • weight (float) – Weight to multiply the count by.

Returns:

Weighted token overlap score.

Return type:

float

euclid.rag.retrievers.publication_tool.get_publication_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) Tool#

Return a tool that answers questions using Euclid Consortium publications.

Uses a language model and vectorstore retriever,

to find and summarize relevant papers.

Parameters:
  • llm (BaseLanguageModel) – The language model used to generate answers.

  • retriever (VectorStoreRetriever) – The retriever for accessing publication documents.

Returns:

A callable tool that answers questions based on EC publications.

Return type:

Tool

euclid.rag.retrievers.publication_tool.semantic_rerank(query: str, docs: list) list#

Rerank a list of documents based on semantic similarity to a query.

Parameters:
  • query (str) – The input query string.

  • docs (list) –

    A list of documents to rerank.

    Each document must have a page_content attribute.

Returns:

The input documents sorted by descending relevance to the query.

Return type:

list

euclid.rag.retrievers.publication_tool.tokenize(text: str) set[str]#

Convert text into a set of lowercase tokens with ≥3 characters.

Punctuation is removed before tokenization.

Parameters:

text (str) – The input text string.

Returns:

Set of cleaned, lowercase tokens at least 3 characters long.

Return type:

set of str

Tool for querying Euclid-Consortium redmine and metadata.

class euclid.rag.retrievers.redmine_tool.RedmineRetrieverHelper(query: str, dedup_hash: HashDeduplicator, dedup_semantic: SemanticSimilarityDeduplicator)#

Bases: object

Helper class to encapsulate the multi-stage retrieval logic for Redmine.

remove_duplicate_docs(scored_docs: list[tuple]) list[tuple]#

Remove exact and semantic duplicates from retrieved documents.

Parameters:

scored_docs (list of tuple) – A list of (document, score) pairs. Each document should have a .page_content attribute containing its text.

Returns:

Deduplicated (score, document) pairs, sorted by score in descending order.

Return type:

list of tuple

Notes

  • Exact duplicates are removed using self.dedup_hash.

  • Semantic duplicates are removed using self.dedup_semantic if enabled.

  • The input is expected in the form (document, score), while the output

is returned in the form (score, document) for downstream ranking.

score_by_metadata(docs: list, top_k_docs: int = 10) list[tuple]#

Score documents based on metadata keyword overlap and recency.

semantic_rerank(docs: list) list#

Rerank a list of documents based on semantic similarity to a query.

Parameters:
  • query (str) – The input query string.

  • docs (list) –

    A list of (LangChain) documents to rerank.

    Each document must have a page_content attribute.

Returns:

The input documents sorted by descending relevance to the query.

Return type:

list

euclid.rag.retrievers.redmine_tool.get_redmine_tool(llm: BaseLanguageModel, retriever: VectorStoreRetriever) Tool#

Return a tool that answers questions using Euclid Consortium Redmine. Uses a language model and vector store retriever, to find and summarize relevant redmine wikis.

Parameters:
  • llm (BaseLanguageModel) – The language model used to generate answers.

  • retriever (VectorStoreRetriever) – The retriever for accessing redmine documents.

Returns:

A callable tool that answers questions based on EC redmine.

Return type:

Tool

Data Ingestion#

Ingest Euclid Science Ground Segment Data Product Description Document (DPDD).

This script downloads the DPDD from the Euclid website, processes it, and ingests the data into a FAISS vectorstore for use in the Euclid RAG system.

class euclid.rag.ingestion.ingest_dpdd.EuclidDPDDIngestor(vector_store_dir: Path, dpdd_config_path: Path)#

Bases: object

Downloads and ingests DPDD data into the vector store.

ingest_new_data() None#

Ingest new data into the vector store.

This method fetches DPDD entries, processes them, and adds them to the vector store, avoiding duplicates based on the ‘source’ metadata field.

Raises:

RuntimeError – If the vector store directory is missing or cannot be created.

Returns:

This function does not return anything; it performs the ingestion.

Return type:

None

euclid.rag.ingestion.ingest_dpdd.main() None#

Run the ingestion script.

euclid.rag.ingestion.ingest_dpdd.run_dpdd_ingestion(config: dict) None#

Run the DPDD ingestion process.

Parameters:

config (dict) – Configuration dictionary containing paths and settings.

Raises:

RuntimeError – If the vector store directory is missing or cannot be created.

Returns:

This function does not return anything; it performs the ingestion.

Return type:

None

Ingest publications into a FAISS vector store from the official EC BibTeX. Each paper is embedded immediately after download and deleted afterward.

class euclid.rag.ingestion.ingest_publications.EuclidBibIngestor(index_dir: Path, temp_dir: Path, data_config: dict)#

Bases: object

Downloads and updates the vector store from the Euclid BibTeX file.

update_from_bibtex() None#

Fetch new BibTeX entries and update the vector store.

euclid.rag.ingestion.ingest_publications.main() None#

Run the ingestion script.

euclid.rag.ingestion.ingest_publications.run_bibtex_ingestion(config: dict) None#

Run the bibtex ingestion script.

Module to ingest JSON-exported pages into a FAISS vector store.

class euclid.rag.ingestion.ingest_redmine.JSONIngestor(index_dir: Path, json_dir: Path, cleaner: RedmineCleaner, data_config: dict)#

Bases: object

Ingest JSON-exported pages into a FAISS vector store.

The JSON structure should be as follows:

{
    "content": "Full text of the page...",
    "metadata": {
        "field1": "",
        "field2": "",
        ...
    }
}
Parameters:
  • index_dir (Path) – Directory where the FAISS index will be stored.

  • json_dir (Path) – Directory containing JSON files to ingest.

  • cleaner (RedmineCleaner) – Text cleaning utility for preprocessing content.

  • data_config (dict) – Configuration dictionary containing embedding and processing settings.

ingest_json_files() None#

Ingest documents from JSON files into the FAISS vector store.

euclid.rag.ingestion.ingest_redmine.main() None#

Run the ingestion script.

euclid.rag.ingestion.ingest_redmine.run_redmine_ingestion(config: dict) None#

Run the ingestion process for Redmine pages using the provided configuration.

Utilities#

Module providing functions to load, extract, and expand acronyms from a JSON file and within a given text.

euclid.rag.utils.acronym_handler.expand_acronyms_in_query(query: str, acronyms: dict) str#

Expand acronyms found in a given query string by replacing them with their definitions.

Parameters:

query (str) – The input string containing potential acronyms. acronyms (dict): A dictionary mapping acronyms (keys) to their definitions (values) based on http://ycopin.pages.euclid-sgs.uk/euclidator/ by Y. Copin

Returns:

str – to include their definitions.

Return type:

The modified query string with acronyms expanded

Example

>>> acro = {"DSS": "Data Storage System | Distributed Storage System"}
>>> expand_acronyms_in_query("What is DSS?", acro)
'What is DSS (Data Storage System | Distributed Storage System)?'
euclid.rag.utils.acronym_handler.extract_acronyms(text: str) set[str]#

Extract acronyms from a string.

Parameters:

text (str) – The string containing acronyms.

Returns:

set[str]

Return type:

A set of the acronyms.

euclid.rag.utils.acronym_handler.load_acronyms(path: str | Path) dict[str, str]#

Load acronyms from a JSON file.

Parameters:

path (str | Path) – The file path to the JSON file containing acronyms.

Returns:

dict[str, str] – and values are their corresponding definitions.

Return type:

A dictionary where keys are acronyms

euclid.rag.utils.acronym_handler.match_acronyms(text: str, acronym_dict: dict[str, str]) dict[str, str]#

Match acronyms between a string and a dictionary of acronyms.

Parameters:
  • text (str) – The string containing acronyms.

  • acronym_dict (dict[str, str]) – A dictionary mapping acronyms (keys) to their definitions (values).

Returns:

dict[str]

Return type:

A susbset of acronym_dict

Utility for loading and parsing config files.

euclid.rag.utils.config.load_config(config_path: Path) dict#

Load YAML configuration from a file.

Parameters:

config_path (Path) – Path to the config YAML file.

Returns:

Parsed configuration dictionary.

Return type:

dict

Utility for loading current device type.

euclid.rag.utils.device.get_device() device#

Return the torch device to use for embedding.

Checks for available hardware acceleration in the following order: CUDA, MPS, then CPU.

Returns:

The selected device (‘cuda’, ‘mps’, or ‘cpu’).

Return type:

torch.device

Module providing a utility class for cleaning and preparing Redmine-exported data.

class euclid.rag.utils.redmine_cleaner.RedmineCleaner(max_chunk_length: int = 1000)#

Bases: object

A utility class for cleaning and preparing Redmine-exported data for ingestion in a RAG pipeline.

Parameters:

max_chunk_length (int, optional) – Maximum length for each split content chunk, by default 1000.

convert_redmine_bold_italic(line: str) str#

Convert bold and _italic_ Redmine syntax to Markdown bold and italic.

convert_redmine_code_blocks(lines: list[str]) list[str]#

Convert HTML pre tags to Markdown code blocks.

Supports multi-line <pre> sections by converting them to triple-backtick code blocks for better Markdown compatibility.

Parameters:

lines (list of str) – List of text lines that may contain Redmine-style code blocks.

Returns:

List of lines with <pre> tags converted to Markdown code blocks.

Return type:

list of str

Examples

>>> cleaner = RedmineCleaner()
>>> lines = ["Some text", "<pre>code here</pre>", "more text"]
>>> result = cleaner.convert_redmine_code_blocks(lines)
>>> print(result)
['Some text', '```', 'code here', '```', 'more text']
convert_redmine_headers(line: str) str | None#

Convert Redmine headers (h1. to h6.) to Markdown (# to ######).

convert_redmine_images(line: str) str#

Convert Redmine image syntax !image.png! or !image.png|widthxheight! to Markdown ![alt](image.png).

convert_redmine_linebreaks(line: str) str#

Convert explicit Redmine line breaks in text to Markdown double spaces + newline.

Convert Redmine links “text”:url to Markdown [text](url).

convert_redmine_lists(line: str) str | None#

Convert Redmine nested lists (, *) to Markdown lists with indentation.

convert_redmine_table(lines: list[str]) tuple[list[str], int]#

Convert Redmine table block lines starting with | to Markdown table. Returns tuple (converted_lines, number_of_lines_consumed).

enrich_with_context(entry: dict[str, Any], chunk: str) str#

Add page hierarchy information as a context prefix to the content.

Parameters:
  • entry – Original Redmine entry.

  • chunk – A chunk of cleaned text content.

Return type:

Chunk prefixed with hierarchy context.

filter_valid_entries(data: list[dict[str, Any]]) list[dict[str, Any]]#

Keep only entries whose metadata status is not ‘NOK’.

Parameters:

data (list of dict) – A list of raw Redmine-exported entries.

Returns:

A filtered list of entries with acceptable statuses.

Return type:

list of dict

normalize_metadata(metadata: dict[str, Any]) dict[str, Any]#

Clean and normalize metadata fields (e.g., timestamps).

Parameters:

metadata – The metadata dictionary from a Redmine page.

Return type:

A normalized metadata dictionary.

prepare_for_ingestion(raw_data: list[dict[str, Any]]) list[dict[str, Any]]#

Full pipeline: filter, clean, split and enrich Redmine data.

Parameters:

raw_data – List of Redmine entries (JSON-like).

Return type:

List of prepared documents ready for ingestion.

redmine_to_markdown(text: str) str#

Convert a multiline Redmine-formatted text to Markdown, handling headers, lists, links, bold/italic, images, tables, code blocks, and line breaks.

split_content(content: str) list[str]#

Split long text into smaller chunks based on sentence boundaries.

Parameters:

content – The cleaned full page text.

Return type:

A list of shorter text chunks.

Extra Tools#

Deduplication filter using hash, FAISS similarity, and cross-encoder re-ranking.

class euclid.rag.extra_scripts.deduplication.ChunkClusterer(distance_threshold: float = 0.1)#

Bases: object

Cluster embedding vectors and return one representative text per cluster.

This class uses clustering on embedding vectors to identify similar groups of text chunks.

Parameters:

distance_threshold (float, optional) – Maximum cosine distance between elements in a cluster. Lower values produce tighter, more conservative clusters. Default is 0.1.

filter(texts: list[str], embeddings: ndarray) list[str]#

Filter semantically similar texts using clustering.

Parameters:
  • texts (list of str) – The input text chunks.

  • embeddings (np.ndarray) – The embedding vectors.

Returns:

One text per cluster.

Return type:

list of str

class euclid.rag.extra_scripts.deduplication.HashDeduplicator#

Bases: object

Deduplicator using SHA256 hashes for exact match detection.

Tracks seen inputs by their hash and filters out exact duplicates.

filter(text: str) bool#

Check if the text has already been seen via hashing.

Parameters:

text (str) – Input text.

Returns:

True if the text is a duplicate, False otherwise.

Return type:

bool

class euclid.rag.extra_scripts.deduplication.SemanticSimilarityDeduplicator(vectorstore: FAISS | None, reranker_model: str, similarity_threshold: float, rerank_threshold: float, k_candidates: int = 5)#

Bases: object

Deduplicator using semantic similarity and optional reranking.

Uses FAISS to find similar texts and CrossEncoder to refine scoring. Texts are considered duplicates if both thresholds are exceeded.

filter(text: str) bool#

Check if the text is semantically similar to existing vectors based on a similarity and reranking threshold.

Parameters:

text (str) – Input text.

Returns:

True if semantically duplicate, False otherwise.

Return type:

bool

Embedding and vector store management utilities for Euclid document ingestion.

This module provides: - An E5 embedding class with support for MPS/CUDA/CPU. - A function to load or create a FAISS vector store from PDFs.

class euclid.rag.extra_scripts.vectorstore_embedder.Embedder(model_name: str = 'intfloat/e5-small-v2', batch_size: int = 16)#

Bases: Embeddings

Embeds text into dense vectors using a HuggingFace model.

Supports MPS, CUDA, or CPU.

Pooling strategy (CLS or mean) is inferred automatically.

Parameters:
  • model_name (str, optional) – HuggingFace model to use. Default is “intfloat/e5-small-v2”.

  • batch_size (int, optional) – Number of texts per batch. Default is 16.

property device: device#

Return the torch device used by the model.

embed_documents(texts: list[str]) list[list[float]]#

Embed a list of documents into dense vectors.

embed_query(text: str) list[float]#

Embed a single query into a dense vector.

euclid.rag.extra_scripts.vectorstore_embedder.load_json_documents(json_paths: list[Path]) list[Document]#

Load documents from a list of JSON files.

Each JSON file should contain a list of dicts with at least a “content” field and optionally a “metadata” field.

Parameters:

json_paths (List[Path]) – List of paths to JSON files.

Returns:

A list of LangChain Document objects.

Return type:

List[Document]

euclid.rag.extra_scripts.vectorstore_embedder.load_or_create_index(index_dir: Path, embedder: Embeddings, pdf_paths: None | list[Path] = None, json_paths: None | list[Path] = None) FAISS#

Load an existing FAISS index, or build one from given documents.

Parameters:
  • index_dir (Path) – Directory where the FAISS index is stored (or will be created).

  • embedder (Embeddings) – Embedding model implementing the LangChain Embeddings interface.

  • pdf_paths (list[Path], optional) – Lists of input documents to embed.

  • json_paths (list[Path], optional) – Lists of input documents to embed.

Returns:

A ready-to-use FAISS vectorstore.

Return type:

FAISS

euclid.rag.extra_scripts.vectorstore_embedder.load_pdf_documents(pdf_paths: list[Path]) list[Document]#

Load documents from a list of PDF files using PyMuPDF.

Parameters:

pdf_paths (List[Path]) – List of paths to PDF files.

Returns:

A list of LangChain Document objects.

Return type:

List[Document]