Document Ingestion#

Before using the chatbot, you must ingest documents into the vector store. This process downloads, processes, and indexes documents for semantic search.

Prerequisites#

Installation completed
Configuration file set up
Internet connection for downloading documents
Sufficient disk space for vector stores

Warning

If the vector store is missing, the app will fail to start with:

RuntimeError: Vector store missing. Please run ingestion before launching the app.

Publications Ingestion#

Ingest Euclid collaboration publications from the official BibTeX bibliography.

The EuclidBibIngestor is specifically designed to process publications from the Euclid collaboration’s official bibliography file, which contains peer-reviewed papers, conference proceedings, and preprints related to the Euclid mission.

Basic Usage#

python python/euclid/rag/ingestion/ingest_publications.py --config path/to/app_config.yaml

Configuration Requirements#

Ensure your app_config.yaml includes:

pdf_data:
  bibtex_url: "https://eceb.astro.uni-bonn.de/public/Euclid.bib"
  arxiv_pdf_base_url: "https://arxiv.org/pdf/"
  chunk_size: 800
  chunk_overlap: 100

Data Source#

The ingestion process downloads and parses the official Euclid collaboration BibTeX file, which includes:

Key project papers - Foundational Euclid mission publications
Data release papers - Documentation for specific data releases (e.g., Q1 Special Issue)
Technical papers - Instrument descriptions, data processing methods
Scientific results - Analysis and findings from Euclid observations

Key Features#

BibTeX Processing

Parses official Euclid collaboration bibliography
Extracts publication metadata (authors, titles, abstracts)
Downloads full-text PDFs from arXiv when available

Content Extraction

Processes PDF content for full-text search
Maintains publication metadata for proper attribution
Handles various publication formats (journal articles, preprints, conference papers)

Semantic Deduplication

Prevents ingestion of semantically similar content across publications
Particularly effective for reducing redundant introductory material
Uses similarity thresholds to identify and filter duplicate content sections

Vector Store Integration

Chunks document content for semantic search
Configurable chunk sizes and overlap for optimal retrieval
Prevents duplicate ingestion of existing publications

Custom Configuration#

To use a different configuration file:

python python/euclid/rag/ingestion/ingest_publications.py -c /path/to/custom_config.yaml

# Or using the long form
python python/euclid/rag/ingestion/ingest_publications.py --config /path/to/custom_config.yaml

JSON Document Ingestion#

Ingest documents from JSON files. While commonly used for Redmine wiki exports, this ingestion method works with any JSON document structure.

Basic Usage#

python python/euclid/rag/ingestion/ingest_json.py -c /path/to/custom_config.yaml

# Or using the long form
python python/euclid/rag/ingestion/ingest_json.py --config /path/to/custom_config.yaml

This processes JSON files from the directory specified in json_data.redmine_json_dir configuration.

Configuration Requirements#

Ensure your app_config.yaml includes:

json_data:
  redmine_json_dir: "/path/to/json/files"
  chunk_size: 800
  json_root_key: pages

JSON File Format#

The expected JSON structure (example using Redmine export format):

{
  "pages": [
    {
      "project_id": "project-name",
      "page_name": "Wiki Page Title",
      "content": "Page content text...",
      "updated_on": "2024-01-15T10:30:00Z",
      "url": "https://redmine.example.com/page",
      "metadata": {
        "author": "username",
        "version": 5
      }
    }
  ]
}

Key Features#

Deduplication

Uses hardcoded key fields to prevent ingesting the same content multiple times

Content Processing

Extracts text from JSON structure
Preserves metadata for source attribution
Handles nested JSON structures

Chunking Strategy

Respects document boundaries
Maintains context between related sections
Configurable chunk sizes for different content types

Custom JSON Structures#

For different JSON document formats, modify the configuration:

json_data:
  json_root_key: "documents" # Change root key for your JSON structure
  redmine_json_dir: "/path/to/your/json/files" # Any JSON documents

DPDD Ingestion#

The DPDD (Data Product Description Document) ingestion downloads and processes Euclid DPDD pages from the official website.

Basic Usage#

python python/euclid/rag/ingestion/ingest_dpdd.py --config path/to/app_config.yaml

Features#

The DPDD ingestion process:

Downloads DPDD pages from the Euclid website
Extracts subtopics and sections, skipping banned sections
Splits text into chunks using RecursiveCharacterTextSplitter
Stores chunks in a FAISS vector store for semantic search
Prevents duplicate ingestion by checking existing sources

Configuration Options#

The DPDD ingestion behavior is controlled by the dpdd_ingest_config.yaml file:

Selective Ingestion

Specify particular topics to ingest:

scrape_all: false
topics:
  - name: Purpose and Scope
    link: purpose.html
  - name: LE1 Data Products
    link: le1dpd/le1index.html

Complete Ingestion

Ingest all available content:

scrape_all: true
topics_number_limit: 0  # No limit

Limited Ingestion

Limit the number of topics:

scrape_all: true
topics_number_limit: 5  # Only first 5 topics

Ingestion Process Details#

Text Processing Pipeline#

Document Download: Fetch content from configured URLs
Content Extraction: Parse HTML and extract relevant text
Section Filtering: Skip banned sections (headers, navigation, etc.)
Text Chunking: Split long documents into manageable chunks
Embedding Generation: Create vector embeddings for each chunk
Vector Storage: Store embeddings in FAISS index
Metadata Storage: Save document metadata for source attribution

Chunk Size and Overlap#

The system uses RecursiveCharacterTextSplitter with optimized settings:

Chunk size: Balanced for context and performance
Chunk overlap: Ensures continuity between chunks
Separator handling: Respects document structure (paragraphs, sentences)

Duplicate Prevention#

The ingestion process automatically:

Checks existing sources in the vector store
Skips already processed documents to avoid duplicates
Updates metadata for modified documents
Maintains consistency across ingestion runs

Monitoring Ingestion Progress#

Command Line Output#

The ingestion scripts provide progress information:

Processing topic: Purpose and Scope
Extracting sections from: purpose.html
Skipping banned section: Header
Creating 15 text chunks
Storing embeddings in vector store
✓ Completed: Purpose and Scope (15 chunks)

Logging#

Detailed logs are available for troubleshooting:

# Enable verbose logging
python python/euclid/rag/ingestion/ingest_dpdd.py --config config.yaml --verbose

Storage Requirements#

Estimate disk space needed for your vector stores:

Small Dataset (< 100 documents)

Vector store: ~50-100 MB
Metadata: ~10-20 MB

Medium Dataset (100-1000 documents)

Vector store: ~500 MB - 1 GB
Metadata: ~50-100 MB

Large Dataset (> 1000 documents)

Vector store: > 1 GB
Metadata: > 100 MB

Note

Actual sizes depend on document length, embedding dimensions, and chunk sizes.

Batch Processing#

For large document collections, consider batch processing:

# Process publications first
python python/euclid/rag/ingestion/ingest_publications.py -c config.yaml

# Then process DPDD documents
python python/euclid/rag/ingestion/ingest_dpdd.py --config config.yaml

# Verify vector stores were created
ls -la *_vector_store/

Ingestion Validation#

After ingestion, verify the vector stores:

# Check vector store contents
import os
from euclid.rag import chatbot

# Load configuration
config_path = "python/euclid/rag/app_config.yaml"

# Verify vector store files exist
vector_store_dirs = ["redmine_vector_store", "public_data_vector_store"]

for dir_name in vector_store_dirs:
    if os.path.exists(dir_name):
        files = os.listdir(dir_name)
        print(f"{dir_name}: {files}")
    else:
        print(f"⚠️  Missing: {dir_name}")

Performance Optimization#

For faster ingestion:

Parallel Processing: The ingestion scripts support concurrent processing where possible.
Network Optimization: Use a stable, fast internet connection for downloading documents.
Storage Optimization: Use SSD storage for better I/O performance during indexing.
Memory Management: Ensure sufficient RAM for large document processing.

Next Steps#

After successful ingestion:

Running the Chatbot - Run the chatbot interface
Troubleshooting - Resolve any ingestion issues
Verify that documents are searchable through the interface

Common ingestion issues and solutions are covered in Troubleshooting.

Document Ingestion#

Prerequisites#

Publications Ingestion#

Basic Usage#

Configuration Requirements#

Data Source#

Key Features#

Custom Configuration#

JSON Document Ingestion#

Basic Usage#

Configuration Requirements#

JSON File Format#

Key Features#

Custom JSON Structures#

DPDD Ingestion#

Basic Usage#

Features#

Configuration Options#

Ingestion Process Details#

Text Processing Pipeline#

Chunk Size and Overlap#

Duplicate Prevention#

Monitoring Ingestion Progress#

Command Line Output#

Logging#

Storage Requirements#

Batch Processing#

Ingestion Validation#

Performance Optimization#

Next Steps#

This Page