Document Ingestion#

Before using the chatbot, you must ingest documents into the vector store. This process downloads, processes, and indexes documents for semantic search.

Prerequisites#

  • Installation completed

  • Configuration file set up

  • Internet connection for downloading documents

  • Sufficient disk space for vector stores

Warning

If the vector store is missing, the app will fail to start with:

RuntimeError: Vector store missing. Please run ingestion before launching the app.

Publications Ingestion#

Ingest Euclid collaboration publications from the official BibTeX bibliography.

The EuclidBibIngestor is specifically designed to process publications from the Euclid collaboration’s official bibliography file, which contains peer-reviewed papers, conference proceedings, and preprints related to the Euclid mission.

Basic Usage#

python python/euclid/rag/ingestion/ingest_publications.py --config path/to/app_config.yaml

Configuration Requirements#

Ensure your app_config.yaml includes:

pdf_data:
  bibtex_url: "https://eceb.astro.uni-bonn.de/public/Euclid.bib"
  arxiv_pdf_base_url: "https://arxiv.org/pdf/"
  chunk_size: 800
  chunk_overlap: 100

Data Source#

The ingestion process downloads and parses the official Euclid collaboration BibTeX file, which includes:

  • Key project papers - Foundational Euclid mission publications

  • Data release papers - Documentation for specific data releases (e.g., Q1 Special Issue)

  • Technical papers - Instrument descriptions, data processing methods

  • Scientific results - Analysis and findings from Euclid observations

Key Features#

BibTeX Processing
  • Parses official Euclid collaboration bibliography

  • Extracts publication metadata (authors, titles, abstracts)

  • Downloads full-text PDFs from arXiv when available

Content Extraction
  • Processes PDF content for full-text search

  • Maintains publication metadata for proper attribution

  • Handles various publication formats (journal articles, preprints, conference papers)

Semantic Deduplication
  • Prevents ingestion of semantically similar content across publications

  • Particularly effective for reducing redundant introductory material

  • Uses similarity thresholds to identify and filter duplicate content sections

Vector Store Integration
  • Chunks document content for semantic search

  • Configurable chunk sizes and overlap for optimal retrieval

  • Prevents duplicate ingestion of existing publications

Custom Configuration#

To use a different configuration file:

python python/euclid/rag/ingestion/ingest_publications.py -c /path/to/custom_config.yaml

# Or using the long form
python python/euclid/rag/ingestion/ingest_publications.py --config /path/to/custom_config.yaml

JSON Document Ingestion#

Ingest documents from JSON files. While commonly used for Redmine wiki exports, this ingestion method works with any JSON document structure.

Basic Usage#

python python/euclid/rag/ingestion/ingest_json.py -c /path/to/custom_config.yaml

# Or using the long form
python python/euclid/rag/ingestion/ingest_json.py --config /path/to/custom_config.yaml

This processes JSON files from the directory specified in json_data.redmine_json_dir configuration.

Configuration Requirements#

Ensure your app_config.yaml includes:

json_data:
  redmine_json_dir: "/path/to/json/files"
  chunk_size: 800
  json_root_key: pages

JSON File Format#

The expected JSON structure (example using Redmine export format):

{
  "pages": [
    {
      "project_id": "project-name",
      "page_name": "Wiki Page Title",
      "content": "Page content text...",
      "updated_on": "2024-01-15T10:30:00Z",
      "url": "https://redmine.example.com/page",
      "metadata": {
        "author": "username",
        "version": 5
      }
    }
  ]
}

Key Features#

Deduplication

Uses hardcoded key fields to prevent ingesting the same content multiple times

Content Processing
  • Extracts text from JSON structure

  • Preserves metadata for source attribution

  • Handles nested JSON structures

Chunking Strategy
  • Respects document boundaries

  • Maintains context between related sections

  • Configurable chunk sizes for different content types

Custom JSON Structures#

For different JSON document formats, modify the configuration:

json_data:
  json_root_key: "documents" # Change root key for your JSON structure
  redmine_json_dir: "/path/to/your/json/files" # Any JSON documents

DPDD Ingestion#

The DPDD (Data Product Description Document) ingestion downloads and processes Euclid DPDD pages from the official website.

Basic Usage#

python python/euclid/rag/ingestion/ingest_dpdd.py --config path/to/app_config.yaml

Features#

The DPDD ingestion process:

  • Downloads DPDD pages from the Euclid website

  • Extracts subtopics and sections, skipping banned sections

  • Splits text into chunks using RecursiveCharacterTextSplitter

  • Stores chunks in a FAISS vector store for semantic search

  • Prevents duplicate ingestion by checking existing sources

Configuration Options#

The DPDD ingestion behavior is controlled by the dpdd_ingest_config.yaml file:

Selective Ingestion

Specify particular topics to ingest:

scrape_all: false
topics:
  - name: Purpose and Scope
    link: purpose.html
  - name: LE1 Data Products
    link: le1dpd/le1index.html
Complete Ingestion

Ingest all available content:

scrape_all: true
topics_number_limit: 0  # No limit
Limited Ingestion

Limit the number of topics:

scrape_all: true
topics_number_limit: 5  # Only first 5 topics

Ingestion Process Details#

Text Processing Pipeline#

  1. Document Download: Fetch content from configured URLs

  2. Content Extraction: Parse HTML and extract relevant text

  3. Section Filtering: Skip banned sections (headers, navigation, etc.)

  4. Text Chunking: Split long documents into manageable chunks

  5. Embedding Generation: Create vector embeddings for each chunk

  6. Vector Storage: Store embeddings in FAISS index

  7. Metadata Storage: Save document metadata for source attribution

Chunk Size and Overlap#

The system uses RecursiveCharacterTextSplitter with optimized settings:

  • Chunk size: Balanced for context and performance

  • Chunk overlap: Ensures continuity between chunks

  • Separator handling: Respects document structure (paragraphs, sentences)

Duplicate Prevention#

The ingestion process automatically:

  • Checks existing sources in the vector store

  • Skips already processed documents to avoid duplicates

  • Updates metadata for modified documents

  • Maintains consistency across ingestion runs

Monitoring Ingestion Progress#

Command Line Output#

The ingestion scripts provide progress information:

Processing topic: Purpose and Scope
Extracting sections from: purpose.html
Skipping banned section: Header
Creating 15 text chunks
Storing embeddings in vector store
✓ Completed: Purpose and Scope (15 chunks)

Logging#

Detailed logs are available for troubleshooting:

# Enable verbose logging
python python/euclid/rag/ingestion/ingest_dpdd.py --config config.yaml --verbose

Storage Requirements#

Estimate disk space needed for your vector stores:

Small Dataset (< 100 documents)
  • Vector store: ~50-100 MB

  • Metadata: ~10-20 MB

Medium Dataset (100-1000 documents)
  • Vector store: ~500 MB - 1 GB

  • Metadata: ~50-100 MB

Large Dataset (> 1000 documents)
  • Vector store: > 1 GB

  • Metadata: > 100 MB

Note

Actual sizes depend on document length, embedding dimensions, and chunk sizes.

Batch Processing#

For large document collections, consider batch processing:

# Process publications first
python python/euclid/rag/ingestion/ingest_publications.py -c config.yaml

# Then process DPDD documents
python python/euclid/rag/ingestion/ingest_dpdd.py --config config.yaml

# Verify vector stores were created
ls -la *_vector_store/

Ingestion Validation#

After ingestion, verify the vector stores:

# Check vector store contents
import os
from euclid.rag import chatbot

# Load configuration
config_path = "python/euclid/rag/app_config.yaml"

# Verify vector store files exist
vector_store_dirs = ["redmine_vector_store", "public_data_vector_store"]

for dir_name in vector_store_dirs:
    if os.path.exists(dir_name):
        files = os.listdir(dir_name)
        print(f"{dir_name}: {files}")
    else:
        print(f"⚠️  Missing: {dir_name}")

Performance Optimization#

For faster ingestion:

Parallel Processing

The ingestion scripts support concurrent processing where possible.

Network Optimization

Use a stable, fast internet connection for downloading documents.

Storage Optimization

Use SSD storage for better I/O performance during indexing.

Memory Management

Ensure sufficient RAM for large document processing.

Next Steps#

After successful ingestion:

Common ingestion issues and solutions are covered in Troubleshooting.