Document Ingestion#
Before using the chatbot, you must ingest documents into the vector store. This process downloads, processes, and indexes documents for semantic search.
Prerequisites#
Installation completed
Configuration file set up
Internet connection for downloading documents
Sufficient disk space for vector stores
Warning
If the vector store is missing, the app will fail to start with:
RuntimeError: Vector store missing. Please run ingestion before launching the app.
Publications Ingestion#
Ingest Euclid collaboration publications from the official BibTeX bibliography.
The EuclidBibIngestor is specifically designed to process publications from the Euclid collaboration’s official bibliography file, which contains peer-reviewed papers, conference proceedings, and preprints related to the Euclid mission.
Basic Usage#
python python/euclid/rag/ingestion/ingest_publications.py --config path/to/app_config.yaml
Configuration Requirements#
Ensure your app_config.yaml includes:
pdf_data:
bibtex_url: "https://eceb.astro.uni-bonn.de/public/Euclid.bib"
arxiv_pdf_base_url: "https://arxiv.org/pdf/"
chunk_size: 800
chunk_overlap: 100
Data Source#
The ingestion process downloads and parses the official Euclid collaboration BibTeX file, which includes:
Key project papers - Foundational Euclid mission publications
Data release papers - Documentation for specific data releases (e.g., Q1 Special Issue)
Technical papers - Instrument descriptions, data processing methods
Scientific results - Analysis and findings from Euclid observations
Key Features#
- BibTeX Processing
Parses official Euclid collaboration bibliography
Extracts publication metadata (authors, titles, abstracts)
Downloads full-text PDFs from arXiv when available
- Content Extraction
Processes PDF content for full-text search
Maintains publication metadata for proper attribution
Handles various publication formats (journal articles, preprints, conference papers)
- Semantic Deduplication
Prevents ingestion of semantically similar content across publications
Particularly effective for reducing redundant introductory material
Uses similarity thresholds to identify and filter duplicate content sections
- Vector Store Integration
Chunks document content for semantic search
Configurable chunk sizes and overlap for optimal retrieval
Prevents duplicate ingestion of existing publications
Custom Configuration#
To use a different configuration file:
python python/euclid/rag/ingestion/ingest_publications.py -c /path/to/custom_config.yaml
# Or using the long form
python python/euclid/rag/ingestion/ingest_publications.py --config /path/to/custom_config.yaml
JSON Document Ingestion#
Ingest documents from JSON files. While commonly used for Redmine wiki exports, this ingestion method works with any JSON document structure.
Basic Usage#
python python/euclid/rag/ingestion/ingest_json.py -c /path/to/custom_config.yaml
# Or using the long form
python python/euclid/rag/ingestion/ingest_json.py --config /path/to/custom_config.yaml
This processes JSON files from the directory specified in json_data.redmine_json_dir configuration.
Configuration Requirements#
Ensure your app_config.yaml includes:
json_data:
redmine_json_dir: "/path/to/json/files"
chunk_size: 800
json_root_key: pages
JSON File Format#
The expected JSON structure (example using Redmine export format):
{
"pages": [
{
"project_id": "project-name",
"page_name": "Wiki Page Title",
"content": "Page content text...",
"updated_on": "2024-01-15T10:30:00Z",
"url": "https://redmine.example.com/page",
"metadata": {
"author": "username",
"version": 5
}
}
]
}
Key Features#
- Deduplication
Uses hardcoded key fields to prevent ingesting the same content multiple times
- Content Processing
Extracts text from JSON structure
Preserves metadata for source attribution
Handles nested JSON structures
- Chunking Strategy
Respects document boundaries
Maintains context between related sections
Configurable chunk sizes for different content types
Custom JSON Structures#
For different JSON document formats, modify the configuration:
json_data:
json_root_key: "documents" # Change root key for your JSON structure
redmine_json_dir: "/path/to/your/json/files" # Any JSON documents
DPDD Ingestion#
The DPDD (Data Product Description Document) ingestion downloads and processes Euclid DPDD pages from the official website.
Basic Usage#
python python/euclid/rag/ingestion/ingest_dpdd.py --config path/to/app_config.yaml
Features#
The DPDD ingestion process:
Downloads DPDD pages from the Euclid website
Extracts subtopics and sections, skipping banned sections
Splits text into chunks using RecursiveCharacterTextSplitter
Stores chunks in a FAISS vector store for semantic search
Prevents duplicate ingestion by checking existing sources
Configuration Options#
The DPDD ingestion behavior is controlled by the dpdd_ingest_config.yaml
file:
- Selective Ingestion
Specify particular topics to ingest:
scrape_all: false topics: - name: Purpose and Scope link: purpose.html - name: LE1 Data Products link: le1dpd/le1index.html
- Complete Ingestion
Ingest all available content:
scrape_all: true topics_number_limit: 0 # No limit
- Limited Ingestion
Limit the number of topics:
scrape_all: true topics_number_limit: 5 # Only first 5 topics
Ingestion Process Details#
Text Processing Pipeline#
Document Download: Fetch content from configured URLs
Content Extraction: Parse HTML and extract relevant text
Section Filtering: Skip banned sections (headers, navigation, etc.)
Text Chunking: Split long documents into manageable chunks
Embedding Generation: Create vector embeddings for each chunk
Vector Storage: Store embeddings in FAISS index
Metadata Storage: Save document metadata for source attribution
Chunk Size and Overlap#
The system uses RecursiveCharacterTextSplitter
with optimized settings:
Chunk size: Balanced for context and performance
Chunk overlap: Ensures continuity between chunks
Separator handling: Respects document structure (paragraphs, sentences)
Duplicate Prevention#
The ingestion process automatically:
Checks existing sources in the vector store
Skips already processed documents to avoid duplicates
Updates metadata for modified documents
Maintains consistency across ingestion runs
Monitoring Ingestion Progress#
Command Line Output#
The ingestion scripts provide progress information:
Processing topic: Purpose and Scope
Extracting sections from: purpose.html
Skipping banned section: Header
Creating 15 text chunks
Storing embeddings in vector store
✓ Completed: Purpose and Scope (15 chunks)
Logging#
Detailed logs are available for troubleshooting:
# Enable verbose logging
python python/euclid/rag/ingestion/ingest_dpdd.py --config config.yaml --verbose
Storage Requirements#
Estimate disk space needed for your vector stores:
- Small Dataset (< 100 documents)
Vector store: ~50-100 MB
Metadata: ~10-20 MB
- Medium Dataset (100-1000 documents)
Vector store: ~500 MB - 1 GB
Metadata: ~50-100 MB
- Large Dataset (> 1000 documents)
Vector store: > 1 GB
Metadata: > 100 MB
Note
Actual sizes depend on document length, embedding dimensions, and chunk sizes.
Batch Processing#
For large document collections, consider batch processing:
# Process publications first
python python/euclid/rag/ingestion/ingest_publications.py -c config.yaml
# Then process DPDD documents
python python/euclid/rag/ingestion/ingest_dpdd.py --config config.yaml
# Verify vector stores were created
ls -la *_vector_store/
Ingestion Validation#
After ingestion, verify the vector stores:
# Check vector store contents
import os
from euclid.rag import chatbot
# Load configuration
config_path = "python/euclid/rag/app_config.yaml"
# Verify vector store files exist
vector_store_dirs = ["redmine_vector_store", "public_data_vector_store"]
for dir_name in vector_store_dirs:
if os.path.exists(dir_name):
files = os.listdir(dir_name)
print(f"{dir_name}: {files}")
else:
print(f"⚠️ Missing: {dir_name}")
Performance Optimization#
For faster ingestion:
- Parallel Processing
The ingestion scripts support concurrent processing where possible.
- Network Optimization
Use a stable, fast internet connection for downloading documents.
- Storage Optimization
Use SSD storage for better I/O performance during indexing.
- Memory Management
Ensure sufficient RAM for large document processing.
Next Steps#
After successful ingestion:
Running the Chatbot - Run the chatbot interface
Troubleshooting - Resolve any ingestion issues
Verify that documents are searchable through the interface
Common ingestion issues and solutions are covered in Troubleshooting.