[Data ingestion and Storage] Design Ingestion API for External Application Integration
Goal
Define a unified ingestion API to simplify integration with external applications across multiple data sources.
Current Ingestion Workflow Overview:
-
Register an artifact in Governance API Requires:
- name: Name of the artifact (usually the file or dataset name)
- catalogue_id: ID of the dataset or catalogue to associate the artifact with Returns: artifact_id
-
Based on the ingestion type, proceed differently:
A. Bulk Upload
- Requires:
- file
- name
- catalogue_id
- Flow:
- Get signed upload URL from StorageAPI
- Upload file to MinIO using signed URL
- Trigger processing pipeline
B. Source-Based Ingestion
- Requires:
- catalogue_id
- ingestion_type: "db" or "external_storage"
- ingestion_config.
- Flow:
- No file upload occurs
- These connection parameters are passed to the pipeline
- Backend triggers ingestion pipeline with artifact metadata and source config
- Requires:
-
Trigger the pipeline (common to both flows) Requires:
- artifact_id
- ingestion_type
- ingestion_config (for DB and external storage)
Proposal
Implementing two dedicated API paths—one for bulk file uploads and another for ingestion from external sources (e.g., databases or object storage).
The two ingestion methods have different input requirements and behaviours. Splitting them into distinct API paths improves clarity and simplifies validation.
Example:
/ingest/bulk
/ingest/source
/cc @jpla
Edited by Antoni Gimeno