[Data ingestion and Storage] Design Ingestion API for External Application Integration

Goal

Define a unified ingestion API to simplify integration with external applications across multiple data sources.

Current Ingestion Workflow Overview:

Register an artifact in Governance API Requires:
- name: Name of the artifact (usually the file or dataset name)
- catalogue_id: ID of the dataset or catalogue to associate the artifact with Returns: artifact_id
Based on the ingestion type, proceed differently:

A. Bulk Upload
- Requires:
  - file
  - name
  - catalogue_id
- Flow:
  - Get signed upload URL from StorageAPI
  - Upload file to MinIO using signed URL
  - Trigger processing pipeline
B. Source-Based Ingestion
- Requires:
  - catalogue_id
  - ingestion_type: "db" or "external_storage"
  - ingestion_config.
- Flow:
  - No file upload occurs
  - These connection parameters are passed to the pipeline
  - Backend triggers ingestion pipeline with artifact metadata and source config
Trigger the pipeline (common to both flows) Requires:
- artifact_id
- ingestion_type
- ingestion_config (for DB and external storage)

Proposal

Implementing two dedicated API paths—one for bulk file uploads and another for ingestion from external sources (e.g., databases or object storage).

The two ingestion methods have different input requirements and behaviours. Splitting them into distinct API paths improves clarity and simplifies validation.

Example:

/ingest/bulk
/ingest/source

/cc @jpla

Edited Apr 16, 2025 by Antoni Gimeno