[Quality Evaluator] Improvement - Object Storage Data profiling new endpoint

Goals:

Implement a new endpoint to enable the Object Storage (MinIO in this case) URL of the file (Parquet file) as a parameter pointing to the data to be profiled.
Update the KPIs computation. Consider the most efficient method for computing KPIs when working with Parquet files. For UDRG computation, the suggestion here is to use the UDRG as a Python Library and avoid REST API Calls.

Note on Parquet format: parquet files in Spark optimize storage and processing efficiency by providing columnar storage, compression, and faster query performance.

Proposal (draft):

Considerations:

The return KPIs (1,2) remains the same JSON structure but now with several columns instead of a single one. The return response model must be revised.
A new payload must be designed with the URL of the Datamite internal Object Storage (MinIO). In this issue you can find an example: payload example and the attribute full description
Recommendation: use the UDRG service as a library rather than REST API (@imurua is up to date about that)
Quality Evaluator should be able to support the following types of files:

Parquet file: will be generated when working with DB sources.
CSV: is the current implementation for manual bulk ingestion.
Other formats: pending (still open).

To address it in two iterations, we propose the following:

1st iteration (as soon as possible):

Start testing with parquet files (CSV also) to change the core algorithm

2nd iteration:

Use of the UDRG as a library instead of REST API calls.

/cc @jpla, @imurua , @marijotecnalia

Edited Feb 18, 2025 by Antoni Gimeno