[Quality Evaluator] Improvement - Object Storage Data profiling new endpoint
Goals:
- Implement a new endpoint to enable the Object Storage (MinIO in this case) URL of the file (Parquet file) as a parameter pointing to the data to be profiled.
- Update the KPIs computation. Consider the most efficient method for computing KPIs when working with Parquet files. For UDRG computation, the suggestion here is to use the UDRG as a Python Library and avoid REST API Calls.
Note on Parquet format: parquet files in Spark optimize storage and processing efficiency by providing columnar storage, compression, and faster query performance.
Considerations:
- The return KPIs (1,2) remains the same JSON structure but now with several columns instead of a single one. The return response model must be revised.
- A new payload must be designed with the URL of the Datamite internal Object Storage (MinIO). In this issue you can find an example: payload example and the attribute full description
- Recommendation: use the UDRG service as a library rather than REST API (@imurua is up to date about that)
- Quality Evaluator should be able to support the following types of files:
- Parquet file: will be generated when working with DB sources.
- CSV: is the current implementation for manual bulk ingestion.
- Other formats: pending (still open).
To address it in two iterations, we propose the following:
1st iteration (as soon as possible):
- Start testing with parquet files (CSV also) to change the core algorithm
2nd iteration:
- Use of the UDRG as a library instead of REST API calls.
/cc @jpla, @imurua , @marijotecnalia
Edited by Antoni Gimeno