Skip to content

[Quality Evaluator] Improvement - Object Storage Data profiling new endpoint

Goals:

  • Implement a new endpoint to enable the Object Storage (MinIO in this case) URL of the file (Parquet file) as a parameter pointing to the data to be profiled.
  • Update the KPIs computation. Consider the most efficient method for computing KPIs when working with Parquet files. For UDRG computation, the suggestion here is to use the UDRG as a Python Library and avoid REST API Calls.

Note on Parquet format: parquet files in Spark optimize storage and processing efficiency by providing columnar storage, compression, and faster query performance.

Proposal (draft): Datamite_Discovery_Connectors_12_

Considerations:

  • The return KPIs (1,2) remains the same JSON structure but now with several columns instead of a single one. The return response model must be revised.
  • A new payload must be designed with the URL of the Datamite internal Object Storage (MinIO). In this issue you can find an example: payload example and the attribute full description
  • Recommendation: use the UDRG service as a library rather than REST API (@imurua is up to date about that)
  • Quality Evaluator should be able to support the following types of files:
  1. Parquet file: will be generated when working with DB sources.
  2. CSV: is the current implementation for manual bulk ingestion.
  3. Other formats: pending (still open).

To address it in two iterations, we propose the following:

1st iteration (as soon as possible):

  • Start testing with parquet files (CSV also) to change the core algorithm

2nd iteration:

  • Use of the UDRG as a library instead of REST API calls.

/cc @jpla, @imurua , @marijotecnalia

Edited by Antoni Gimeno