Quality Evaluator - Implement a better way to pass job results to the Quality Evaluator middleware and other job containers
Refactor the current file-based mechanism used to pass job results to the Quality Evaluator middleware. This approach is not reliable or scalable in k8s clusters due to storage and coordination limitations.
Proposal
Use a temporary bucket in MinIO to store and pass job results between containers. This approach enables a more reliable, scalable, and decoupled architecture. Temporary results can be automatically deleted once they are read (MinIO supports bucket lifecycle policies for auto-deleting objects after a specified time)
Challenges to Solve
Result File Naming and Identification
- A consistent and unique file naming convention (object key structure) is required to prevent name collisions and allow the QE service to unambiguously locate the results of a specific Spark job.
- This naming should include job-specific metadata (e.g., datasetID, job ID, timestamp) to ensure traceability.
The QE payload, has the "datasetId"
Example, based on the QE received payload:
Payload:
{
"object_storage_url": "https://minio-datamite-ds.iti.es/default/c11fa555-3a0c-4bb5-bd50-9582b2dbd612/E-REDES-ConnectionsElectricMobility-DS%20.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20250718T104210Z&X-Amz-SignedHeaders=host&X-Amz-Credential=minioadmin%2F20250718%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Expires=600&X-Amz-Signature=f7fc40cd57ee31dbc88a80ca4b699610ab3b5f9479d8e510d4784bd2629a4537",
"dataset_id": "83cc2b24-a1b2-40b0-9190-aa88c2f57721",
"column_types": {
"Year": "numeric",
"Semester": "numeric",
"CodConcelho": "numeric",
"Municipality": "text",
"Executed Network Connection Requests": "numeric"
},
"csv_options": {
"field_delimiter": ";",
"decimal_delimiter": "."
},
"charset": "UTF_8",
"mime_type": "text/csv"
}
Suggested name:
Given:
- dataset_id:
83cc2b24-a1b2-40b0-9190-aa88c2f57721
- original_filename:
E-REDES-ConnectionsElectricMobility-DS .csv
(extracted from the object_storage_url) - timestamp:
20250722T112034Z
(format YYYYMMDDThhmmssZ) - uuid: short hash or UUID v4 suffix like
3f21a8d1
Resulting key:
83cc2b24-a1b2-40b0-9190-aa88c2f57721/20250722T112034Z_3f21a8d1_E-REDES-ConnectionsElectricMobility-DS.csv
Result Lifecycle Management
- First Iteration Approach: As an initial implementation, the application responsible for reading the result from the OS bucket (likely the QE service) can delete the object immediately after reading it, ensuring simple lifecycle control without requiring a separate cleanup process.
NOTE: An automatic system might be implemented to delete or archive old result files based on configurable retention policies (e.g., age, job status, last access time).