Job Merging Optimization Proposal

Job Merging Optimization Proposal

Current Architecture Analysis The current two-job approach has the following flow:

Job 1 (KPI Library): Downloads dataset from MinIO → processes KPIs → saves to filesystem
Job 2 (UDRG): Re-reads same dataset from filesystem → processes user-defined rules → saves results

Observed Performance Issues Based on execution monitoring:

Cold start overhead: Job 2 requires full Spark initialization

Redundant I/O: Dataset is read twice (MinIO → filesystem, then filesystem → memory) Resource inefficiency: Job 1 completes, releases resources, then Job 2 requests same resources

Proposed Optimization: Job Merging Merge the lightweight UDRG processing into the main KPI job, eliminating the second job entirely. Implementation approach:
Keep dataset in memory after KPI processing Call UDRG function directly with loaded data and computed KPIs Generate single consolidated result for MinIO storage

Benefits for Object Storage Migration

Simplified storage pattern: Only final results need MinIO storage/retrieval
Reduced object operations: No intermediate file sharing between jobs
Faster execution: Eliminates cold start + dataset reload overhead
Better resource utilization: Single job uses cluster resources more efficiently

Considerations

Respects original architectural decision rationale from Tasos
Compatible with high-load Spark cluster scenarios
Maintains same functionality with improved performance
Can be implemented incrementally alongside object storage changes

@agimenobono @jabefa

Edited Jul 22, 2025 by Aristotelis Katsanas