Job Merging Optimization Proposal
Job Merging Optimization Proposal
Current Architecture Analysis The current two-job approach has the following flow:
- Job 1 (KPI Library): Downloads dataset from MinIO → processes KPIs → saves to filesystem
- Job 2 (UDRG): Re-reads same dataset from filesystem → processes user-defined rules → saves results
Observed Performance Issues Based on execution monitoring:
- Cold start overhead: Job 2 requires full Spark initialization
Redundant I/O: Dataset is read twice (MinIO → filesystem, then filesystem → memory) Resource inefficiency: Job 1 completes, releases resources, then Job 2 requests same resources
-
Proposed Optimization: Job Merging Merge the lightweight UDRG processing into the main KPI job, eliminating the second job entirely. Implementation approach:
-
Keep dataset in memory after KPI processing Call UDRG function directly with loaded data and computed KPIs Generate single consolidated result for MinIO storage
Benefits for Object Storage Migration
- Simplified storage pattern: Only final results need MinIO storage/retrieval
- Reduced object operations: No intermediate file sharing between jobs
- Faster execution: Eliminates cold start + dataset reload overhead
- Better resource utilization: Single job uses cluster resources more efficiently
Considerations
- Respects original architectural decision rationale from Tasos
- Compatible with high-load Spark cluster scenarios
- Maintains same functionality with improved performance
- Can be implemented incrementally alongside object storage changes