Skip to content

Job Merging Optimization Proposal

Job Merging Optimization Proposal

Current Architecture Analysis The current two-job approach has the following flow:

  • Job 1 (KPI Library): Downloads dataset from MinIO → processes KPIs → saves to filesystem
  • Job 2 (UDRG): Re-reads same dataset from filesystem → processes user-defined rules → saves results

Observed Performance Issues Based on execution monitoring:

  • Cold start overhead: Job 2 requires full Spark initialization

Redundant I/O: Dataset is read twice (MinIO → filesystem, then filesystem → memory) Resource inefficiency: Job 1 completes, releases resources, then Job 2 requests same resources

  • Proposed Optimization: Job Merging Merge the lightweight UDRG processing into the main KPI job, eliminating the second job entirely. Implementation approach:

  • Keep dataset in memory after KPI processing Call UDRG function directly with loaded data and computed KPIs Generate single consolidated result for MinIO storage

Benefits for Object Storage Migration

  • Simplified storage pattern: Only final results need MinIO storage/retrieval
  • Reduced object operations: No intermediate file sharing between jobs
  • Faster execution: Eliminates cold start + dataset reload overhead
  • Better resource utilization: Single job uses cluster resources more efficiently

Considerations

  • Respects original architectural decision rationale from Tasos
  • Compatible with high-load Spark cluster scenarios
  • Maintains same functionality with improved performance
  • Can be implemented incrementally alongside object storage changes

@agimenobono @jabefa

Edited by Aristotelis Katsanas