Skip to content
Snippets Groups Projects

K8s Workload Simulator

A modular and extensible simulator for evaluating diverse pod scheduling strategies in Kubernetes-like environments. Supports customizable clusters, workloads, and schedulers — including rule-based and learning-based approaches.


Installation

Requires Python 3.10+. The simulator can be installed as a normal Python Package if used externally:

pip install git+https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project/k8s-workload-simulator.git

Alternatively, for local development run:

pip install -e .

For internal use, just clone the repository:

git clone https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project/k8s-workload-simulator.git

The optional KWOK binary is needed only for KWOK-based clusters.

  • For KWOK-based clusters:
    • Install KWOK
    • Ensure kubectl is configured

Standalone Simulation

To run a one-time simulation:

python3 scripts/simulation_controller.py configs/config.yaml

This uses our YAML configuration (/configs/config.yaml) to:

  • Deploy a synthetic cluster (KWOK or Python)
  • Generate pod workloads based on task/pod distributions
  • Apply the selected scheduler (e.g., ROUNDROBIN, DEFAULT, or DAROTRAIN)
  • Save simulation traces and rewards (if enabled)

YAML Parameters for Simulation include:

🔹 Cluster Parameters

  • cluster_type, cluster_reset
  • cluster_nodes_cloud, cluster_nodes_edge, cluster_nodes_iot
  • cluster_node_cloud_cpu_dist, cluster_node_cloud_mem_dist, cluster_node_cloud_max_pods
  • cluster_node_edge_cpu_dist, cluster_node_edge_mem_dist, cluster_node_edge_max_pods
  • cluster_node_iot_cpu_dist, cluster_node_iot_mem_dist, cluster_node_iot_max_pods

🔹 Workload Parameters

  • workload_tasks
  • workload_pods_number_dist, workload_pods_cpu_dist, workload_pods_mem_dist
  • workload_pods_interarrival_dist, workload_pods_duration_dist, workload_pods_max_restarts

🔹 Scheduler Parameters

  • scheduler_type

🔹 Simulation Settings

  • simulation_speedup
  • simulation_save_trace, simulation_save_basic_stats, simulation_save_detail_stats

🔹 Training Parameters

  • training_episodes
  • training_cloud_nodes_per_episode_min, training_cloud_nodes_per_episode_max
  • training_edge_nodes_per_episode_min, training_edge_nodes_per_episode_max
  • training_iot_nodes_per_episode_min, training_iot_nodes_per_episode_max
  • training_tasks_per_episode_min, training_tasks_per_episode_max

Multi-Episode Training

To launch MARL-based training using the DAROTRAIN scheduler:

python3 scripts/training_controller.py configs/config.yaml

The training process will:

  • Randomize cluster size and workload per episode
  • Schedule pods using the DAROTRAIN (QMIX) agent
  • Train and update the agent using reward feedback
  • Save model weights (qmix_latest.pth) and logs

Additional YAML Parameters for Training:

  • All scheduler_daro_* hyperparameters (learning rate, gamma, etc.)

Output Artifacts

File Description
simulation_trace.csv Trace with deployment and termination events
simulation_basic_stats.csv Deployment and termination events
simulation_detail_stats.csv Deployment and termination events
reward_trace.csv Reward values per pod and node (only for DAROTRAIN)
qmix_latest.pth Trained QMIX model (only for DAROTRAIN)

Configurable Components

All settings are defined in a single flattened YAML (configs/config.yaml):

  • Cluster type, size, and node resource distributions
  • Workload task structure and pod arrival/duration/resource distributions
  • Scheduler type and parameters (including DAROTRAIN hyperparameters)
  • Simulation toggles and speed
  • Training episode counts and node/task ranges

Supported Schedulers

Scheduler Description
DEFAULT Native Kubernetes (KWOK) scheduler
ROUNDROBIN Simple round-robin node selection
DAROTRAIN Decentralized RL scheduler using QMIX
MOSTAVAILABLE Schedules on most avaliabel CPU, MEM node

Supported Distributions

You can configure the following statistical distributions:

  • fixed, normal, poisson, uniform
  • Fields: CPU, memory, pod interarrival, duration, number of pods per task
Type Format Example
Fixed {type: fixed, value: 4}
Normal {type: normal, mean: 6, stdev: 2, min: 2, max: 8, round: 1}
Poisson {type: poisson, mean: 6, min: 2, max: 8, round: 1}
Uniform {type: uniform, min: 2, max: 8, round: 1}
Pareto {type: pareto, alpha: 2, min: 2, max: 8, round: 1}

Units:

  • CPU: millicores
  • Memory: Mi (Kubernetes expects integer memory values for pods)
  • Time (Interarrival/Duration): seconds

round (optional): Rounds output to given decimal.


Contact

Developed and maintained by the CUT.
For issues or contributions, please contact us or submit a pull request.