I found the following open source framework that can measure energy consumption (if the nodes's hardware supports power telemetry interfaces like RAPL (Intel), IPMI (BMC) or Redfish) OR estimate energy consumption (using machine learning models trained on real data and takes inputs like CPU usage, memory usage, etc.)
Kepler (Kubernetes-based Efficient Power Level Exporter) is an open-source tool designed to monitor energy consumption at both the pod and node levels. It uses interfaces like RAPL (Intel), IPMI (BMC), and Redfish to access actual power consumption or to estimate energy use when direct measurement is unavailable.
Kepler exposes Prometheus-compatible metrics such as:
kepler_node_joules_total: Total energy consumption per node (in Joules)
kepler_container_joules_total: Energy consumption per container/pod
Additional metrics include CPU package energy, DRAM energy, and full platform-level energy use
These metrics can be visualized via Grafana and queried through Prometheus for integration into our control logic.
Interesting tool. We would consider installing it on-premises to check its capabilities. One drawback to avoid is retrieving data from sources other than the node itself, even from the Control Plance. When the tool needs to get a measurement about a node, it should rely on information available locally on that node—not other nodes, tools(Prometheus). But it is definitely a good point to start.
Current Approach
We have compiled a dataset of Intel and AMD CPUs that includes their Thermal Design Power (TDP) values. Our energy efficiency estimation pipeline works as follows:
TDP Lookup: We first attempt to retrieve the TDP from our dataset based on the detected CPU model.
FLOPS Estimation: Using the retrieved TDP, we run an artificial workload (e.g., dot product using NumPy) to estimate the system’s FLOPS (Floating Point Operations Per Second).
Energy Efficiency Calculation: We then compute the energy efficiency in terms of GFLOPS (giga) per Joule using the formula:
GFLOPS/J = (FLOPS/sec)/TDP (W) ÷ 10^9
Fallback Mechanism: If the CPU model is not found in the dataset, we use a pre-trained regression model as a fallback to estimate the TDP.
Regressor Prediction: The regression model is loaded from a saved file and used to predict the TDP based on extracted CPU features, ensuring coverage for unknown or newer CPUs.
The original code didn’t work because here 36e6bdef the , should be replaced with . in order to cast the str to float.
The L3_Cache was not calculated correctly on my PC. The value appears as: L3: 12 MiB (1 instance). In this case, the last letter is not always M or B, so it falls back to the default.
The default values should be explained with comments about why they were chosen, rather than, for example, just returning "unknown".
Should we extract the FLOPS as a separate label or annotation? @hherodotou
Thank you for your review and edits Michael.
Mafooq will address the issues with the necessary changes.
Yes, the FLOPS should be extracted as a separate annotation.
Perfect. Please also include the relevant documentation explaining how they are calculated. (I think most of the functionality is covered in the energyEfficiency docs)