Define energyEfficiency Metric for Native Nodes

I found the following open source framework that can measure energy consumption (if the nodes's hardware supports power telemetry interfaces like RAPL (Intel), IPMI (BMC) or Redfish) OR estimate energy consumption (using machine learning models trained on real data and takes inputs like CPU usage, memory usage, etc.)

Kepler (Kubernetes-based Efficient Power Level Exporter) is an open-source tool designed to monitor energy consumption at both the pod and node levels. It uses interfaces like RAPL (Intel), IPMI (BMC), and Redfish to access actual power consumption or to estimate energy use when direct measurement is unavailable.

Kepler exposes Prometheus-compatible metrics such as:

kepler_node_joules_total: Total energy consumption per node (in Joules) kepler_container_joules_total: Energy consumption per container/pod Additional metrics include CPU package energy, DRAM energy, and full platform-level energy use

These metrics can be visualized via Grafana and queried through Prometheus for integration into our control logic.

GitHub: https://github.com/sustainable-computing-io/kepler

Interesting tool. We would consider installing it on-premises to check its capabilities. One drawback to avoid is retrieving data from sources other than the node itself, even from the Control Plance. When the tool needs to get a measurement about a node, it should rely on information available locally on that node—not other nodes, tools(Prometheus). But it is definitely a good point to start.

assigned to @hherodotou and @mafooq

changed health status to on track

mentioned in commit 75595307

Current Approach:

Use perf-tool to access real-time energy consumption of the machine.

Pros: Gives exact and accurate energy consumption. Cons: Requires sudo privilege to access this info.

Could you provide a sample code or a gist that I can use to test both the CLI tool and the service running in the pod? Thanks!

Sure, I will send you an email.

Current Approach We have compiled a dataset of Intel and AMD CPUs that includes their Thermal Design Power (TDP) values. Our energy efficiency estimation pipeline works as follows:

TDP Lookup: We first attempt to retrieve the TDP from our dataset based on the detected CPU model.
FLOPS Estimation: Using the retrieved TDP, we run an artificial workload (e.g., dot product using NumPy) to estimate the system’s FLOPS (Floating Point Operations Per Second).
Energy Efficiency Calculation: We then compute the energy efficiency in terms of GFLOPS (giga) per Joule using the formula: GFLOPS/J = (FLOPS/sec)/TDP (W) ÷ 10^9
Fallback Mechanism: If the CPU model is not found in the dataset, we use a pre-trained regression model as a fallback to estimate the TDP.
Regressor Prediction: The regression model is loaded from a saved file and used to predict the TDP based on extracted CPU features, ensuring coverage for unknown or newer CPUs.

I’ve merged the energy-efficient branch into master, so you can work without any conflicts.

changed milestone to %Version 2

@mafooq I see some progress in the branch. Should i start reviewing the code?

Please go ahead and review this. Let me know if any changes or improvements are needed.

added PendingReview label

I made some changes, which can be seen here: 36e6bdef f03a68cc

Some comments:

The original code didn’t work because here 36e6bdef the , should be replaced with . in order to cast the str to float.
The L3_Cache was not calculated correctly on my PC. The value appears as: L3: 12 MiB (1 instance). In this case, the last letter is not always M or B, so it falls back to the default.
The default values should be explained with comments about why they were chosen, rather than, for example, just returning "unknown".
Should we extract the FLOPS as a separate label or annotation? @hherodotou

Thank you for your review and edits Michael. Mafooq will address the issues with the necessary changes. Yes, the FLOPS should be extracted as a separate annotation.

Perfect. Please also include the relevant documentation explaining how they are calculated. (I think most of the functionality is covered in the energyEfficiency docs)

I've created the merge request and squashed all commits into one. If there are no objections, I'll proceed to merge it and close the issue.

If everything checks out, please proceed with merging.

closed

Define energyEfficiency Metric for Native Nodes

Designs

Child items ...

Activity