openPASS: Kubernetes pods not starting up (again)

added IT-priorityhigh IT-severitymajor service:jenkins statetodo team:releng labels

assigned to @pstankie

added statewip label and removed statetodo label

I restarted jenkins master.

Unfortunately, the issue still persists.

@rbiegel I cleaned jenkins nodes directory and also restarted jenkins. I hope that helps.

If you see issue again, please include more information, eg. build queue or jobs that are failing due to quota.

e.g. this job is currently failing to launch pods. It's the only job currently running:

https://ci.eclipse.org/openpass/job/openPASS_simulator_build/job/develop/108/execution/node/53/log/

The problem is - too much memory requested and pod cannot start:

Message: pods "openpass-simulator-build-develop-108-v1pw1-8zv78-nzghg" is forbidden: exceeded quota: jenkins-instance-quota, requested: limits.memory=16896Mi,requests.cpu=4100m,requests.memory=16640Mi, used: limits.memory=19Gi,requests.cpu=4950m,requests.memory=19200Mi, limited: limits.memory=30208Mi,requests.cpu=8050m,requests.memory=30208Mi.

try lowering memory parameters in your Jenkinsfile

I copied the values from the first error message of the build job mentioned above here:

requested: limits.memory=16896Mi, requests.cpu=4100m, requests.memory=16640Mi,
used:      limits.memory=19Gi,    requests.cpu=4950m, requests.memory=19200Mi,
limited:   limits.memory=30208Mi, requests.cpu=8050m, requests.memory=30208Mi.

At the time of triggering the job, no other pods where running. Thus, I'm confused why it has any values different from 0 in the "used" line. The "limited" line shows enough resources for the current Jenkins job to fit. AFAIK there haven't been any related changes to the Jenkinsfile recently. Also, according to https://api.eclipse.org/cbi/sponsorships the upcoming reassignment of resource packs is not yet finished and there are still two packs assigned.

Can it also be related to our Windows agents? I noticed there is still the "old" agent up, which isn't used anymore. Seems this wasn't clearly communicated in the associated ticket.

Just to confirm, current situation (stale pods) is not the same as in #4892 (closed)?

If nothing else works at the moment, lowering the resources in Jenkinsfile should be ok for us as a intermediate solution. I'd just like to make sure that we can really use all assigned resources correctly.

At the time of triggering the job, no other pods where running. Thus, I'm confused why it has any values different from 0 in the "used" line.

The job runs multiple pods in parallel. That's why "used" is > 0.

The "limited" line shows enough resources for the current Jenkins job to fit.

19 used + 16 requested = 35 (limit is 30)

Also, according to https://api.eclipse.org/cbi/sponsorships the upcoming reassignment of resource packs is not yet finished and there are still two packs assigned.

The API had not been updated yet, but the resource packs were already removed on the Jenkins level. API has been updated.

Can it also be related to our Windows agents? I noticed there is still the "old" agent up, which isn't used anymore. Seems this wasn't clearly communicated in the associated ticket.

It's unrelated to the Windows agents, but I will clean up the old agent (b9qls-windows-10).

The actual issue was the reassignment of the resource packs to the SCM instance, which led to a lack of resources. Meanwhile, new requests have been created to add more resource packs to the openPASS instance. 10 resource packs have been added, so the build should work again.

Thanks for clarifying the resource pack situation.

There still seems to be an issue, e.g. this job is now failing in a different way related to pods: https://ci.eclipse.org/openpass/job/openPASS_simulator_build/job/develop/109/consoleFull

Another sub-project also is failing in the same way: https://ci.eclipse.org/openpass/job/GT-Gen-Simulator-PreMerge-Gate/view/tags/job/v3.0.0/1/console

Build queue and executor status:

19 used + 16 requested = 35 (limit is 30)

I'm still not sure about that. The stages running in parallel are related to our Linux and Windows builds. Inside of the Linux stage everything should run in sequence, but I have to look into that in more detail.

It looks like the pod cannot start within specified timeout - 3minutes. That's probably because pulling xiaopanansys/gt-gen-dev:latest takes long time. Maybe adding imagePullPolicy: IfNotPresent to the container definition would solve the issue.

I'm not sure if we can overwrite timeout but in your jenkinsfile can you add:

spec:
  activeDeadlineSeconds: 600
...

I added the activeDeadlineSeconds here, unfortunately it has no effect. Still timing out after 180s. Did I specify the setting correctly?

As we used to have some quite large images recently I checkt the current one, which is < 1 GiB. I don't think this is a size to worry about and a container should be able to start even within the 180s limit. Can this incident be related?

Looks like the pods are currently starting up in time, but then we are running into the quota issue again:

exceeded quota: jenkins-instance-quota,
requested: limits.memory=16896Mi, requests.cpu=4100m, requests.memory=16640Mi,
used:      limits.memory=123Gi,   requests.cpu=32550m,requests.memory=124160Mi,
limited:   limits.memory=127488Mi,requests.cpu=35050m,requests.memory=127488Mi,

This seems to be ok, as there are some jobs running right now.

I noticed that the opSimulation Jenkinsfile was also missing this change related to timeout issues with the persistent volume being mounted. Added it now to the current test branch. Startup of pod was smooth in the latest run.

mentioned in issue #4380

closed

removed statewip label

reopened

added wg:OpenPASS label

mentioned in issue eclipse/openpass/opSimulation#278 (closed)

Hi,

I investigated the issue, and it turns out the gt-gen-core CI was not affected. After reviewing the history, I came across this MR by @fgurr : eclipse/openpass/gt-gen-core!123 (merged). I am not sure how it works, but I made a similar change in the gt-gen-simulator repo: eclipse/openpass/gt-gen-simulator!55 (merged), which seems to have resolved the issue. The CI agent is now back up and running: https://ci.eclipse.org/openpass/job/GT-Gen-Simulator-PreMerge-Gate/view/change-requests/job/MR-55/

Root cause at least in opSimulation builds was the missing change in the seLinuxOptions. This is now fixed. Thanks for everyone's patience!

closed

openPASS: Kubernetes pods not starting up (again)

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Priority

Severity

Impact

Designs

Child items ...

Activity