[Sirius] OOM killer regularly occurs on CI Sirius tests (SIGKILL)
Summary
For a while, we observed that a part of our tests jobs are killed: process returned error code 137(SIGKILL received?)
. We never really investigate deeply because these tests are also launched in parallel on a "private" infrastructure where they are OK.
Steps to reproduce
The problem is reproductible by launching Sirius Tests Master job. The swtbot-part2 suite is almost always concerned.
What is the current bug behavior?
The tests suite is killed before finishing, only with message process returned error code 137(SIGKILL received?)
.
The problem seems to be similar to bugzilla 560654 ? The JDT team seems to solve its problem by changing the Xmx parameter to 1g. But our tests jobs are already configured to a XmX of 2g.
What is the expected correct behavior?
The expected behavior is to have the suite executed until the end.
Relevant logs and/or screenshots
Here are some example of process ended with process returned error code 137(SIGKILL received?)
:
- https://ci.eclipse.org/sirius/job/sirius.tests-master/PLATFORM=2023-03,SUITE=swtbot-part1,jdk=openjdk-jdk17-latest,label=migration/3272/console
- https://ci.eclipse.org/sirius/job/sirius.tests-master/PLATFORM=2023-03,SUITE=swtbot-part2,jdk=openjdk-jdk17-latest,label=migration/3271/console
- https://ci.eclipse.org/sirius/job/sirius.tests-master/PLATFORM=2023-03,SUITE=swtbot-part2,jdk=openjdk-jdk17-latest,label=migration/3270/console
- https://ci.eclipse.org/sirius/job/sirius.tests-master/PLATFORM=2023-03,SUITE=swtbot-part2,jdk=openjdk-jdk17-latest,label=migration/3269/console
The duration is not in cause because it is not always the same. And the last executed test is not always the same.
Priority
-
Urgent -
High -
Medium -
Low
Severity
-
Blocker -
Major -
Normal -
Low
Impact
The impact is low for us as we have another internal CI in a specific context. But it is not ideal.
We are considering a Yourkit locally, but the environment is not necessarily similar, so we are not sure that this will be effective.
Have you any advices to investigate on the "OOM killer" on the CI server?
- Maybe we can access to the oom_score file of the process to be sure that the OOM killer is the "culprit"?
- Any other idée will be good to take.