Builds on technology.packaging failing for no apparent reason

added IT-priorityhigh IT-severitymajor service:jenkins statetodo team:releng labels

Packaging Jenkins instance has been restarted. Let's see if that helps.

Pardon my ignorance on this, but https://ci.eclipse.org/packaging/job/simrel.epp-tycho-build/3132/ was started ~30 minutes ago and is still running. Should I cancel it so the instance can restart fully?

I've restarted it, before the last build was started. I wanted to wait before reporting success, but then I realized that the build will be running for 1.5+ hours.

Thanks @fgurr!

Approx 3 hours in and a similar failure - https://ci.eclipse.org/packaging/job/simrel.epp-tycho-build/3132/

I now see that it failed at the same point, during copying the archives around for the PHP bundle. I am going to do some experiments to see if I can get a partial build to work, first where I only build PHP and then all but PHP

If you are still working and see something in the meantime, let me know.

A build of the PHP package alone works (the job fails, but later because I only built one package). Building the PHP one on my own machine also works. Now running the build everything except PHP

Could it be that I am hitting limits or a full hard drive? Not sure what may have changed on our side, but if I run a simple build that just tries to create the same amount of files (quickly) then the build fails. See e.g. https://ci.eclipse.org/packaging/job/test3/3/console

It seems we are running out of space during the build. 32GB isn't enough for EPP (assuming I can read the output of df correctly.

Any suggestions?

00:00:00.338 + df -h
00:00:00.343 Filesystem                        Size  Used Avail Use% Mounted on
00:00:00.343 overlay                            32G   20K   32G   1% /
00:00:00.343 tmpfs                              64M     0   64M   0% /dev
00:00:00.343 tmpfs                              63G     0   63G   0% /sys/fs/cgroup
00:00:00.343 shm                                64M     0   64M   0% /dev/shm
00:00:00.343 tmpfs                              26G   60M   26G   1% /etc/hostname
00:00:00.343 /dev/sda4                         894G   64G  831G   8% /etc/hosts
00:00:00.343 bambam:/home/data/cbi/buildtools   22T  5.0T   17T  23% /opt/tools
00:00:00.343 tmpfs                             4.0G  8.0K  4.0G   1% /home/jenkins/.m2/settings.xml
00:00:00.343 tmpfs                             4.0G   32K  4.0G   1% /run/secrets/kubernetes.io/serviceaccount
00:00:00.343 tmpfs                              63G     0   63G   0% /proc/acpi
00:00:00.343 tmpfs                              63G     0   63G   0% /proc/scsi
00:00:00.343 tmpfs                              63G     0   63G   0% /sys/firmware

My best estimate is it takes 80-120 GB of disk space to build EPP.

I am running a build now that does not produce all the artifacts (no aarch64 and only the 5 most popular products) to try to reduce total disk usage dramatically, hopefully enough that it can fit in the currently available space.

The final output size of one EPP build is ~30GB, there are approximately three full copies of the final output in various stages while the build is running.

Two questions:

Is it possible to re-jig the build in stages so that all the space is not required at one moment in time?
How did this work previously? I don't think our containers have changed, so I'm wondering what's new.

Is it possible to re-jig the build in stages so that all the space is not required at one moment in time?

Yes - I can if needed. FWIW it has been running this way for a long time. In fact in the last few months we actually went down ~10% in usage.

How did this work previously? I don't think our containers have changed, so I'm wondering what's new.

I don't know. Is the space mounted on / just what is available on the local machine? Has a limit been put on that recently? The last build that worked was Tuesday morning (Ottawa time) and it would have used the 80-120GB of space.

The builds are using an emptyDir volume without a size limit. So you're only limited by the disk size of the k8s node and the disk usage of other pods on the same node. The current build is running on a node that has ~800GB of free disk space. So disk space should not be an issue.

Can you tell if the earlier failing builds were running on a different node with less space?

I'll let the current build finish (hopefully successfully) as it covers the most critical outputs. And then I will try again with the full output.

(PS Thanks Fred - I know it is late there and I appreciate your input )

The limited build finished successfully, I am now running the full build in the hope of getting that same node again.

Can you tell if the earlier failing builds were running on a different node with less space?

Builds #3133 to #3136 used only two different nodes that both have the same disk size (~800GB). I wasn't able to track down earlier pod-to-node assignments. From what I found so far, I don't think disk usage is the issue here.

While this might not help for this iteration, we recommend building the different bundles sequentially in separate jobs or using parallel builds to improve stability and speed. Bonus: convert the job to a Jenkins pipeline job.

I will look at refactoring the job and publishing in steps.

I have no idea if it is disk space - I really have little visibility, it just seemed a likely culprit because we are such a disk space hog.

Bonus: convert the job to a Jenkins pipeline job.

+1

In the df -h output I think files are being created on this mount:

00:00:00.343 /dev/sda4 894G 64G 831G 8% /etc/hosts

which isn't what I expected, because I thought it was in /:

00:00:00.343 overlay 32G 20K 32G 1% /

So that is why I thought we were running out of space.

I still think there is a writing to disk problem, but it may be indirectly related to it. My current best guess is I am exceeding limits somewhere. So I tried to reproduce a build that did the same quantity of writing files to disk, without it taking as long as an EPP build.

So I did a job that did this:

for i in $(seq 1 1000); do echo $i; dd if=/dev/urandom of=output$i.dat bs=100M count=1; done

Which should make 1000 files each of 100M which equals the 100GB that an EPP build writes.

When I run the above it fails fairly reliably on the 40th iteration, or at ~4G. Suspiciously similar to the memory limit:

+ dd if=/dev/urandom of=output40.dat bs=100M count=1
/tmp/jenkins10105137327478374250.sh: line 11:   173 Killed                  dd if=/dev/urandom of=output$i.dat bs=100M count=1

However, if I add a sync output$i.dat after the dd command, then it is able to create all 1000 files.

I did the above example on the CDT CI instance as I didn't want to keep messing with EPP available resources while I was still running. You can compare run 1 that fails and run 4 that succeeds https://ci.eclipse.org/cdt/job/test/

@jograham that is interesting. Can you re-run your test and echo the value of cat /proc/meminfo | grep Dirty each time?

Also, cat /proc/sys/vm/vfs_cache_pressure and let me know the value?

cat /proc/sys/vm/vfs_cache_pressure

100

Dirty

Without sync:

Run #3 - Starts at 185152 kB, was at 6304908 kB right before the dd call that was killed on iteration 120.

Run #7 - Starts at 167592 kB, was at 4003800 kB right before the dd call that was killed on iteration 113

Run 8 - Starts at 119336 kB, was at 3514768 kB when I manually aborted after 377 iterations.

Very interesting that it seems just adding the cat /proc/meminfo | grep Dirty into the loop causes the number of successful iterations to go up a lot. For run 9 I removed the it and it only lasted 42 iterations.

So for run 10 I changed it to only do cat /proc/meminfo | grep Dirty if dd failed and the value after failure was 3954484 kB on iteration 123.

Yesterday when I was doing this, all my test runs (without sync) were dying. This morning I am seeing some runs (like 8 above) that are getting into the hundreds without issue.

Note that on iteration 40 there is a very noticeable slow down for dd as I assume at that point the cache is being written back to disk (e.g. 18.9318 s to create the file instead of the normal less than 1 second) all the times I see dd being killed dd is running very slow on that iteration.

With Sync:

Run 12 - Starts at 676380 kB, and bounces around a range of 161648 - 3428252 kB (only ~50 iterations were above 3000000kB) for the full run to 1000 iterations.

PS Run #3 was done yesterday and it has all of /proc/meminfo between each iteration if there are other memory values you want to look at.

Thanks. So the kernel is perferring to buffer writes than to flush them to disk, but it's unaware of the cgroup memory limitation. For now, is it possible for you to introduce a "sync" shell script call during the build process?

I'll look at tweaking /proc/sys/vm/vfs_cache_pressure across all our nodes. 100 is the Linux default, but it does try quite agressively to stay off the disk as much as it can.

I don't know how to put a sync in the correct place as it is all one giant Maven run, I'll try something though.

I spoke to Ed and I am also working on making a partial build for M3 today and refactoring the build so that we are not using so much resource in any one job. That will have the secondary advantage of making it possible to re-run smaller parts if one of the sub-jobs fails.

I'm shocked that Linux retains dirty buffer page allocation to the user space. You should not have to manage this.

I do think breaking up the build in smaller chunks would be a nice goal regardless.

@jograham I've tweaked the setting but it's not permanent. Whenever you can, I'd be curious to see if that makes a difference on the stability of the build.

I ran a "sync" every 5 seconds in the background on https://ci.eclipse.org/packaging/job/simrel.epp-tycho-build/3143/ and it worked. That build was started at 9:05 this morning, so I don't know if the magic sauce was the "sync", your tweak or a mixture of both. But regardless we have a successful build - and the first successful full build since Tuesday morning.

My sync in the background is probably terrible for everyone else sharing the machine as IIUC it will affect all pods assigned to that machine (VM?) so I will turn that off and run a new build with just your tweak to see if we get to the positive result (once I complete the other parts of my releng work for 2024-03 M3)

Au contraire, sync is always good. I'm glad you've got a good build. It would be interesting to run it without the sync to make sure the vfs parameter addresses this.

Mind you, this is also a good way to throttle disk I/O on heavy consumers

I see that vfs_cache_pressure is now 400. Unfortunately my dd test failed in #15 with a final Dirty of 4018564 kB after 97 iterations. As of now test #14 is still running.

#14 seems to be running slower than #15 (different physical machines?).

I have the EPP build running (with no sync) in #3145

PS all my above links to #xxx are to the build jobs, not to other issues on GitLab.

97 x 100 MB right? So the job failed after attempting to write 9.7G? I don't think that's necessarily a bad thing, but vfs_cache_pressure did nothing to improve, as you use to get to 113.

I don't think that's necessarily a bad thing

That seems bad to me :-( but there is much I don't understand here about limits in containerized spaces. (I found this article about OOM Killer interesting) I think it is bad because 9.7 GB of file operations is still pretty small when compared to a full build generating ~30GB of artifacts and I agree with your earlier statement 100% "I'm shocked that Linux retains dirty buffer page allocation to the user space. You should not have to manage this."

but vfs_cache_pressure did nothing to improve, as you use to get to 113.

I get random failures - yesterday almost all my failures were ~4GB (40 iterations), this morning many went longer. But yes, I don't think vfs_cache_pressure made any substantial difference.

The failures on EPP are all when it is writing/copying around results, it is creating and copying ~3.5GB of zips at a time.

It looks like I have a workaround of background sync calls until I can get refactored build in place.

I have the EPP build running (with no sync) in #3145

Failed with unrelated(?) error:

01:31:30.608 [INFO] --- eclipse-macsigner-plugin:1.3.2:sign (sign) @ epp.package.jee ---
01:32:13.727 [INFO] [Fri Feb 23 19:39:53 UTC 2024] Signing OS X application '/home/jenkins/agent/workspace/simrel.epp-tycho-build/org.eclipse.epp.packages/packages/org.eclipse.epp.package.jee.product/target/products/epp.package.jee/macosx/cocoa/aarch64/Eclipse.app'...
01:33:00.489 FATAL: command execution failed
01:33:00.489 java.nio.channels.ClosedChannelException

Rerunning as #3146 if the vfs_cache_pressure experiment is still useful to you.

added statewip label and removed statetodo label

mentioned in issue #4358 (closed)

mentioned in issue #4359 (closed)

I have refactored the build and I am getting ready to replace the old build with a restructured one that should have less pressure on the filesystem and dirty cache. In particular I have removed one of the places that was doing a copy of ~3.5 GB at a time and split the build into multiple stages.

Please have a look at PRs that I will attach to https://github.com/eclipse-packaging/packages/issues/120 if interested.

#4358 (closed) and #4359 (closed) are needed to complete the transition, but neither are blocking.

Feel free to close this issue as of now I think I have worked around this issue with the refactored EPP build.

Ok. Thanks for refactoring the build!

Closing.

closed

Builds on technology.packaging failing for no apparent reason

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Priority

Severity

Impact

Designs

Child items ...

Activity