We did have a successful build a couple of days ago, but all builds yesterday failed, some of them for other (expected) reasons, but many for the above reason.
I am trying to publish 2024-03 M3 today. If possible to restart the Jenkins instance that would be great. Feel free to stop any and all running jobs to do that (unless I make a further comment below)
2 of 8 checklist items completed
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
I've restarted it, before the last build was started. I wanted to wait before reporting success, but then I realized that the build will be running for 1.5+ hours.
I now see that it failed at the same point, during copying the archives around for the PHP bundle. I am going to do some experiments to see if I can get a partial build to work, first where I only build PHP and then all but PHP
If you are still working and see something in the meantime, let me know.
A build of the PHP package alone works (the job fails, but later because I only built one package). Building the PHP one on my own machine also works. Now running the build everything except PHP
Could it be that I am hitting limits or a full hard drive? Not sure what may have changed on our side, but if I run a simple build that just tries to create the same amount of files (quickly) then the build fails. See e.g. https://ci.eclipse.org/packaging/job/test3/3/console
My best estimate is it takes 80-120 GB of disk space to build EPP.
I am running a build now that does not produce all the artifacts (no aarch64 and only the 5 most popular products) to try to reduce total disk usage dramatically, hopefully enough that it can fit in the currently available space.
The final output size of one EPP build is ~30GB, there are approximately three full copies of the final output in various stages while the build is running.
Is it possible to re-jig the build in stages so that all the space is not required at one moment in time?
Yes - I can if needed. FWIW it has been running this way for a long time. In fact in the last few months we actually went down ~10% in usage.
How did this work previously? I don't think our containers have changed, so I'm wondering what's new.
I don't know. Is the space mounted on / just what is available on the local machine? Has a limit been put on that recently? The last build that worked was Tuesday morning (Ottawa time) and it would have used the 80-120GB of space.
The builds are using an emptyDir volume without a size limit. So you're only limited by the disk size of the k8s node and the disk usage of other pods on the same node. The current build is running on a node that has ~800GB of free disk space. So disk space should not be an issue.
Can you tell if the earlier failing builds were running on a different node with less space?
Builds #3133 to #3136 used only two different nodes that both have the same disk size (~800GB). I wasn't able to track down earlier pod-to-node assignments. From what I found so far, I don't think disk usage is the issue here.
While this might not help for this iteration, we recommend building the different bundles sequentially in separate jobs or using parallel builds to improve stability and speed. Bonus: convert the job to a Jenkins pipeline job.
which isn't what I expected, because I thought it was in /:
00:00:00.343 overlay 32G 20K 32G 1% /
So that is why I thought we were running out of space.
I still think there is a writing to disk problem, but it may be indirectly related to it. My current best guess is I am exceeding limits somewhere. So I tried to reproduce a build that did the same quantity of writing files to disk, without it taking as long as an EPP build.
So I did a job that did this:
for i in $(seq 1 1000); do echo $i; dd if=/dev/urandom of=output$i.dat bs=100M count=1; done
Which should make 1000 files each of 100M which equals the 100GB that an EPP build writes.
When I run the above it fails fairly reliably on the 40th iteration, or at ~4G. Suspiciously similar to the memory limit:
However, if I add a sync output$i.dat after the dd command, then it is able to create all 1000 files.
I did the above example on the CDT CI instance as I didn't want to keep messing with EPP available resources while I was still running. You can compare run 1 that fails and run 4 that succeeds https://ci.eclipse.org/cdt/job/test/
Run #3 - Starts at 185152 kB, was at 6304908 kB right before the dd call that was killed on iteration 120.
Run #7 - Starts at 167592 kB, was at 4003800 kB right before the dd call that was killed on iteration 113
Run 8 - Starts at 119336 kB, was at 3514768 kB when I manually aborted after 377 iterations.
Very interesting that it seems just adding the cat /proc/meminfo | grep Dirty into the loop causes the number of successful iterations to go up a lot. For run 9 I removed the it and it only lasted 42 iterations.
So for run 10 I changed it to only do cat /proc/meminfo | grep Dirty if dd failed and the value after failure was 3954484 kB on iteration 123.
Yesterday when I was doing this, all my test runs (without sync) were dying. This morning I am seeing some runs (like 8 above) that are getting into the hundreds without issue.
Note that on iteration 40 there is a very noticeable slow down for dd as I assume at that point the cache is being written back to disk (e.g. 18.9318 s to create the file instead of the normal less than 1 second) all the times I see dd being killed dd is running very slow on that iteration.
With Sync:
Run 12 - Starts at 676380 kB, and bounces around a range of 161648 - 3428252 kB (only ~50 iterations were above 3000000kB) for the full run to 1000 iterations.
Thanks. So the kernel is perferring to buffer writes than to flush them to disk, but it's unaware of the cgroup memory limitation. For now, is it possible for you to introduce a "sync" shell script call during the build process?
I'll look at tweaking /proc/sys/vm/vfs_cache_pressure across all our nodes. 100 is the Linux default, but it does try quite agressively to stay off the disk as much as it can.
I don't know how to put a sync in the correct place as it is all one giant Maven run, I'll try something though.
I spoke to Ed and I am also working on making a partial build for M3 today and refactoring the build so that we are not using so much resource in any one job. That will have the secondary advantage of making it possible to re-run smaller parts if one of the sub-jobs fails.
@jograham I've tweaked the setting but it's not permanent. Whenever you can, I'd be curious to see if that makes a difference on the stability of the build.
I ran a "sync" every 5 seconds in the background on https://ci.eclipse.org/packaging/job/simrel.epp-tycho-build/3143/ and it worked. That build was started at 9:05 this morning, so I don't know if the magic sauce was the "sync", your tweak or a mixture of both. But regardless we have a successful build - and the first successful full build since Tuesday morning.
My sync in the background is probably terrible for everyone else sharing the machine as IIUC it will affect all pods assigned to that machine (VM?) so I will turn that off and run a new build with just your tweak to see if we get to the positive result (once I complete the other parts of my releng work for 2024-03 M3)
Au contraire, sync is always good. I'm glad you've got a good build. It would be interesting to run it without the sync to make sure the vfs parameter addresses this.
Mind you, this is also a good way to throttle disk I/O on heavy consumers
I see that vfs_cache_pressure is now 400. Unfortunately my dd test failed in #15 with a final Dirty of 4018564 kB after 97 iterations. As of now test #14 is still running.
#14 seems to be running slower than #15 (different physical machines?).
I have the EPP build running (with no sync) in #3145
97 x 100 MB right? So the job failed after attempting to write 9.7G? I don't think that's necessarily a bad thing, but vfs_cache_pressure did nothing to improve, as you use to get to 113.
That seems bad to me :-( but there is much I don't understand here about limits in containerized spaces. (I found this article about OOM Killer interesting) I think it is bad because 9.7 GB of file operations is still pretty small when compared to a full build generating ~30GB of artifacts and I agree with your earlier statement 100% "I'm shocked that Linux retains dirty buffer page allocation to the user space. You should not have to manage this."
but vfs_cache_pressure did nothing to improve, as you use to get to 113.
I get random failures - yesterday almost all my failures were ~4GB (40 iterations), this morning many went longer. But yes, I don't think vfs_cache_pressure made any substantial difference.
The failures on EPP are all when it is writing/copying around results, it is creating and copying ~3.5GB of zips at a time.
It looks like I have a workaround of background sync calls until I can get refactored build in place.
I have refactored the build and I am getting ready to replace the old build with a restructured one that should have less pressure on the filesystem and dirty cache. In particular I have removed one of the places that was doing a copy of ~3.5 GB at a time and split the build into multiple stages.