Cache is not automatically cleaned
Description
The method to clean up the Bazel cache, introduced in !179 (merged), uses the flag --experimental_disk_cache_gc_max_size
to keep the cache size under 130 GB. However, the cache has not been automatically cleaned and has grown large enough that the disk quota is exceeded, causing the CI to fail. Currently it is up to ~165 GB (see: https://ci.eclipse.org/openpass/job/GtGenCore/job/MR-208/1/pipeline-console/?start-byte=0&selected-node=25#log-22).
15:48:03 73G /home/jenkins/cache/gtgen_core/_bazel_1002870000
15:48:03 18M /home/jenkins/cache/gtgen_core/ac
15:48:10 92G /home/jenkins/cache/gtgen_core/cas
15:48:10 114M /home/jenkins/cache/gtgen_core/downloads
15:48:10 0 /home/jenkins/cache/gtgen_core/tmp
In the documentation for the disk cache, it states that
Bazel will automatically garbage collect the disk cache while idling between builds; the idle timer can be set with --experimental_disk_cache_gc_idle_delay (defaulting to 5 minutes).
So the cleanup only happens when the server is running and idle for 5 minutes. We can set this to a lower value, but I am not sure if the Bazel server stays idle between CI runs, because it is printing Starting local Bazel server and connecting to it...
at the start of every run.
Setting idle time
I found that there is a --max_idle_secs
option, to control how long the Bazel server will stay alive for when idle, which for us seems to be 15 seconds. Here it is explained that when using the environment variable TEST_TMPDIR
, it will set it to 15 seconds. So I think there could be some options we could use, such as
-
--max_idle_secs=10800
(back to the default of 3 hours) - Use a lower value of
--experimental_disk_cache_gc_idle_delay
- Change the output_root as described here by doing either:
- Use
export XDG_CACHE_HOME=/home/jenkins/cache/gtgen_core/
- Use
bazel --output_base=/home/jenkins/cache/gtgen_core/
for all bazel commands (or--output_user_root
?)
- Use
Not using TEST_TMPDIR
might mean that we have to set --test_tmpdir=/home/jenkins/cache/gtgen_core/
for all bazel commands as well, but I am not sure
I have a PR for this at !209 (closed), but it seems that the bazel server is being killed anyway because it is in a container. We would have to make another long running container with a bazel server whose job it is to clear the disk cache.
Standalone GC tool approach
The alternative mentioned in the docs is to use the standalone garbage collection tool, where we can manually control when the GC happens. I cannot find a way to easily run this with Bazel binary. I think it involves cloning the Bazel repository and building the tool from source. We would then have to store the tool on the CI server. I would recommend moving the CACHEDIR
one directory down to /home/jenkins/cache/gtgen_core/bazel_cache/
, and then storing the built gc
tool binary somewhere like /home/jenkins/cache/gtgen_core/diskcache_gc/
. It takes up about 500MB of space including all the runfiles needed.
This approach would check if the gc
binary exists at the correct version, and if not, builds it with the general approach:
git clone --depth=1 --branch "${bazel_version}" --filter=blob:none https://github.com/bazelbuild/bazel.git "${bazel_gc_build}"
cd "${bazel_gc_build}"
export TEST_TMPDIR="${temp_gc_cache}"
export BAZELISK_HOME="${temp_gc_cache}"
bazel --disk_cache="${temp_gc_cache}" --compilation_mode=opt //src/tools/diskcache:gc
cp --recursive --dereference bazel-bin/src/tools/diskcache/. "${gc_tool_dir}"
rm -rf "${temp_gc_cache}"
Then it can be used to clean the cache in the CI like so:
"${gc_tool_dir}"/gc --disk_cache "${CACHE_PATH}" --max_size "${MAX_CACHE_SIZE}"
I have a PR for this at !210 (closed)
Other findings
Regardless of which method we use, I found that we have to be careful with the disk cache in the same directory as the output base. The GC will try to clean all files, including the unpacked Bazel installation in /home/jenkins/cache/gtgen_core/_bazel_*/install
. If some files are deleted from there, then Bazel commands will fail until the install
directory is completely deleted. After deleting, the Bazel installation will be re-extracted the next time a Bazel command is run. So, we should try to check for this at the start of the CI.
There are a lot of different locations where the build output is written to. As I understand, each MR creates a new output directory based on the md5sum of the working directory, which is something like /home/jenkins/agent/workspace/GtGenCore_MR-xxx/repo/
. These could either be cleaned up immediately once the MR is merged, or left the the GC. It should prioritize the oldest entries.
On further investigation, leaving the output base to be cleaned by the GC can also cause issues if important files are cleaned from an output base that is still being used. It is probably best to separate the disk cache and output base, and do a manual cleanup job of the output base.
Example sizes of output bases:
15:49:06 3.9G /home/jenkins/cache/gtgen_core/_bazel_1002870000/2b3e2569aba6493268664049ed810a62
15:50:14 3.9G /home/jenkins/cache/gtgen_core/_bazel_1002870000/4b60b72c35184d033392db474a188059
15:52:50 5.7G /home/jenkins/cache/gtgen_core/_bazel_1002870000/53c576ef38f7163af512460de0707593
15:53:58 3.8G /home/jenkins/cache/gtgen_core/_bazel_1002870000/583404cc1755eed46b7112053616dc5e
15:55:50 5.6G /home/jenkins/cache/gtgen_core/_bazel_1002870000/699116b68ef84b3550ef7249c5f45f24
15:59:41 6.6G /home/jenkins/cache/gtgen_core/_bazel_1002870000/7557f90bf58c388e978416c203e1e537
16:01:03 4.1G /home/jenkins/cache/gtgen_core/_bazel_1002870000/83dbc1e4b16191a7c6d778170d7d5248
16:02:25 6.1G /home/jenkins/cache/gtgen_core/_bazel_1002870000/8d755fb6e770395eccbce00a66b7f73d
16:05:16 4.3G /home/jenkins/cache/gtgen_core/_bazel_1002870000/8d8e8ab1ce58152667bc288097031de9
16:05:20 185M /home/jenkins/cache/gtgen_core/_bazel_1002870000/a2f188e71722c3392d63e0058de21ecb
16:06:43 5.0G /home/jenkins/cache/gtgen_core/_bazel_1002870000/d83cc5304c7db492a5214149d8b2ba2e
16:07:39 3.8G /home/jenkins/cache/gtgen_core/_bazel_1002870000/d9d9415f51262d7347f704857bd90d3c
16:08:36 3.8G /home/jenkins/cache/gtgen_core/_bazel_1002870000/e4bccd05f3bb4e2461466b7397a323da
16:10:57 6.7G /home/jenkins/cache/gtgen_core/_bazel_1002870000/eadd47911cca8f8a65d7df3908c8824e
16:12:34 4.6G /home/jenkins/cache/gtgen_core/_bazel_1002870000/ff8952601e7a37ca40a9b7244d252e54
16:13:55 3.8G /home/jenkins/cache/gtgen_core/_bazel_1002870000/ffb28ad5ef7ef5b664efe231772988f3