Before completing calculation of namespace, pods, concurrency limits we need to define exactly relation between JIRO and GRAC infrastructure processes and how number of resource packs contribute to each part's parameter.
TODO: prepare the spreadsheet.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
As discussed some time ago, I have experimented with a spreadsheet based a more refined approach to pod template typologies for builds.
The idea is to increase the number of parallel builds while using same resources to what is proposed by JIRO calculation. This also allows us to offer an on-demand service based on the job's needs. Therefore, a pipeline that requires more resources can run on a runner with more resources, while micro-tasks that require minimal resources can use a runner with very few CPU/RAM.
It is important to consider this microservice aspect in the execution of GitLab CI pipelines, where the aim is to have dedicated jobs for specific tasks and thus to have a large number of jobs running in parallel to execute a pipeline.
As a consequence, I propose three types of build containers with the following specifications:
Basic
Advance
Expert
cpu req
250m
1000m
2000m
cpu limit
500m
2000m
4000m
mem
1024Mi
4096Mi
8192Mi
I have established a distribution of concurrency based on the resource pack specifications as follows:
# Resource packs
1
2
3
4
5
10
Concurrent Basic
3
5
7
9
11
21
Concurrent Advance
1
2
3
4
5
10
Concurrent Expert
0
0
1
1
2
5
max concurrency
4
7
11
14
18
36
NOTE:
2 basics per resource pack starting from 3
1 advance per resource pack
1 expert every 2 resources pack
The calculation method is also different from the Jiro method. I did not adopt the top-down approach with restrictions coming from the resource quotas definition. Instead, I used a classic bottom-up iterative approach, starting with requirements, balancing between resources and concurrency to achieve satisfactory results. The other approach does not allow mixing different types of pod templates without potentially breaking a build.
For details see spreadsheet: Grac resource pack calculator.
This one include also a comparative with grac calculation based on jiro calculation, and the jiro calculation.
In terms of the microservices aspect, I will rely on your judgment and experience with every-day requirements of GitLab CI builds.
I'm a bit concerned about the following:
We have a few projects that require more than 4 GB of RAM for a build. In JIRO, we can create custom pod templates (or the projects can do it themselves) to allow a build to use 6 GB of RAM (for example). Do we have the same flexibility with your proposed solution, or would the project need to find a sponsor for a resource pack to get an "Expert runner"?
While projects gain some flexibility with a higher concurrency of smaller runners, they cannot run the same number of "Advance runners" as in JIRO (2x 2CPUs, 4GB RAM). This could be a potential dealbreaker for projects.
My biggest concern is how projects deal with all the options that we are giving them. Is this easy enough to understand and adjust by a non-releng-expert? How would they know when to use a "Basic runner" and when to use an "Advance Runner"? Trial and error? Are there common scenarios like "a Hugo build will always be fine with a 'Basic Runner'"?
Maybe it would be easier to have fewer options (e.g. no basic runner) and a bit more overhead, but less trial and error with the runner size for the projects?
@heurtemattes can you create a spreadsheet with our current capacity and utilization of resource packs with JIRO and GRAC (imagine every Eclipse project on GitLab had 1 resource pack)?
We have a few projects that require more than 4 GB of RAM for a build. In JIRO, we can create custom pod templates (or the projects can do it themselves) to allow a build to use 6 GB of RAM (for example). Do we have the same flexibility with your proposed solution, or would the project need to find a sponsor for a resource pack to get an "Expert runner"?
This possibility has been considered in the approach. It is the pod templates that determine the total quotas per namespace, and therefore, within this quota, you can define as many other pod templates as needed.
The counterpart, and it's the same for Jenkins, is that if all the pods consume more than expected, Kubernetes will trigger an OOM. It's up to the maintainer to be carefull.
Requests for custom pods must be submitted through the helpdesk ticket, or merge requests.
While projects gain some flexibility with a higher concurrency of smaller runners, they cannot run the same number of "Advance runners" as in JIRO (2x 2CPUs, 4GB RAM). This could be a potential dealbreaker for projects.
Running multiple jobs in parallel within a pipeline allows it to finish even faster than through a sequential pipeline, especially when resource allocations are often set higher than require, particularly for micro-tasks (e.g., eca check, dco check, some analysis tools, ...).
A pipeline that ends quickly provides developers with faster feedback and aligns with the concept of a 'fail-fast' pipeline.
My biggest concern is how projects deal with all the options that we are giving them. Is this easy enough to understand and adjust by a non-releng-expert? How would they know when to use a "Basic runner" and when to use an "Advance Runner"? Trial and error? Are there common scenarios like "a Hugo build will always be fine with a 'Basic Runner'"?
What do you mean by 'all the options'? Does it refer of the 3 pod templates instead of just one? Considering that and as you pointed out, projects in Jiro can configure their own pod templates.
By default, runners can pick up any jobs. If a job requires more resources than others, then the project maintainer will need to configure the job with the tag of the runner that has more resources. It's not automatic, but it's not very complicated to explain and implement. You just need to add a tag to a job definition in a gitlab-ci.yml file.
Maybe it would be easier to have fewer options (e.g. no basic runner) and a bit more overhead, but less trial and error with the runner size for the projects?
More overhead means less concurrency if we want to keep cpu/ram comsumption equals to jiro, and therefore less interest in gitlab CI.
Can you create a spreadsheet with our current capacity and utilization of resource packs with JIRO and GRAC (imagine every Eclipse project on GitLab had 1 resource pack)?
Good approach for increasing concurrent builds. This always has been a problem with CI tasks and people were waiting for jobs to finish ASAP and asking what’s wrong and howto speedup pipelines.
Specifically for gitlab the parallelism is much easier to achieve, higher and more frequently occurring than in both Jenkins and GitHub actions. This is observation I just realised just recently.
On the other hand, I’m not sure if basic, odwance, expert is good name. From what I’ve been seeing, tasks usually are being dispatched based on tags like small, medium, large - more like naming of AWS instances sizes. Obviously, information of how much resources each runner has is essential to be published to offload serve service to pipelines creators.
Also, it is important to show the resource comsumption. In Jenkins we have logs. I would propose to activate Grafana and show both, how many builds are running, waiting in queues. I’ve done that for several times and I can implement it here. Big projects have many pipelines, many jobs, merge trains and always want to know when jobs are going to finish and also want to tune parallelism.
Also, there will be more clear visibility on how fast EF gitlab is running in comparison to Gitlab's gitlab. We need to be prepared for tuning.
I just want to emphasise. Though, how important is to have clear visibility of how much resources is available to a project based on resource packs available and required. Too complication and too many different resources available may create more helpdesk issues.
Also, there are other parameters we will want to really consider:
Build cache
Artifacts size
I don’t see calculations here (?) This is addressed in jssonnet, though.
As an example, Oniro has several TB of cache and they needed fast cache. Nvme was used.
Also, it’s worth to note that creation of build environment usually means large transfer to populate caches and artifacts. This is also essential for jobs speeds and resource consumption.
But I propose we work on Sebastien spreadsheet as my approach was strictly JIRO like but we need to focus on providing standardised, different sizes of runners, which my approach was not focusing on.
On the other hand, I’m not sure if basic, odwance, expert is good name. From what I’ve been seeing, tasks usually are being dispatched based on tags like small, medium, large - more like naming of AWS instances sizes. Obviously, information of how much resources each runner has is essential to be published to offload serve service to pipelines creators.
I have changed to small, medium, large and it aligns better with cloud naming convention.
Also, it is important to show the resource comsumption. In Jenkins we have logs. I would propose to activate Grafana and show both, how many builds are running, waiting in queues. I’ve done that for several times and I can implement it here. Big projects have many pipelines, many jobs, merge trains and always want to know when jobs are going to finish and also want to tune parallelism.
Next Step ;-) service metrics is not yet fully implemented and expose.
Also, there will be more clear visibility on how fast EF gitlab is running in comparison to Gitlab's gitlab. We need to be prepared for tuning.
I'm sure gitlab runner are faster in any way than in EF infra, except for edge case like if project needs 32 CPU and 64G of RAM.
gitlab.com runner has also this limitation with CI/CD minutes that can be not suitable to project activities.
Build cache
out of scope, since nfs can't
Artifacts size
Good point! there is no limitation on JIRO about artifact size but the impact on GitLab is not negligible.
Artifacts can be uploaded to an object storage, we could address this point as well when Ceph will be operational.
@fgurr@pstankie I will start to merge these 2 MRs this week unless you have any comments.
As discuss, this implementation is different from the JIRO project and therefore must be test first with guinea-pig projects.
We need more community feedback on this approach fostering concurrency compare to capacity.
And validate that resource quotas and limite range definition in kubernetes are well defined.