Most of the services we care about are defined in the level of the common lib, so we should move this there and explore health checks in those services first.
I've started to look into this and all of our endpoints should have this baked in at the endpoint /q/health by just being a quarkus app. This check by default just tells you if the application is up, but we can add more checks in by registering a readiness health check. I'm thinking we can add a readiness check for database connections at the very least (just does some basic ping to make sure the db conn is alive). We could also add in checks for external APIs so that we can report on general API readiness. If all of our APIs serve this endpoint, we could create a web of health checks without too much problem.
Quarkus supports readiness and liveness checks, so we should be able to set up all of the checkz. Do we want to start encoding this into the configuration files of APIs, or do we want to do something through the UI for this? I think the config is the way to go, as defined in the OKD application monitoring page, but sometimes I get surprised.
Yes, please, add those probes to the configuration files. Most important probe is the liveness one, then readiness and finally startup (see details).
Liveness: will get the pod / container to restart after some failure
Readiness: will prevent traffic to be routed to the pod if this one fails
Startup: will not start probing for liveness nor readiness until this pass. This is important to avoid having to encode large initialDelaySeconds in readiness and liveness when applications take a long time to start.
Read the full doc above, it explains all of this very well.
I'm thinking the best way to experiment and test will be to create an endpoint that has a toggle that I can flip the states at will and force certain events to test the binding before we start rolling it out. I'll use kubectl port-forward so that we don't need to host it live for me to poke it and flip the states.
I'm thinking I'll be starting on this sometime this week since other stuff is moving along well. Let me know if you have any issues related to how I plan to test this. Otherwise, I'll let you know how it goes!
Some work has been done to start exposing these endpoints in Quarkus, though there are some concerns with potential collision based on how the non-application endpoints are mounted. Examples of this are live for the Working Groups API with the following URL: https://api.eclipse.org/working-groups/q/health
The root of the application https://api.eclipse.org/working-groups is used for requests already, with an additional endpoint available to retrieve singular working groups like https://api.eclipse.org/working-groups/jakarta-ee. While there is no real danger of collisions in this API, it is something to be aware of. We are looking for help/guidance on the best way to mount and expose health endpoints from Quarkus through OKD/nginx.
This has been added to every repo and needs to just be deployed at this point. Not all health checks are publicly available, but they have been configured internally so that at least OKD can use it for more sustainable uptime. I'll be closing this issue as it has been functionally resolved at this point.