Improve CI setup #9947

andreaTP · 2022-02-02T08:30:27Z

andreaTP
Feb 2, 2022

Description

Yesterday we observed enormous delays and queuing of GH Action pipelines.

The underlying issue was probably network hiccups, still, it took more than 6 hours to recover (timing out most of the queued jobs).

The current setup is optimized for CI speed but is clogging the available concurrent runners.
in the "testing phase", after the build of the distribution, we count 7 independent jobs + 12 matrix jobs, given that the available runners on a Free plan are 20 ( ref ) it means that this project CI is (on average) serving 1/2 PRs concurrently, queuing all the rest of the jobs.

I think that an acceptable compromise is to slow the CI down by removing one axis from the matrix build but allowing more PR CIs to progress concurrently.

Discussion

No response

Motivation

No response

Details

No response

hmlnarik · 2022-02-02T09:43:02Z

hmlnarik
Feb 2, 2022

The proposal suggests prolonging the time to finish a job successfully to 2+ hours [*]

In case of similar network hiccup, prolonging the job length, it would get even more pronounced since the chance of getting the job finished would decrease. This would require restart, leading to even bigger job congestion. This seems to increase the problem and at the same time decrease developers comfort which under normal circumstances is to obtain full results under 1 hour.

However the core of this issue - too big a number of require runners per PR - is there and we should rather question the need to always run all the combinations. There are already certain jobs that are only run if they could be affected, otherwise they are quickly terminated, e.g. clustering or cross-dc tests. This could similarly be applied to map-hot-rod, or operator which might be run only if some of the of the classes from within their respective module has changed as long as this gives suficient stability guarantee. WDYT @mhajas, @martin-kanis, @vmuzikar?

I would also in the matter of months question the need to run the tests on wildfly given that the main distro is moving towards Quarkus.

[*] According to the previous successful builds, preparation before running each test group takes 2-3 minutes, most of the time is spent in the tests. It seems that this proposal expects saving of 8x 2-3 minutes of total running time at the cost of getting the results in longer total time: For example, in the case of wildfly from this run, the results would be available in appx 2 + 43 + 46 + 41 = 132 rather than max(2+43, 2+46, 2+41) = 48 minutes.

0 replies

andreaTP · 2022-02-02T10:04:07Z

andreaTP
Feb 2, 2022
Author

the core of this issue - too big a number of require runners per PR - is there

Agree, let's try to look for a solution!

For some modules we can introduce path inclusions/exclusions:
https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore

But we should make sure that those jobs are still running if a transitive dependency gets touched (I can take care of the operator).

Any other alternative on top of my head would requires some budget allocation:

paying for GH plan better than "Free"
spinning managed runners instances for this project

0 replies

stianst · 2022-02-03T08:48:07Z

stianst
Feb 3, 2022
Maintainer

There's a few things we can look at:

Reduce the number of jobs running

Making sure we only have one concurrent job for a branch (PR, main) running/scheduled at a time

Reduce the number of tests running for a matrix

For example right now we are running the full testsuite for every combination.

Reduce the number of matrix

Getting rid of WildFly will help quite a bit here. If we can also replace undertow (KeycloakServer) with Quarkus that would further help.

Be smart about what jobs execute for a PR

Using path inclusions/exclusions would probably be a nightmare, would be better to have something more central/shared.

Make tests more stable

Fairly regularly there's a failure in the tests, which causes having to re-run the whole job. From my observation I see quite a lot of PRs failing due to unstable tests. Guessing something like 10-20%.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CI setup #9947

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Improve CI setup #9947

Uh oh!

andreaTP Feb 2, 2022

Description

Discussion

Motivation

Details

Replies: 3 comments

Uh oh!

hmlnarik Feb 2, 2022

Uh oh!

andreaTP Feb 2, 2022 Author

Uh oh!

Uh oh!

stianst Feb 3, 2022 Maintainer

Reduce the number of jobs running

Reduce the number of tests running for a matrix

Reduce the number of matrix

Be smart about what jobs execute for a PR

Make tests more stable

andreaTP
Feb 2, 2022

hmlnarik
Feb 2, 2022

andreaTP
Feb 2, 2022
Author

stianst
Feb 3, 2022
Maintainer