Replies: 3 comments
-
|
The proposal suggests prolonging the time to finish a job successfully to 2+ hours [*] In case of similar network hiccup, prolonging the job length, it would get even more pronounced since the chance of getting the job finished would decrease. This would require restart, leading to even bigger job congestion. This seems to increase the problem and at the same time decrease developers comfort which under normal circumstances is to obtain full results under 1 hour. However the core of this issue - too big a number of require runners per PR - is there and we should rather question the need to always run all the combinations. There are already certain jobs that are only run if they could be affected, otherwise they are quickly terminated, e.g. clustering or cross-dc tests. This could similarly be applied to map-hot-rod, or operator which might be run only if some of the of the classes from within their respective module has changed as long as this gives suficient stability guarantee. WDYT @mhajas, @martin-kanis, @vmuzikar? I would also in the matter of months question the need to run the tests on wildfly given that the main distro is moving towards Quarkus. [*] According to the previous successful builds, preparation before running each test group takes 2-3 minutes, most of the time is spent in the tests. It seems that this proposal expects saving of 8x 2-3 minutes of total running time at the cost of getting the results in longer total time: For example, in the case of wildfly from this run, the results would be available in appx 2 + 43 + 46 + 41 = 132 rather than max(2+43, 2+46, 2+41) = 48 minutes. |
Beta Was this translation helpful? Give feedback.
-
Agree, let's try to look for a solution! For some modules we can introduce path inclusions/exclusions: But we should make sure that those jobs are still running if a transitive dependency gets touched (I can take care of the operator). Any other alternative on top of my head would requires some budget allocation:
|
Beta Was this translation helpful? Give feedback.
-
|
There's a few things we can look at: Reduce the number of jobs runningMaking sure we only have one concurrent job for a branch (PR, main) running/scheduled at a time Reduce the number of tests running for a matrixFor example right now we are running the full testsuite for every combination. Reduce the number of matrixGetting rid of WildFly will help quite a bit here. If we can also replace undertow (KeycloakServer) with Quarkus that would further help. Be smart about what jobs execute for a PRUsing path inclusions/exclusions would probably be a nightmare, would be better to have something more central/shared. Make tests more stableFairly regularly there's a failure in the tests, which causes having to re-run the whole job. From my observation I see quite a lot of PRs failing due to unstable tests. Guessing something like 10-20%. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
Yesterday we observed enormous delays and queuing of GH Action pipelines.
The underlying issue was probably network hiccups, still, it took more than 6 hours to recover (timing out most of the queued jobs).
The current setup is optimized for CI speed but is clogging the available concurrent runners.
in the "testing phase", after the build of the distribution, we count 7 independent jobs + 12 matrix jobs, given that the available runners on a Free plan are 20 ( ref ) it means that this project CI is (on average) serving 1/2 PRs concurrently, queuing all the rest of the jobs.
I think that an acceptable compromise is to slow the CI down by removing one axis from the matrix build but allowing more PR CIs to progress concurrently.
Discussion
No response
Motivation
No response
Details
No response
Beta Was this translation helpful? Give feedback.
All reactions