Block shutdown if the Infinispan cluster is not stable by pruivo · Pull Request #48341 · keycloak/keycloak

pruivo · 2026-04-21T19:40:46Z

If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached.

shawkins · 2026-04-21T20:09:20Z

    @Override
    public void close() {
        logger.debug("Closing provider");
+        shutdownManager.onShutdown();


Seems like blocking the other close calls could eventually be an issue as we're serializing the orderly shutdown.

Should we think about allowing for a close method that returns a future or a higher level way to register a shutdown hook? The latter relates to #42670 which proposed adding post start and pre stop events - if that proposal were expanded so that the pre stop event could have shutdown hooks registered. cc @thomasdarimont

Seems like blocking the other close calls could eventually be an issue as we're serializing the orderly shutdown.

We require this behavior. As a concrete example, the JPA layer must be closed after Infinspan/JGroups because we have to clean up some tables from stale data. I can't see how the future or the shutdown hook will help with this case - at the end of the day, order matters.

Note that this PR is a workaround (and not a good one) until the feature is implemented in Infinispan. The CacheManager.stop() will do the blocking that I'm doing manually in this PR. infinispan/infinispan#17016

I can't see how the future or the shutdown hook will help with this case - at the end of the day, order matters.

Sure, and ordering can easily be preserved with a CompletableFuture. You just need to compose them in the desired order, then wait for everything to complete. I can provide an example of what that could look like if you want.

This also relates to something @ahus1 was looking to do for startup parallelization, which used virtual threads for init - the same principle could be applied to close.

Please check whether the shutdown time is long enough to compensate for the work we need to parallelize everything. It only takes 200ms on my machine.

Please check whether the shutdown time is long enough to compensate for the work we need to parallelize everything. It only takes 200ms on my machine.

If you are certain that it won't block for long, then there's no need to do additional work.

No, it can block for as long as --shutdown-timeout at max.

The point that I want to make is that the parallel approach may only be a couple hundreds milliseconds faster than the serial.

If you have a PoC, we can measure both times to see what the gain is and if it is significant.

What I'm roughly describing is like: https://github.com/keycloak/keycloak/compare/main...shawkins:asyncClose?expand=1

With the expectation that factories with long running close methods should override the new method and provide a CompletionStage (or it just a easily be a CompletableFuture).

Alternatively if we don't want to directly expand the api this way, it could be done as a separate interface, or if we assume that it isn't likely that factories could provide a future then there could be annonation like AsyncClose such that the close provider logic would delegate the task to a threadpool to create the future.

My discussion on startup time was driven by a customer need to see a fast startup after a failover. The customer chose "4 seconds" as a goal here.

I don't see a similar need for the shutdown. I addition to that, I consider this out-of-scope for the problem this one is planning to solve.

My discussion on startup time was driven by a customer need to see a fast startup after a failover. The customer chose "4 seconds" as a goal here.

Yes, I understand that it has a different rationale, but there is a symmetry to the problem of long-running blocking operations in our start and shutdown sequences. And I'm not sure if we're trying to account there or here for the possiblity of unknown init / close times from user deployed factories - not just our built-in ones.

I don't see a similar need for the shutdown. I addition to that, I consider this out-of-scope for the problem this one is planning to solve.

The problem it is trying to solve is highly related to the concern you are bringing up in #48341 (review) - that is if we have a serial closure of factories, a single long-running closure could consumer the entire timeout and prevent or delay the closure of other factories.

If at this point in time there is a well-defined ordering of long-running factory closures, then I agree parallizing is out-of-scope.

keycloak-github-bot

Unreported flaky test detected, please review

keycloak-github-bot · 2026-04-21T20:50:24Z

Unreported flaky test detected

If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.

org.keycloak.testsuite.model.singleUseObject.SingleUseObjectModelTest#testCluster

Keycloak CI - Store Model Tests

java.lang.AssertionError: 
threads didn't terminate in time: [main (RUNNABLE):
	at [email protected]/sun.management.ThreadImpl.dumpThreads0(Native Method)
	at [email protected]/sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:505)
	at [email protected]/sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:493)
...

Report flaky test

org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUserConcurrent

Keycloak CI - Store Model Tests

org.opentest4j.MultipleFailuresError: 
Multiple Failures (2 failures)
	java.lang.NullPointerException: <no message>
	org.infinispan.commons.IllegalLifecycleStateException: Cache container has been stopped and cannot be reused. Recreate the cache container.
	Suppressed: java.lang.NullPointerException
...

Report flaky test

org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUser

Keycloak CI - Store Model Tests

org.opentest4j.MultipleFailuresError: 
Multiple Failures (2 failures)
	java.lang.NullPointerException: Cannot invoke "org.keycloak.models.KeycloakSessionFactory.create()" because "factory" is null
	java.lang.NullPointerException: Null keys are not supported!
	Suppressed: java.lang.NullPointerException: Cannot invoke "org.keycloak.models.KeycloakSessionFactory.create()" because "factory" is null
...

Report flaky test

If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached. Closes keycloak#44620 Signed-off-by: Pedro Ruivo <[email protected]>

ahus1

Thank you for this PR.

How I understand what is happening here: Keycloak will first wait for running HTTP requests to finish. And then it will wait again for the same period for the Infinispan cluster to stabilize.

Assuming the value is set to 30 seconds, this then doubles to 60 seconds if both are maxed out.

From a user's perspective, I want them to set only one timeout. Can you please check if this is possible?

shawkins reviewed Apr 21, 2026

View reviewed changes

keycloak-github-bot Bot added the flaky-test label Apr 21, 2026

keycloak-github-bot Bot reviewed Apr 21, 2026

View reviewed changes

pruivo force-pushed the t_44620_block_shutdown branch from 591e9e7 to 010a5ad Compare April 21, 2026 21:34

pruivo force-pushed the t_44620_block_shutdown branch from 010a5ad to 07d8914 Compare April 22, 2026 07:57

Block shutdown if the Infinispan cluster is not stable

5483113

If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached. Closes keycloak#44620 Signed-off-by: Pedro Ruivo <[email protected]>

pruivo force-pushed the t_44620_block_shutdown branch from 07d8914 to 5483113 Compare April 22, 2026 13:56

This was referenced Apr 22, 2026

Flaky test: org.keycloak.testsuite.model.session.UserSessionProviderModelTest#testCreateUserSessionsParallel #48120

Open

Flaky test: org.keycloak.testsuite.model.session.AuthenticationSessionTest#testConcurrentAuthenticationSessionsRemoval #47925

Open

pruivo marked this pull request as ready for review April 22, 2026 20:03

pruivo requested review from a team as code owners April 22, 2026 20:03

keycloak-github-bot Bot added the team/cloud-native label Apr 22, 2026

pruivo assigned ahus1 Apr 22, 2026

ahus1 requested changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block shutdown if the Infinispan cluster is not stable#48341

Block shutdown if the Infinispan cluster is not stable#48341
pruivo wants to merge 1 commit intokeycloak:mainfrom
pruivo:t_44620_block_shutdown

pruivo commented Apr 21, 2026

Uh oh!

shawkins Apr 21, 2026

Uh oh!

pruivo Apr 22, 2026

Uh oh!

shawkins Apr 22, 2026 •

edited

Loading

Uh oh!

pruivo Apr 22, 2026

Uh oh!

shawkins Apr 22, 2026

Uh oh!

pruivo Apr 22, 2026

Uh oh!

shawkins Apr 23, 2026

Uh oh!

ahus1 Apr 23, 2026

Uh oh!

shawkins Apr 23, 2026

Uh oh!

keycloak-github-bot Bot left a comment

Uh oh!

keycloak-github-bot Bot commented Apr 21, 2026

Uh oh!

ahus1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pruivo commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shawkins Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keycloak-github-bot Bot left a comment

Choose a reason for hiding this comment

Uh oh!

keycloak-github-bot Bot commented Apr 21, 2026

Unreported flaky test detected

org.keycloak.testsuite.model.singleUseObject.SingleUseObjectModelTest#testCluster

org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUserConcurrent

org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUser

Uh oh!

ahus1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shawkins Apr 22, 2026 •

edited

Loading