Skip to content

Block shutdown if the Infinispan cluster is not stable#48341

Open
pruivo wants to merge 1 commit intokeycloak:mainfrom
pruivo:t_44620_block_shutdown
Open

Block shutdown if the Infinispan cluster is not stable#48341
pruivo wants to merge 1 commit intokeycloak:mainfrom
pruivo:t_44620_block_shutdown

Conversation

@pruivo
Copy link
Copy Markdown
Member

@pruivo pruivo commented Apr 21, 2026

If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached.

Closes #44620

@Override
public void close() {
logger.debug("Closing provider");
shutdownManager.onShutdown();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like blocking the other close calls could eventually be an issue as we're serializing the orderly shutdown.

Should we think about allowing for a close method that returns a future or a higher level way to register a shutdown hook? The latter relates to #42670 which proposed adding post start and pre stop events - if that proposal were expanded so that the pre stop event could have shutdown hooks registered. cc @thomasdarimont

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like blocking the other close calls could eventually be an issue as we're serializing the orderly shutdown.

We require this behavior. As a concrete example, the JPA layer must be closed after Infinspan/JGroups because we have to clean up some tables from stale data. I can't see how the future or the shutdown hook will help with this case - at the end of the day, order matters.

Note that this PR is a workaround (and not a good one) until the feature is implemented in Infinispan. The CacheManager.stop() will do the blocking that I'm doing manually in this PR. infinispan/infinispan#17016

Copy link
Copy Markdown
Contributor

@shawkins shawkins Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see how the future or the shutdown hook will help with this case - at the end of the day, order matters.

Sure, and ordering can easily be preserved with a CompletableFuture. You just need to compose them in the desired order, then wait for everything to complete. I can provide an example of what that could look like if you want.

This also relates to something @ahus1 was looking to do for startup parallelization, which used virtual threads for init - the same principle could be applied to close.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether the shutdown time is long enough to compensate for the work we need to parallelize everything. It only takes 200ms on my machine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether the shutdown time is long enough to compensate for the work we need to parallelize everything. It only takes 200ms on my machine.

If you are certain that it won't block for long, then there's no need to do additional work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can block for as long as --shutdown-timeout at max.

The point that I want to make is that the parallel approach may only be a couple hundreds milliseconds faster than the serial.

If you have a PoC, we can measure both times to see what the gain is and if it is significant.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm roughly describing is like: https://github.com/keycloak/keycloak/compare/main...shawkins:asyncClose?expand=1

With the expectation that factories with long running close methods should override the new method and provide a CompletionStage (or it just a easily be a CompletableFuture).

Alternatively if we don't want to directly expand the api this way, it could be done as a separate interface, or if we assume that it isn't likely that factories could provide a future then there could be annonation like AsyncClose such that the close provider logic would delegate the task to a threadpool to create the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My discussion on startup time was driven by a customer need to see a fast startup after a failover. The customer chose "4 seconds" as a goal here.

I don't see a similar need for the shutdown. I addition to that, I consider this out-of-scope for the problem this one is planning to solve.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My discussion on startup time was driven by a customer need to see a fast startup after a failover. The customer chose "4 seconds" as a goal here.

Yes, I understand that it has a different rationale, but there is a symmetry to the problem of long-running blocking operations in our start and shutdown sequences. And I'm not sure if we're trying to account there or here for the possiblity of unknown init / close times from user deployed factories - not just our built-in ones.

I don't see a similar need for the shutdown. I addition to that, I consider this out-of-scope for the problem this one is planning to solve.

The problem it is trying to solve is highly related to the concern you are bringing up in #48341 (review) - that is if we have a serial closure of factories, a single long-running closure could consumer the entire timeout and prevent or delay the closure of other factories.

If at this point in time there is a well-defined ordering of long-running factory closures, then I agree parallizing is out-of-scope.

Copy link
Copy Markdown

@keycloak-github-bot keycloak-github-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreported flaky test detected, please review

@keycloak-github-bot
Copy link
Copy Markdown

Unreported flaky test detected

If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.

org.keycloak.testsuite.model.singleUseObject.SingleUseObjectModelTest#testCluster

Keycloak CI - Store Model Tests

java.lang.AssertionError: 
threads didn't terminate in time: [main (RUNNABLE):
	at [email protected]/sun.management.ThreadImpl.dumpThreads0(Native Method)
	at [email protected]/sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:505)
	at [email protected]/sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:493)
...

Report flaky test

org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUserConcurrent

Keycloak CI - Store Model Tests

org.opentest4j.MultipleFailuresError: 
Multiple Failures (2 failures)
	java.lang.NullPointerException: <no message>
	org.infinispan.commons.IllegalLifecycleStateException: Cache container has been stopped and cannot be reused. Recreate the cache container.
	Suppressed: java.lang.NullPointerException
...

Report flaky test

org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUser

Keycloak CI - Store Model Tests

org.opentest4j.MultipleFailuresError: 
Multiple Failures (2 failures)
	java.lang.NullPointerException: Cannot invoke "org.keycloak.models.KeycloakSessionFactory.create()" because "factory" is null
	java.lang.NullPointerException: Null keys are not supported!
	Suppressed: java.lang.NullPointerException: Cannot invoke "org.keycloak.models.KeycloakSessionFactory.create()" because "factory" is null
...

Report flaky test

If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached.

Closes keycloak#44620

Signed-off-by: Pedro Ruivo <[email protected]>
Copy link
Copy Markdown
Member

@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR.

How I understand what is happening here: Keycloak will first wait for running HTTP requests to finish. And then it will wait again for the same period for the Infinispan cluster to stabilize.

Assuming the value is set to 30 seconds, this then doubles to 60 seconds if both are maxed out.

From a user's perspective, I want them to set only one timeout. Can you please check if this is possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible data loss when scaling Keycloak StatefulSet down (on Kubernetes)

3 participants