Block shutdown if the Infinispan cluster is not stable#48341
Block shutdown if the Infinispan cluster is not stable#48341pruivo wants to merge 1 commit intokeycloak:mainfrom
Conversation
| @Override | ||
| public void close() { | ||
| logger.debug("Closing provider"); | ||
| shutdownManager.onShutdown(); |
There was a problem hiding this comment.
Seems like blocking the other close calls could eventually be an issue as we're serializing the orderly shutdown.
Should we think about allowing for a close method that returns a future or a higher level way to register a shutdown hook? The latter relates to #42670 which proposed adding post start and pre stop events - if that proposal were expanded so that the pre stop event could have shutdown hooks registered. cc @thomasdarimont
There was a problem hiding this comment.
Seems like blocking the other close calls could eventually be an issue as we're serializing the orderly shutdown.
We require this behavior. As a concrete example, the JPA layer must be closed after Infinspan/JGroups because we have to clean up some tables from stale data. I can't see how the future or the shutdown hook will help with this case - at the end of the day, order matters.
Note that this PR is a workaround (and not a good one) until the feature is implemented in Infinispan. The CacheManager.stop() will do the blocking that I'm doing manually in this PR. infinispan/infinispan#17016
There was a problem hiding this comment.
I can't see how the future or the shutdown hook will help with this case - at the end of the day, order matters.
Sure, and ordering can easily be preserved with a CompletableFuture. You just need to compose them in the desired order, then wait for everything to complete. I can provide an example of what that could look like if you want.
This also relates to something @ahus1 was looking to do for startup parallelization, which used virtual threads for init - the same principle could be applied to close.
There was a problem hiding this comment.
Please check whether the shutdown time is long enough to compensate for the work we need to parallelize everything. It only takes 200ms on my machine.
There was a problem hiding this comment.
Please check whether the shutdown time is long enough to compensate for the work we need to parallelize everything. It only takes 200ms on my machine.
If you are certain that it won't block for long, then there's no need to do additional work.
There was a problem hiding this comment.
No, it can block for as long as --shutdown-timeout at max.
The point that I want to make is that the parallel approach may only be a couple hundreds milliseconds faster than the serial.
If you have a PoC, we can measure both times to see what the gain is and if it is significant.
There was a problem hiding this comment.
What I'm roughly describing is like: https://github.com/keycloak/keycloak/compare/main...shawkins:asyncClose?expand=1
With the expectation that factories with long running close methods should override the new method and provide a CompletionStage (or it just a easily be a CompletableFuture).
Alternatively if we don't want to directly expand the api this way, it could be done as a separate interface, or if we assume that it isn't likely that factories could provide a future then there could be annonation like AsyncClose such that the close provider logic would delegate the task to a threadpool to create the future.
There was a problem hiding this comment.
My discussion on startup time was driven by a customer need to see a fast startup after a failover. The customer chose "4 seconds" as a goal here.
I don't see a similar need for the shutdown. I addition to that, I consider this out-of-scope for the problem this one is planning to solve.
There was a problem hiding this comment.
My discussion on startup time was driven by a customer need to see a fast startup after a failover. The customer chose "4 seconds" as a goal here.
Yes, I understand that it has a different rationale, but there is a symmetry to the problem of long-running blocking operations in our start and shutdown sequences. And I'm not sure if we're trying to account there or here for the possiblity of unknown init / close times from user deployed factories - not just our built-in ones.
I don't see a similar need for the shutdown. I addition to that, I consider this out-of-scope for the problem this one is planning to solve.
The problem it is trying to solve is highly related to the concern you are bringing up in #48341 (review) - that is if we have a serial closure of factories, a single long-running closure could consumer the entire timeout and prevent or delay the closure of other factories.
If at this point in time there is a well-defined ordering of long-running factory closures, then I agree parallizing is out-of-scope.
Unreported flaky test detectedIf the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR. org.keycloak.testsuite.model.singleUseObject.SingleUseObjectModelTest#testClusterKeycloak CI - Store Model Tests org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUserConcurrentKeycloak CI - Store Model Tests org.keycloak.testsuite.model.user.UserModelTest#testAddRemoveUserKeycloak CI - Store Model Tests |
591e9e7 to
010a5ad
Compare
010a5ad to
07d8914
Compare
If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached. Closes keycloak#44620 Signed-off-by: Pedro Ruivo <[email protected]>
07d8914 to
5483113
Compare
ahus1
left a comment
There was a problem hiding this comment.
Thank you for this PR.
How I understand what is happening here: Keycloak will first wait for running HTTP requests to finish. And then it will wait again for the same period for the Infinispan cluster to stabilize.
Assuming the value is set to 30 seconds, this then doubles to 60 seconds if both are maxed out.
From a user's perspective, I want them to set only one timeout. Can you please check if this is possible?
If a rebalance is in progress, block the shutdown procedure until it finishes or a timeout is reached.
Closes #44620