Add early exit kill switch#175
Merged
sfc-gh-mwyatt merged 1 commit intomainfrom May 6, 2025
Merged
Conversation
sfc-gh-jrasley
approved these changes
May 5, 2025
Comment on lines
+336
to
+337
| if self.config.kill_switch_path.exists(): | ||
| self.early_stop = True |
Collaborator
There was a problem hiding this comment.
This isn't going to work in a multi-node situation, as you're checking each node's fs but the kill switch is normally added on one node. So this will require a shared fs.
I think the above needs to check for global rank 0's fs if it's local and then signal to other nodes to exit - otherwise it'll break.
If you want a quick solution make the default local to where the code launched from and perhaps use our util that maps the fs type to ensure that it lands on a shared fs. If not, assert until signaling is implemented.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allow user to manually set
early_stop=Trueduring training by creating a file. Defaults to/tmp/at_kill_switch, but can be modified withkill_switch_pathin the trainer config.Resolves #126