Skip to content

Add early exit kill switch#175

Merged
sfc-gh-mwyatt merged 1 commit intomainfrom
mwyatt/kill-switch
May 6, 2025
Merged

Add early exit kill switch#175
sfc-gh-mwyatt merged 1 commit intomainfrom
mwyatt/kill-switch

Conversation

@sfc-gh-mwyatt
Copy link
Copy Markdown
Collaborator

Allow user to manually set early_stop=True during training by creating a file. Defaults to /tmp/at_kill_switch, but can be modified with kill_switch_path in the trainer config.

Resolves #126

@sfc-gh-mwyatt sfc-gh-mwyatt merged commit a34de41 into main May 6, 2025
6 checks passed
@sfc-gh-mwyatt sfc-gh-mwyatt deleted the mwyatt/kill-switch branch May 6, 2025 15:10
Comment on lines +336 to +337
if self.config.kill_switch_path.exists():
self.early_stop = True
Copy link
Copy Markdown
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to work in a multi-node situation, as you're checking each node's fs but the kill switch is normally added on one node. So this will require a shared fs.

I think the above needs to check for global rank 0's fs if it's local and then signal to other nodes to exit - otherwise it'll break.

If you want a quick solution make the default local to where the code launched from and perhaps use our util that maps the fs type to ensure that it lands on a shared fs. If not, assert until signaling is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] Kill switch

3 participants