Skip to content

Bugfix/slurm job checking#405

Merged
FrankD412 merged 4 commits intollnl:developfrom
jwhite242:bugfix/slurm_job_checking
Aug 12, 2022
Merged

Bugfix/slurm job checking#405
FrankD412 merged 4 commits intollnl:developfrom
jwhite242:bugfix/slurm_job_checking

Conversation

@jwhite242
Copy link
Copy Markdown
Collaborator

Fixes issue with job checking on slurm systems that can result in steps never being marked finished. The squeue command appears to flush finished/cancelled/killed jobs from the queue before maestro can check on the jobs and update their status. Have yet to reproduce this in maestro studies, but can reproduces with manual job submissions and using ctrl-c to kill them and watching the squeue iterate output drop the job almost immediately. This fix replaces squeue with sacct which preserves the job info for much longer.

@jwhite242 jwhite242 added the Bugfix Discussion/Implementation of a solution to a bug (usually a PR) label Jun 30, 2022
@jwhite242 jwhite242 requested a review from FrankD412 June 30, 2022 20:14
@jwhite242 jwhite242 self-assigned this Jun 30, 2022
@jwhite242
Copy link
Copy Markdown
Collaborator Author

Screen shot for documenting the default outputs from running sacct on the hello_bye_parameterized spec:

sacct --jobs=6564055,6564056,6564057,6564058,6564059,6564060,6564061,6564062
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6564055      hello_wor+     pdebug    wbronze          1  COMPLETED      0:0 
6564055.bat+      batch               wbronze          1  COMPLETED      0:0 
6564055.0          echo               wbronze          1  COMPLETED      0:0 
6564056      hello_wor+     pdebug    wbronze          1  COMPLETED      0:0 
6564056.bat+      batch               wbronze          1  COMPLETED      0:0 
6564056.0          echo               wbronze          1  COMPLETED      0:0 
6564057      hello_wor+     pdebug    wbronze          1  COMPLETED      0:0 
6564057.bat+      batch               wbronze          1  COMPLETED      0:0 
6564057.0          echo               wbronze          1  COMPLETED      0:0 
6564058      hello_wor+     pdebug    wbronze          1  COMPLETED      0:0 
6564058.bat+      batch               wbronze          1  COMPLETED      0:0 
6564058.0          echo               wbronze          1  COMPLETED      0:0 
6564059      bye_world+     pdebug    wbronze          1  COMPLETED      0:0 
6564059.bat+      batch               wbronze          1  COMPLETED      0:0 
6564059.0          echo               wbronze          1  COMPLETED      0:0 
6564060      bye_world+     pdebug    wbronze          1  COMPLETED      0:0 
6564060.bat+      batch               wbronze          1  COMPLETED      0:0 
6564060.0          echo               wbronze          1  COMPLETED      0:0 
6564061      bye_world+     pdebug    wbronze          1  COMPLETED      0:0 
6564061.bat+      batch               wbronze          1  COMPLETED      0:0 
6564061.0          echo               wbronze          1  COMPLETED      0:0 
6564062      bye_world+     pdebug    wbronze          1  COMPLETED      0:0 
6564062.bat+      batch               wbronze          1  COMPLETED      0:0 
6564062.0          echo               wbronze          1  COMPLETED      0:0

The machinery used to split lines currently ignores the job steps and simply uses the cumulative job state.

"""
LOGGER.debug("Received SLURM State -- %s", slurm_state)
if slurm_state == "R":
if slurm_state == "R" or slurm_state == "RUNNING":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these actually be abbreviated in sacct?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll play the format options and see, but it's not clear that restricting state to 2 letters will actually do the same thing since running is only a 1 letter abbreviation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry -- I was unclear. My question was whether or not we need the or clause in there.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though -- if we fall back to squeue if sacct doesn't work, then we'd need those there.

@FrankD412
Copy link
Copy Markdown
Member

FrankD412 commented Jul 2, 2022

@jwhite242 -- For not being able to test this directly, these changes look like they're just fine. Just one question, but otherwise this is looking just fine to me.

@FrankD412
Copy link
Copy Markdown
Member

Random thought, might it be worth checking if the call to sacct fails and then backing off to squeue?

@jwhite242
Copy link
Copy Markdown
Collaborator Author

Yeah, I could go back to that if you prefer; initial implementation of this used a fallback to sacct when squeue failed and kept the same end check for both of them losing track of jobs. One other related hook that might be interesting here is to add a cli hook to conductor to make it possible for users to manually change the status of lost steps when this happens so the workflow can continue? There'd have to be logging of this of course so the provenance doesn't become unreliable.

@FrankD412
Copy link
Copy Markdown
Member

I think it's a good idea to have the backup since it's seemingly not guaranteed that sacct will actually be available (as it seems to not be usable on the clusters I'm on).

@FrankD412 FrankD412 merged commit db2cb0f into llnl:develop Aug 12, 2022
jwhite242 added a commit that referenced this pull request Dec 12, 2023
1.1.10 Release (#432)

* Sync up read the docs config with dev environments using poetry (#399)
* Print usage on command line when no args are provided (#404)
* Add sacct fallback to slurm adapter to improve robustness of job tracking (#405)
* Update Flurm Job State mappings for flux versions >= 0.26 (#407)
* Bump certifi from 2021.10.8 to 2022.12.7 to address security issue (#409)
* Bump cryptography from 37.0.1 to 38.0.3 to address security issue (#410)
* Add missing shbang in unscheduled scripts from lsf adapter (#411)
* Update poetry lockfile to address dependabot flagged security issues (#412)
* Fix for Dockerfile smell DL3006 (#418)
* Port Maestro documentation to mkdocs and expand coverage of features and tutorials (#403)
* Update version info to be driven from pyproject.toml exclusively, and hook up to command line (#419)
* Pin mermaid to < 10.x due to api change (#422)
* Bump lock file certifi from 2022.12.7 to 2023.7.22 to address security issue (#426)
* Refactor flux adapter to avoid using pickle to talk to flux brokers installed in external environments (#415)
   Also adds flux integration tests to exercise against real flux brokers
* Add pager functionality to status command (#420)
* Patch broken flux job cancellation (#428)
* Insulate slurm adapters from user customization of squeue and sacct output formats (#431)
   Also adds live unit and integration tests for slurm adapter

---------

Co-authored-by: Francesco Di Natale <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Bruno P. Kinoshita <[email protected]>
Co-authored-by: Charles Doutriaux <[email protected]>
Co-authored-by: Giovanni Rosa <[email protected]>
Co-authored-by: Brian Gunnarson <[email protected]>
jwhite242 added a commit that referenced this pull request Feb 6, 2024
1.1.10 Release (#432)

* Sync up read the docs config with dev environments using poetry (#399)
* Print usage on command line when no args are provided (#404)
* Add sacct fallback to slurm adapter to improve robustness of job tracking (#405)
* Update Flurm Job State mappings for flux versions >= 0.26 (#407)
* Bump certifi from 2021.10.8 to 2022.12.7 to address security issue (#409)
* Bump cryptography from 37.0.1 to 38.0.3 to address security issue (#410)
* Add missing shbang in unscheduled scripts from lsf adapter (#411)
* Update poetry lockfile to address dependabot flagged security issues (#412)
* Fix for Dockerfile smell DL3006 (#418)
* Port Maestro documentation to mkdocs and expand coverage of features and tutorials (#403)
* Update version info to be driven from pyproject.toml exclusively, and hook up to command line (#419)
* Pin mermaid to < 10.x due to api change (#422)
* Bump lock file certifi from 2022.12.7 to 2023.7.22 to address security issue (#426)
* Refactor flux adapter to avoid using pickle to talk to flux brokers installed in external environments (#415)
   Also adds flux integration tests to exercise against real flux brokers
* Add pager functionality to status command (#420)
* Patch broken flux job cancellation (#428)
* Insulate slurm adapters from user customization of squeue and sacct output formats (#431)
   Also adds live unit and integration tests for slurm adapter

---------

Co-authored-by: Francesco Di Natale <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Bruno P. Kinoshita <[email protected]>
Co-authored-by: Charles Doutriaux <[email protected]>
Co-authored-by: Giovanni Rosa <[email protected]>
Co-authored-by: Brian Gunnarson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bugfix Discussion/Implementation of a solution to a bug (usually a PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants