Conversation
|
Screen shot for documenting the default outputs from running sacct on the hello_bye_parameterized spec: The machinery used to split lines currently ignores the job steps and simply uses the cumulative job state. |
| """ | ||
| LOGGER.debug("Received SLURM State -- %s", slurm_state) | ||
| if slurm_state == "R": | ||
| if slurm_state == "R" or slurm_state == "RUNNING": |
There was a problem hiding this comment.
Can these actually be abbreviated in sacct?
There was a problem hiding this comment.
I'll play the format options and see, but it's not clear that restricting state to 2 letters will actually do the same thing since running is only a 1 letter abbreviation.
There was a problem hiding this comment.
Sorry -- I was unclear. My question was whether or not we need the or clause in there.
There was a problem hiding this comment.
Though -- if we fall back to squeue if sacct doesn't work, then we'd need those there.
|
@jwhite242 -- For not being able to test this directly, these changes look like they're just fine. Just one question, but otherwise this is looking just fine to me. |
|
Random thought, might it be worth checking if the call to |
|
Yeah, I could go back to that if you prefer; initial implementation of this used a fallback to sacct when squeue failed and kept the same end check for both of them losing track of jobs. One other related hook that might be interesting here is to add a cli hook to conductor to make it possible for users to manually change the status of lost steps when this happens so the workflow can continue? There'd have to be logging of this of course so the provenance doesn't become unreliable. |
|
I think it's a good idea to have the backup since it's seemingly not guaranteed that |
1.1.10 Release (#432) * Sync up read the docs config with dev environments using poetry (#399) * Print usage on command line when no args are provided (#404) * Add sacct fallback to slurm adapter to improve robustness of job tracking (#405) * Update Flurm Job State mappings for flux versions >= 0.26 (#407) * Bump certifi from 2021.10.8 to 2022.12.7 to address security issue (#409) * Bump cryptography from 37.0.1 to 38.0.3 to address security issue (#410) * Add missing shbang in unscheduled scripts from lsf adapter (#411) * Update poetry lockfile to address dependabot flagged security issues (#412) * Fix for Dockerfile smell DL3006 (#418) * Port Maestro documentation to mkdocs and expand coverage of features and tutorials (#403) * Update version info to be driven from pyproject.toml exclusively, and hook up to command line (#419) * Pin mermaid to < 10.x due to api change (#422) * Bump lock file certifi from 2022.12.7 to 2023.7.22 to address security issue (#426) * Refactor flux adapter to avoid using pickle to talk to flux brokers installed in external environments (#415) Also adds flux integration tests to exercise against real flux brokers * Add pager functionality to status command (#420) * Patch broken flux job cancellation (#428) * Insulate slurm adapters from user customization of squeue and sacct output formats (#431) Also adds live unit and integration tests for slurm adapter --------- Co-authored-by: Francesco Di Natale <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Bruno P. Kinoshita <[email protected]> Co-authored-by: Charles Doutriaux <[email protected]> Co-authored-by: Giovanni Rosa <[email protected]> Co-authored-by: Brian Gunnarson <[email protected]>
1.1.10 Release (#432) * Sync up read the docs config with dev environments using poetry (#399) * Print usage on command line when no args are provided (#404) * Add sacct fallback to slurm adapter to improve robustness of job tracking (#405) * Update Flurm Job State mappings for flux versions >= 0.26 (#407) * Bump certifi from 2021.10.8 to 2022.12.7 to address security issue (#409) * Bump cryptography from 37.0.1 to 38.0.3 to address security issue (#410) * Add missing shbang in unscheduled scripts from lsf adapter (#411) * Update poetry lockfile to address dependabot flagged security issues (#412) * Fix for Dockerfile smell DL3006 (#418) * Port Maestro documentation to mkdocs and expand coverage of features and tutorials (#403) * Update version info to be driven from pyproject.toml exclusively, and hook up to command line (#419) * Pin mermaid to < 10.x due to api change (#422) * Bump lock file certifi from 2022.12.7 to 2023.7.22 to address security issue (#426) * Refactor flux adapter to avoid using pickle to talk to flux brokers installed in external environments (#415) Also adds flux integration tests to exercise against real flux brokers * Add pager functionality to status command (#420) * Patch broken flux job cancellation (#428) * Insulate slurm adapters from user customization of squeue and sacct output formats (#431) Also adds live unit and integration tests for slurm adapter --------- Co-authored-by: Francesco Di Natale <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Bruno P. Kinoshita <[email protected]> Co-authored-by: Charles Doutriaux <[email protected]> Co-authored-by: Giovanni Rosa <[email protected]> Co-authored-by: Brian Gunnarson <[email protected]>
Fixes issue with job checking on slurm systems that can result in steps never being marked finished. The squeue command appears to flush finished/cancelled/killed jobs from the queue before maestro can check on the jobs and update their status. Have yet to reproduce this in maestro studies, but can reproduces with manual job submissions and using ctrl-c to kill them and watching the squeue iterate output drop the job almost immediately. This fix replaces squeue with sacct which preserves the job info for much longer.