Correction to a bug for attempting a restart when a restart isn't specified.#133
Correction to a bug for attempting a restart when a restart isn't specified.#133
Conversation
jsemler
left a comment
There was a problem hiding this comment.
I tested the change and everything ran. Approved.
|
@jsemler -- You previously had steps that timed out, was the timeout properly reflected in the status? |
|
Yes, the timeouts are reflected. Looking through the status again I'm noticing some of the steps don’t include the restart count in the status.csv. |
|
@jsemler -- Can you provide an example? |
|
It looks like the restart count is in the status.csv file, but there is an issue with it displaying via the console for some reason: |
jsemler
left a comment
There was a problem hiding this comment.
There seems to still be an issue with a successfully completed step restarting twice. I'm going to do some debugging.
|
Yeah, I was just looking at the log. The weird thing is the conductor launches it, then for a while reports it's complete and that it's being skipped. Then after some point, it's picked up again. Still not sure why. |
|
@jsemler -- Alright, after a good bit of digging around, I think I've narrowed the bug down. I apologize if this explanation is somewhat convoluted, but I'll do what I can to make it clear. In the I noticed that in the log you provided that steps that were repeated would always trigger the "All dependencies completed. Staging." log message and I never saw the passing message in the I don't expect the fix to this will be difficult, so I'll try and have it completed today. We can still go through it if you'd like. Edit: As an extra note, I think this issue only appears when throttling because that is the only time you'll have steps queued that are still in the |
|
@jsemler -- I'm thinking the bug fix for the |
|
@FrankD412 -- I agree. This PR fixes the original issue. |
* Update setup.py to 1.1.3dev * Updated the getting started documentation * Added examples for running both lulesh specs * Flux for Spectrum bugfixes (#116) * Additional arguments can be passed through batch. * Correction of the use of extend to append. * Removal of OMPI vars from env. * Reversal of env altering. * Addition of mpi exe to batch. * Removal of -gpu flag * Addition of "allocated" to PENDING set. * Correction of the check_status method call. * Correction of the Flux import * Addition of the EnvironmentError to try/catch for check_status. * Addition of jobid to submission INFO logging. * Workflow setup fix (#117) * Fixed a workflow setup issue caused by spaces in the spec name * Additional dirname formatting * Refactor of the Study class to breakdown complex APIs (#118) * Minor docstring correction. * Change to study setup API to break out workspace creation. * Update Maestro frontend to use new Study method. * Renamed Study.setup to be better communicate its functionality. * Correction to style for configure_study method. * Moved environment application to add_step. * Split out acquiring environment elements to its own method. * Updates to staging checks. * Renamed _setup_linear and _setup_parameterized from setup to stage for clarity. * Tweaks and addition of starting to SpectrumFluxAdapter. (#119) * Updated the LULESH examples to use the LULESH git repository for cloning the repo (#121) * Add the generation of metadata to Study construction. (#120) * Addition of method and tweaks to support metadata. * Addition of call to the method for generating metadata. * Correction to encode YAML string for py3 compatibility. * Another attempt at py2-py3 compatible encoding. * Addition of load_metadata method. * Additional debug logging for the Flux Spectrum Adapter (#122) * Additional logging for debugging of the adapter. * Additional logging. * Updated Exception.message references to Exception.args (#125) * Fix status.csv 'State' column writeout. (#127) * #107 Enhance/return codes (#128) * #107 Added StudyStatus enumeration Changed DAG execute_ready_steps to return StudyStatus enum instead of a boolean conductor and maestro also accept StudyStatus and convert to return value * #128 fixed issues found in pull request * #107 added inline comments per PC * #107 pulling out _check_study_completion to avoid initial implementaion copy and paste error * #107 fixed line length issue * Bugfix that would cause the ExecutionGraph to not update when cancelled. (#131) * Addition of custom parameter generation when running studies. (#129) * Command line parameter for custom ParameterGenerators. * Addition of sample custom parameter generation. * Correction to missing parameters in SIZE. * Addition of utility method for importing custom gen py files. * Correction of syntax. * Conversion of DOS newlines to Unix. * Addition of call to check for custom code. * Correction of a bad docstring. * Addition of parameter metadata to Maestro. * Addition of parameters to study metadata. * Addition of copying of the parameter file to study workspace. * Correction to a bug for attempting a restart when a restart isn't specified. (#133) * Initial fix for steps that cannot restart. * Comments to clarify logic. * More flexible ExecutionGraph description API and logging of description. (#137) * Tweaks to how the ExecutionGraph adds descriptions. Added logging. * Updated study to use the new API and some minor refactor. * Fixed #134 * Fix to Variable verification. (#138) Fixes #123 * Updates to Maestro documentation (#114) * Updates to the README. * Added initial getting started example documentation * Addition of Quick Start documentation. (#130) * Addition of Quick Start. * Addition of cancellation status example. * Corrections to quick start and fix for build errors. * Addition of a LULESH study specification with heavier comments. (#132) * Start of a LULESH spec that's commented. * Addition of explanation of step structure. * Additional comments about steps and dependencies. * Minor tweaks to comments. * Addition of parameter block comments. * Addition of reference to launcher_token YAML * Tweaks and edits based on feedback. * Throttled workflows push steps into the ready queue multiple times. (#136) * Moved 'mark_submitted' into ExecutionGraph 'execute' * Conversion of ready_steps deque to use step names. * Capping the available slots to the length of ready_steps. * Fixes Popen output to use universal newlines. (#143) * Addition of a shell util to centralize process creation settings. * Switch to start_process method. * Addition of docstring for start_process. * Update to fix universal newline from false to true. * Addition of a check for a list and a hard set of shell to False. * Made conductor launch command a string. * Fix behavior of start_process for localscriptadapter. The shell=False flag was not being passed, nor were the env or cwd arguments being used in the actual function. * Clean up of start_process * Cleans up GitDependency errors to not rely on return codes. (#144) * Addition of a ping method. * Tweaks to error logging. * Moved specific existence exception out of return code check. * Cleanup of some merge characters. * Tweak for connectivity checking. * Addition of info to the error message. * Removal of git return codes since they don't mean anything. * Update version number to 1.1.3.
Previously, the conductor made the assumption that if a step timed out that the user specified a way to restart. That assumption does not necessarily hold because a restart can not be specified. This PR corrects that assumption by adding the capability to check if a record can be restarted and does the appropriate book keeping if it cannot be.