Correction to a bug for attempting a restart when a restart isn't specified. by FrankD412 · Pull Request #133 · llnl/maestrowf

FrankD412 · 2018-08-15T22:43:58Z

Previously, the conductor made the assumption that if a step timed out that the user specified a way to restart. That assumption does not necessarily hold because a restart can not be specified. This PR corrects that assumption by adding the capability to check if a record can be restarted and does the appropriate book keeping if it cannot be.

jsemler

I tested the change and everything ran. Approved.

FrankD412 · 2018-08-16T15:59:29Z

@jsemler -- You previously had steps that timed out, was the timeout properly reflected in the status?

jsemler · 2018-08-16T17:11:44Z

Yes, the timeouts are reflected. Looking through the status again I'm noticing some of the steps don’t include the restart count in the status.csv.

FrankD412 · 2018-08-16T17:12:38Z

@jsemler -- Can you provide an example?

jsemler · 2018-08-17T02:27:12Z

It looks like the restart count is in the status.csv file, but there is an issue with it displaying via the console for some reason:
run-sedov run-sedov RUNNING -1 day 22:29:02.592852 0:01:00.479318 2018-08-16 19:03:40.142983 2018-08-16 17:31:42.256517 2018-08-16 17:32:42.735835

jsemler

There seems to still be an issue with a successfully completed step restarting twice. I'm going to do some debugging.

FrankD412 · 2018-08-17T02:45:37Z

Yeah, I was just looking at the log. The weird thing is the conductor launches it, then for a while reports it's complete and that it's being skipped. Then after some point, it's picked up again. Still not sure why.

FrankD412 · 2018-08-17T17:19:32Z

@jsemler -- Alright, after a good bit of digging around, I think I've narrowed the bug down. I apologize if this explanation is somewhat convoluted, but I'll do what I can to make it clear.

In the ExecutionGraph method execute_ready_steps, there are a number of back end sets that track various aspects of execution, including failed steps, in progress steps, completed steps, dependency completions, and various other things (which for this discussion aren't important). The major thing that gets tracked are the ready steps which are in a deque object. When a step is deemed to be ready, its StepRecord object is pushed to the ready_steps queue. However, in the execute_ready_steps method, the following code currently exists (line 785 in executiongraph.py):

# If the gating dependencies set is empty, we can execute.
if not self._dependencies[record.name]:
if key not in self.ready_steps:
    logger.debug("All dependencies completed. Staging.")
    self.ready_steps.append(record)
else:
    logger.debug("Already staged. Passing.")
    continue

I noticed that in the log you provided that steps that were repeated would always trigger the "All dependencies completed. Staging." log message and I never saw the passing message in the else. That's likely because when I refactored and cleaned up the method last time, I made the transition to tracking records by their name instead of the object in order to clean up a lot of the record.name referencing I was doing. Somewhere along the way, it looks like I forgot to transition the ready_steps queue to the same structure and so it goes to check if the key is in the deque and never finds it since it's a queue of StepRecord objects.

I don't expect the fix to this will be difficult, so I'll try and have it completed today. We can still go through it if you'd like.

Edit: As an extra note, I think this issue only appears when throttling because that is the only time you'll have steps queued that are still in the INITIALIZED state. If not throttling, you'd never run into a duplicate in the queue since those steps would be in at least PENDING when submitted.

FrankD412 · 2018-08-17T17:49:32Z

@jsemler -- I'm thinking the bug fix for the ready_steps bug is a new PR. If you can, review and comment on this PR in terms of fixing the original issue? I'll start a new branch for the other bug.

jsemler · 2018-08-17T19:21:51Z

@FrankD412 -- I agree. This PR fixes the original issue.

* Update setup.py to 1.1.3dev * Updated the getting started documentation * Added examples for running both lulesh specs * Flux for Spectrum bugfixes (#116) * Additional arguments can be passed through batch. * Correction of the use of extend to append. * Removal of OMPI vars from env. * Reversal of env altering. * Addition of mpi exe to batch. * Removal of -gpu flag * Addition of "allocated" to PENDING set. * Correction of the check_status method call. * Correction of the Flux import * Addition of the EnvironmentError to try/catch for check_status. * Addition of jobid to submission INFO logging. * Workflow setup fix (#117) * Fixed a workflow setup issue caused by spaces in the spec name * Additional dirname formatting * Refactor of the Study class to breakdown complex APIs (#118) * Minor docstring correction. * Change to study setup API to break out workspace creation. * Update Maestro frontend to use new Study method. * Renamed Study.setup to be better communicate its functionality. * Correction to style for configure_study method. * Moved environment application to add_step. * Split out acquiring environment elements to its own method. * Updates to staging checks. * Renamed _setup_linear and _setup_parameterized from setup to stage for clarity. * Tweaks and addition of starting to SpectrumFluxAdapter. (#119) * Updated the LULESH examples to use the LULESH git repository for cloning the repo (#121) * Add the generation of metadata to Study construction. (#120) * Addition of method and tweaks to support metadata. * Addition of call to the method for generating metadata. * Correction to encode YAML string for py3 compatibility. * Another attempt at py2-py3 compatible encoding. * Addition of load_metadata method. * Additional debug logging for the Flux Spectrum Adapter (#122) * Additional logging for debugging of the adapter. * Additional logging. * Updated Exception.message references to Exception.args (#125) * Fix status.csv 'State' column writeout. (#127) * #107 Enhance/return codes (#128) * #107 Added StudyStatus enumeration Changed DAG execute_ready_steps to return StudyStatus enum instead of a boolean conductor and maestro also accept StudyStatus and convert to return value * #128 fixed issues found in pull request * #107 added inline comments per PC * #107 pulling out _check_study_completion to avoid initial implementaion copy and paste error * #107 fixed line length issue * Bugfix that would cause the ExecutionGraph to not update when cancelled. (#131) * Addition of custom parameter generation when running studies. (#129) * Command line parameter for custom ParameterGenerators. * Addition of sample custom parameter generation. * Correction to missing parameters in SIZE. * Addition of utility method for importing custom gen py files. * Correction of syntax. * Conversion of DOS newlines to Unix. * Addition of call to check for custom code. * Correction of a bad docstring. * Addition of parameter metadata to Maestro. * Addition of parameters to study metadata. * Addition of copying of the parameter file to study workspace. * Correction to a bug for attempting a restart when a restart isn't specified. (#133) * Initial fix for steps that cannot restart. * Comments to clarify logic. * More flexible ExecutionGraph description API and logging of description. (#137) * Tweaks to how the ExecutionGraph adds descriptions. Added logging. * Updated study to use the new API and some minor refactor. * Fixed #134 * Fix to Variable verification. (#138) Fixes #123 * Updates to Maestro documentation (#114) * Updates to the README. * Added initial getting started example documentation * Addition of Quick Start documentation. (#130) * Addition of Quick Start. * Addition of cancellation status example. * Corrections to quick start and fix for build errors. * Addition of a LULESH study specification with heavier comments. (#132) * Start of a LULESH spec that's commented. * Addition of explanation of step structure. * Additional comments about steps and dependencies. * Minor tweaks to comments. * Addition of parameter block comments. * Addition of reference to launcher_token YAML * Tweaks and edits based on feedback. * Throttled workflows push steps into the ready queue multiple times. (#136) * Moved 'mark_submitted' into ExecutionGraph 'execute' * Conversion of ready_steps deque to use step names. * Capping the available slots to the length of ready_steps. * Fixes Popen output to use universal newlines. (#143) * Addition of a shell util to centralize process creation settings. * Switch to start_process method. * Addition of docstring for start_process. * Update to fix universal newline from false to true. * Addition of a check for a list and a hard set of shell to False. * Made conductor launch command a string. * Fix behavior of start_process for localscriptadapter. The shell=False flag was not being passed, nor were the env or cwd arguments being used in the actual function. * Clean up of start_process * Cleans up GitDependency errors to not rely on return codes. (#144) * Addition of a ping method. * Tweaks to error logging. * Moved specific existence exception out of return code check. * Cleanup of some merge characters. * Tweak for connectivity checking. * Addition of info to the error message. * Removal of git return codes since they don't mean anything. * Update version number to 1.1.3.

FrankD412 added 2 commits August 15, 2018 15:14

Initial fix for steps that cannot restart.

7757e8f

Comments to clarify logic.

d7014dd

FrankD412 added the Bugfix Discussion/Implementation of a solution to a bug (usually a PR) label Aug 15, 2018

FrankD412 self-assigned this Aug 15, 2018

FrankD412 requested a review from jsemler August 15, 2018 22:43

jsemler approved these changes Aug 16, 2018

View reviewed changes

jsemler requested changes Aug 17, 2018

View reviewed changes

FrankD412 mentioned this pull request Aug 17, 2018

Throttled workflows push steps into the ready queue multiple times. #136

Merged

jsemler approved these changes Aug 17, 2018

View reviewed changes

FrankD412 merged commit 786d523 into develop Aug 17, 2018

FrankD412 deleted the bugfix/restart_exception branch August 17, 2018 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correction to a bug for attempting a restart when a restart isn't specified.#133

Correction to a bug for attempting a restart when a restart isn't specified.#133
FrankD412 merged 2 commits intodevelopfrom
bugfix/restart_exception

FrankD412 commented Aug 15, 2018

Uh oh!

jsemler left a comment

Uh oh!

FrankD412 commented Aug 16, 2018

Uh oh!

jsemler commented Aug 16, 2018

Uh oh!

FrankD412 commented Aug 16, 2018

Uh oh!

jsemler commented Aug 17, 2018

Uh oh!

jsemler left a comment

Uh oh!

FrankD412 commented Aug 17, 2018

Uh oh!

FrankD412 commented Aug 17, 2018 •

edited

Loading

Uh oh!

FrankD412 commented Aug 17, 2018

Uh oh!

jsemler commented Aug 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FrankD412 commented Aug 15, 2018

Uh oh!

jsemler left a comment

Choose a reason for hiding this comment

Uh oh!

FrankD412 commented Aug 16, 2018

Uh oh!

jsemler commented Aug 16, 2018

Uh oh!

FrankD412 commented Aug 16, 2018

Uh oh!

jsemler commented Aug 17, 2018

Uh oh!

jsemler left a comment

Choose a reason for hiding this comment

Uh oh!

FrankD412 commented Aug 17, 2018

Uh oh!

FrankD412 commented Aug 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrankD412 commented Aug 17, 2018

Uh oh!

jsemler commented Aug 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FrankD412 commented Aug 17, 2018 •

edited

Loading