Update flux exclusive behavior modify task info passed to allocation/launcher#467
Update flux exclusive behavior modify task info passed to allocation/launcher#467
Conversation
…ions and cores/gpus from launcher
…equirements to set nodes if exclusive also set
| ngpus_per_slot = int(ceil(ngpus / nodes)) | ||
| else: | ||
| ngpus_per_slot = None | ||
| LOGGER.warn(f"NODES recieved: {nodes}") |
There was a problem hiding this comment.
typo and also why is it a wraning? Shouldn't this be an info? (unless you want it in the else)
|
|
||
| args = ["flux", "run", "-n", str(procs)] | ||
|
|
||
| args = ["flux", "run"] #, "-n", str(procs)] |
There was a problem hiding this comment.
do we want to keep the commented out bit?
| if nodes is not None and nodes != '': | ||
| args.append("-N") | ||
| args.append(str(nodes)) | ||
| # ntasks = nodes if nodes else 1 |
There was a problem hiding this comment.
same comment do we keep this?
| # args.append(str(ntasks)) | ||
|
|
||
| # NOTE: should we raise an exception here instead of just logging the error? | ||
| # better pre-run validation might be more useful... |
There was a problem hiding this comment.
Yes I agree I'd raise an error since we log it, or downgrade the log to a warning.
|
|
||
| if procs: | ||
| args.append("-n") | ||
| args.append(str(procs)) |
There was a problem hiding this comment.
args.append(f"- n {procs}") ?
There was a problem hiding this comment.
convention on all the adapters so far is each arg as a separate list item, following some of the subprocessing style arguments for shell commands, so going to leave it this way to be consistent
| nodes = None | ||
| if nodes is not None and nodes != '': # Have to account for empty string! | ||
| nodes = int(nodes) | ||
| # if not isinstance(nodes, int): |
|
|
||
| # Calculate nprocs | ||
| ncores = cores_per_task * nodes | ||
| ncores = cores_per_task * processors # What was this check doing? |
There was a problem hiding this comment.
seems like a big change, was it broken before?
There was a problem hiding this comment.
allowing nodes to be 0 or None (never could be before) broke it.
will likely remove it entirely later.. not clear this check is useful given there's no place to eval against the machine... might only catch negative values now, which will be better checked upstream after future refactoring.
| conf: | ||
| resource: | ||
| rediscover: True # Flux wants 'true', not 'True' that python bool serializes to | ||
| rediscover: True |
There was a problem hiding this comment.
stray debris from prior pr: already fixed it so it's not a user facing problem anymore
| launcher_args: | ||
| setopt: | ||
| fastload: | ||
| fastload: |
| conf: | ||
| resource: | ||
| rediscover: "true" # Flux wants 'true', not 'True' that python bool serializes to | ||
| rediscover: "true" |
There was a problem hiding this comment.
so string here but True above? Is it a flux issue?
There was a problem hiding this comment.
either one works: it's json syntax. see 'render_additional_args' in the flux0_49_0.py for explanation
Updates handling of exclusive keys to modify the batch directives and launcher rendering
Updates default renderings:
Unblocks scheduling to nodes with configurable resources: subdividing gpus happens after allocation startup, and prior behavior of attaching tasks/gpus to allocation would cause flux to reject such jobs (unsatisfiable).