Skip to content

improvements to controller/fsm disconnection handling#883

Open
Aurashk wants to merge 8 commits intodevelopfrom
aurashk/improve-handling-of-connection-problems-in-fsm
Open

improvements to controller/fsm disconnection handling#883
Aurashk wants to merge 8 commits intodevelopfrom
aurashk/improve-handling-of-connection-problems-in-fsm

Conversation

@Aurashk
Copy link
Copy Markdown
Contributor

@Aurashk Aurashk commented Apr 20, 2026

Description

Fixes issue #516
Several improvements and fixes to the controller, particularly disconnection handling.

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

Change log

  1. The status command now checks that there is a valid connection to all children, showing them as disconnected without changing the internal state if they can't be reached. Disconnected children will no longer be marked as in error
  2. Before any fsm command is issued, the connection of all children in the chain is checked, the shell will emit a warning and abort the command if any of the children are disconnected. It will suggest that you can exclude the disconnected child if you want to carry on without it. If you exclude all disconnected children, fsm commands can go ahead.
  3. When you restart killed rest apps, they now work properly (status will show not in error, fsm commands are possible) see this comment for an example. This is done by refreshing the new rest endpoint from the connectivity service if a connection couldn't be established, similar to the connection refreshing for gRPC children.
  4. When you exclude a particular target application, it's parent won't be excluded by default, which avoids this behaviour

Suggested manual testing checklist

drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
status
kill --name df-01
status
drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
status
kill --name df-01
conf
exclude --target root-controller/df-controller/df-01
conf
drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
status
kill --name df-01
status
restart --name df-01
status
conf

Note: you may need to do status twice, re-connection is asynchronous and doesn't happen instantly

Same as 2, observe that the parent df-controller will be configured

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: hep cluster from develop

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@Aurashk Aurashk changed the title improvements to contorller handling improvements to controller/fsm disconnection handling Apr 20, 2026
@Aurashk Aurashk requested a review from jamesturner246 April 22, 2026 13:41
@Aurashk Aurashk marked this pull request as ready for review April 22, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants