Skip to content

Correct terminate order#891

Open
wanyunSu wants to merge 2 commits intodevelopfrom
wanyunSu/SSH-terminate
Open

Correct terminate order#891
wanyunSu wants to merge 2 commits intodevelopfrom
wanyunSu/SSH-terminate

Conversation

@wanyunSu
Copy link
Copy Markdown
Contributor

@wanyunSu wanyunSu commented Apr 27, 2026

Description

Fixes issue #867

The SSH process manager incorrectly classified deeply nested applications as unknown role during shutdown. The root cause was in ProcessMetadata.compute_role_from_tree_id, which used a hard-coded dot-depth check.

Added is_controller: bool = False parameter to compute_role_from_tree_id
Role mapping now mirrors K8s: controller + 0.* → segment-controller, non-controller + 0.* → application (at any depth), anything else → infrastructure-applications

Example:

drunc-unified-shell ssh-standalone config/tests/nestedConfig.data.xml test-config claudia-test boot terminate

drunc-unified-shell > terminate
[2026/04/28 08:36:41 UTC] INFO       ssh_process_manager.py:200               drunc.process_manager.SSH_SHELL_process_manager    Terminating
[2026/04/28 08:36:41 UTC] INFO       ssh_process_manager.py:203               drunc.process_manager.SSH_SHELL_process_manager    Killing all the known processes before exiting
[2026/04/28 08:36:41 UTC] INFO       ssh_process_lifetime_manager_shell.py:61 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Terminating role 
'application' from provided UUIDs ---
[2026/04/28 08:36:41 UTC] INFO       ssh_process_lifetime_manager_shell.py:53 drunc.drunc.processes.ssh_process_lifetime_manager Killing 3 process(es) with role 'application' 
from 8 candidates
[2026/04/28 08:36:52 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
364ad6be-6b33-4a04-863c-75e0eb82e3e7 (PID 307729) did not terminate after SIGQUIT signal.
[2026/04/28 08:36:52 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
dd1e5ce3-1059-48a7-ad52-c659f4966a8b (PID 309350) did not terminate after SIGQUIT signal.
[2026/04/28 08:36:52 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
7f4a29b7-4a94-4538-9833-971436258518 (PID 308845) did not terminate after SIGQUIT signal.
[2026/04/28 08:36:53 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'bottom-segment-1-application' 
(session: 'claudia-test', user: 'wasu') process exited with exit code 137
[2026/04/28 08:36:53 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'bottom-segment-2-application' 
(session: 'claudia-test', user: 'wasu') process exited with exit code 137
[2026/04/28 08:36:53 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'nested-segment-application' (session:
'claudia-test', user: 'wasu') process exited with exit code 137
[2026/04/28 08:36:53 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
364ad6be-6b33-4a04-863c-75e0eb82e3e7 (PID 307729) terminated forcibly following SIGKILL signal.
[2026/04/28 08:36:53 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
dd1e5ce3-1059-48a7-ad52-c659f4966a8b (PID 309350) terminated forcibly following SIGKILL signal.
[2026/04/28 08:36:53 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
7f4a29b7-4a94-4538-9833-971436258518 (PID 308845) terminated forcibly following SIGKILL signal.
[2026/04/28 08:36:54 UTC] INFO       ssh_process_lifetime_manager_shell.py:62 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Role 'application' 
complete ---
[2026/04/28 08:36:54 UTC] INFO       ssh_process_lifetime_manager_shell.py:61 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Terminating role 
'segment-controller' from provided UUIDs ---
[2026/04/28 08:36:54 UTC] INFO       ssh_process_lifetime_manager_shell.py:53 drunc.drunc.processes.ssh_process_lifetime_manager Killing 3 process(es) with role 
'segment-controller' from 8 candidates
[2026/04/28 08:36:54 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
54fe6167-1ef8-4df7-abe6-9ee2be87e92e (PID 306816) terminated gracefully following SIGQUIT signal.
[2026/04/28 08:36:55 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'bottom-segment-1-controller' 
(session: 'claudia-test', user: 'wasu') process exited with exit code 0
[2026/04/28 08:36:55 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'nested-segment-controller' (session: 
'claudia-test', user: 'wasu') process exited with exit code 0
[2026/04/28 08:36:55 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
e0047635-c00c-42e7-a812-21a07ca7d632 (PID 308521) terminated gracefully following SIGQUIT signal.
[2026/04/28 08:36:55 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'bottom-segment-2-controller' 
(session: 'claudia-test', user: 'wasu') process exited with exit code 0
[2026/04/28 08:36:55 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
5d1de534-0f2b-4c7d-85f2-0790444defa7 (PID 306234) terminated gracefully following SIGQUIT signal.
[2026/04/28 08:36:56 UTC] INFO       ssh_process_lifetime_manager_shell.py:62 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Role 'segment-controller' 
complete ---
[2026/04/28 08:36:56 UTC] INFO       ssh_process_lifetime_manager_shell.py:61 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Terminating role 
'root-controller' from provided UUIDs ---
[2026/04/28 08:36:56 UTC] INFO       ssh_process_lifetime_manager_shell.py:53 drunc.drunc.processes.ssh_process_lifetime_manager Killing 1 process(es) with role 
'root-controller' from 8 candidates
[2026/04/28 08:36:58 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'top-segment-controller' (session: 
'claudia-test', user: 'wasu') process exited with exit code 0
[2026/04/28 08:36:59 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
1f7dcdf4-db1a-4c3d-91b1-0314d670068a (PID 305600) terminated gracefully following SIGQUIT signal.
[2026/04/28 08:36:59 UTC] INFO       ssh_process_lifetime_manager_shell.py:62 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Role 'root-controller' 
complete ---
[2026/04/28 08:36:59 UTC] INFO       ssh_process_lifetime_manager_shell.py:61 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Terminating role 
'infrastructure-applications' from provided UUIDs ---
[2026/04/28 08:36:59 UTC] INFO       ssh_process_lifetime_manager_shell.py:53 drunc.drunc.processes.ssh_process_lifetime_manager Killing 1 process(es) with role 
'infrastructure-applications' from 8 candidates
[2026/04/28 08:37:00 UTC] INFO       ssh_process_manager.py:302               drunc.process_manager.SSH_SHELL_process_manager    Process 'local-connection-server' (session: 
'claudia-test', user: 'wasu') process exited with exit code 0
[2026/04/28 08:37:01 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Remote process 
3354a57f-da79-4b5b-bbc7-a666fa0ae4d4 (PID 303829) terminated gracefully following SIGQUIT signal.
[2026/04/28 08:37:01 UTC] INFO       ssh_process_lifetime_manager_shell.py:62 drunc.drunc.processes.ssh_process_lifetime_manager --- Shutdown stage: Role 
'infrastructure-applications' complete ---

                                                        Terminated process                                                         
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ session      ┃ friendly name                      ┃ user ┃ host      ┃ uuid                                 ┃ alive ┃ exit-code ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ claudia-test │ local-connection-server            │ wasu │ localhost │ 08887eb9-0d47-408b-bed8-372c41e1757e │ False │ 0         │
│ claudia-test │ top-segment-controller             │ wasu │ localhost │ 0b6d2be8-16a4-46d4-bbff-5ca162758df9 │ False │ 0         │
│ claudia-test │   nested-segment-controller        │ wasu │ localhost │ 4076042b-66b5-489a-a58f-df05a7133626 │ False │ 0         │
│ claudia-test │     bottom-segment-1-controller    │ wasu │ localhost │ 06ebaea9-d4fe-4954-8481-dea6b0029b39 │ False │ 0         │
│ claudia-test │       bottom-segment-1-application │ wasu │ localhost │ bf3a70ed-b9b3-47c6-8607-efe4c986913c │ False │ 137       │
│ claudia-test │     bottom-segment-2-controller    │ wasu │ localhost │ efa238df-a449-4b5c-9edd-213020a92b0a │ False │ 0         │
│ claudia-test │       bottom-segment-2-application │ wasu │ localhost │ bbe9ed07-411c-4560-a5ea-1ba0c52543b3 │ False │ 137       │
│ claudia-test │     nested-segment-application     │ wasu │ localhost │ ddf067d5-67b6-4871-be47-6822aea141f5 │ False │ 0         │

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

N/A

Change log

See description.

Suggested manual testing checklist

See description.

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: np04 019 from release NFD_DEV_260427_A9

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: App roles unknown have only been addressed for the k8s process manager, not for the ssh process manager

2 participants