Skip to content

add kernel instruction for when pm parent process is killed #878

Open
Aurashk wants to merge 1 commit intodevelopfrom
aurashk/add-shell-crash-failsafe-for-forked-pm
Open

add kernel instruction for when pm parent process is killed #878
Aurashk wants to merge 1 commit intodevelopfrom
aurashk/add-shell-crash-failsafe-for-forked-pm

Conversation

@Aurashk
Copy link
Copy Markdown
Contributor

@Aurashk Aurashk commented Apr 14, 2026

Description

Fixes issue #871

To prevent zombie processes from the PM when the unified shell is killed ungracefully, the kernel will send a sigterm to the ssh lifetime manager when the main PM process dies.

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

NA

Change log

Suggested manual testing checklist

Find the main pid for the unified shell:

main_pid=$(ps -eo pid,ppid,args --forest | awk '/drunc-unified-shell/ {print $1; exit}')
echo "$main_pid"

Kill it with sig kill:

kill -9 <main pid here>

Then check for zombies:

ps aux | grep sleep

You shouldn't see the zombie processes from #871 with this branch.

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: hep cluster from release RELEASE_NAME

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@Aurashk Aurashk requested a review from PawelPlesniak April 14, 2026 15:33
@Aurashk Aurashk changed the base branch from develop to prep-release/fddaq-v5.6.0 April 22, 2026 12:41
@Aurashk Aurashk changed the base branch from prep-release/fddaq-v5.6.0 to develop April 22, 2026 12:42
@PawelPlesniak
Copy link
Copy Markdown
Collaborator

In the production environment, starting with a clean process list determined by htop -u as np04daq, we saw processes running as
image
After completion of the BDE-only session, two watcher processes were still active
image

@PawelPlesniak
Copy link
Copy Markdown
Collaborator

After discussion, this was intended for when the unfiied shell gets killed, so work for the issue is ongoing

@PawelPlesniak
Copy link
Copy Markdown
Collaborator

Running a simple 1x1 integration test in the standard production framework (np04daq user,np04-srv-024, tmux) also gives these stale processes

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants