Skip to content

Zombie process problem#887

Merged
PawelPlesniak merged 5 commits intoprep-release/fddaq-v5.6.0from
PawelPlesniak/WatcherZombies
Apr 24, 2026
Merged

Zombie process problem#887
PawelPlesniak merged 5 commits intoprep-release/fddaq-v5.6.0from
PawelPlesniak/WatcherZombies

Conversation

@PawelPlesniak
Copy link
Copy Markdown
Collaborator

@PawelPlesniak PawelPlesniak commented Apr 23, 2026

Description

Fixes issue #871

Removes the PTY allocation from the subprocesses.

The relevant changes in the user workflow have been documented within the code

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

N/A

Change log

The runtime policy of users vs superusers is configured differnetly. For users, if you run

loginctl show-session $(cat /proc/self/sessionid)

you will see

State=closing

but if you run this as np04daq you will see

State=active

When the process is closed as a general user, systemd kills everything within the killed process scope. When the process is closed as np04daq, it does not do the same. The watcher process is orphaned, and as the state is closing (active) for the general user (superuser), the process is killed (not killed when -tt is used). When run as a PTY, the process is left persistent for the superuser, which does not get cleaned up. Removing the PTY allocation menas that the OS is not responsible for the cleanup, instead, the process is now a pipe which gets a SIGPIPE, cleaning up the process.

Suggested manual testing checklist

Run an integration test in the 4 combinations of in a tmux, not in a tmux, and as both yourself and as np04daq.
Verify the connection details.
Validate that without this PR, when you run the integration test either in or out of a tmux, the watcher threads are still alive. With this PR, they should clean up.

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: np04-srv-029 from release fddaq-v5.6.0-rc3

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Note - only the MSQT was run as this is a very small change.

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER - new unit tests have not been written as we want this to be ready for the release.
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@PawelPlesniak PawelPlesniak changed the title Removing -tt from the SSH options Zombie process problem Apr 23, 2026
@Aurashk
Copy link
Copy Markdown
Contributor

Aurashk commented Apr 24, 2026

Nice one @PawelPlesniak!

We may need to only disable -tt for the ssh watcher threads. I think in the case where the remote pid can't be found/written for whatever reason, and we rely instead on killing the ssh client to clean up, you need -tt for that to work, or you'll have other zombies. Just going to run some tests and get back to you shortly.

@PawelPlesniak
Copy link
Copy Markdown
Collaborator Author

Thanks Aurash. The process termination is correct for the controllers and apps for superusers, so I agree it could just be for the watchers. Unless there is a change in the process watching logic that we could implement to account for the superuser config, this would likely be the safest method

@Aurashk
Copy link
Copy Markdown
Contributor

Aurashk commented Apr 24, 2026

It seems to work although will need checking on np04:

I replaced build_ssh_arguments with this:

    def _build_ssh_arguments(
        self, hostname: str, user_host: str, use_tty: bool = True
    ) -> List[str]:
        """
        Build standard SSH arguments with host key checking policy.

        Args:
            hostname: Target hostname for policy determination
            user_host: User@hostname string for SSH connection
            use_tty: Whether to allocate a pseudo-terminal

        Returns:
            List of SSH command arguments
        """
        disable_host_key_check = self.disable_host_key_check or (
            self.disable_localhost_host_key_check
            and hostname in ("localhost", "127.0.0.1", "::1")
        )

        arguments = [user_host, "-o", "StrictHostKeyChecking=no"]

        if use_tty:
            arguments.append("-tt")

        if disable_host_key_check:
            arguments.extend(
                [
                    "-o",
                    "LogLevel=error",
                    "-o",
                    "GlobalKnownHostsFile=/dev/null",
                    "-o",
                    "UserKnownHostsFile=/dev/null",
                ]
            )

        return arguments

Then only in the call to build_ssh_arguments in _monitor_remote_process you set use_tty to False.

To see/test with the problematic path the easiest hacky way is to set metadata = None in the run method of ProcessWatcherThread If all tty are disabled (without the fix above) a simple boot and terminate leaves zombie processes in all cases.

It might be worth extending these tests to do specific checking for the possible zombies that could be left. As we probably would then catch these issues at the pytest stage, but I'll make a separate issue for that.

@PawelPlesniak
Copy link
Copy Markdown
Collaborator Author

With the current branch, the following tests all did not leave zombie processes as neither the superuser or regular user.

  • Suggested pytests
  • Running an integration test as the superuser

I am now running the full integration test suite on np04-srv-029 as np04daq, will comment the results and if there are any leftover watchers after

@PawelPlesniak
Copy link
Copy Markdown
Collaborator Author

Tests all passed, no zombies observed.

+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ SUMMARY ++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

Fri Apr 24 04:19:26 PM CEST 2026
Log file is: /tmp/pytest-of-np04daq/dunedaq_integtest_bundle_20260424152422.log

⮕ Running daqsystemtest/3ru_1df_multirun_test.py ⬅
======================== 6 passed ✅ in 301.55s (0:05:01) =========================
⮕ Running daqsystemtest/3ru_3df_multirun_test.py ⬅
======================== 6 passed ✅ in 330.65s (0:05:30) =========================
⮕ Running daqsystemtest/example_system_test.py ⬅
======================== 12 passed ✅ in 361.80s (0:06:01) ========================
⮕ Running daqsystemtest/fake_data_producer_test.py ⬅
======================== 6 passed ✅ in 286.72s (0:04:46) =========================
⮕ Running daqsystemtest/long_window_readout_test.py ⬅
============================== 1 skipped 🟡 in 1.71s ==============================
⮕ Running daqsystemtest/minimal_system_quick_test.py ⬅
========================= 4 passed ✅ in 77.67s (0:01:17) =========================
⮕ Running daqsystemtest/readout_type_scan_test.py ⬅
======================== 33 passed ✅ in 860.49s (0:14:20) ========================
⮕ Running daqsystemtest/sample_ehn1_multihost_test.py ⬅
======================== 4 skipped 🟡 in 93.37s (0:01:33) =========================
⮕ Running daqsystemtest/small_footprint_quick_test.py ⬅
========================= 3 passed ✅ in 77.52s (0:01:17) =========================
⮕ Running daqsystemtest/tpg_state_collection_test.py ⬅
======================== 5 passed ✅ in 144.23s (0:02:24) =========================
⮕ Running daqsystemtest/tpreplay_test.py ⬅
======================== 6 passed ✅ in 172.40s (0:02:52) =========================
⮕ Running daqsystemtest/tpstream_writing_test.py ⬅
======================== 4 passed ✅ in 144.31s (0:02:24) =========================
⮕ Running daqsystemtest/trigger_bitwords_test.py ⬅
======================== 18 passed ✅ in 432.86s (0:07:12) ========================

@PawelPlesniak
Copy link
Copy Markdown
Collaborator Author

@Aurashk @mroda88 @MRiganSUSX if one of you can look over again, this is ready for review

@Aurashk
Copy link
Copy Markdown
Contributor

Aurashk commented Apr 24, 2026

LGTM, only adjustment I would make is move the specific comments about -tt to the place where it's non default in monitor_remote_process

Copy link
Copy Markdown
Contributor

@Aurashk Aurashk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@PawelPlesniak PawelPlesniak merged commit fc62f06 into prep-release/fddaq-v5.6.0 Apr 24, 2026
4 checks passed
@PawelPlesniak PawelPlesniak deleted the PawelPlesniak/WatcherZombies branch April 24, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants