Zombie process problem#887
Conversation
|
Nice one @PawelPlesniak! We may need to only disable |
|
Thanks Aurash. The process termination is correct for the controllers and apps for superusers, so I agree it could just be for the watchers. Unless there is a change in the process watching logic that we could implement to account for the superuser config, this would likely be the safest method |
|
It seems to work although will need checking on I replaced Then only in the call to To see/test with the problematic path the easiest hacky way is to set It might be worth extending these tests to do specific checking for the possible zombies that could be left. As we probably would then catch these issues at the |
|
With the current branch, the following tests all did not leave zombie processes as neither the superuser or regular user.
I am now running the full integration test suite on |
|
Tests all passed, no zombies observed. |
|
@Aurashk @mroda88 @MRiganSUSX if one of you can look over again, this is ready for review |
|
LGTM, only adjustment I would make is move the specific comments about -tt to the place where it's non default in |
Description
Fixes issue #871
Removes the PTY allocation from the subprocesses.
The relevant changes in the user workflow have been documented within the code
Type of change
List of required branches from other repositories
N/A
Change log
The runtime policy of users vs superusers is configured differnetly. For users, if you run
loginctl show-session $(cat /proc/self/sessionid)you will see
but if you run this as
np04daqyou will seeWhen the process is closed as a general user,
systemdkills everything within the killed process scope. When the process is closed asnp04daq, it does not do the same. The watcher process is orphaned, and as thestateisclosing(active) for the general user (superuser), the process is killed (not killed when-ttis used). When run as a PTY, the process is left persistent for the superuser, which does not get cleaned up. Removing the PTY allocation menas that the OS is not responsible for the cleanup, instead, the process is now a pipe which gets aSIGPIPE, cleaning up the process.Suggested manual testing checklist
Run an integration test in the 4 combinations of in a tmux, not in a tmux, and as both yourself and as np04daq.
Verify the connection details.
Validate that without this PR, when you run the integration test either in or out of a tmux, the watcher threads are still alive. With this PR, they should clean up.
Developer checklist
Prior to marking this as "Ready for Review"
Tests ran on:
np04-srv-029from releasefddaq-v5.6.0-rc3Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.
Integration tests - the
daqsystemtest_integtest_bundlerequires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .pytest --marker) passeddaqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.pydaqsystemtest_integtest_bundle.shNote - only the MSQT was run as this is a very small change.
Final checklist prior to marking this as "Ready for Review"
Reviewer checklist
druncare in the log filesdruncfailure appears:Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.
Choose one of the following an complete all substepsPrior to merging
Once completed, the reviewer can merge the PR.
Notification message for a Slack channel
Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.
For an single merge that changes the user workflow
For co-ordinated merge