AAP-72269 Change fact processing loop to use file listing#16403
AAP-72269 Change fact processing loop to use file listing#16403AlanCoding wants to merge 3 commits intoansible:develfrom
Conversation
📝 WalkthroughWalkthroughReworks finish_fact_cache to read host fact files by scanning the fact_cache_dir, build changed/unchanged host sets by file mtime, then perform two bulk DB phases: bulk-update changed facts and bulk-clear missing hosts; expands exception handling and adds row-count warnings. Changes
Sequence Diagram(s)sequenceDiagram
participant Task as finish_fact_cache
participant FS as fact_cache_dir (filesystem)
participant DB as Database (Host rows)
participant Logger as Logger
Task ->> FS: list files in fact_cache_dir
FS -->> Task: filenames + mtimes
Task ->> Task: parse JSON for files with mtime >= facts_write_time
alt parse error / unreadable file
Task ->> Logger: warn and skip file
end
Task ->> DB: bulk_update_sorted_by_id(changed_facts)
DB -->> Task: rows_updated_count
Task ->> Logger: warn if rows_updated_count != expected
Task ->> DB: bulk_clear_facts(hosts_to_clear)
DB -->> Task: rows_cleared_count
Task ->> Logger: warn if rows_cleared_count != expected
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I would be pushing to merge this, but I have not been able to get clean YOLO results, so now all my effort is being spent on debugging those failures. |
|
@AlanCoding I've tested this change against my environment and can confirm that it works as expected (no more excessive database work_mem use). |
Baby Yolo ResultsRan containerized Baby Yolo build #130 (Apr 20) and compared against reference build #129 (devel, same day).
Verdict: This PR does not introduce test regressions. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
awx/main/tests/unit/models/test_jobs.py (1)
115-117: Assert thename__infilter instead of returninghosts[1]unconditionally.This mock would still pass if
finish_fact_cacheaccidentally requested extra hosts to clear. Make the mock validate the queried names so the test protects the new selection behavior.Proposed test tightening
hosts_qs = mock.MagicMock() - # The new code calls host_qs.filter(name__in=...).select_related('inventory') - # Only hosts[1] needs clearing (its file was removed), so return just that host - hosts_qs.filter.return_value.select_related.return_value = [hosts[1]] + filtered_qs = mock.MagicMock() + filtered_qs.select_related.return_value = [hosts[1]] + hosts_qs.filter.return_value = filtered_qs finish_fact_cache(hosts_qs, artifacts_dir=artifacts_dir, inventory_id=inventory_id) + + hosts_qs.filter.assert_called_once() + assert set(hosts_qs.filter.call_args.kwargs['name__in']) == {hosts[1].name} + filtered_qs.select_related.assert_called_once_with('inventory')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@awx/main/tests/unit/models/test_jobs.py` around lines 115 - 117, Modify the test's host_qs.filter mock so it asserts the name__in argument contains exactly the expected host names instead of always returning hosts[1]; update the stub for host_qs.filter (the call used by finish_fact_cache) to inspect its kwargs (look for name__in) and only return an object whose select_related(...).return_value is [hosts[1]] when the provided name__in matches the expected list, otherwise raise or return an empty list to fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@awx/main/tasks/facts.py`:
- Around line 111-113: The scan currently accumulates all changes into
facts_updates and hosts_to_clear then performs large name__in DB lookups; change
the loop that iterates os.scandir()/file entries to collect hostnames/facts into
fixed-size batches (e.g. BATCH_SIZE), and when the batch fills call
bulk_update_sorted_by_id to process that batch (perform the name__in lookup and
updates), then clear facts_updates and hosts_to_clear for the next batch; repeat
until scan completes and flush any remaining items at the end; keep seen_in_dir
logic but ensure any lookups that relied on global lists are performed per-batch
to avoid huge in-memory maps and oversized DB queries.
- Around line 120-134: Replace the fragile startswith containment check with a
real path containment test using os.path.commonpath: compute real_fact_cache_dir
= os.path.realpath(fact_cache_dir) and real_filepath =
os.path.realpath(filepath) and verify os.path.commonpath([real_fact_cache_dir,
real_filepath]) == real_fact_cache_dir before accepting the file; also protect
the os.path.getmtime call by moving it into a try/except block (catch OSError
and skip the file) so that races where the file is removed between isfile and
getmtime do not crash the facts loop (update references to filepath,
fact_cache_dir, modified, facts_write_time, and seen_in_dir accordingly).
---
Nitpick comments:
In `@awx/main/tests/unit/models/test_jobs.py`:
- Around line 115-117: Modify the test's host_qs.filter mock so it asserts the
name__in argument contains exactly the expected host names instead of always
returning hosts[1]; update the stub for host_qs.filter (the call used by
finish_fact_cache) to inspect its kwargs (look for name__in) and only return an
object whose select_related(...).return_value is [hosts[1]] when the provided
name__in matches the expected list, otherwise raise or return an empty list to
fail the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 556b60f7-bbb2-420a-a397-29109361d62b
📒 Files selected for processing (2)
awx/main/tasks/facts.pyawx/main/tests/unit/models/test_jobs.py
| filepath = os.path.join(fact_cache_dir, filename) | ||
| if not os.path.realpath(filepath).startswith(fact_cache_dir): | ||
| logger.error(f'Invalid path for facts file: {filepath}') | ||
| continue | ||
| if not os.path.isfile(filepath): | ||
| continue | ||
|
|
||
| seen_in_dir.add(filename) | ||
| modified = os.path.getmtime(filepath) | ||
| if not facts_write_time or modified >= facts_write_time: | ||
| if modified >= facts_write_time: | ||
| try: | ||
| with codecs.open(filepath, 'r', encoding='utf-8') as f: | ||
| ansible_facts = json.load(f) | ||
| except ValueError: | ||
| facts_updates[filename] = json.load(f) | ||
| except (ValueError, OSError): | ||
| continue |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import os
base = '/tmp/artifacts/fact_cache'
candidate = '/tmp/artifacts/fact_cache_evil/host1'
print("startswith result:", os.path.realpath(candidate).startswith(base))
print("commonpath result:", os.path.commonpath([os.path.realpath(base), os.path.realpath(candidate)]) == os.path.realpath(base))
PYRepository: ansible/awx
Length of output: 103
🏁 Script executed:
# Check the actual file and surrounding context
cat -n awx/main/tasks/facts.py | sed -n '110,145p'Repository: ansible/awx
Length of output: 1860
Use os.path.commonpath() for path containment and wrap getmtime() in exception handling.
Line 121's startswith check is vulnerable to path-prefix bypass attacks; a symlink or path like fact_cache_evil/host1 will incorrectly pass the string-prefix test. Additionally, os.path.getmtime() at line 128 can raise OSError (e.g., if the file is deleted between the isfile check and the getmtime call) but is only caught by the try-except at line 130, which is too narrow and will crash the entire fact-collection loop.
Proposed fix
+ real_fact_cache_dir = os.path.realpath(fact_cache_dir)
+
if os.path.isdir(fact_cache_dir):
for filename in os.listdir(fact_cache_dir):
if filename not in hosts_cached_map:
continue # not an expected host for this job
filepath = os.path.join(fact_cache_dir, filename)
- if not os.path.realpath(filepath).startswith(fact_cache_dir):
+ real_filepath = os.path.realpath(filepath)
+ if os.path.commonpath([real_fact_cache_dir, real_filepath]) != real_fact_cache_dir:
logger.error(f'Invalid path for facts file: {filepath}')
continue
- if not os.path.isfile(filepath):
- continue
-
- seen_in_dir.add(filename)
- modified = os.path.getmtime(filepath)
- if modified >= facts_write_time:
- try:
- with codecs.open(filepath, 'r', encoding='utf-8') as f:
- facts_updates[filename] = json.load(f)
- except (ValueError, OSError):
+
+ try:
+ if not os.path.isfile(real_filepath):
continue
+
+ seen_in_dir.add(filename)
+ modified = os.path.getmtime(real_filepath)
+ if modified >= facts_write_time:
+ with codecs.open(real_filepath, 'r', encoding='utf-8') as f:
+ facts_updates[filename] = json.load(f)
+ else:
+ log_data['unmodified_ct'] += 1
+ except (ValueError, OSError):
+ continue
- else:
- log_data['unmodified_ct'] += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@awx/main/tasks/facts.py` around lines 120 - 134, Replace the fragile
startswith containment check with a real path containment test using
os.path.commonpath: compute real_fact_cache_dir =
os.path.realpath(fact_cache_dir) and real_filepath = os.path.realpath(filepath)
and verify os.path.commonpath([real_fact_cache_dir, real_filepath]) ==
real_fact_cache_dir before accepting the file; also protect the os.path.getmtime
call by moving it into a try/except block (catch OSError and skip the file) so
that races where the file is removed between isfile and getmtime do not crash
the facts loop (update references to filepath, fact_cache_dir, modified,
facts_write_time, and seen_in_dir accordingly).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
awx/main/tasks/facts.py (1)
151-206:⚠️ Potential issue | 🟠 MajorPhase 2 update and Phase 3 clear run in separate autocommit transactions — partial failure leaves DB inconsistent.
finish_fact_cacheis invoked frompost_run_hook(jobs.py) without@transaction.atomic, andbulk_update_sorted_by_idinternally issuesbulk_update(..., batch_size=1000), so each call commits as it goes. Splitting the previous single-phase flow into two bulk phases widens the failure window: if Phase 2 succeeds but Phase 3 raises (DB hiccup, deadlock retry exhaustion, etc.), hosts that Ansible removed facts for will remain with staleansible_facts, and there is no rollback. Consider wrapping both phases in a singletransaction.atomic()block so either both succeed or neither is persisted.♻️ Suggested structure
+ from django.db import transaction + ... - # Phase 2: Save updated facts to database - if facts_updates: - ... - # Phase 3: Clear facts for hosts whose files were removed by Ansible - if hosts_to_clear: - ... + with transaction.atomic(): + # Phase 2: Save updated facts to database + if facts_updates: + ... + # Phase 3: Clear facts for hosts whose files were removed by Ansible + if hosts_to_clear: + ...As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 67a892c8-f334-4efd-b39a-38f708ccd8c5
📒 Files selected for processing (1)
awx/main/tasks/facts.py
|
pb82
left a comment
There was a problem hiding this comment.
Ok, this looks good to me. If I understand correctly this compares the time of host fact updates from the summary against the modified time of host facts in the filesystem. This avoids querying all the hosts from the database. Afterwards it bulk updates or deletes facts.
tvo318
left a comment
There was a problem hiding this comment.
Code Review
The restructuring into file-scan-first phases is clean and the motivation (avoiding WITH HOLD cursor issues from iterating DB hosts) makes sense. Symlink rejection, stat error handling, and the row-count mismatch warnings are good additions. Two observations below — one about lost batching for large inventories, one minor logging gap.
| # if the file goes missing, ansible removed it (likely via clear_facts) | ||
| # if the file goes missing, but the host has not started facts, then we should not clear the facts | ||
| if hosts_to_save: | ||
| rows = bulk_update_sorted_by_id(Host, hosts_to_save, fields=['ansible_facts', 'ansible_facts_modified']) |
There was a problem hiding this comment.
Lost batching for bulk updates
Medium severity — the old code batched bulk_update_sorted_by_id calls in chunks of 100 hosts. The new code loads all hosts with changed facts into memory at once (list(host_qs.filter(...)) at line 153) and does a single unbounded bulk_update here.
For large inventories where a playbook touches many hosts (e.g. setup on 10k hosts where all facts change), this loads all Host objects with their ansible_facts JSON blobs into memory simultaneously, and issues a single large bulk update.
Worth considering whether bulk_update_sorted_by_id handles internal batching already, or whether chunking should be reintroduced here and in the Phase 3 clear path (line 204).
| except ValueError: | ||
| facts_updates[filename] = json.load(f) | ||
| except (ValueError, OSError): | ||
| continue |
There was a problem hiding this comment.
Untracked parse failures in log metrics
This is a minor issue — files that hit this except branch are added to seen_in_dir (line 127) so they won't be flagged for clearing, which is correct. But no log_data counter is incremented, so these hosts are invisible in the final log summary. Consider incrementing unmodified_ct or adding a separate error_ct metric.
SUMMARY
Got some pushback against the updated loop in
finish_fact_cache. Because in auto-commit this will do a WITH HOLD or something else crazy, due to crazy reasons assuming connection pooling is a problem.So this gives up on listing the DB hosts at the end of the job. We didn't really need to do that. The alternative, this, is to loop over the files. We have a list of host facts written to disk, and we have what is on disk now. So we just do what we need to do using the files. We have to update the database hosts, but we'll do that in tiny chunks as we run into them.
still need to confirm tests pass.
ISSUE TYPE
COMPONENT NAME
Note
Medium Risk
Changes the end-of-job fact synchronization algorithm and bulk update behavior, which could affect when facts are updated/cleared or missed if file mtimes or directory contents are unexpected.
Overview
Reworks
finish_fact_cacheto scan thefact_cache/directory (instead of iterating DB hosts) to determine which hosts have updated facts and which hosts had fact files removed (treated asclear_facts).Updates are now applied in two bulk phases: one to save changed
ansible_factsfor only the hosts with modified files, and another to clear facts for hosts whose pre-written cache files disappeared; it also tightens filesystem handling by rejecting symlink entries and tolerating stat/read errors without attempting DB writes.Reviewed by Cursor Bugbot for commit 0a4775a. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Bug Fixes
Chores
Tests