Skip to content

AAP-72269 Change fact processing loop to use file listing#16403

Open
AlanCoding wants to merge 3 commits intoansible:develfrom
AlanCoding:new_fact_loop
Open

AAP-72269 Change fact processing loop to use file listing#16403
AlanCoding wants to merge 3 commits intoansible:develfrom
AlanCoding:new_fact_loop

Conversation

@AlanCoding
Copy link
Copy Markdown
Member

@AlanCoding AlanCoding commented Apr 15, 2026

SUMMARY

Got some pushback against the updated loop in finish_fact_cache. Because in auto-commit this will do a WITH HOLD or something else crazy, due to crazy reasons assuming connection pooling is a problem.

So this gives up on listing the DB hosts at the end of the job. We didn't really need to do that. The alternative, this, is to loop over the files. We have a list of host facts written to disk, and we have what is on disk now. So we just do what we need to do using the files. We have to update the database hosts, but we'll do that in tiny chunks as we run into them.

still need to confirm tests pass.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Note

Medium Risk
Changes the end-of-job fact synchronization algorithm and bulk update behavior, which could affect when facts are updated/cleared or missed if file mtimes or directory contents are unexpected.

Overview
Reworks finish_fact_cache to scan the fact_cache/ directory (instead of iterating DB hosts) to determine which hosts have updated facts and which hosts had fact files removed (treated as clear_facts).

Updates are now applied in two bulk phases: one to save changed ansible_facts for only the hosts with modified files, and another to clear facts for hosts whose pre-written cache files disappeared; it also tightens filesystem handling by rejecting symlink entries and tolerating stat/read errors without attempting DB writes.

Reviewed by Cursor Bugbot for commit 0a4775a. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Bug Fixes

    • More robust handling of malformed or unreadable fact files (additional error types caught) and warnings for missing/unstatable files.
    • Added row-count mismatch warnings when persisted updates or clears don't affect expected rows.
  • Chores

    • Faster fact synchronization by scanning on-disk facts, updating only changed hosts, and separately clearing missing entries via bulk operations.
  • Tests

    • Adjusted unit tests to align with narrower host selection and strengthened assertions for invalid data handling.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

Reworks finish_fact_cache to read host fact files by scanning the fact_cache_dir, build changed/unchanged host sets by file mtime, then perform two bulk DB phases: bulk-update changed facts and bulk-clear missing hosts; expands exception handling and adds row-count warnings.

Changes

Cohort / File(s) Summary
Fact Cache Refactor
awx/main/tasks/facts.py
Reads fact files by scanning fact_cache_dir, builds facts_updates only for files with mtime >= facts_write_time, tracks seen_in_dir to detect hosts to clear, then does two separate bulk DB operations: bulk-update changed facts and bulk-clear missing hosts. Removes per-host batching, broadens parsing errors to (ValueError, OSError), warns on unreadable files, and logs row-count mismatches for both bulk phases.
Unit Test Adjustments
awx/main/tests/unit/models/test_jobs.py
Aligns mocked query behavior with new implementation: test_finish_job_fact_cache_clear mocks filter(...).select_related('inventory') to return a narrower host set; test_finish_job_fact_cache_with_bad_data stops mocking queryset iteration and asserts bulk_update_sorted_by_id is not called for invalid JSON.

Sequence Diagram(s)

sequenceDiagram
  participant Task as finish_fact_cache
  participant FS as fact_cache_dir (filesystem)
  participant DB as Database (Host rows)
  participant Logger as Logger

  Task ->> FS: list files in fact_cache_dir
  FS -->> Task: filenames + mtimes
  Task ->> Task: parse JSON for files with mtime >= facts_write_time
  alt parse error / unreadable file
    Task ->> Logger: warn and skip file
  end
  Task ->> DB: bulk_update_sorted_by_id(changed_facts)
  DB -->> Task: rows_updated_count
  Task ->> Logger: warn if rows_updated_count != expected
  Task ->> DB: bulk_clear_facts(hosts_to_clear)
  DB -->> Task: rows_cleared_count
  Task ->> Logger: warn if rows_cleared_count != expected
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: refactoring the fact processing loop to use file listing instead of database iteration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AlanCoding AlanCoding requested review from djyasin and pb82 April 16, 2026 17:07
@AlanCoding
Copy link
Copy Markdown
Member Author

I would be pushing to merge this, but I have not been able to get clean YOLO results, so now all my effort is being spent on debugging those failures.

@pb82
Copy link
Copy Markdown
Contributor

pb82 commented Apr 16, 2026

@AlanCoding I've tested this change against my environment and can confirm that it works as expected (no more excessive database work_mem use).

@AlanCoding
Copy link
Copy Markdown
Member Author

Baby Yolo Results

Ran containerized Baby Yolo build #130 (Apr 20) and compared against reference build #129 (devel, same day).

  • 16 total failures in the patch build vs 13 in the reference — no regressions introduced by this PR.
  • 12 of the 16 failures are identical to the reference baseline (AzureAD ODataError, galaxy PermissionsEnum, lightspeed XDS timeout, hub e2e, etc.).
  • The 4 failures unique to this build are RBAC tests that hit "Project failed to enter a successful state" or permission assertion errors during setup — a known intermittent infrastructure flake, unrelated to the fact processing changes.

Verdict: This PR does not introduce test regressions.

@AlanCoding AlanCoding marked this pull request as ready for review April 21, 2026 13:46
@AlanCoding AlanCoding changed the title Change fact processing loop to use file listing AAP-72269 Change fact processing loop to use file listing Apr 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
awx/main/tests/unit/models/test_jobs.py (1)

115-117: Assert the name__in filter instead of returning hosts[1] unconditionally.

This mock would still pass if finish_fact_cache accidentally requested extra hosts to clear. Make the mock validate the queried names so the test protects the new selection behavior.

Proposed test tightening
     hosts_qs = mock.MagicMock()
-    # The new code calls host_qs.filter(name__in=...).select_related('inventory')
-    # Only hosts[1] needs clearing (its file was removed), so return just that host
-    hosts_qs.filter.return_value.select_related.return_value = [hosts[1]]
+    filtered_qs = mock.MagicMock()
+    filtered_qs.select_related.return_value = [hosts[1]]
+    hosts_qs.filter.return_value = filtered_qs
 
     finish_fact_cache(hosts_qs, artifacts_dir=artifacts_dir, inventory_id=inventory_id)
+
+    hosts_qs.filter.assert_called_once()
+    assert set(hosts_qs.filter.call_args.kwargs['name__in']) == {hosts[1].name}
+    filtered_qs.select_related.assert_called_once_with('inventory')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@awx/main/tests/unit/models/test_jobs.py` around lines 115 - 117, Modify the
test's host_qs.filter mock so it asserts the name__in argument contains exactly
the expected host names instead of always returning hosts[1]; update the stub
for host_qs.filter (the call used by finish_fact_cache) to inspect its kwargs
(look for name__in) and only return an object whose
select_related(...).return_value is [hosts[1]] when the provided name__in
matches the expected list, otherwise raise or return an empty list to fail the
test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@awx/main/tasks/facts.py`:
- Around line 111-113: The scan currently accumulates all changes into
facts_updates and hosts_to_clear then performs large name__in DB lookups; change
the loop that iterates os.scandir()/file entries to collect hostnames/facts into
fixed-size batches (e.g. BATCH_SIZE), and when the batch fills call
bulk_update_sorted_by_id to process that batch (perform the name__in lookup and
updates), then clear facts_updates and hosts_to_clear for the next batch; repeat
until scan completes and flush any remaining items at the end; keep seen_in_dir
logic but ensure any lookups that relied on global lists are performed per-batch
to avoid huge in-memory maps and oversized DB queries.
- Around line 120-134: Replace the fragile startswith containment check with a
real path containment test using os.path.commonpath: compute real_fact_cache_dir
= os.path.realpath(fact_cache_dir) and real_filepath =
os.path.realpath(filepath) and verify os.path.commonpath([real_fact_cache_dir,
real_filepath]) == real_fact_cache_dir before accepting the file; also protect
the os.path.getmtime call by moving it into a try/except block (catch OSError
and skip the file) so that races where the file is removed between isfile and
getmtime do not crash the facts loop (update references to filepath,
fact_cache_dir, modified, facts_write_time, and seen_in_dir accordingly).

---

Nitpick comments:
In `@awx/main/tests/unit/models/test_jobs.py`:
- Around line 115-117: Modify the test's host_qs.filter mock so it asserts the
name__in argument contains exactly the expected host names instead of always
returning hosts[1]; update the stub for host_qs.filter (the call used by
finish_fact_cache) to inspect its kwargs (look for name__in) and only return an
object whose select_related(...).return_value is [hosts[1]] when the provided
name__in matches the expected list, otherwise raise or return an empty list to
fail the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 556b60f7-bbb2-420a-a397-29109361d62b

📥 Commits

Reviewing files that changed from the base of the PR and between e5bae59 and 0e1b3b5.

📒 Files selected for processing (2)
  • awx/main/tasks/facts.py
  • awx/main/tests/unit/models/test_jobs.py

Comment thread awx/main/tasks/facts.py
Comment thread awx/main/tasks/facts.py
Comment on lines +120 to 134
filepath = os.path.join(fact_cache_dir, filename)
if not os.path.realpath(filepath).startswith(fact_cache_dir):
logger.error(f'Invalid path for facts file: {filepath}')
continue
if not os.path.isfile(filepath):
continue

seen_in_dir.add(filename)
modified = os.path.getmtime(filepath)
if not facts_write_time or modified >= facts_write_time:
if modified >= facts_write_time:
try:
with codecs.open(filepath, 'r', encoding='utf-8') as f:
ansible_facts = json.load(f)
except ValueError:
facts_updates[filename] = json.load(f)
except (ValueError, OSError):
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import os

base = '/tmp/artifacts/fact_cache'
candidate = '/tmp/artifacts/fact_cache_evil/host1'

print("startswith result:", os.path.realpath(candidate).startswith(base))
print("commonpath result:", os.path.commonpath([os.path.realpath(base), os.path.realpath(candidate)]) == os.path.realpath(base))
PY

Repository: ansible/awx

Length of output: 103


🏁 Script executed:

# Check the actual file and surrounding context
cat -n awx/main/tasks/facts.py | sed -n '110,145p'

Repository: ansible/awx

Length of output: 1860


Use os.path.commonpath() for path containment and wrap getmtime() in exception handling.

Line 121's startswith check is vulnerable to path-prefix bypass attacks; a symlink or path like fact_cache_evil/host1 will incorrectly pass the string-prefix test. Additionally, os.path.getmtime() at line 128 can raise OSError (e.g., if the file is deleted between the isfile check and the getmtime call) but is only caught by the try-except at line 130, which is too narrow and will crash the entire fact-collection loop.

Proposed fix
+    real_fact_cache_dir = os.path.realpath(fact_cache_dir)
+
     if os.path.isdir(fact_cache_dir):
         for filename in os.listdir(fact_cache_dir):
             if filename not in hosts_cached_map:
                 continue  # not an expected host for this job
 
             filepath = os.path.join(fact_cache_dir, filename)
-            if not os.path.realpath(filepath).startswith(fact_cache_dir):
+            real_filepath = os.path.realpath(filepath)
+            if os.path.commonpath([real_fact_cache_dir, real_filepath]) != real_fact_cache_dir:
                 logger.error(f'Invalid path for facts file: {filepath}')
                 continue
-            if not os.path.isfile(filepath):
-                continue
-
-            seen_in_dir.add(filename)
-            modified = os.path.getmtime(filepath)
-            if modified >= facts_write_time:
-                try:
-                    with codecs.open(filepath, 'r', encoding='utf-8') as f:
-                        facts_updates[filename] = json.load(f)
-                except (ValueError, OSError):
+
+            try:
+                if not os.path.isfile(real_filepath):
                     continue
+
+                seen_in_dir.add(filename)
+                modified = os.path.getmtime(real_filepath)
+                if modified >= facts_write_time:
+                    with codecs.open(real_filepath, 'r', encoding='utf-8') as f:
+                        facts_updates[filename] = json.load(f)
+                else:
+                    log_data['unmodified_ct'] += 1
+            except (ValueError, OSError):
+                continue
-            else:
-                log_data['unmodified_ct'] += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@awx/main/tasks/facts.py` around lines 120 - 134, Replace the fragile
startswith containment check with a real path containment test using
os.path.commonpath: compute real_fact_cache_dir =
os.path.realpath(fact_cache_dir) and real_filepath = os.path.realpath(filepath)
and verify os.path.commonpath([real_fact_cache_dir, real_filepath]) ==
real_fact_cache_dir before accepting the file; also protect the os.path.getmtime
call by moving it into a try/except block (catch OSError and skip the file) so
that races where the file is removed between isfile and getmtime do not crash
the facts loop (update references to filepath, fact_cache_dir, modified,
facts_write_time, and seen_in_dir accordingly).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
awx/main/tasks/facts.py (1)

151-206: ⚠️ Potential issue | 🟠 Major

Phase 2 update and Phase 3 clear run in separate autocommit transactions — partial failure leaves DB inconsistent.

finish_fact_cache is invoked from post_run_hook (jobs.py) without @transaction.atomic, and bulk_update_sorted_by_id internally issues bulk_update(..., batch_size=1000), so each call commits as it goes. Splitting the previous single-phase flow into two bulk phases widens the failure window: if Phase 2 succeeds but Phase 3 raises (DB hiccup, deadlock retry exhaustion, etc.), hosts that Ansible removed facts for will remain with stale ansible_facts, and there is no rollback. Consider wrapping both phases in a single transaction.atomic() block so either both succeed or neither is persisted.

♻️ Suggested structure
+    from django.db import transaction
+    ...
-    # Phase 2: Save updated facts to database
-    if facts_updates:
-        ...
-    # Phase 3: Clear facts for hosts whose files were removed by Ansible
-    if hosts_to_clear:
-        ...
+    with transaction.atomic():
+        # Phase 2: Save updated facts to database
+        if facts_updates:
+            ...
+        # Phase 3: Clear facts for hosts whose files were removed by Ansible
+        if hosts_to_clear:
+            ...

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 67a892c8-f334-4efd-b39a-38f708ccd8c5

📥 Commits

Reviewing files that changed from the base of the PR and between 0e1b3b5 and 0a4775a.

📒 Files selected for processing (1)
  • awx/main/tasks/facts.py

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@pb82 pb82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this looks good to me. If I understand correctly this compares the time of host fact updates from the summary against the modified time of host facts in the filesystem. This avoids querying all the hosts from the database. Afterwards it bulk updates or deletes facts.

Copy link
Copy Markdown
Member

@tvo318 tvo318 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The restructuring into file-scan-first phases is clean and the motivation (avoiding WITH HOLD cursor issues from iterating DB hosts) makes sense. Symlink rejection, stat error handling, and the row-count mismatch warnings are good additions. Two observations below — one about lost batching for large inventories, one minor logging gap.

Comment thread awx/main/tasks/facts.py
# if the file goes missing, ansible removed it (likely via clear_facts)
# if the file goes missing, but the host has not started facts, then we should not clear the facts
if hosts_to_save:
rows = bulk_update_sorted_by_id(Host, hosts_to_save, fields=['ansible_facts', 'ansible_facts_modified'])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lost batching for bulk updates

Medium severity — the old code batched bulk_update_sorted_by_id calls in chunks of 100 hosts. The new code loads all hosts with changed facts into memory at once (list(host_qs.filter(...)) at line 153) and does a single unbounded bulk_update here.

For large inventories where a playbook touches many hosts (e.g. setup on 10k hosts where all facts change), this loads all Host objects with their ansible_facts JSON blobs into memory simultaneously, and issues a single large bulk update.

Worth considering whether bulk_update_sorted_by_id handles internal batching already, or whether chunking should be reintroduced here and in the Phase 3 clear path (line 204).

Comment thread awx/main/tasks/facts.py
except ValueError:
facts_updates[filename] = json.load(f)
except (ValueError, OSError):
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untracked parse failures in log metrics

This is a minor issue — files that hit this except branch are added to seen_in_dir (line 127) so they won't be flagged for clearing, which is correct. But no log_data counter is incremented, so these hosts are invisible in the final log summary. Consider incrementing unmodified_ct or adding a separate error_ct metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants