AAP-72269 Change fact processing loop to use file listing by AlanCoding · Pull Request #16403 · ansible/awx

AlanCoding · 2026-04-15T15:31:53Z

SUMMARY

Got some pushback against the updated loop in finish_fact_cache. Because in auto-commit this will do a WITH HOLD or something else crazy, due to crazy reasons assuming connection pooling is a problem.

So this gives up on listing the DB hosts at the end of the job. We didn't really need to do that. The alternative, this, is to loop over the files. We have a list of host facts written to disk, and we have what is on disk now. So we just do what we need to do using the files. We have to update the database hosts, but we'll do that in tiny chunks as we run into them.

still need to confirm tests pass.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

Note

Medium Risk
Changes the end-of-job fact synchronization algorithm and bulk update behavior, which could affect when facts are updated/cleared or missed if file mtimes or directory contents are unexpected.

Overview
Reworks finish_fact_cache to scan the fact_cache/ directory (instead of iterating DB hosts) to determine which hosts have updated facts and which hosts had fact files removed (treated as clear_facts).

Updates are now applied in two bulk phases: one to save changed ansible_facts for only the hosts with modified files, and another to clear facts for hosts whose pre-written cache files disappeared; it also tightens filesystem handling by rejecting symlink entries and tolerating stat/read errors without attempting DB writes.

^{Reviewed by Cursor Bugbot for commit 0a4775a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Summary by CodeRabbit

Bug Fixes
- More robust handling of malformed or unreadable fact files (additional error types caught) and warnings for missing/unstatable files.
- Added row-count mismatch warnings when persisted updates or clears don't affect expected rows.
Chores
- Faster fact synchronization by scanning on-disk facts, updating only changed hosts, and separately clearing missing entries via bulk operations.
Tests
- Adjusted unit tests to align with narrower host selection and strengthened assertions for invalid data handling.

coderabbitai · 2026-04-15T15:32:01Z

📝 Walkthrough

Walkthrough

Reworks finish_fact_cache to read host fact files by scanning the fact_cache_dir, build changed/unchanged host sets by file mtime, then perform two bulk DB phases: bulk-update changed facts and bulk-clear missing hosts; expands exception handling and adds row-count warnings.

Changes

Cohort / File(s)	Summary
Fact Cache Refactor `awx/main/tasks/facts.py`	Reads fact files by scanning `fact_cache_dir`, builds `facts_updates` only for files with mtime >= `facts_write_time`, tracks `seen_in_dir` to detect hosts to clear, then does two separate bulk DB operations: bulk-update changed facts and bulk-clear missing hosts. Removes per-host batching, broadens parsing errors to `(ValueError, OSError)`, warns on unreadable files, and logs row-count mismatches for both bulk phases.
Unit Test Adjustments `awx/main/tests/unit/models/test_jobs.py`	Aligns mocked query behavior with new implementation: `test_finish_job_fact_cache_clear` mocks `filter(...).select_related('inventory')` to return a narrower host set; `test_finish_job_fact_cache_with_bad_data` stops mocking queryset iteration and asserts `bulk_update_sorted_by_id` is not called for invalid JSON.

Sequence Diagram(s)

sequenceDiagram
  participant Task as finish_fact_cache
  participant FS as fact_cache_dir (filesystem)
  participant DB as Database (Host rows)
  participant Logger as Logger

  Task ->> FS: list files in fact_cache_dir
  FS -->> Task: filenames + mtimes
  Task ->> Task: parse JSON for files with mtime >= facts_write_time
  alt parse error / unreadable file
    Task ->> Logger: warn and skip file
  end
  Task ->> DB: bulk_update_sorted_by_id(changed_facts)
  DB -->> Task: rows_updated_count
  Task ->> Logger: warn if rows_updated_count != expected
  Task ->> DB: bulk_clear_facts(hosts_to_clear)
  DB -->> Task: rows_cleared_count
  Task ->> Logger: warn if rows_cleared_count != expected

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: refactoring the fact processing loop to use file listing instead of database iteration.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

AlanCoding · 2026-04-16T19:05:55Z

I would be pushing to merge this, but I have not been able to get clean YOLO results, so now all my effort is being spent on debugging those failures.

pb82 · 2026-04-16T19:59:17Z

@AlanCoding I've tested this change against my environment and can confirm that it works as expected (no more excessive database work_mem use).

AlanCoding · 2026-04-21T13:33:02Z

Baby Yolo Results

Ran containerized Baby Yolo build #130 (Apr 20) and compared against reference build #129 (devel, same day).

16 total failures in the patch build vs 13 in the reference — no regressions introduced by this PR.
12 of the 16 failures are identical to the reference baseline (AzureAD ODataError, galaxy PermissionsEnum, lightspeed XDS timeout, hub e2e, etc.).
The 4 failures unique to this build are RBAC tests that hit "Project failed to enter a successful state" or permission assertion errors during setup — a known intermittent infrastructure flake, unrelated to the fact processing changes.

Verdict: This PR does not introduce test regressions.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

awx/main/tests/unit/models/test_jobs.py (1)

115-117: Assert the name__in filter instead of returning hosts[1] unconditionally.

This mock would still pass if finish_fact_cache accidentally requested extra hosts to clear. Make the mock validate the queried names so the test protects the new selection behavior.

Proposed test tightening

     hosts_qs = mock.MagicMock()
-    # The new code calls host_qs.filter(name__in=...).select_related('inventory')
-    # Only hosts[1] needs clearing (its file was removed), so return just that host
-    hosts_qs.filter.return_value.select_related.return_value = [hosts[1]]
+    filtered_qs = mock.MagicMock()
+    filtered_qs.select_related.return_value = [hosts[1]]
+    hosts_qs.filter.return_value = filtered_qs
 
     finish_fact_cache(hosts_qs, artifacts_dir=artifacts_dir, inventory_id=inventory_id)
+
+    hosts_qs.filter.assert_called_once()
+    assert set(hosts_qs.filter.call_args.kwargs['name__in']) == {hosts[1].name}
+    filtered_qs.select_related.assert_called_once_with('inventory')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@awx/main/tests/unit/models/test_jobs.py` around lines 115 - 117, Modify the
test's host_qs.filter mock so it asserts the name__in argument contains exactly
the expected host names instead of always returning hosts[1]; update the stub
for host_qs.filter (the call used by finish_fact_cache) to inspect its kwargs
(look for name__in) and only return an object whose
select_related(...).return_value is [hosts[1]] when the provided name__in
matches the expected list, otherwise raise or return an empty list to fail the
test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@awx/main/tasks/facts.py`:
- Around line 111-113: The scan currently accumulates all changes into
facts_updates and hosts_to_clear then performs large name__in DB lookups; change
the loop that iterates os.scandir()/file entries to collect hostnames/facts into
fixed-size batches (e.g. BATCH_SIZE), and when the batch fills call
bulk_update_sorted_by_id to process that batch (perform the name__in lookup and
updates), then clear facts_updates and hosts_to_clear for the next batch; repeat
until scan completes and flush any remaining items at the end; keep seen_in_dir
logic but ensure any lookups that relied on global lists are performed per-batch
to avoid huge in-memory maps and oversized DB queries.
- Around line 120-134: Replace the fragile startswith containment check with a
real path containment test using os.path.commonpath: compute real_fact_cache_dir
= os.path.realpath(fact_cache_dir) and real_filepath =
os.path.realpath(filepath) and verify os.path.commonpath([real_fact_cache_dir,
real_filepath]) == real_fact_cache_dir before accepting the file; also protect
the os.path.getmtime call by moving it into a try/except block (catch OSError
and skip the file) so that races where the file is removed between isfile and
getmtime do not crash the facts loop (update references to filepath,
fact_cache_dir, modified, facts_write_time, and seen_in_dir accordingly).

---

Nitpick comments:
In `@awx/main/tests/unit/models/test_jobs.py`:
- Around line 115-117: Modify the test's host_qs.filter mock so it asserts the
name__in argument contains exactly the expected host names instead of always
returning hosts[1]; update the stub for host_qs.filter (the call used by
finish_fact_cache) to inspect its kwargs (look for name__in) and only return an
object whose select_related(...).return_value is [hosts[1]] when the provided
name__in matches the expected list, otherwise raise or return an empty list to
fail the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 556b60f7-bbb2-420a-a397-29109361d62b

📥 Commits

Reviewing files that changed from the base of the PR and between e5bae59 and 0e1b3b5.

📒 Files selected for processing (2)

awx/main/tasks/facts.py
awx/main/tests/unit/models/test_jobs.py

coderabbitai · 2026-04-21T13:53:33Z

+            filepath = os.path.join(fact_cache_dir, filename)
+            if not os.path.realpath(filepath).startswith(fact_cache_dir):
+                logger.error(f'Invalid path for facts file: {filepath}')
+                continue
+            if not os.path.isfile(filepath):
+                continue
+
+            seen_in_dir.add(filename)
            modified = os.path.getmtime(filepath)
-            if not facts_write_time or modified >= facts_write_time:
+            if modified >= facts_write_time:
                try:
                    with codecs.open(filepath, 'r', encoding='utf-8') as f:
-                        ansible_facts = json.load(f)
-                except ValueError:
+                        facts_updates[filename] = json.load(f)
+                except (ValueError, OSError):
                    continue


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import os base = '/tmp/artifacts/fact_cache' candidate = '/tmp/artifacts/fact_cache_evil/host1' print("startswith result:", os.path.realpath(candidate).startswith(base)) print("commonpath result:", os.path.commonpath([os.path.realpath(base), os.path.realpath(candidate)]) == os.path.realpath(base)) PY

Repository: ansible/awx

Length of output: 103

🏁 Script executed:

# Check the actual file and surrounding context cat -n awx/main/tasks/facts.py | sed -n '110,145p'

Repository: ansible/awx

Length of output: 1860

Use os.path.commonpath() for path containment and wrap getmtime() in exception handling.

Line 121's startswith check is vulnerable to path-prefix bypass attacks; a symlink or path like fact_cache_evil/host1 will incorrectly pass the string-prefix test. Additionally, os.path.getmtime() at line 128 can raise OSError (e.g., if the file is deleted between the isfile check and the getmtime call) but is only caught by the try-except at line 130, which is too narrow and will crash the entire fact-collection loop.

Proposed fix

+ real_fact_cache_dir = os.path.realpath(fact_cache_dir) + if os.path.isdir(fact_cache_dir): for filename in os.listdir(fact_cache_dir): if filename not in hosts_cached_map: continue # not an expected host for this job filepath = os.path.join(fact_cache_dir, filename) - if not os.path.realpath(filepath).startswith(fact_cache_dir): + real_filepath = os.path.realpath(filepath) + if os.path.commonpath([real_fact_cache_dir, real_filepath]) != real_fact_cache_dir: logger.error(f'Invalid path for facts file: {filepath}') continue - if not os.path.isfile(filepath): - continue - - seen_in_dir.add(filename) - modified = os.path.getmtime(filepath) - if modified >= facts_write_time: - try: - with codecs.open(filepath, 'r', encoding='utf-8') as f: - facts_updates[filename] = json.load(f) - except (ValueError, OSError): + + try: + if not os.path.isfile(real_filepath): continue + + seen_in_dir.add(filename) + modified = os.path.getmtime(real_filepath) + if modified >= facts_write_time: + with codecs.open(real_filepath, 'r', encoding='utf-8') as f: + facts_updates[filename] = json.load(f) + else: + log_data['unmodified_ct'] += 1 + except (ValueError, OSError): + continue - else: - log_data['unmodified_ct'] += 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@awx/main/tasks/facts.py` around lines 120 - 134, Replace the fragile startswith containment check with a real path containment test using os.path.commonpath: compute real_fact_cache_dir = os.path.realpath(fact_cache_dir) and real_filepath = os.path.realpath(filepath) and verify os.path.commonpath([real_fact_cache_dir, real_filepath]) == real_fact_cache_dir before accepting the file; also protect the os.path.getmtime call by moving it into a try/except block (catch OSError and skip the file) so that races where the file is removed between isfile and getmtime do not crash the facts loop (update references to filepath, fact_cache_dir, modified, facts_write_time, and seen_in_dir accordingly).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

awx/main/tasks/facts.py (1)
151-206: ⚠️ Potential issue | 🟠 Major

Phase 2 update and Phase 3 clear run in separate autocommit transactions — partial failure leaves DB inconsistent.

finish_fact_cache is invoked from post_run_hook (jobs.py) without @transaction.atomic, and bulk_update_sorted_by_id internally issues bulk_update(..., batch_size=1000), so each call commits as it goes. Splitting the previous single-phase flow into two bulk phases widens the failure window: if Phase 2 succeeds but Phase 3 raises (DB hiccup, deadlock retry exhaustion, etc.), hosts that Ansible removed facts for will remain with stale ansible_facts, and there is no rollback. Consider wrapping both phases in a single transaction.atomic() block so either both succeed or neither is persisted.
♻️ Suggested structure
+    from django.db import transaction
+    ...
-    # Phase 2: Save updated facts to database
-    if facts_updates:
-        ...
-    # Phase 3: Clear facts for hosts whose files were removed by Ansible
-    if hosts_to_clear:
-        ...
+    with transaction.atomic():
+        # Phase 2: Save updated facts to database
+        if facts_updates:
+            ...
+        # Phase 3: Clear facts for hosts whose files were removed by Ansible
+        if hosts_to_clear:
+            ...
As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security."

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 67a892c8-f334-4efd-b39a-38f708ccd8c5

📥 Commits

Reviewing files that changed from the base of the PR and between 0e1b3b5 and 0a4775a.

📒 Files selected for processing (1)

awx/main/tasks/facts.py

sonarqubecloud · 2026-04-22T15:31:14Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
84.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

pb82

Ok, this looks good to me. If I understand correctly this compares the time of host fact updates from the summary against the modified time of host facts in the filesystem. This avoids querying all the hosts from the database. Afterwards it bulk updates or deletes facts.

tvo318

Code Review

The restructuring into file-scan-first phases is clean and the motivation (avoiding WITH HOLD cursor issues from iterating DB hosts) makes sense. Symlink rejection, stat error handling, and the row-count mismatch warnings are good additions. Two observations below — one about lost batching for large inventories, one minor logging gap.

tvo318 · 2026-04-23T16:26:30Z

-            # if the file goes missing, ansible removed it (likely via clear_facts)
-            # if the file goes missing, but the host has not started facts, then we should not clear the facts
+        if hosts_to_save:
+            rows = bulk_update_sorted_by_id(Host, hosts_to_save, fields=['ansible_facts', 'ansible_facts_modified'])


Lost batching for bulk updates

Medium severity — the old code batched bulk_update_sorted_by_id calls in chunks of 100 hosts. The new code loads all hosts with changed facts into memory at once (list(host_qs.filter(...)) at line 153) and does a single unbounded bulk_update here.

For large inventories where a playbook touches many hosts (e.g. setup on 10k hosts where all facts change), this loads all Host objects with their ansible_facts JSON blobs into memory simultaneously, and issues a single large bulk update.

Worth considering whether bulk_update_sorted_by_id handles internal batching already, or whether chunking should be reintroduced here and in the Phase 3 clear path (line 204).

tvo318 · 2026-04-23T16:26:30Z

-                except ValueError:
+                        facts_updates[filename] = json.load(f)
+                except (ValueError, OSError):
                    continue


Untracked parse failures in log metrics

This is a minor issue — files that hit this except branch are added to seen_in_dir (line 127) so they won't be flagged for clearing, which is correct. But no log_data counter is incremented, so these hosts are invisible in the final log summary. Consider incrementing unmodified_ct or adding a separate error_ct metric.

Change fact processing loop to use file listing

50dc1dc

github-actions Bot added the component:api label Apr 15, 2026

Fix some test

0e1b3b5

AlanCoding requested review from djyasin and pb82 April 16, 2026 17:07

djyasin approved these changes Apr 16, 2026

View reviewed changes

AlanCoding marked this pull request as ready for review April 21, 2026 13:46

AlanCoding changed the title ~~Change fact processing loop to use file listing~~ AAP-72269 Change fact processing loop to use file listing Apr 21, 2026

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Address coderabbit comments

0a4775a

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

pb82 approved these changes Apr 22, 2026

View reviewed changes

tvo318 reviewed Apr 23, 2026

View reviewed changes

Conversation

AlanCoding commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

ISSUE TYPE

COMPONENT NAME

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

AlanCoding commented Apr 16, 2026

Uh oh!

pb82 commented Apr 16, 2026

Uh oh!

AlanCoding commented Apr 21, 2026

Baby Yolo Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 22, 2026

Quality Gate passed

Uh oh!

pb82 left a comment

Choose a reason for hiding this comment

Uh oh!

tvo318 left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tvo318 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

tvo318 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlanCoding commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading