Skip to content

fix: set SQLite busy_timeout before WAL pragma to prevent concurrent locking errors#1057

Merged
stack72 merged 1 commit intomainfrom
fix/sqlite-busy-timeout-ordering
Apr 2, 2026
Merged

fix: set SQLite busy_timeout before WAL pragma to prevent concurrent locking errors#1057
stack72 merged 1 commit intomainfrom
fix/sqlite-busy-timeout-ordering

Conversation

@adamhjk
Copy link
Copy Markdown
Contributor

@adamhjk adamhjk commented Apr 2, 2026

Summary

  • Swaps the order of PRAGMA busy_timeout=5000 and PRAGMA journal_mode=WAL in the CatalogStore constructor so SQLite has a retry timeout configured before attempting operations that may need a write lock under concurrency
  • Adds a unit test that opens two CatalogStore instances against the same DB file simultaneously to guard against regression

Fixes #1055

Test Plan

  • deno run test src/infrastructure/persistence/catalog_store_test.ts — all 17 tests pass, including new concurrent-open test
  • Reproduced the bug in a scratch repo: 9/10 concurrent runs failed with database is locked before the fix, 0/10 failed after
  • Existing UAT adversarial concurrency_test.ts covers the end-to-end cross-process scenario

🤖 Generated with Claude Code

@stack72 stack72 marked this pull request as draft April 2, 2026 16:03
@stack72 stack72 marked this pull request as draft April 2, 2026 16:03
github-actions[bot]
github-actions Bot previously approved these changes Apr 2, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-scoped fix. The root cause analysis is correct — PRAGMA journal_mode=WAL acquires a write lock, so busy_timeout must be configured first to allow SQLite to retry rather than immediately throwing SQLITE_BUSY.

Blocking Issues

None.

Suggestions

None — the fix is minimal, the test is appropriate, and the commit message clearly explains the reasoning.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None.

Medium

  1. catalog_store_test.ts:312-329 — Test does not actually reproduce the race. Both CatalogStore instances are opened sequentially in a synchronous Deno test. The constructor calls new DatabaseSync(dbPath) + three synchronous exec() calls, so by the time store2 opens, store1 has already completed all its PRAGMAs and createSchema(). The real bug (per the PR body) required 9/10 concurrent runs across processes to reproduce. This test is a useful smoke test that two handles coexist, but it would pass even with the old ordering on most runs — it doesn't deterministically exercise the contention window. Not blocking since the fix itself is obviously correct and the author confirmed separate adversarial testing.

Low

  1. catalog_store_test.ts:327-328 — No cleanup of temp DB on failure. If assertEquals throws before close(), the two SQLite handles leak (WAL/SHM files left behind). This is consistent with every other test in the file so it's a pre-existing pattern, not a regression.

Verdict

PASS — The fix is a correct, minimal reordering of two PRAGMA statements. Setting busy_timeout before journal_mode=WAL ensures SQLite retries on SQLITE_BUSY rather than failing immediately when the WAL mode switch requires a write lock held by another connection. The test is a useful guard rail even if it can't deterministically reproduce the race.

…base is locked" under concurrency

The CatalogStore constructor set PRAGMA busy_timeout after PRAGMA
journal_mode=WAL. Under concurrent access, the WAL pragma needs a database
lock but had no retry configured yet, causing SQLite to immediately throw
SQLITE_BUSY instead of retrying. Swapping the order ensures the 5-second
retry timeout is active before any lock-acquiring operation.

Fixes #1055
@adamhjk adamhjk force-pushed the fix/sqlite-busy-timeout-ordering branch from f3dd360 to c8ab97c Compare April 2, 2026 17:13
@adamhjk adamhjk marked this pull request as ready for review April 2, 2026 17:13
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-targeted fix. Swapping the PRAGMA order is the correct approach — busy_timeout must be configured before any operation that may need a write lock (like journal_mode=WAL), otherwise SQLite returns SQLITE_BUSY immediately instead of retrying.

Blocking Issues

None.

Suggestions

  1. Test timing sensitivity: The test relies on a 200ms setTimeout to release the lock, which could be flaky on slow CI runners if the subprocess hasn't started its busy-wait yet. In practice the 5000ms busy_timeout gives plenty of headroom, so this is unlikely to be an issue — just worth noting.

Overall: minimal, correct fix with a solid regression test. LGTM.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None.

Medium

  1. catalog_store_test.ts:365 — subprocess has no timeout, can hang forever.
    The proc.output() call has no timeout or AbortSignal. If the subprocess hangs (e.g., Deno deadlocks on module resolution under lock contention, or the busy_timeout itself blocks for 5 full seconds on a slow CI runner before failing), this test blocks indefinitely. The project's CLAUDE.md convention says "every promise must be awaited or explicitly handled" and recommends AbortSignal with timeouts for outbound calls.

    Breaking scenario: CI runner is under heavy I/O load. The subprocess takes >5s to acquire the lock (busy_timeout expires), then exits with a non-zero code, but only after a long delay. Or worse, the Deno subprocess itself hangs during startup with a held exclusive lock on the file. The test suite stalls with no indication of what's wrong.

    Suggested fix: Add a timeout to proc.output() or wrap it in a Promise.race with a setTimeout rejection (e.g., 10s), and ensure the subprocess is killed on timeout:

    const timeout = AbortSignal.timeout(10_000);
    // or use Promise.race with a cleanup that calls proc.kill()
  2. catalog_store_test.ts:360 — 200ms delay is a race condition with subprocess startup.
    The test assumes the subprocess will have started and reached the PRAGMA within 200ms, so it's "waiting under busy_timeout" when the lock is released. On a cold CI runner (no Deno cache, high load), Deno startup + module resolution can exceed 200ms, meaning the lock is already released before the subprocess tries to acquire it. In that case, the test passes vacuously — it doesn't actually validate the busy_timeout retry behavior.

    Breaking scenario: This doesn't break correctness (the test still passes), but it silently stops testing what it claims to test. A future regression could slip through.

    Suggested fix: Instead of a fixed delay, have the subprocess signal readiness (e.g., write to stdout before constructing CatalogStore), and only release the lock after observing that signal.

Low

  1. catalog_store_test.ts:332 — temp script file is never cleaned up. The scriptPath and its parent dir are left behind after the test. Minor since makeTempDbPath already creates temp dirs that aren't cleaned up either — consistent with existing tests, but worth noting.

Verdict

PASS — The production fix (swapping two PRAGMA lines in catalog_store.ts:64-65) is correct and well-motivated. Setting busy_timeout before journal_mode=WAL ensures SQLite retries on lock contention rather than failing immediately. The test has minor robustness concerns (no subprocess timeout, timing-dependent lock scenario) but these don't affect the production code quality.

@stack72 stack72 merged commit 203cd31 into main Apr 2, 2026
20 checks passed
@stack72 stack72 deleted the fix/sqlite-busy-timeout-ordering branch April 2, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLite "database is locked" error under concurrent model method runs

2 participants