Skip to content

fix(store): preserve underscores in BM25 search terms#404

Merged
tobi merged 1 commit intotobi:mainfrom
mvanhorn:osc/305-bm25-underscore-search
Apr 5, 2026
Merged

fix(store): preserve underscores in BM25 search terms#404
tobi merged 1 commit intotobi:mainfrom
mvanhorn:osc/305-bm25-underscore-search

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Fixes #305

Summary

sanitizeFTS5Term stripped underscores from search terms, causing BM25 searches for snake_case identifiers to silently fail. my_variable became myvariable, matching nothing.

Changes

  • Add _ to the preserved character set in sanitizeFTS5Term regex (store.ts)
  • Export the function for testability
  • Add 6 unit tests covering snake_case, contractions, punctuation, unicode

Before / After

sanitizeFTS5Term("my_variable")
  Before: "myvariable"  (no BM25 match)
  After:  "my_variable" (correct match)

The CLI's copy of sanitizeFTS5Term (cli/qmd.ts:1695) already uses \w which preserves underscores - this aligns the store's version.

This contribution was developed with AI assistance (Claude Code).

sanitizeFTS5Term stripped all non-letter/non-number characters including
underscores, causing snake_case identifiers like `my_variable` to become
`myvariable` and silently fail BM25 matches.

Add underscore to the preserved character set in the Unicode regex.
Export the function and add unit tests covering snake_case, contractions,
punctuation stripping, and unicode.

Fixes tobi#305

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
fxstein pushed a commit to fxstein/qmd that referenced this pull request Mar 18, 2026
Preserve hyphens and underscores in sanitizeFTS5Term so FTS5's unicode61
tokenizer can split them symmetrically at query time, producing precise
phrase matches. Also fix validateSemanticQuery false positive that rejected
hyphenated terms like DEC-0054 as negation syntax in vec/hyde queries.

Complements tobi#404 (underscore-only fix) by also covering hyphens.
Refs: tobi#305, tobi#417
zeattacker pushed a commit to zeattacker/qmd that referenced this pull request Mar 26, 2026
Preserve hyphens and underscores in sanitizeFTS5Term so FTS5's unicode61
tokenizer can split them symmetrically at query time, producing precise
phrase matches. Also fix validateSemanticQuery false positive that rejected
hyphenated terms like DEC-0054 as negation syntax in vec/hyde queries.

Complements tobi#404 (underscore-only fix) by also covering hyphens.
Refs: tobi#305, tobi#417
sangdo90 pushed a commit to sangdo90/qmd that referenced this pull request Apr 5, 2026
Preserve hyphens and underscores in sanitizeFTS5Term so FTS5's unicode61
tokenizer can split them symmetrically at query time, producing precise
phrase matches. Also fix validateSemanticQuery false positive that rejected
hyphenated terms like DEC-0054 as negation syntax in vec/hyde queries.

Complements tobi#404 (underscore-only fix) by also covering hyphens.
Refs: tobi#305, tobi#417
@tobi tobi merged commit 1ad3388 into tobi:main Apr 5, 2026
6 checks passed
@mvanhorn
Copy link
Copy Markdown
Contributor Author

mvanhorn commented Apr 6, 2026

Thanks for the quick merge!

tanarchytan referenced this pull request in tanarchytan/lotl Apr 8, 2026
sanitizeFTS5Term stripped all non-letter/non-number characters including
underscores, causing snake_case identifiers like `my_variable` to become
`myvariable` and silently fail BM25 matches.

Add underscore to the preserved character set in the Unicode regex.
Export the function and add unit tests covering snake_case, contractions,
punctuation stripping, and unicode.

Fixes #305

Co-authored-by: Matt Van Horn <[email protected]>
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BM25 search fails on snake_case identifiers (sanitizeFTS5Term strips underscores)

2 participants