Feat: use LMDB + orjson for better performance by thiswillbeyourgithub · Pull Request #279 · mwmbl/mwmbl

thiswillbeyourgithub · 2025-07-08T12:50:30Z

This is just the commits from #270 but without including the ones that were already merged, I then reran poetry lock.

Recopying what I said there:

I added a cli at the end of the file so that you can directly run python indexer.py your_index and it will convert your index into a LMDB index without removing the previous one, and with a bunch of useful metrics.

In the penultimate commit (e1fa14d) I actually removed all the unecessary method and simplified the code. This was for testing purposes but it worked fine so I left it but it removes the 4096 page size limit. So that requires some thinking from you.

feat: replace mmap with lmdb without breaking the db
slightly improved the conversion logic
enh: page key is just its index now
minor
fix: the lmdb does not need to be padded
no need to use bytes for the metadata key
doc: turn comments into docstrings
refactor: store ZstdCompressor and ZstdDecompressor as class attributes
refactor: optimize mmap to LMDB conversion with batch processing and memory management
feat: add minimal progress bar to index conversion using standard library
feat: add corruption detection and file size reporting to mmap to lmdb conversion
bump index version to 2
refactor: Optimize LMDB transactions for shorter, more focused operations
fix: actually yes we need to store as bytes
fix: forgot to strip metadata padding when converting
feat: add CLI interface for troubleshooting index loading in indexer.py
feat: add logging configuration for CLI script output
feat: modify progress reporting to log only every 5 percent during index conversion
feat: add configurable default compression level for zstandard
distinguish types of error on conversion
feat: migrate from JSON to MessagePack for serialization in LMDB indexer
refactor: replace msgpack with orjson for serialization
feat: migrate mmap index to LMDB with orjson and page size fitting
refactor: Fix JSON decoding in index conversion to prevent page size overflow
refactor: remove unnecessary page padding for LMDB storage
perf: set default compression level to 8
refactor: modify _get_page_data to create compressor internally
fix: align LMDB index path creation with initialization method
test: conditionally check serialized data assertion based on VERSION
refactor: Prevent LMDB BadRslotError by using fresh environments per operation
refactor: simplify data storage by removing page size constraints
ran ruff
add orjson to poetry
ran poetry lock

Signed-off-by: thiswillbeyourgithub <[email protected]>

…memory management

…rary

…b conversion

Signed-off-by: thiswillbeyourgithub <[email protected]>

…ions

Signed-off-by: thiswillbeyourgithub <[email protected]>

…dex conversion

Signed-off-by: thiswillbeyourgithub <[email protected]>

…overflow

Signed-off-by: thiswillbeyourgithub <[email protected]>

…operation

Signed-off-by: thiswillbeyourgithub <[email protected]>

thiswillbeyourgithub added 30 commits July 8, 2025 14:44

feat: replace mmap with lmdb without breaking the db

ce1a913

slightly improved the conversion logic

19fb439

Signed-off-by: thiswillbeyourgithub <[email protected]>

enh: page key is just its index now

cfc4a81

Signed-off-by: thiswillbeyourgithub <[email protected]>

minor

3fdab4d

Signed-off-by: thiswillbeyourgithub <[email protected]>

fix: the lmdb does not need to be padded

92bf3ca

Signed-off-by: thiswillbeyourgithub <[email protected]>

no need to use bytes for the metadata key

6c7c43b

Signed-off-by: thiswillbeyourgithub <[email protected]>

doc: turn comments into docstrings

246d54b

Signed-off-by: thiswillbeyourgithub <[email protected]>

refactor: store ZstdCompressor and ZstdDecompressor as class attributes

aa2db87

refactor: optimize mmap to LMDB conversion with batch processing and …

dc59603

…memory management

feat: add minimal progress bar to index conversion using standard lib…

e3808f5

…rary

feat: add corruption detection and file size reporting to mmap to lmd…

64ad7b9

…b conversion

bump index version to 2

3da8c7d

Signed-off-by: thiswillbeyourgithub <[email protected]>

refactor: Optimize LMDB transactions for shorter, more focused operat…

06639b4

…ions

fix: actually yes we need to store as bytes

d743119

Signed-off-by: thiswillbeyourgithub <[email protected]>

fix: forgot to strip metadata padding when converting

763ddeb

Signed-off-by: thiswillbeyourgithub <[email protected]>

feat: add CLI interface for troubleshooting index loading in indexer.py

dd5d9c1

feat: add logging configuration for CLI script output

55976ed

feat: modify progress reporting to log only every 5 percent during in…

95c52ac

…dex conversion

feat: add configurable default compression level for zstandard

ebda741

distinguish types of error on conversion

b82f6eb

Signed-off-by: thiswillbeyourgithub <[email protected]>

feat: migrate from JSON to MessagePack for serialization in LMDB indexer

95bdb47

refactor: replace msgpack with orjson for serialization

8fc125b

feat: migrate mmap index to LMDB with orjson and page size fitting

3c5913a

refactor: Fix JSON decoding in index conversion to prevent page size …

9e7025d

…overflow

refactor: remove unnecessary page padding for LMDB storage

e2560e4

perf: set default compression level to 8

692ece3

Signed-off-by: thiswillbeyourgithub <[email protected]>

refactor: modify _get_page_data to create compressor internally

7e0c556

fix: align LMDB index path creation with initialization method

7e5d56a

test: conditionally check serialized data assertion based on VERSION

65eb36b

refactor: Prevent LMDB BadRslotError by using fresh environments per …

021e52e

…operation

thiswillbeyourgithub added 4 commits July 8, 2025 14:44

refactor: simplify data storage by removing page size constraints

8de8b1e

ran ruff

22a7fed

Signed-off-by: thiswillbeyourgithub <[email protected]>

add orjson to poetry

78989ff

Signed-off-by: thiswillbeyourgithub <[email protected]>

ran poetry lock

0e2f687

Signed-off-by: thiswillbeyourgithub <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: use LMDB + orjson for better performance#279

Feat: use LMDB + orjson for better performance#279
thiswillbeyourgithub wants to merge 34 commits intomwmbl:mainfrom
thiswillbeyourgithub:lmdb-orjson

thiswillbeyourgithub commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thiswillbeyourgithub commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant