Skip to content

Feat: use LMDB + orjson for better performance#279

Open
thiswillbeyourgithub wants to merge 34 commits intomwmbl:mainfrom
thiswillbeyourgithub:lmdb-orjson
Open

Feat: use LMDB + orjson for better performance#279
thiswillbeyourgithub wants to merge 34 commits intomwmbl:mainfrom
thiswillbeyourgithub:lmdb-orjson

Conversation

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor

This is just the commits from #270 but without including the ones that were already merged, I then reran poetry lock.

Recopying what I said there:

I added a cli at the end of the file so that you can directly run python indexer.py your_index and it will convert your index into a LMDB index without removing the previous one, and with a bunch of useful metrics.

In the penultimate commit (e1fa14d) I actually removed all the unecessary method and simplified the code. This was for testing purposes but it worked fine so I left it but it removes the 4096 page size limit. So that requires some thinking from you.

  • feat: replace mmap with lmdb without breaking the db
  • slightly improved the conversion logic
  • enh: page key is just its index now
  • minor
  • fix: the lmdb does not need to be padded
  • no need to use bytes for the metadata key
  • doc: turn comments into docstrings
  • refactor: store ZstdCompressor and ZstdDecompressor as class attributes
  • refactor: optimize mmap to LMDB conversion with batch processing and memory management
  • feat: add minimal progress bar to index conversion using standard library
  • feat: add corruption detection and file size reporting to mmap to lmdb conversion
  • bump index version to 2
  • refactor: Optimize LMDB transactions for shorter, more focused operations
  • fix: actually yes we need to store as bytes
  • fix: forgot to strip metadata padding when converting
  • feat: add CLI interface for troubleshooting index loading in indexer.py
  • feat: add logging configuration for CLI script output
  • feat: modify progress reporting to log only every 5 percent during index conversion
  • feat: add configurable default compression level for zstandard
  • distinguish types of error on conversion
  • feat: migrate from JSON to MessagePack for serialization in LMDB indexer
  • refactor: replace msgpack with orjson for serialization
  • feat: migrate mmap index to LMDB with orjson and page size fitting
  • refactor: Fix JSON decoding in index conversion to prevent page size overflow
  • refactor: remove unnecessary page padding for LMDB storage
  • perf: set default compression level to 8
  • refactor: modify _get_page_data to create compressor internally
  • fix: align LMDB index path creation with initialization method
  • test: conditionally check serialized data assertion based on VERSION
  • refactor: Prevent LMDB BadRslotError by using fresh environments per operation
  • refactor: simplify data storage by removing page size constraints
  • ran ruff
  • add orjson to poetry
  • ran poetry lock

Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub
<[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant