Skip to content

enh: crawler + lmdb + orjson#270

Closed
thiswillbeyourgithub wants to merge 79 commits intomwmbl:mainfrom
thiswillbeyourgithub:enh-crawler-test-lmdb
Closed

enh: crawler + lmdb + orjson#270
thiswillbeyourgithub wants to merge 79 commits intomwmbl:mainfrom
thiswillbeyourgithub:enh-crawler-test-lmdb

Conversation

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor

  • feat: replace mmap with lmdb without breaking the db

This is the exact same as #269 but with targeted changes to the mwmbl/tinysearchengine/indexer.py to use lmdb instead of mmap. The test passes with DJANGO_SETTINGS_MODULE="mwmbl.settings_dev" pytest, the building works fine and the up works.

I've been running it for a few minutes without issues.

There is a conversion method that turns the old format into LMDB format, this way we can reuse the previous index. I did not add a fallback in case the conversion failed but it should be simple to do (just create a indexer_fallback.py and if anything goes wrong during the creation we return an instance of the previous class?).

What do you think about this?

Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub
<[email protected]>
because loguru takes care of the tracing

Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub
<[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub
<[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

Okay so starting from 0a0f8fe I added a lot more improvements. Now orjson is used, we get customizable compression levels, lmdb seems robust etc.

I only modified a test to only check the padding if version is 1 (lmdb is version 2).

I added a cli at the end of the file so that you can directly run python indexer.py your_index and it will convert your index into a LMDB index without removing the previous one, and with a bunch of useful metrics.

Notably, Instead of 390MB my db is now 43MB.

In the last commit I ran ruff.

In the penultimate commit (e1fa14d) I actually removed all the unecessary method and simplified the code. This was for testing purposes but it worked fine so I left it but it removes the 4096 page size limit. So that requires some thinking from you.

Feedback much appreciated.

Signed-off-by: thiswillbeyourgithub <[email protected]>
@thiswillbeyourgithub thiswillbeyourgithub changed the title enh: crawler + lmdb enh: crawler + lmdb + orjson Jun 19, 2025
@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as ready for review June 19, 2025 11:19
@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

I just noticed that because those changes are restricted to mwbl/tinysearchengine/indexer.py and one test file I could actually have branched from main directly. Let me know if you prefer that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant