enh: crawler + lmdb + orjson#270
Conversation
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
…ADME with workflow details
because loguru takes care of the tracing Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
|
Okay so starting from 0a0f8fe I added a lot more improvements. Now orjson is used, we get customizable compression levels, lmdb seems robust etc. I only modified a test to only check the padding if version is 1 (lmdb is version 2). I added a cli at the end of the file so that you can directly run Notably, Instead of 390MB my db is now 43MB. In the last commit I ran ruff. In the penultimate commit (e1fa14d) I actually removed all the unecessary method and simplified the code. This was for testing purposes but it worked fine so I left it but it removes the 4096 page size limit. So that requires some thinking from you. Feedback much appreciated. |
Signed-off-by: thiswillbeyourgithub <[email protected]>
|
I just noticed that because those changes are restricted to mwbl/tinysearchengine/indexer.py and one test file I could actually have branched from main directly. Let me know if you prefer that |
This is the exact same as #269 but with targeted changes to the mwmbl/tinysearchengine/indexer.py to use lmdb instead of mmap. The test passes with
DJANGO_SETTINGS_MODULE="mwmbl.settings_dev" pytest, the building works fine and theupworks.I've been running it for a few minutes without issues.
There is a conversion method that turns the old format into LMDB format, this way we can reuse the previous index. I did not add a fallback in case the conversion failed but it should be simple to do (just create a indexer_fallback.py and if anything goes wrong during the creation we return an instance of the previous class?).
What do you think about this?