enhancements playground by thiswillbeyourgithub · Pull Request #242 · mwmbl/mwmbl

thiswillbeyourgithub · 2025-05-30T09:37:02Z

Hi,

In the process of teaching me how the crawler works, I tend to write missing doc using LLMs as checking is far faster than understanding from raw code.

Hence I'm making this PR as a suggestion.

I think it goes in the right direction but there are probably inaccuracies in the doc and it would be nice if you could give me your thoughts on it.
If add docker compose for new crawler #241 is merged I can add a comment in the docker compose to the new readme to figure out the env var settings.
If there are env var settings you do not want to let user modify let me know.
For the others, there are probably ranges of acceptable values so it might be good to add asserts to enforce them and mention them in the readme too.

I can do all that if you tell me to.

Edit: I went on and added a bunch more things. Feedback welcome. I have been running it for the passed hour with no issue.

…iable support

…rsting

…uration

Signed-off-by: thiswillbeyourgithub <[email protected]>

…pawning

Signed-off-by: thiswillbeyourgithub <[email protected]>

…ADME with workflow details

…scoring

…shold

daoudclarke

In general this looks good. Have you tested it locally? When I have time I will try and test it myself.

Using uv is a no-brainer, so thanks for that.

thiswillbeyourgithub · 2025-05-30T17:32:10Z

I have built it and ran locally yes, but not ran any specific test suite if that's your question

thiswillbeyourgithub · 2025-06-02T15:21:41Z

A few more changes for the crawler:

applied black to cleanup the code
switched to loguru for better logging
add an env var for configurable log level

And something a bit ambitious for #249 :

tried to switch the backend of the local index to use lmdb and remove zstd (lmdb handles the compression itself on the fly, as well as mmap, thread safe concurrency and atomic writes)

That last point seems to mostly work. I am running it right now.

Btw, if I'm not mistaken there is no way for the main server to send messages to the crawler. Wouldn't that be useful? It might be nice in the future for compatibility like "Please update your crawler" or "you are sending invalid data" or "you're the top 50 contributor today", "downtime planned tomorow" etc. It seems that if the client is broken and sends rubish there is no way for the master server to tell the crawler the issue. Should I open an issue to track this?

Also there's something I was not sure about: how does the local index grow in the crawler? Is its size bounded? Because there is a hard limit to set when creating the lmdb (can be changed on the fly but the default is very low) and I didn't know what to set. I set it at 2G right now.

Signed-off-by: thiswillbeyourgithub <[email protected]>

…-based architecture

because loguru takes care of the tracing Signed-off-by: thiswillbeyourgithub <[email protected]>

Signed-off-by: thiswillbeyourgithub <[email protected]>

thiswillbeyourgithub · 2025-06-02T18:52:04Z

Ive been running the crawler for a couple of hour now and it seems to be running very smoothly apart from 504 errors from mwmbl.org that seems to struggle, I don't think it has anything to do with me. The crawl-index.tinysearch weighs 10Mb

daoudclarke · 2025-06-03T07:51:34Z

This is looking good. I'm concerned about the LMDB change - I think it's potentially a good idea but it will break the existing index which is used on the central server.

I would ideally like to see benchmarks comparing it to our existing index - ideally it would be faster, but even if it's a little slower that would be fine. You could just generate e.g. 1000 random queries and run them against the same index translated to LDMB.

Eventually we need to change the database for the central server and LMDB may be a good fit but we will need a way to migrate the existing index. Also, the plan for the central server is that it ends up as more of a cache on S3-like storage. That way we can make the index as big as we like without being constrained by the disk of the central server. LMDB may still work for the cache though.

However, LMDB looks like a good idea for the crawler. Perhaps you can create a separate PR where we keep the TinyIndexer as it is to be used on the central server, and create an LMDBIndexer to be used in the crawler. That will be a good way to start using it properly and iron out any issues.

daoudclarke · 2025-06-03T07:52:10Z

Forgot to say, the size of the current index is fixed, based on the number of pages and the size of each page.

daoudclarke · 2025-06-03T07:53:40Z

Sending messages to the crawler sounds good. Please try and use separate PRs for each feature, that way, the no-brainers can get merged quickly without having to wait for experimental stuff like new index formats 😉

Signed-off-by: thiswillbeyourgithub <[email protected]>

daoudclarke · 2025-06-03T07:56:36Z

Another thought - if you're rewriting the crawler using LMDB as the index, this might be a good time to rewrite it in Rust, if you feel confident doing that.

thiswillbeyourgithub · 2025-06-03T10:06:48Z

Btw here is the most important python doc for lmdb. They state:

On 64-bit there is no penalty for making this huge (say 1TB). Must be <2GB on 32-bit. The default map size is set low to encourage a crash, so users can figure out a good value before learning about this option too late.

So I can set the size limit way higher. I'd appreciate it if you could ballpark the size estimate for the local index for the crawler. I don't know how much it grows and what exactly triggers an increase in size.

In any case I used lmdb-dict-full for simplicity and because it takes care of the zstd + caching IIRC, but it might be better to only rely on py-lmdb? Not sure if that's an issue. The thread safe aspect is intrinsic to lmdb apparently.

This is looking good. I'm concerned about the LMDB change - I think it's potentially a good idea but it will break the existing index which is used on the central server.

Yeah for sure it can totally be a problem. But I don't think I can easily run the central server to troubleshoot so I could only rely on the tests. And I had to modify the tests sort of blindly so that's somewhere intense scrutiny would be good. Otherwise I'm not so concerned for the migration of the main server: the migration is "just" reading all key/values from one db and putting it in the other one no? The downtime should be too big. I'm sure responsible professionals run both DB for a month side by side and monitor any deviation from the new backend vs the reference one and when everything has settled the switch is done.

I don't know how to do this so asking you instead: if you could write some test for the crawler that mimic the central server so that I can see in advance if a change on the crawler is breaking the architecture that would be helpful. Don't feel up to that myself.

I would ideally like to see benchmarks comparing it to our existing index - ideally it would be faster, but even if it's a little slower that would be fine. You could just generate e.g. 1000 random queries and run them against the same index translated to LDMB.

Would you mind telling me exactly which API you are referring to? Ideally give me the python code you have in mind for the original index and I'll manage to port it to LMDB, or write pseudocode with the exact method name and parameters you have in mind.

Eventually we need to change the database for the central server and LMDB may be a good fit but we will need a way to migrate the existing index. Also, the plan for the central server is that it ends up as more of a cache on S3-like storage. That way we can make the index as big as we like without being constrained by the disk of the central server. LMDB may still work for the cache though.

Btw if you planned on using minio for FOSS S3, there might be trouble ahead.

However, LMDB looks like a good idea for the crawler. Perhaps you can create a separate PR where we keep the TinyIndexer as it is to be used on the central server, and create an LMDBIndexer to be used in the crawler. That will be a good way to start using it properly and iron out any issues.

Yeah good call. I'll try to do that. Btw I was confused at some point between tinysearchengine/indexer and indexer/*py would you mind explaining the difference? Especially it's not easy to figure out which part of the code is crawler only and which part has an impact on the master server for example.

Forgot to say, the size of the current index is fixed, based on the number of pages and the size of each page.

Are you talking about the local index of the crawler or the index of the master server? And I thought the size of each page was fixed but new pages could be added, so that would not be "fixed". What am I misunderstanding here?

Sending messages to the crawler sounds good.

I'm not comfortable doing changes on the central server side. I think I would be if you wrote a ARCHITECTURE.md file that recaps the relationship between subdirs and the role of each file. Quite a lot of the backend is still very obscure to me. And the technical faq would help here.

Please try and use separate PRs for each feature, that way, the no-brainers can get merged quickly without having to wait for experimental stuff like new index formats 😉

Yeah as you can guess that was the plan but some things are hard to limitate, for example loguru is I think superior on all points to logger but that would be a large change. Same for using LMDB, I didn't know if changing the local crawler index would have side effect elsewhere in advance. Also I would find it awesome if you could add a commit hook to run black on the python files, but that might be my taste only. So yeah I don't intend this PR to be merged and it became more of a playground.

Here's a proposal:

asap you run black everywhere and add a commit hook (I'm fine with doing it in a separate PR but I could probably add malicious things in a huge SLOC change so prefer to ask), and push
you give me or write a performance test function in a pytest file, maybe just something that gets run when we call python pytest/indexer_perf.py as it's not a unit test
I do a PR for a repo wide change for using loguru instead of logger
I redo a PR with the no brainers + what's mostly crawler enhancements (like the redis ping, env var etc but not lmdb of course)
then I can redo properly the more invasive changes for LMDBIndexer (and possibly using fastbloom-rs). And I'll then probably use py-lmdb instead of lmdb-dict-full and keep your zstd code, that seems like a more future proof plan

What do you think of this? Fine with any other plan of your choosing. If you're swamped or something that's fine obviously but obviously I'd prefer to keep going as long as I'm fully nerdsniped :)

thiswillbeyourgithub · 2025-06-03T10:17:03Z

Another thought - if you're rewriting the crawler using LMDB as the index, this might be a good time to rewrite it in Rust, if you feel confident doing that.

I cannot read a single line of rust unfortunately. And tbh with my current (low) knowledge of mwmbl I'm not sure that should be prioritized: python is fast to prototype stuff (as I've shown), you benefit from a larger skill pool (I'm basically good only at python), there seems to be low hanging fruits left for the crawler performance wise (fastbloom-rs, hopefully lmdb), and importantly: the crawler bottleneck has more to do with network IO and the number of processes and threads than with actual raw computing power. My 2 cents (again, I'm a psychiatry resident, take a pool of salt) is that mwmbl's design will probably change in the future it it's currently moving slow enough that I'd bet enhancing the doc to attract users would be a faaaar better bet. Would gladly accept to be proven wrong of course, but then I can just watch the repo from afar :)

Signed-off-by: thiswillbeyourgithub <[email protected]>

thiswillbeyourgithub added 20 commits May 30, 2025 14:02

fix: import HTTPException from ninja to resolve undefined name errors

ccf25cf

refactor: move crawler configuration to env_vars with environment var…

fde29c4

…iable support

docs: add initial README for mwmbl crawler

5d25317

feat: add configurable rate limiting for crawler to reduce request bu…

0de9a6a

…rsting

feat: import CRAWLER_WORKERS from env_vars and simplify worker config…

661879f

…uration

docs: add CRAWLER_WORKERS env var description to crawler README

706c8d6

docs: Add comprehensive crawler architecture details to README

bff1ba5

remove burst limiting

0dcdd61

Signed-off-by: thiswillbeyourgithub <[email protected]>

enh: print error reason in the logger

2343671

Signed-off-by: thiswillbeyourgithub <[email protected]>

feat: add configurable threads for crawl_batch with env var

6d2424a

docs: clarify CRAWLER_WORKERS documentation to specify OS process s…

81c4615

…pawning

enh: add check for crawler worker value

c1dfd0f

Signed-off-by: thiswillbeyourgithub <[email protected]>

docs: clarify crawler process and thread architecture in README

63ff2b2

minor: use logger instead of print for the results

2e23d3f

Signed-off-by: thiswillbeyourgithub <[email protected]>

feat: add rate limiting with configurable delay and random fuzz

9188909

docs: Add docstrings to process_batch and run_indexing, update RE…

7420378

…ADME with workflow details

docs: clarify DOMAIN_GROUPS purpose in README for domain authority …

e6c14e6

…scoring

feat: add crash tracking for index process with 5-crash-per-hour thre…

25d1c79

…shold

chore: replace pip with uv for faster Python package management

bd75726

feat: add crawler version tracking to batch uploads

bc46c41

thiswillbeyourgithub force-pushed the enh_crawler branch from ac592d3 to bc46c41 Compare May 30, 2025 12:03

thiswillbeyourgithub mentioned this pull request May 30, 2025

add docker compose for new crawler #241

Merged

thiswillbeyourgithub marked this pull request as ready for review May 30, 2025 12:55

daoudclarke reviewed May 30, 2025

View reviewed changes

Comment thread mwmbl/crawler/env_vars.py Outdated

thiswillbeyourgithub added 4 commits June 2, 2025 17:47

apply black to cleanup the code

c362fd3

Signed-off-by: thiswillbeyourgithub <[email protected]>

refactor: replace logging with loguru and update log statements

3c17fe1

feat: migrate TinyIndex storage from mmap to LMDB with preserved page…

69447c2

…-based architecture

refactor: Remove redundant Zstd compression in TinyIndex methods

23866ee

thiswillbeyourgithub force-pushed the enh_crawler branch from 282b5c6 to d16b53b Compare June 2, 2025 15:47

thiswillbeyourgithub added 14 commits June 2, 2025 17:57

Merge branch 'main' into enh_crawler

75f45b4

minor: more compact log line for crawling

bbeca09

because loguru takes care of the tracing Signed-off-by: thiswillbeyourgithub <[email protected]>

feat: import REDIS_URL from env_vars and use in Redis connection

5d3859c

minor: var rename

1961ca4

Signed-off-by: thiswillbeyourgithub <[email protected]>

feat: add retry mechanism with exponential backoff for API requests

f65de33

ran poetry.lock

04c4b3b

Signed-off-by: thiswillbeyourgithub <[email protected]>

bump python version to 3.11.12

c03c5f2

Signed-off-by: thiswillbeyourgithub <[email protected]>

upgrade version of pybloomfiltermmap3

4933779

Signed-off-by: thiswillbeyourgithub <[email protected]>

fix wrong requirements.txt

21a4121

Signed-off-by: thiswillbeyourgithub <[email protected]>

actually I believe the requirements.txt is now useless

ffc6aa2

Signed-off-by: thiswillbeyourgithub <[email protected]>

ran poetry.lock

6b2ef03

Signed-off-by: thiswillbeyourgithub <[email protected]>

test: Rewrite tests for LMDB-based TinyIndex implementation

373eebf

test: add test data and create temporary indexes for copy_pages test

7bc286b

bump django-ninja and rerun poetry lock

ef683db

Signed-off-by: thiswillbeyourgithub <[email protected]>

enh: add check for redis connection very early

42de10e

Signed-off-by: thiswillbeyourgithub <[email protected]>

thiswillbeyourgithub marked this pull request as draft June 3, 2025 10:17

thiswillbeyourgithub changed the title ~~enh crawler~~ enhancement playground Jun 3, 2025

thiswillbeyourgithub changed the title ~~enhancement playground~~ enhancements playground Jun 3, 2025

thiswillbeyourgithub mentioned this pull request Jun 3, 2025

TODO: add a message exchange system between crawlers and the master server #251

Open

set lmdb size to 500G

634cae2

Signed-off-by: thiswillbeyourgithub <[email protected]>

This was referenced Jun 17, 2025

FR: use loguru for logging #267

Open

enh: crawler #269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancements playground#242

enhancements playground#242
thiswillbeyourgithub wants to merge 42 commits intomwmbl:mainfrom
thiswillbeyourgithub:enh_crawler

thiswillbeyourgithub commented May 30, 2025 •

edited

Loading

Uh oh!

daoudclarke left a comment

Uh oh!

Uh oh!

thiswillbeyourgithub commented May 30, 2025 •

edited

Loading

Uh oh!

thiswillbeyourgithub commented Jun 2, 2025 •

edited

Loading

Uh oh!

thiswillbeyourgithub commented Jun 2, 2025 •

edited

Loading

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

thiswillbeyourgithub commented Jun 3, 2025 •

edited

Loading

Uh oh!

thiswillbeyourgithub commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thiswillbeyourgithub commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daoudclarke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thiswillbeyourgithub commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiswillbeyourgithub commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiswillbeyourgithub commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

daoudclarke commented Jun 3, 2025

Uh oh!

thiswillbeyourgithub commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiswillbeyourgithub commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thiswillbeyourgithub commented May 30, 2025 •

edited

Loading

thiswillbeyourgithub commented May 30, 2025 •

edited

Loading

thiswillbeyourgithub commented Jun 2, 2025 •

edited

Loading

thiswillbeyourgithub commented Jun 2, 2025 •

edited

Loading

thiswillbeyourgithub commented Jun 3, 2025 •

edited

Loading