Skip to content

enhancements playground#242

Draft
thiswillbeyourgithub wants to merge 42 commits intomwmbl:mainfrom
thiswillbeyourgithub:enh_crawler
Draft

enhancements playground#242
thiswillbeyourgithub wants to merge 42 commits intomwmbl:mainfrom
thiswillbeyourgithub:enh_crawler

Conversation

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor

@thiswillbeyourgithub thiswillbeyourgithub commented May 30, 2025

Hi,

In the process of teaching me how the crawler works, I tend to write missing doc using LLMs as checking is far faster than understanding from raw code.

Hence I'm making this PR as a suggestion.

  • I think it goes in the right direction but there are probably inaccuracies in the doc and it would be nice if you could give me your thoughts on it.

  • If add docker compose for new crawler #241 is merged I can add a comment in the docker compose to the new readme to figure out the env var settings.

  • If there are env var settings you do not want to let user modify let me know.

  • For the others, there are probably ranges of acceptable values so it might be good to add asserts to enforce them and mention them in the readme too.

I can do all that if you tell me to.

Edit: I went on and added a bunch more things. Feedback welcome. I have been running it for the passed hour with no issue.

Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub
<[email protected]>
Copy link
Copy Markdown
Contributor

@daoudclarke daoudclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good. Have you tested it locally? When I have time I will try and test it myself.

Using uv is a no-brainer, so thanks for that.

Comment thread mwmbl/crawler/env_vars.py Outdated
@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

thiswillbeyourgithub commented May 30, 2025

I have built it and ran locally yes, but not ran any specific test suite if that's your question

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

thiswillbeyourgithub commented Jun 2, 2025

A few more changes for the crawler:

  • applied black to cleanup the code
  • switched to loguru for better logging
  • add an env var for configurable log level

And something a bit ambitious for #249 :

  • tried to switch the backend of the local index to use lmdb and remove zstd (lmdb handles the compression itself on the fly, as well as mmap, thread safe concurrency and atomic writes)

That last point seems to mostly work. I am running it right now.

Btw, if I'm not mistaken there is no way for the main server to send messages to the crawler. Wouldn't that be useful? It might be nice in the future for compatibility like "Please update your crawler" or "you are sending invalid data" or "you're the top 50 contributor today", "downtime planned tomorow" etc. It seems that if the client is broken and sends rubish there is no way for the master server to tell the crawler the issue. Should I open an issue to track this?

Also there's something I was not sure about: how does the local index grow in the crawler? Is its size bounded? Because there is a hard limit to set when creating the lmdb (can be changed on the fly but the default is very low) and I didn't know what to set. I set it at 2G right now.

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

thiswillbeyourgithub commented Jun 2, 2025

Ive been running the crawler for a couple of hour now and it seems to be running very smoothly apart from 504 errors from mwmbl.org that seems to struggle, I don't think it has anything to do with me. The crawl-index.tinysearch weighs 10Mb

@daoudclarke
Copy link
Copy Markdown
Contributor

This is looking good. I'm concerned about the LMDB change - I think it's potentially a good idea but it will break the existing index which is used on the central server.

I would ideally like to see benchmarks comparing it to our existing index - ideally it would be faster, but even if it's a little slower that would be fine. You could just generate e.g. 1000 random queries and run them against the same index translated to LDMB.

Eventually we need to change the database for the central server and LMDB may be a good fit but we will need a way to migrate the existing index. Also, the plan for the central server is that it ends up as more of a cache on S3-like storage. That way we can make the index as big as we like without being constrained by the disk of the central server. LMDB may still work for the cache though.

However, LMDB looks like a good idea for the crawler. Perhaps you can create a separate PR where we keep the TinyIndexer as it is to be used on the central server, and create an LMDBIndexer to be used in the crawler. That will be a good way to start using it properly and iron out any issues.

@daoudclarke
Copy link
Copy Markdown
Contributor

Forgot to say, the size of the current index is fixed, based on the number of pages and the size of each page.

@daoudclarke
Copy link
Copy Markdown
Contributor

Sending messages to the crawler sounds good. Please try and use separate PRs for each feature, that way, the no-brainers can get merged quickly without having to wait for experimental stuff like new index formats 😉

@daoudclarke
Copy link
Copy Markdown
Contributor

Another thought - if you're rewriting the crawler using LMDB as the index, this might be a good time to rewrite it in Rust, if you feel confident doing that.

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

thiswillbeyourgithub commented Jun 3, 2025

Btw here is the most important python doc for lmdb. They state:

On 64-bit there is no penalty for making this huge (say 1TB). Must be <2GB on 32-bit. The default map size is set low to encourage a crash, so users can figure out a good value before learning about this option too late.

So I can set the size limit way higher. I'd appreciate it if you could ballpark the size estimate for the local index for the crawler. I don't know how much it grows and what exactly triggers an increase in size.

In any case I used lmdb-dict-full for simplicity and because it takes care of the zstd + caching IIRC, but it might be better to only rely on py-lmdb? Not sure if that's an issue. The thread safe aspect is intrinsic to lmdb apparently.

This is looking good. I'm concerned about the LMDB change - I think it's potentially a good idea but it will break the existing index which is used on the central server.

Yeah for sure it can totally be a problem. But I don't think I can easily run the central server to troubleshoot so I could only rely on the tests. And I had to modify the tests sort of blindly so that's somewhere intense scrutiny would be good. Otherwise I'm not so concerned for the migration of the main server: the migration is "just" reading all key/values from one db and putting it in the other one no? The downtime should be too big. I'm sure responsible professionals run both DB for a month side by side and monitor any deviation from the new backend vs the reference one and when everything has settled the switch is done.

I don't know how to do this so asking you instead: if you could write some test for the crawler that mimic the central server so that I can see in advance if a change on the crawler is breaking the architecture that would be helpful. Don't feel up to that myself.

I would ideally like to see benchmarks comparing it to our existing index - ideally it would be faster, but even if it's a little slower that would be fine. You could just generate e.g. 1000 random queries and run them against the same index translated to LDMB.

Would you mind telling me exactly which API you are referring to? Ideally give me the python code you have in mind for the original index and I'll manage to port it to LMDB, or write pseudocode with the exact method name and parameters you have in mind.

Eventually we need to change the database for the central server and LMDB may be a good fit but we will need a way to migrate the existing index. Also, the plan for the central server is that it ends up as more of a cache on S3-like storage. That way we can make the index as big as we like without being constrained by the disk of the central server. LMDB may still work for the cache though.

Btw if you planned on using minio for FOSS S3, there might be trouble ahead.

However, LMDB looks like a good idea for the crawler. Perhaps you can create a separate PR where we keep the TinyIndexer as it is to be used on the central server, and create an LMDBIndexer to be used in the crawler. That will be a good way to start using it properly and iron out any issues.

Yeah good call. I'll try to do that. Btw I was confused at some point between tinysearchengine/indexer and indexer/*py would you mind explaining the difference? Especially it's not easy to figure out which part of the code is crawler only and which part has an impact on the master server for example.

Forgot to say, the size of the current index is fixed, based on the number of pages and the size of each page.

Are you talking about the local index of the crawler or the index of the master server? And I thought the size of each page was fixed but new pages could be added, so that would not be "fixed". What am I misunderstanding here?

Sending messages to the crawler sounds good.

I'm not comfortable doing changes on the central server side. I think I would be if you wrote a ARCHITECTURE.md file that recaps the relationship between subdirs and the role of each file. Quite a lot of the backend is still very obscure to me. And the technical faq would help here.

Please try and use separate PRs for each feature, that way, the no-brainers can get merged quickly without having to wait for experimental stuff like new index formats 😉

Yeah as you can guess that was the plan but some things are hard to limitate, for example loguru is I think superior on all points to logger but that would be a large change. Same for using LMDB, I didn't know if changing the local crawler index would have side effect elsewhere in advance. Also I would find it awesome if you could add a commit hook to run black on the python files, but that might be my taste only. So yeah I don't intend this PR to be merged and it became more of a playground.

Here's a proposal:

  1. asap you run black everywhere and add a commit hook (I'm fine with doing it in a separate PR but I could probably add malicious things in a huge SLOC change so prefer to ask), and push
  2. you give me or write a performance test function in a pytest file, maybe just something that gets run when we call python pytest/indexer_perf.py as it's not a unit test
  3. I do a PR for a repo wide change for using loguru instead of logger
  4. I redo a PR with the no brainers + what's mostly crawler enhancements (like the redis ping, env var etc but not lmdb of course)
  5. then I can redo properly the more invasive changes for LMDBIndexer (and possibly using fastbloom-rs). And I'll then probably use py-lmdb instead of lmdb-dict-full and keep your zstd code, that seems like a more future proof plan

What do you think of this? Fine with any other plan of your choosing. If you're swamped or something that's fine obviously but obviously I'd prefer to keep going as long as I'm fully nerdsniped :)

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

Another thought - if you're rewriting the crawler using LMDB as the index, this might be a good time to rewrite it in Rust, if you feel confident doing that.

I cannot read a single line of rust unfortunately. And tbh with my current (low) knowledge of mwmbl I'm not sure that should be prioritized: python is fast to prototype stuff (as I've shown), you benefit from a larger skill pool (I'm basically good only at python), there seems to be low hanging fruits left for the crawler performance wise (fastbloom-rs, hopefully lmdb), and importantly: the crawler bottleneck has more to do with network IO and the number of processes and threads than with actual raw computing power. My 2 cents (again, I'm a psychiatry resident, take a pool of salt) is that mwmbl's design will probably change in the future it it's currently moving slow enough that I'd bet enhancing the doc to attract users would be a faaaar better bet. Would gladly accept to be proven wrong of course, but then I can just watch the repo from afar :)

@thiswillbeyourgithub thiswillbeyourgithub marked this pull request as draft June 3, 2025 10:17
@thiswillbeyourgithub thiswillbeyourgithub changed the title enh crawler enhancement playground Jun 3, 2025
@thiswillbeyourgithub thiswillbeyourgithub changed the title enhancement playground enhancements playground Jun 3, 2025
Signed-off-by: thiswillbeyourgithub
<[email protected]>
This was referenced Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants