add docker compose for new crawler by thiswillbeyourgithub · Pull Request #241 · mwmbl/mwmbl

thiswillbeyourgithub · 2025-05-30T08:47:33Z

This follows the issue thread #45

After using it for a few hour there are a few things I noticed:

it seems that the run is alternating between a large CPU spike when the crawling burst happens, then waits for a while for the results to be uploaded I think.
1.1. I don't think it's good for the hardware, especially consumer level.
1.2. ~~Couldn't we use async to crawl, pull and push in parallel?~~ Edit: reading the code it seems to be already async.
1.3. I fear that for customer hardware this would cause an annoying alternation between full fan and quiet.
I don't think any crawler should be implemented without an option to set a rate limit. The only parameter right now is apparently the number of workers. **Edit: addressed in enhancements playground #242 **
I think you should rename a bunch of things to crawlerv1 or crawlerv2 because it's very confusing right now. There is a github repo, a docker image, a "command line crawler" etc. The Dockerfile.crawler is not even in mwmbl-crawler repo :). Apparently something is getting ported to rust but is still not there so it's still impossible to know which is which. I also think you should write in text related to crawlerv1 that the v2 is in the works: you are not particularly proud of how the v1 works so you don't want to let newcomers think that you haven't noticed its shortcomings.
I saw somewhere sometimes ago you have a webpage somewhere where crawler can see charts of the current crawlers etc, I think it would be nice to print to the log at the start of the new crawler a link to that page. Because I have never been able to find that page again!
The build would be way faster if using uv than pip no? **Edit: addressed in enhancements playground #242 **
I do think the crawlerv2 deserves a readme of its own. I don't like to run code without knowing how it's supposed to work. Is it usually cpu bound? bandwidth bound? Are there hardware requirements? Are there situations where you dont want contributions (e.g. unreliable consumer hardware)? Does it crawl sites that are sometimes restricted (porn, political, whatever)? Do you mind if someone runs the crawler intermittently (only at night)? **Edit: added a readme in enhancements playground #242 **
If I'm not mistaken there is a single requirements.txt file that is shared for the whole repo. Wouldn't it be better to make it different depending on if we are run directly or from docker or just the engine without crawler etc?
I don't think I saw any versioning in the crawler code. I think it might be good to declare a VERSION var somewhere and upload it as part of the result as otherwise you might have issue with later versions in mwmbl no? Edit: addressed in enhancements playground #242 and so far it works on my end
Does mwmbl have any plan to have a reputation system to ban crawlers that return wrong result? I know it's not crypto based so there's hardly an incentive to do that but if any competing search engine car poison the whole index in just a few weeks that seems bad.

That's it for me. What do you think about all this?

Also could you clearly state the location where the code for that new crawler code resides? I might have a go at improving it a bit, especially for the async part depending on how the code is done. Edit: well actually it's just ./mwmbl/crawl I think, really confusing IMO :)

Signed-off-by: thiswillbeyourgithub <[email protected]>

daoudclarke · 2025-05-30T14:53:24Z

Hey, thanks so much for this!

This shouldn't really happen. By default we have ten threads that crawl the web in parallel, and another that runs in parallel and syncs the crawled data with the main server. So I'm not sure what is causing the spikes that you are seeing.
Anyway, I need to make the number of threads configurable.
The old repo will be archived shortly and this will be the main crawler. The rust crawler doesn't exist yet (we have the ranking code in rust so far, but that is it).
Adding this is high priority
Absolutely! Thanks
Thanks!
If we're using uv do we still need requirements.txt?
Yes, thanks!
Yes, but I don't know what this looks like yet. At the moment, the reputation system is you have to talk to me to get an API key, and I will only give you one if I don't think you're a spammer...

thiswillbeyourgithub · 2025-05-30T17:29:51Z

I think your meant processes not threads
I think I did in the other pr
Yeah. Personnaly i prefer using setup.py instead. And this would allow doing things like "install .[dockerversion]" iirc
Right. Well that's an entire area of research of its own so I gotta say i'm kinda surprised you're coding the new crawler and archiving the old one without having your mind set on this. To me that's the major issue needing tacklement when thinking about distributed systems. And also why "blockchains" are trying to be used left and right for that (because icentives by pow etc).

daoudclarke · 2025-05-30T20:36:03Z

I did - you are 100% correct
Actually the number of processes is already configurable via the CRAWL_WORKERS env var
is this 7? I think the markdown messed up your numbers. In which case I think setup.py is obsolete with pyproject.toml now.
is this 9? In which case you're right, this could do with attention. But right now things are falling over and I need to stop them falling over. Trust wasn't such an issue with the old crawler, but it is with the new one. But I can't keep running the old one because the server can't cope. I don't think blockchain is necessary, we just need to remember who put what in the index. And then have moderation processes for identifying dodgy stuff and blocking crawlers that do.

thiswillbeyourgithub mentioned this pull request May 30, 2025

enhancements playground #242

Draft

thiswillbeyourgithub force-pushed the add_docker_compose_for_new_crawler branch 3 times, most recently from f4e119f to 467f217 Compare May 30, 2025 11:42

thiswillbeyourgithub added 2 commits May 30, 2025 13:42

add docker compose for new crawler

29d9fde

Signed-off-by: thiswillbeyourgithub <[email protected]>

doc: hint at new crawler in the readme

656b852

Signed-off-by: thiswillbeyourgithub <[email protected]>

thiswillbeyourgithub force-pushed the add_docker_compose_for_new_crawler branch from 467f217 to 656b852 Compare May 30, 2025 11:43

daoudclarke merged commit a1b84ab into mwmbl:main May 30, 2025
1 check passed

thiswillbeyourgithub mentioned this pull request May 31, 2025

minor: print the url to the crawler stats #246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add docker compose for new crawler#241

add docker compose for new crawler#241
daoudclarke merged 2 commits intomwmbl:mainfrom
thiswillbeyourgithub:add_docker_compose_for_new_crawler

thiswillbeyourgithub commented May 30, 2025 •

edited

Loading

Uh oh!

daoudclarke commented May 30, 2025

Uh oh!

Uh oh!

thiswillbeyourgithub commented May 30, 2025

Uh oh!

daoudclarke commented May 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thiswillbeyourgithub commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daoudclarke commented May 30, 2025

Uh oh!

Uh oh!

thiswillbeyourgithub commented May 30, 2025

Uh oh!

daoudclarke commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thiswillbeyourgithub commented May 30, 2025 •

edited

Loading

daoudclarke commented May 30, 2025 •

edited

Loading