Skip to content

add docker compose for new crawler#241

Merged
daoudclarke merged 2 commits intomwmbl:mainfrom
thiswillbeyourgithub:add_docker_compose_for_new_crawler
May 30, 2025
Merged

add docker compose for new crawler#241
daoudclarke merged 2 commits intomwmbl:mainfrom
thiswillbeyourgithub:add_docker_compose_for_new_crawler

Conversation

@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor

@thiswillbeyourgithub thiswillbeyourgithub commented May 30, 2025

This follows the issue thread #45

After using it for a few hour there are a few things I noticed:

  1. it seems that the run is alternating between a large CPU spike when the crawling burst happens, then waits for a while for the results to be uploaded I think.
    1.1. I don't think it's good for the hardware, especially consumer level.
    1.2. Couldn't we use async to crawl, pull and push in parallel? Edit: reading the code it seems to be already async.
    1.3. I fear that for customer hardware this would cause an annoying alternation between full fan and quiet.
  2. I don't think any crawler should be implemented without an option to set a rate limit. The only parameter right now is apparently the number of workers. **Edit: addressed in enhancements playground #242 **
  3. I think you should rename a bunch of things to crawlerv1 or crawlerv2 because it's very confusing right now. There is a github repo, a docker image, a "command line crawler" etc. The Dockerfile.crawler is not even in mwmbl-crawler repo :). Apparently something is getting ported to rust but is still not there so it's still impossible to know which is which. I also think you should write in text related to crawlerv1 that the v2 is in the works: you are not particularly proud of how the v1 works so you don't want to let newcomers think that you haven't noticed its shortcomings.
  4. I saw somewhere sometimes ago you have a webpage somewhere where crawler can see charts of the current crawlers etc, I think it would be nice to print to the log at the start of the new crawler a link to that page. Because I have never been able to find that page again!
  5. The build would be way faster if using uv than pip no? **Edit: addressed in enhancements playground #242 **
  6. I do think the crawlerv2 deserves a readme of its own. I don't like to run code without knowing how it's supposed to work. Is it usually cpu bound? bandwidth bound? Are there hardware requirements? Are there situations where you dont want contributions (e.g. unreliable consumer hardware)? Does it crawl sites that are sometimes restricted (porn, political, whatever)? Do you mind if someone runs the crawler intermittently (only at night)? **Edit: added a readme in enhancements playground #242 **
  7. If I'm not mistaken there is a single requirements.txt file that is shared for the whole repo. Wouldn't it be better to make it different depending on if we are run directly or from docker or just the engine without crawler etc?
  8. I don't think I saw any versioning in the crawler code. I think it might be good to declare a VERSION var somewhere and upload it as part of the result as otherwise you might have issue with later versions in mwmbl no? Edit: addressed in enhancements playground #242 and so far it works on my end
  9. Does mwmbl have any plan to have a reputation system to ban crawlers that return wrong result? I know it's not crypto based so there's hardly an incentive to do that but if any competing search engine car poison the whole index in just a few weeks that seems bad.

That's it for me. What do you think about all this?

Also could you clearly state the location where the code for that new crawler code resides? I might have a go at improving it a bit, especially for the async part depending on how the code is done. Edit: well actually it's just ./mwmbl/crawl I think, really confusing IMO :)

@thiswillbeyourgithub thiswillbeyourgithub force-pushed the add_docker_compose_for_new_crawler branch 3 times, most recently from f4e119f to 467f217 Compare May 30, 2025 11:42
@thiswillbeyourgithub thiswillbeyourgithub force-pushed the add_docker_compose_for_new_crawler branch from 467f217 to 656b852 Compare May 30, 2025 11:43
@daoudclarke
Copy link
Copy Markdown
Contributor

Hey, thanks so much for this!

  1. This shouldn't really happen. By default we have ten threads that crawl the web in parallel, and another that runs in parallel and syncs the crawled data with the main server. So I'm not sure what is causing the spikes that you are seeing.
  2. Anyway, I need to make the number of threads configurable.
  3. The old repo will be archived shortly and this will be the main crawler. The rust crawler doesn't exist yet (we have the ranking code in rust so far, but that is it).
  4. Adding this is high priority
  5. Absolutely! Thanks
  6. Thanks!
  7. If we're using uv do we still need requirements.txt?
  8. Yes, thanks!
  9. Yes, but I don't know what this looks like yet. At the moment, the reputation system is you have to talk to me to get an API key, and I will only give you one if I don't think you're a spammer...

@daoudclarke daoudclarke merged commit a1b84ab into mwmbl:main May 30, 2025
1 check passed
@thiswillbeyourgithub
Copy link
Copy Markdown
Contributor Author

  1. I think your meant processes not threads
  2. I think I did in the other pr
  3. Yeah. Personnaly i prefer using setup.py instead. And this would allow doing things like "install .[dockerversion]" iirc
  4. Right. Well that's an entire area of research of its own so I gotta say i'm kinda surprised you're coding the new crawler and archiving the old one without having your mind set on this. To me that's the major issue needing tacklement when thinking about distributed systems. And also why "blockchains" are trying to be used left and right for that (because icentives by pow etc).

@daoudclarke
Copy link
Copy Markdown
Contributor

daoudclarke commented May 30, 2025

  1. I did - you are 100% correct
  2. Actually the number of processes is already configurable via the CRAWL_WORKERS env var
  3. is this 7? I think the markdown messed up your numbers. In which case I think setup.py is obsolete with pyproject.toml now.
  4. is this 9? In which case you're right, this could do with attention. But right now things are falling over and I need to stop them falling over. Trust wasn't such an issue with the old crawler, but it is with the new one. But I can't keep running the old one because the server can't cope. I don't think blockchain is necessary, we just need to remember who put what in the index. And then have moderation processes for identifying dodgy stuff and blocking crawlers that do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants