Conversation
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
…ADME with workflow details
ac592d3 to
bc46c41
Compare
daoudclarke
left a comment
There was a problem hiding this comment.
In general this looks good. Have you tested it locally? When I have time I will try and test it myself.
Using uv is a no-brainer, so thanks for that.
|
I have built it and ran locally yes, but not ran any specific test suite if that's your question |
|
A few more changes for the crawler:
And something a bit ambitious for #249 :
That last point seems to mostly work. I am running it right now. Btw, if I'm not mistaken there is no way for the main server to send messages to the crawler. Wouldn't that be useful? It might be nice in the future for compatibility like "Please update your crawler" or "you are sending invalid data" or "you're the top 50 contributor today", "downtime planned tomorow" etc. It seems that if the client is broken and sends rubish there is no way for the master server to tell the crawler the issue. Should I open an issue to track this? Also there's something I was not sure about: how does the local index grow in the crawler? Is its size bounded? Because there is a hard limit to set when creating the lmdb (can be changed on the fly but the default is very low) and I didn't know what to set. I set it at 2G right now. |
Signed-off-by: thiswillbeyourgithub <[email protected]>
…-based architecture
282b5c6 to
d16b53b
Compare
because loguru takes care of the tracing Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
Signed-off-by: thiswillbeyourgithub <[email protected]>
|
Ive been running the crawler for a couple of hour now and it seems to be running very smoothly apart from 504 errors from mwmbl.org that seems to struggle, I don't think it has anything to do with me. The crawl-index.tinysearch weighs 10Mb |
|
This is looking good. I'm concerned about the LMDB change - I think it's potentially a good idea but it will break the existing index which is used on the central server. I would ideally like to see benchmarks comparing it to our existing index - ideally it would be faster, but even if it's a little slower that would be fine. You could just generate e.g. 1000 random queries and run them against the same index translated to LDMB. Eventually we need to change the database for the central server and LMDB may be a good fit but we will need a way to migrate the existing index. Also, the plan for the central server is that it ends up as more of a cache on S3-like storage. That way we can make the index as big as we like without being constrained by the disk of the central server. LMDB may still work for the cache though. However, LMDB looks like a good idea for the crawler. Perhaps you can create a separate PR where we keep the TinyIndexer as it is to be used on the central server, and create an LMDBIndexer to be used in the crawler. That will be a good way to start using it properly and iron out any issues. |
|
Forgot to say, the size of the current index is fixed, based on the number of pages and the size of each page. |
|
Sending messages to the crawler sounds good. Please try and use separate PRs for each feature, that way, the no-brainers can get merged quickly without having to wait for experimental stuff like new index formats 😉 |
Signed-off-by: thiswillbeyourgithub <[email protected]>
|
Another thought - if you're rewriting the crawler using LMDB as the index, this might be a good time to rewrite it in Rust, if you feel confident doing that. |
|
Btw here is the most important python doc for lmdb. They state:
So I can set the size limit way higher. I'd appreciate it if you could ballpark the size estimate for the local index for the crawler. I don't know how much it grows and what exactly triggers an increase in size. In any case I used lmdb-dict-full for simplicity and because it takes care of the zstd + caching IIRC, but it might be better to only rely on py-lmdb? Not sure if that's an issue. The thread safe aspect is intrinsic to lmdb apparently.
Yeah for sure it can totally be a problem. But I don't think I can easily run the central server to troubleshoot so I could only rely on the tests. And I had to modify the tests sort of blindly so that's somewhere intense scrutiny would be good. Otherwise I'm not so concerned for the migration of the main server: the migration is "just" reading all key/values from one db and putting it in the other one no? The downtime should be too big. I'm sure responsible professionals run both DB for a month side by side and monitor any deviation from the new backend vs the reference one and when everything has settled the switch is done. I don't know how to do this so asking you instead: if you could write some test for the crawler that mimic the central server so that I can see in advance if a change on the crawler is breaking the architecture that would be helpful. Don't feel up to that myself.
Would you mind telling me exactly which API you are referring to? Ideally give me the python code you have in mind for the original index and I'll manage to port it to LMDB, or write pseudocode with the exact method name and parameters you have in mind.
Btw if you planned on using minio for FOSS S3, there might be trouble ahead.
Yeah good call. I'll try to do that. Btw I was confused at some point between tinysearchengine/indexer and indexer/*py would you mind explaining the difference? Especially it's not easy to figure out which part of the code is crawler only and which part has an impact on the master server for example.
Are you talking about the local index of the crawler or the index of the master server? And I thought the size of each page was fixed but new pages could be added, so that would not be "fixed". What am I misunderstanding here?
I'm not comfortable doing changes on the central server side. I think I would be if you wrote a ARCHITECTURE.md file that recaps the relationship between subdirs and the role of each file. Quite a lot of the backend is still very obscure to me. And the technical faq would help here.
Yeah as you can guess that was the plan but some things are hard to limitate, for example loguru is I think superior on all points to logger but that would be a large change. Same for using LMDB, I didn't know if changing the local crawler index would have side effect elsewhere in advance. Also I would find it awesome if you could add a commit hook to run black on the python files, but that might be my taste only. So yeah I don't intend this PR to be merged and it became more of a playground. Here's a proposal:
What do you think of this? Fine with any other plan of your choosing. If you're swamped or something that's fine obviously but obviously I'd prefer to keep going as long as I'm fully nerdsniped :) |
I cannot read a single line of rust unfortunately. And tbh with my current (low) knowledge of mwmbl I'm not sure that should be prioritized: python is fast to prototype stuff (as I've shown), you benefit from a larger skill pool (I'm basically good only at python), there seems to be low hanging fruits left for the crawler performance wise (fastbloom-rs, hopefully lmdb), and importantly: the crawler bottleneck has more to do with network IO and the number of processes and threads than with actual raw computing power. My 2 cents (again, I'm a psychiatry resident, take a pool of salt) is that mwmbl's design will probably change in the future it it's currently moving slow enough that I'd bet enhancing the doc to attract users would be a faaaar better bet. Would gladly accept to be proven wrong of course, but then I can just watch the repo from afar :) |
Signed-off-by: thiswillbeyourgithub <[email protected]>
Hi,
In the process of teaching me how the crawler works, I tend to write missing doc using LLMs as checking is far faster than understanding from raw code.
Hence I'm making this PR as a suggestion.
I think it goes in the right direction but there are probably inaccuracies in the doc and it would be nice if you could give me your thoughts on it.
If add docker compose for new crawler #241 is merged I can add a comment in the docker compose to the new readme to figure out the env var settings.
If there are env var settings you do not want to let user modify let me know.
For the others, there are probably ranges of acceptable values so it might be good to add asserts to enforce them and mention them in the readme too.
I can do all that if you tell me to.
Edit: I went on and added a bunch more things. Feedback welcome. I have been running it for the passed hour with no issue.