Here is the first article of a series, on how to build a search engine, from scratch, in Rust.
Feel free to give me some feedback
https://jdrouet.github.io/posts/202503161800-search-engine-intro/
Following the introduction, here is the part 1 of my series of articles on how to build a crossplatform search engine from scratch, in #rustlang.
This section will handle how we'll store the encrypted data on any platform.
Enjoy reading it, feel free to provide some feedback, here or directly on GitHub
https://jdrouet.github.io/posts/202503170800-search-engine-part-1/
If you enjoy it, feel free to share it on other platforms!
If you enjoy it, feel free to share it on other platforms!
@jdrouet just wanted to say you started a great series. Looking forward to upcoming articles, thank you!
@jdrouet good article, looking forward to more!
I've got a question regarding `Directory::files`: when iterating over the files, you're `await`ing them one at a time – doesn't that kind of defeat the whole purpose of async? Could one instead spawn a task for each file, and then `join` all the handles?
I think an even nicer implementation could be some kind of `directory.files().filter(Path::is_file).collect()`, but that would require `AsyncIterator`, which AFAIK Rust doesn't currently have
@pmmeurcatpics thanks for your feedback!
Spawning a task for each file would require importing a runtime, which I decided not to do. Here, we only use `futures_lite` which doesn't have this `join` macro.
Taking a step back, this function will almost never be used so even if we optimise this, it will not be visible.
@jdrouet You can get some pretty big performance improvements by intersecting the binary indices on the go.
Depending on how they are laid out, you can intersect any number of postings lists in linear to sublinear time, with zero memory overhead. This scales much better than intersecting hash tables.
"Search Engines: Information Retrieval in Practice" has a section discussing the technique in chapter 5.4.7.
@jdrouet This article discusses the technique in more detail with regards to skip lists, though it does (as noted in SeIRP) work with any sorted list.
@marginalia really interesting! I'll have a look at it. Maybe not for the next article (although the topic is the optimisation). Thanks!
@jdrouet It's also fully possible the juice might not not be worth the squeeze for these types optimizations at the scale you're targeting. Though I figured I'd share it none the less, as it's genuinely a very cool optimization that's pretty intuitive.
@marginalia yeah, right now the bottleneck is not here, but more at the encryption/decryption level...