I'm about to release (this week of the next, not much later) a search engine for the Fediverse (including mastodon, pleroma, peertube etc.)
If you want to know more about this project, feel free to subscribe to my account here.
I will publish most source code under GPL license, and translation help will be appreciated! See you then!
@archeodon how does it works ?
Is it possible to opt-out (or opt-in ?) ?
@Lapineige It works by indexing most public timelines from a bunch (~50) of mastodon instances.
Also, it respects the setting "Opt-out of search engine indexing" in your mastodon preferences
(don't know if it works for pleroma accounts though)
@Lapineige if someone changes his/her "opt out of search engine" account setting, it's applied less than 30days later in the search engine. I'll setup a contact form for faster requests if needed.
@archeodon why a contact form ?
If we change the settings, shouldn't it be possible to have a button or something on the website, to "refresh" the status of one particular account ?
@Lapineige yes, that's a good idea, thanks for it :D will do that (added to the TODO...)
@archeodon what about peertube and pixelfed ? (or any other platform indexed ?)
If we remove a toot/video/etc, will it be removed (instantly ?)
@Lapineige I don't know how peertube & pixelfed works on the Delete part, but I get the streams from the instances, so I get the "delete" events and mark those statuses as deleted,
They will be removed from solr a few minutes later.
@archeodon and if a platform doesn't support it, how can we exclude our content from the search engine ?
Wouldn't it be better to ask each software developers to add such an "search engine indexing opt-in/out" option before indexing their content ?
@Lapineige I thought about it and I use "robots.txt" for that at least, at the moment (+ the "noindex" meta in profile pages for mastodon). So if a domain doesn't want to be indexed, I will obey (as google or bing does).
On the other hand, I will index whatever "other search engines" would be allowed to index from the norms of the web in that regard.
@archeodon but that means users have no control if their admin doesn't decide to do it (which is likely to append).
@Lapineige well, in that case they would *already* be indexed by google, no? ... I would fear a lot more the google than my own pet project
@archeodon as far as I tested, it's not that easy to find some fediverse content in Google…
I don't really think about Google (&co) as the treat here (it's public content, if any surveillance tool would like to index the content, it can do it), but humans willing for instance to harass people.
Anyway, in both circumstances, it's good to add this opt-in/out (at user level) in those project, isn't it ? 🙂
@Lapineige I have 6000 instances visible from those 50 instances (I record the *federated* stream of those 50 instances). So I guess I have most of the public content.
And of course, I plan to make it better in the weeks to come :) The release of next week will clearly be an Alpha version ;)
@archeodon so only public content, excluding non listed ?
@Lapineige regarding "non listed" I'm not really sure to be honest. I don't exactly know how all this works on the mastodon's side
@archeodon I'm replying in unlisted right now: this content is not shown in the public timelines. There are various reason to do so (avoid spamming with replies or within a thread, making those toots less visible (and easy to discover if you don't follow the author), …)
@Lapineige I will look at those toots (yours and mine) to see if they get indexed or not :D
@archeodon > most source code?
@devnull yeah, there are little parts of the indexer I'm not sure I will publish at once
1. because it's damn ugly
2. because it may trigger unexpected reactions from people managing some instances, (although I respect strict rate limits and robots noindex rules...)
@archeodon So, basically, you're expecting poeple to blindly trust unknown code (that might trigger some people)? 🤔
@devnull I don't know what you mean by "blindly trust" or "trigger some people"
@archeodon Blindly trust -> Trust unpublished source code
"trigger some people" -> I was refering to your own statement "because it may trigger unexpected reactions from people managing some instances". you you better than me what it means.
oh, this one :)
I worked for a search engine that indexed websites such as craigslist in the past,
and people that manages such websites have mixed-feeling with indexers...
(even if we respect robots.txt)
I'd rather have them use a standard protocol (robots.txt or meta tags) rather than trying to block my bots.
@archeodon I don't know robots.txt perfectly: is there a way to precisely define index rates, etc ? If not you can create one to add in an optional file and respect it to be fully transparent, your search engine will be more easyly accepted then, I think.
But full source code is always better, if some choose to block your crawler that's on their right to choose :) With open source code or not; transparency is the key IMHO @devnull
as of the "trust unpublished source code", I only talk about some bots as of now,
the search part will be published, for sure, as will be some indexing bots.
But you can't trust any code running on a remote server anyway, right?
@archeodon > But you can't trust any code running on a remote server anyway, right?
Of course not, who knows what code the remote server actually runs.
Bienvenue dans le media fédéré de la Quadrature du Net association de défense des libertés. Les inscriptions sont ouvertes et libres.
Tout compte créé ici pourra a priori discuter avec l'ensemble des autres instances de Mastodon de la fédération, et sera visible sur les autres instances.
Nous maintiendrons cette instance sur le long terme.