The Fediverse is Inefficient (but that's a good trade-off)

Danterious@lemmy.dbzer0.com · 3 days ago

The Fediverse is Inefficient (but that's a good trade-off)

hendrik@palaver.p3x.de · 3 days ago

Mmhm, I’m not sure if I’m entirely on the same page. Admins have complained. Users would like to run their own instances, but they can’t as the media cache is quite demanding and requires a bigger and costly virtual server. And we’re at the brink of DDoSing ourselves with the way ActivityPub syncs (popular) new posts throughout the network. We still have some room to grow, but it’s limited due to the protocol design choices. And it’s chatty as pointed out. Additionally we’ve already had legal concerns, due to media caching…

Up until now everything turned out mostly alright in the end. But I’m not sure if it’s good as is. We could just have been lucky. And we’re forced to implement some minimum standards of handling harassment, online law, copyright and illegal content. Just saying we’re amateurs doesn’t really help. And it shifts burden towards instance admins. Same for protocol inefficiencies.

I agree - however - with the general promise. We’re not a big company. And that’s a good thing. We’re not doing business and not doing economy of scale here. And it’s our garden which we foster and have fun at.

simple@lemm.ee · 3 days ago

The post is from 2 years ago but it does pose an important question which few people really talk about. The Fediverse isn’t scaling.

Any distributed system is inefficient, for one, because it lacks the economy of scale.

Sure, it’s probably worth the tradeoff, but what happens when we actually get so many people that servers start to collapse? Lemmy has ~45k active users, but let’s say we jump to 1 million active users. Small servers will stop working due to too much traffic, medium servers will need way more money to process the thousands of images per day, large servers will become too centralized. We’re already slowly going that way with the instance count steadily going down and users/instance going up.

None of this matters now but within the next 5, 10 years I think we really need a game plan in order for these platforms to succeed. You can’t just increase the servers to spread the load, the load on all instances is steadily going up.

tehn00bi@lemmy.world · 3 days ago

Will we need to move to a subscription service?

originalucifer@moist.catsweat.com · 3 days ago

someone has to pay for infrastructure. i think if we can solve for intra-instance load balancing, community-funded instances and maybe centralized content hosting services instances could subscribe to it might get us to the next level. not sure after that.

kopper [they/them]@lemmy.blahaj.zone · edit-2 3 days ago

Eh, I’d make the argument the fediverse is overly inefficient, way more than it has to be. (But that doesn’t seem to be the actual point of the post, instead rehashing the same “distribution = good” thing without bringing anything new to the table)

Here are just a few things that could be fixed without needing to centralize fedi:

A vast majority of instance software will store all old remote non-media data (that could easily be re-fetched when needed) permanently, even if nobody has seen it in years.
If you’re lucky enough to be on instance software that backfills replies (GoToSocial, the Iceshrimp rewrite as of a few days ago, Mastodon in an extremely limited capacity), it will be done slowly and recursively, when much better alternatives that don’t need to deal with easy-to-get-wrong recursion handing are possible. (There is work going on to improve this, but it may take a while for it to land on enough instance software to make a difference)
The obvious thing everyone harps on: Abysmal media caching defaults.
No batching of activities. And relatedly, all sent activities are individually re-signed for each instance on each delivery (to be fair, handling this in a privacy preserving way is hard)
No batching of fetches.
RSA, just to make the above signature situation even worse
Mastodon. Just in general. It’s by far the most heavyweight fedi software I know of, running on a synchronous and poorly threaded tech stack that’s is not very adequate to the fairly IO bound (when not using authorized fetch) and very concurrent AP use case. Running Mastodon for any instance less than ~500 active users is extremely overkill and you’d likely be suited with other, lighter, instance software if you don’t need any of the features that are Mastodon exclusive (which there aren’t that much of).
Pleroma database rot, an exemplar of why the C2S advocates’ model of “store the raw JSON for everything” is a terrible idea (thankfully the C2S model hasn’t taken off enough to be important)

schizo@forum.uncomfortable.business · 3 days ago

A vast majority of instance software will store all old remote non-media data (that could easily be re-fetched when needed) permanently, even if nobody has seen it in years.

Seriously, this is the most befuddling design decision. There’s no reason to cache that data more than like, maybe a week.

Maybe it’s because I’m a sysadmin background type and not a programmer, but the endless obsession that fedi-software has with caching everything at every stop along the route from the poster to the person reading the post is just the most weird thing to me.

kopper [they/them]@lemmy.blahaj.zone · edit-2 3 days ago

A lot of it boils down to most fedi software not being “native” and only having federation designed more-or-less as an afterthought addition on top of a traditional centralized-ish system (even for ones that have federation from the get-go). Meaning you make assumptions like “it’s fine if I deletes the replies of a post if the post gets deleted”.

This, combined with how much data you can’t re-load and have to track as it comes in (e.g. nobody implements the necessary collections to backfill who liked or boosted what from it’s source, so you have to track that implicitly through Like and Announce activities), makes it extremely infeasible to implement while keeping the same user experience. Hell, even reply collections needed to backfill missing replies are a rarity (though a lot more common than the others given Mastodon implements them).

Additionally, people want the same user experience they’re used to in centralized systems, like search actually searching through everyone, globally. This is something I believe AP simply isn’t “intended” for. ATProto, for example, is much better in this specific regard (but comes at it’s own hefty costs, as an implementor).

I don’t blame the implementors for doing things this way. IMO it’s better to partially implement something like AP as an extension, as opposed to diving in head first into being AP-native. The standards are extremely vague and incomplete once you start looking below the shallow surface, and this way at least if a better protocol comes by migration (or multi-protocol federation) won’t be too difficult compared to if your source of truth was the same AS2 data you federated, the way AP intended you to.

3 days ago

Most of the fediverse is just text and metadata which is relatively small and easy to handle. The issue arises when we start hosting images/videos ourselves which consume far more bandwidth and storage. We are then distributing all the media across to all instances multiplying the requirements.

What we need to do is implement a bittorrent proxy as our cache server. Activpub distributes links that can be accessed directly getting served by a bittorrent proxy or torrented by the client itself. This means u dont need ur instance to have a copy of the media as long as enough other people have it ur good. If the client itself is doing the torrenting then popular posts get more people serving it thus shifting the load to many client and lowering load in each individual instance.

The other solution is to implement fediverse gold. Let it be a monero based donation with a reward split of 50% to instance 50% to content itself ie community, post, comment itself, etc. If u make it so gold ranks things heigher it aligns the incentives quality content generation and quality instance/community management with that of the financial incentives. Instances get paid to host, people get paid for heigh quality content. The whales get to have a power trip effecting rankings with pay to win, the tankie devs get their baby corrupted by capitalism. Win, win, win, win.

schizo@forum.uncomfortable.business · edit-2 3 days ago

If you’ve never seen it, https://github.com/PeerConnect/peer-connect is a clever solution to p2p CDN stuff.

It’s not Bittorrent for images, but it’s still a shockingly elegant thing.

It’s probably a dead project, but I ran across it and went ‘why is this not how everything works’.

lps@lemmy.ml · 3 days ago

Such a great take:)

FaceDeer@fedia.io · 3 days ago

There have been many systems developed over the years for handling decentralized data storage, decentralized user identities, and decentralized decision-making. There are excellent options out there for all this stuff.

IMO the problem is that there’s a huge “not invented here” problem, combined with a popular “ew, I don’t want to be associated with that technology (or more accurately with the group behind that technology)” reflex that has nothing to do with the technology itself. So projects like the Fediverse keep reinventing the wheel over and over, and whenever a project manages to do something right it’s rare for the other projects to abandon their own implementations to borrow from the best.

lambalicious@lemmy.sdf.org · 3 days ago

Usually the issue of media storage (photos, videos, etc) is brought in as an Issue. For now I’ll skirt the “legal ramifications” including copying media and privacy, as those are an ever changing landscape of legal wanking that wankers can speak of much better than one can (and evil wanking still needs to be fought against).

One idea I’ve seen floated around is to have some sort of cooperative CDN for instances. Let’s say four or five relatively kindred instances, make a commitment to last and pool their resources to maintain a joint CDN from from which they’ll get their “media federation” from. This would reduce costs and issues a lot, since by the very nature of the fediverse, if everyone builds their own caches most of those caches are going to be hosting most of the same content. Basically: deduplication, but the poor man’s version.

Another alternative is to just ditch storage of videos and images. Just take links to Elsewhere and let Elsewhere handle it.

kopper [they/them]@lemmy.blahaj.zone · 3 days ago

I mean, I’d say that all instances copying media by default, to be stored forever, is kind of unnecessary. (And as far as I’m aware Mastodon is the only one configured like this by default anyway)

The largest instances? Sure. I’d say they have an obligation to not DoS smaller instances by simply hotlinking or proxying without any kind of cache. But smaller ones can get away with short lived middleware-level caches, and single user ones can often get away with hotlinking (oh boo hoo your firewalled IPv4 behind enough CGNATs to block any incoming connections got exposed)

One idea I’ve seen floated around is to have some sort of cooperative CDN for instances. Let’s say four or five relatively kindred instances, make a commitment to last and pool their resources to maintain a joint CDN from from which they’ll get their “media federation” from. This would reduce costs and issues a lot, since by the very nature of the fediverse, if everyone builds their own caches most of those caches are going to be hosting most of the same content. Basically: deduplication, but the poor man’s version.

https://jortage.com/ already exists, and the code behind is open.

lambalicious@lemmy.sdf.org · edit-2 3 days ago

Interesting! Didn’t know that was what it was for. I always thought it was merely a storage backend.

Any metrics on how many instances are using it and how much deduplication is it doing? EDIT: I see the numbers on their page, I was wondering more about people or instances using their own copy of it, since it’s open source.