Codeberg scaling issue mostly caused by software architecture

aeyes · on Sept 21, 2023

Looking at https://codeberg.org/forgejo/forgejo/issues/220 I'm 99% certain that this is caused by foreign keys because the lock is waiting for primary key index access.

And if you have foreign keys you implicitly lock the referenced rows.

At scale you have to be very careful with your foreign keys, especially on update heavy tables.

A quick fix could be to skip the update of the count if the row is locked (select for update skip locked or nowait). This will fix the lock issue but you'll lose consistency.

The real fix would be to move these counters to separate tables without foreign keys and to batch the updates with background tasks.

It might also be possible to take a weaker lock for these updates but I'd have to review it in much more detail.

Edit: Another problem is that they apparently have not configured reasonable lock / statement timeouts in their database. There is no good reason to wait for a lock for 30 seconds.

morelisp · on Sept 21, 2023

I want Codeberg to succeed, but every time they have a problem they ask for "help", and you go read the issue, and... they're still working on routing contexts through DB queries? Like, really, the problem is essentially a lack of typists?

And stuff like https://codeberg.org/forgejo/forgejo/issues/220#issuecomment... is really worrying, like, you're counting non-critical integers one at a time, but you also think you're outsmarting the MariaDB engineers? I don't know how you productively enter a discussion like this to "help".

aeyes · on Sept 21, 2023

The first iteration of my comment was actually: "Looking at this issue, the developers are clearly clueless."

But that wasn't constructive. It seems to be just volunteers working on this so it makes sense that they ask for help because they can't pay their way out of this by actually having people with experience on their team.

morelisp · on Sept 21, 2023

They are volunteers in the sense participation in the project is voluntary, but are also an organized non-profit accepting donations.

Anyway I just logged in to fork the repo on the idea I might automate the context replacement, but... my subscriptions are somehow covered in spam for projects/users I don't follow? That is also worrying...

aeyes · on Sept 21, 2023

But the organization has no employees, all money goes towards hosting and maintenance of the legal structure.

This is a bit of a problem when it comes to maintaining the infrastructure and code of a large-scale system.

jlkuester7 · on Sept 21, 2023

In my experience, the DB layer is always the hardest to scale. At a certain point, vertically scaling system resources starts producing ever diminishing returns. At that point, the natural progression is to move to a clustered setup with multiple DB nodes, etc, but that introduces a whole new universe of complexity and failure points.

My takeaway is that proper data modeling at the initial stages of designing a software architecture is absolutely essential to long-term scalability. You really need to consider the type of data you are storing, its shape, and the access patterns involved with reading/writing it. Of course, since it is basically impossible to get this correct the first time, it is even more important to have a good abstraction layer between your business logic and your database to make it possible to change your data store in the future.

flashgordon · on Sept 21, 2023

Spot on. You can only go so far with vertical scaling. While you can't just Willy nilly shard and hope to scale, modelling is key as well as introducing and embracing eventually consistency (and read your writes) all the way across stack especially on the front end experience.

fmajid · on Sept 22, 2023

Sometimes vertical scaling is counterproductive because of NUMA memory bottlenecks. In my experience the STREAM benchmark correlates pretty well with RDBMS performance.

wutwutwat · on Sept 21, 2023

Most scaling issues are due to the software, what’s newsworthy about this? Gitea, Gogs, GitLab, name a scm, have all had their fair share of growing pains. If you run a SaaS, scaling issues are par for the course. They are also a good thing. Failing endeavors don’t have scaling issues.

You can likely squeeze some extra performance out of your database by:

slimming down large tables with old data. Move it to an archive table to reduce the index sizes

looking for missing indexes on hot tables, or inefficient ones. Combining multiple indexes into one can help too

Running vacuum full

checking and possibly increasing work_mem, and shared buffers. Enable pg_stat log temp files and look at how often you’re writing to disk due to worker memory exhaustion on complex queries and joins. You can set the logger to log above a certain size and adjust work mem to cover most hot queries. Don’t try to get them all, that’s a losing game and the gains are not worth it for seldom used queries. Just let them hit disk.

Take a look at your max connections and see if you’re at the limit, and increase or look into a connection pooler if so

Think about splitting reads and writes. This can be done in the code, or, using a pooler (pg bouncer I think or pg pooler). Send your writes to a primary and reads to n number of replicas following the primary. Throw replicas at it load balance reads until you optimize the queries.

First thing I would do though, throw beefier hardware at the db. If using managed service like aurora/rds, work mem, shared buffers, and max conns are proportional to instance size. More memory on a larger instance could get you stable again. I guarantee you, you’ve not reached as far as you can go on a single db server, you just have to pony up the dough

> Just adding: The performance problem is also worsened by the fact that we are hit with spam waves for several weeks now. Bots create thousands of issues on Codeberg continuously, and although we got more efficient with cleanign it up, it heavily stresses the database.

Sounds like you need to start blocking ip ranges instead of optimizing spam cleanup. If you can, put a WAF infront of the app servers. Throw captchas on forms that are spammed for suspicious ips. Implement rate limiting for issue creation and if it’s already there, make it more aggressive. People don’t make multiple issues on a repo in an entire day usually

MarcusE1W · on Sept 21, 2023

The way I read it is that at least a good part of the problem is that they get spammed with issues.

How large are you supposed to size your DB so that some <add explicit here> can spam you. What can the conversion rate of this spam possibly be, I wonder.This probably all happens for no reason/result at all.

I know there are also good examples, but in some days you loose hope for mankind a bit. Every time someone creates something nice some <the explicit from above> comes around and tries to misuse it.

rurban · on Sept 21, 2023

Mostly caused by MySQL locking too many tables. So why are they using a read-mostly database, and not a proper write-aware database like PostgresQL?

Simply switching to a proper database would help in scaling.