Postgres: Opinionated Rules-of-Thumb and Gotchas

bruce511 · on Nov 27, 2023

>> . Every table's primary key should be a single `bigserial id` col unless you have strong reasons and benchmarks showing otherwise. ("They have UUIDs" isn't a good enough reason; simply store those in a separate col.)

As with all "rules", context is really important. Without knowing anything about your context, its hard to really agree or disagree with your rules.

If I could give my 1995 self a tip though, it would be to use a UUID as the primary key for all tables. More specifically I would tell him to nske it a client-generated UUID.

it certainly wasn't necessary-then-, but 15 years later it would have made do many problems into not-problems.

Specifically as we wanted to merge data, put subsets on remote devices, and so on. Sequential id fields are by far my biggest regret.

But again, that's me, for my context. Ids were fine in the original context. But then the context changed...

Yes, I'm aware of the space and performance implications - but in my context those are way, way, down the problem tree.

thorin · on Nov 30, 2023

If you'd done UUIDs in 1995 ( if that was even possible ) you'd have a had horrendous indexing problems, mainly relating to clustering. This was an issue when I started using SQL Server in around 2006. Was also a concern in Oracle so I would be surprised if Postgres wouldn't have had similar problems. I think they are a non issue now though.

A discussion of some of the pros and cons: https://stackoverflow.com/questions/11938044/what-are-the-be...

hot_gril · on Dec 6, 2023

In my experience, it's not horrendous in Postgres but still noticeably worse than bigserial.

There are often reasons to have UUIDs as secondary keys. But your DB uses primary keys for disk layout, joins, etc, so I've never seen a good reason to use UUIDs there. Even the CitusDB (horizontally-scaling Postgres) examples use bigserials. I've always regretted using UUID pkeys and never regretted using bigserial.

Side bonus of bigserials is you get ordering, which matters in some contexts.

ttymck · on Nov 27, 2023

I tried the strict 3NF approach in the past. There weren't performance problems, but there was query ergonomics problems. Combined with "no ORM", the query layer was a labyrinthian minefield.

hot_gril · on Dec 6, 2023

If you find yourself needing an ORM, probably what you really need is some jsonb. Which maybe strict 3NF goes against, but I would ignore that.

Good example is a "user settings" table that you're mostly just going to store and retrieve, such that splitting them into separate cols or tables doesn't help.

perrygeo · on Nov 27, 2023

Postgres is one of those technologies that comes with enough power to shoot yourself in the foot. Having a set of heuristics and usage patterns is almost necessary - unless you want to relearn these hard lessons. Thanks!

Re:

1) The default 32 bit serial integer is indeed a huge problem. 2 billion rows can come up fast in an active system and you do NOT want to experience the fun of migrating to a 64 bit serial while under a constant load. I'd advocate for using `bigint primary key generated always as identity` (see https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use...)

The UUID pk advocates have some good points. And for low-to-mid traffic tables, I agree. The size doesn't bother me but UUIDs are not sortable in insert-order, which means the b-tree index pages are uncorrelated with the position of the row in the table. There is a severe penalty once you start getting maxxing out your shared_buffers cache - every new insert into the btree is likely to require fetching a new page, cache starts thrashing and grinding performance to a halt. It's one of those performance issues that doesn't creep up on you, it smacks you in the face! One day, your inserts just get 100x slower and crashes your app. The only solution is to buy more and more RAM as your table grows... or go back in time and use an autoincrementing primary key to avoid the whole issue.

2) Timezones - Not intuitive but super important to get right before you start inserting data! Get it wrong and the previous timestamps are likely impossible/ambiguous to interpret.

5) ORMs are a great abstraction over 80% of the basic stuff - CRUD operations on a single table, select by id ... the core functionality to bootstrap a web API on a greenfield project.

But how many of us are working on basic greenfield projects? As soon as you start needing to do anything complex, ORMs tend to fall apart quickly and devolve into raw sql for almost all of the database interactions anyway. Might as well start with SQL or a query-builder. I really prefer the SQL-first hugsql technique (https://www.hugsql.org/) - I wish it was available in more languages.

6) Honestly, though I learned all the relational theory at some point, I rarely need to implement anything more complex than 3NF or Boyce-Codd NF. But I'd say one of the advantages of postgres is the JSONB type which allows you to ignore normalization theory almost entirely :-) It can be great for MVPs, implement the initial functionality in a document-style database and migrate to properly normalized tables once the schema of the document is more stable. So document db? Check. An append-only timeseries db? Check. Spatial database? Check (see PostGIS). Using postgres tables as a work queue? Check. Postgres can behave like many other styles of databases, not just a pure relational one.

7) For my money, there's no reason to "install" Postgres at all. Docker-compose and the official postgres docker images are ideal for local development. For production, use managed postgres unless you've got a strong reason not to.

hot_gril · on Dec 6, 2023

1. Yeah, even if you don't have 2B rows, your sequence can go past 2B because it's not affected by transaction rollbacks. This is even more of an issue if you're using serializable xacts, which need a retry mechanism. Much better to avoid this problem.

5. I like jsonb as a compromise. You can shove things into there, and whenever that stops making sense, you split things into their own cols/tables. The thing to watch out for is arrays within jsonb; usually not a good practice to join or do anything other than simple set/get for those.

7. Often I'm doing nodejs dev which doesn't really need Docker, so installed Postgres ends up being more convenient. Also means data persists across runs, which you may or may not want. But Docker is a better solution overall, especially at scale.