Project Book Retro

It’s four years since the first commit to Work Project #2. With over 10,000 commits, thousands of deployments, a couple of hours downtime, and lots of paying customers (please more though), I wanted to take a second to look back on all of the many decisions. What was good, what was crap. Do again, change next time?

Before we begin, we’re a fairly simple stack: Linux (Ubuntu Server), nginx, MySQL, PHP. We have a few other bits, but that’s the core product.

Infrastructure

DigitalOcean

DigitalOcean have been near-rock solid. A few intermittent network issues between managed Redis and apps, but otherwise very simple. Not having to worry about an unexpected bandwidth bill has been great. Setting up VPCs and firewalls is just… simple. Coming from the horror show that is AWS/Azure, it’s just so nice to have a simple UI. 99% of sites are not Netflix, will never have the traffic of 99%, and don’t need the engineering of Netflix.

Their managed services might not compete with AWS on all features, but what they do do, they do well.

Result: would use again

Ansible

It’s battle-tested at this point. Our traffic isn’t spikey enough to justify spinning up and down servers, so been able to provision one manually and get it running in 10 mins instead of 10 secs is fine. Terraform has been a nightmare elsewhere (storing state, hoping it never gets out of sync).

Result: would use again

Docker & Docker Compose

While we use normal nginx for most endpoints, we run a few services in docker containers. Not having to worry about different libraries affecting different parts, been able to move services between boxes etc is great. It’s not perfect (I completely missed restart policies in compose), but it’s been very reliable for us (no downtime in four years).

Result: would use again

Datadog

Where to start? The pricing is just astronomical. The stats and dashboards are nice. One far outweighs the other.

Result: avoid unless you have VC cash that you want to piss up a wall

Clickhouse & Clog

Clickhouse was a bit of a darkhorse. Previously I’d used an ELK stack but just hated how it was a bastard to manage. Clickhouse just ran and ingested anything I threw at it. We’re now ingesting tens of thousands of logs an hour; every request, DB query, exception, pageview etc. And what’s more, querying it is still super quick! It’s slowly powering more and more of the app. The SQL-like syntax takes a while to get used to, but it uses a quarter the resources of elasticsearch, is easily 10x faster, and also let us move a load of stuff off of Datadog.

Result: would definitely use again

Elasticsearch

Initially used one cluster for search and another for logging, I cannot in good conscience recommend elasticsearch in 2024. It is a huge resource hog that is finicky to manage. Everything is fine until it’s not. For very advanced search, it’s probably still one of the best, but for anything simpler e.g. give me a list of things that match a simple search term, just use something else.

Result: use a simpler option, like…

Typesense

Typesense now powers our search and also our vector search for AI embeddings. It’s not perfect. CPU contention is a contentious issue and it can hang if processing too much at once. It can be scaled but gets trickier. Getting from 0-search though is pretty quick and easy.

Result: use until something better comes along

SSO/SAML/SKIM + WorkOS

Everyone wants to sign-in to your app a different way. I’ve been hella-burnt before using Auth0, so I am 100% deadset against the very concept of storing your user database in someone else’s business. WorkOS was a good compromise where we can let them abstract away every services SAML/SKIM flow and we just handle the normalized endpoints our side. They’re expensive, but I never need to worry about writing an AD/Entra driver ever again.

Result: use if you need more advanced connections

CI/CD: Buddy.Works

I’ve been using Buddy for nearly ten years now. They’re not perfect, they can be expensive, BUT, they have integrations for almost every runtime you want to build for and service you’d want to push to. Don’t run CI/CD infrastructure yourself, it’s a timehog, resource hog, and a pain in the arse. Whatever you do, don’t be using FTP/git pulls for deployments :facepalm:

Result: use some form of CI/CD

Self-Hosting (Image Processing/Websockets/etc)

In direct contradiction to the advice above, we’ve tried to run a lot of services on-prem. Docker makes it dead easy to run them and unless you’re serving millions of daily users, you’d be amazed what a couple of simple boxes can achieve. Libvips, soketi, geoip have all saved us thousands each month in SaaS costs.

Result: where possible, would do it again

App

Separate API & Front-ends

This was a design choice we pulled over from WP#1. A standard REST API that communicates with different front-ends. It’s gone very well, especially as we’ve added mobile, browser extensions etc.

REST is still my goto as well. I haven’t seen enough tooling around GraphQL outside of the (horrible) JavaScript world to risk touching it. Honestly if you need that, just use gRPC.

Result: the way to do it

OAuth2

Our previous setup used a custom auth setup that issued tokens and had it’s own patterns and it was just horrible. Or better put, annoying. HTTP testers had to have custom auth providers written. Want to integrate mobile/browser? Custom code. Just having the API offer a standard OAuth2 server has made life so much simpler. Don’t use session auth for apps folks.

Wildcard Subdomains

I remember, 10-15 years ago doing a project where users could use subdomains as usernames e.g. dachande663.example.com. It was hell. Starting with the cost of a wildcard SSL certificate back then, but I was just very inexperienced. The moment someone registers their username as www, you’ll know how I feel. You’ve got watch the rough edges, but this was a great choice because when it came to doing custom CNAMEs, the system was already set to look at origin and not just path or session.

Result: would do again

Separate Secure Area

I’m including this because it was a random choice that I think has paid off. All authentication in WP#2 is handled on secure.example.com. Want to login? Secure. Want to start/finish SSO flow? Secure. 2FA? Secure. The simple win for this is we can really restrict access. It’s got a very severe Content-Security-Policy header, separate rate-limiting at multiple tiers, much more access logging etc. It’s a tiny bit of hardening that pays off when you get your first pen-test.

Result: do this

Custom CNAMEs + Openresty

I don’t think I’d realised how much people value having something on their own domain. Seriously, acme.example.com vs example.acme.com has been a sales hit since day one.

Running openresty with a LUA script that checks subscriptions and provisions certificates from Let’s Encrypt was a days work and has saved us tens of thousands in costs if we’d used Cloudflare or others’s Saas offering to do this.

Result: would do again

Back-Office Tool

We’ve used Retool in a previous project. Don’t use Retool. Not only are they expensive and crap to develop in, keeping another tool inline with your main system is just a pain. We have list and detail views for all entities, lots of actions to perform etc. It’s a. cut down the amount of time between query and CS been able to fix and b. saved me making lots of database changes all the time.

Result: do this

Programming

Laravel

Ergh, where to start. I like that Laravel is obviously pushing the PHP community but it has that early-Rails feeling. Every method and example online is now a mishmash of closures and reducers. We have had to swap out almost all of the parts. The only advantage over something like Symfony is the ability to do this without rewriting blocks of YAML.

Result: would use again until a better framework emerges

Type Hinting

We’re running PHP 8.3 which offers a pretty nice level of type-hinting. It’s still missing generics and types over collections, but what it does offer has made code much nicer to use. The other extreme is something like Typescript which has made it’s type system turning complete. It’s bonkers and I can see a big wave coming to push back against it.

Result: would use again

Testing

We have about 4000 tests covering ~85% of code in the API. 100% coverage is a myth, don’t bother. This breaks down to around 3,500 full integration tests and 500 unit tests. The entire test runs in about 2 minutes on our CI/CD servers, or ~30 seconds with parallelism locally.

This is probably the single biggest thing I’ve pushed for and maintained. In four years, we’ve shipped 2 builds that broke something. Test. It.

Result: do this or don’t bother shipping

Prefixed ULIDs

This is a tiny one but I’m including it because it’s really, really paid off for us. All IDs are 36 char binary strings of the format prefix + underscore + encoded(timestamp + rand) + checkdigit e.g. doc_0qic3nmshj4zgstwi4bsu3lrvlymgv4p.

First let’s get the elephant in the room out of the way: yes, ints make better IDs and join keys. But I’d sacrifice a few percent loss for the benefits. Which are:

  • As a human, you can instantly see what an entity is
  • Referential integrity is improved (with int columns, you can accidentally join the wrong columns very easily)
  • Prevents enumeration attacks
  • (A little internal one, our back-office tool has a shortcut to open details page for any ID which uses prefix as selector)

Result: in the right cases, very useful

Vue

God I hate React. It’s grown into a sprawling mess. Vue isn’t perfect, but it’s done ok for us. We’re looking at doing more code splitting and importmap loading, but it’s worked so far.

Result: would look at what else is out

There are a lot more parts of what we’ve done I wanted to go through but my fingers are tired. I think overall I’m moving away from big and complex (AWS/Elasticsearch) and just smaller and simple.