Clog

There was recently a discussion on Hacker News around application logging on a budget. At work I’ve been trying to keep things lean, not to the point of absurdity, but also not using a $100 or $1000/month setup, when a $10 one will suffice for now. We settled on a homegrown Clickhouse + PHP solution that has performed admirably for two years now. Like everything, this is all about tradeoffs, so here’s a top-level breakdown of how Clog (Clickhouse + Log) works.

Creation

We have one main app and several smaller apps (you might call them microservices) spread across a few Digital Ocean instances. These generate logs from requests, queries performed, exceptions encountered, remote service calls etc. We use monolog in PHP and just a standard file writer elsehwere to write new-line delimited JSON to log files.

In this way, there is no dependency between applications and the final logs. Everything that follows could fail, and the apps are still generating logs ready for later collection (or recollection).

Collection

On each server, we run a copy of filebeat. I love this little thing. One binary, a basic YAML file, and we have something that watches our log files, adds a few bits of extra data (the app, host, environment etc), and then pushes each line into a redis queue. This way our central logging instance doesn’t need to have any knowledge of each of the instances which can come and go.

(Weirdly, filebeat is part of elastic so can be used as part of your normal ELK stack, meaning if we wanted to change systems later, we have a natural inflection point.)

There’s definitely bits we could change here. Checking queue length, managing backpressue, etc. But do you know what? In 24 months of running this in production, ingesting between 750K to 1M logs a day, none of that has actually been a problem. Will it be a problem when we hit 10M or 100M logs a day? Sure. But then we have a different set of resources to hand.

Ingesting

We now have a redis queue with a queue of JSON log lines. Originally this was a redis server running on the clog instance, but we later started using a managed redis server for other things so migrated this to. Our actual Clog instance is a 4GB DO instance. That’s it. Initially it was a 2GB (which was $10), so I don’t think we’re too far off the linked HN discussion.

The app to read the queue and add to Clickhouse is… simple. Brutally simple. Written in PHP using the PHP Redis extension in an afternoon, it runs BLPOP in an infinite loop to take an entry, run some very basic input processing (see next), and insert it into Clickhouse.

That processing is the key to how this system stays (fairly) speedy and is 100% not my idea. Uber is one of the first I could find who detailed how splitting up log keys from log values can make querying much more efficient. Combined with materialized views, we can get something very robust that will handle 90% of things we throw at it

Say we have a JSON log like so:

{
  "created": "2022-12-25T13:37:00.12345Z",
  "event_type": "http_response",
  "http_route": "api.example"
}

This is turned into a set of keys and values based on type:

"datetime_keys": ["created"],
"datetime_values": [DateTime(2022-12-25T13:37:00.12345Z)],
"string_keys": ["event_type", "http_route"],
"string_values": ["http_response", "api.example"]

Our clickhouse logs table is:

  • Partitioned by log created date;
  • Has some top-level columns for things we’ll always have like application name, environment, etc;
  • Array-based string columns for the *_keys columns;
  • Array-based type-specific columns the *_values columns;
  • A set of materialized views for pre-defined columns e.g. matcol_event_type String MATERIALIZED string_values[indexOf(string_keys, 'event_type')]. This pulls out the value of event_type, and creates a virtual column that is stored. This makes queries for these columns much quicker.
  • A data retention policy to automatically remove data after 180 days.

This isn’t perfect. Not by a long shot. But it means we’ve been able to store our logs and just… not worry about costs spiralling out of control. A combination of a short retention time, Clickhouse’s in-built compression, and just realising that most people aren’t going to be generating TBs of logs a day, means we’ve flown by with this system.

Querying & Analysing

Querying is, again, very simple. Clickhouse offers packages for most languages, but also supports MySQL (and other) interfaces. We already have a back-office tool (in my experience, one of the first things you should work on), that makes it drop-dead simple to add a new screen and connect it to Clickhouse.

From there we can list logs with basic filters and facets. The big advantage I’ve found here over other log-specific tools is we can be a bit smart and link back into the application. For example, if a log includes a “auth_user_id” or “requested_entity_id”, we can link this to an existing information page in our back-office automatically.

Conclusions

There are definitely rough edges in Clog. A big one is that it’s simply an internal tool which means existing knowledge of other tools is lost. Some of the querying and filtering can definitely use some UX love. The alerts are hard-coded. And more.

But, in the two plus years we’ve been using Clog it has cost us a couple hundred dollars and all told, a day or two of my time, and in return saved us an order of magnitude more when pricing up hosted cloud options. This has given us a much longer runway.

I 100% wouldn’t recommend DIY NIH options for everything, but I Clog has paid off for what we needed.