DIY Analytics

I did something stupid. Off the back of my last post about looking for something for analytics, I ended up making a system myself. I don’t know why. It’s certainly never going to be as good as one off the shelf, but I did it anyway.

Probably firstly the why (or more why not):

  1. Everything free is privacy invasive, and everything paid is paid. This is a random little blog and I’m already trying to minimise spend where I can (in life and cloud providers).
  2. I wanted something that operated server-side. I don’t want to rely on JS (and more importantly ad blockers). This way I can also log response times, database queries etc.
  3. I’m very much going for 100% NIH, build it myself on this site (often to my detriment, but that’s a choice I made). Adding some logging isn’t outside the realm of difficulty for that.
  4. And last, and probably most importantly, I already have a clickhouse server running on this box for logging other things (locstats, probe monitoring etc). So adding a pageviews table was all of five minutes work.

What I’ve ended up with isn’t the most robust thing by a long-shot, but it appears to be running fine. I’ve ended up with a very basic schema like this:

CREATE TABLE IF NOT EXISTS pageviews (

    id UUID DEFAULT generateUUIDv4(),
    created_at DateTime64(3, 'UTC'),

    request_method LowCardinality(String),
    request_path String,
    request_query_names Array(String),
    request_query_values Array(String),

    response_code UInt16,
    response_time Float32,

    tracking_daily_hash FixedString(100),
    tracking_from_code String,
    tracking_referer_domain String,

    user_country_code LowCardinality(String(3)),
    user_language LowCardinality(String),

    user_agent_bot Bool,
    user_agent_browser_name LowCardinality(String),
    user_agent_browser_version LowCardinality(String),
    user_agent_os_name LowCardinality(String),
    user_agent_os_version LowCardinality(String),
    user_agent_device_type LowCardinality(String),

    system_num_queries UInt16,
    system_is_me Bool
)
ENGINE = MergeTree()
PRIMARY KEY (created_at, id)
ORDER BY (created_at, id);

CREATE TABLE pageviews_buffer
AS pageviews
ENGINE = Buffer(
    default, --db_name
    pageviews, --dest_table
    1, --num_layers
    60, --min_seconds
    300, --max_seconds
    10, --min_rows
    1000, -- max_rows
    100, --min_bytes
    512000 --max_bytes
);

I create two tables. Pageviews is the main store with various anonymised fields such as user agent fields, and a daily hash (stolen from Plausible) for tracking repeat views.

The pageviews_buffer table is a Clickhouse specific optimization that buffers up lots of small inserts and does one large flush to the destination table.

It’s pretty simple, which is what I wanted. As an example, to get the number of pageviews per hour for the last week, I’d run this query.

select
    toStartOfHour(created_at) as d,
    count(*) as num_pageviews,
    count(distinct tracking_daily_hash) as num_visitors
from pageviews
where created_at between toStartOfHour(now() - toIntervalDay(7)) and toStartOfHour(now() + toIntervalHour(1))
and system_is_admin = 0
group by toStartOfHour(created_at)
order by d asc
with fill from toStartOfHour(now() - toIntervalDay(7)) TO toStartOfHour(now() + toIntervalHour(1)) step toIntervalHour(1)

Given at work we’re ingesting tens of thousands of rows an hour and a query like this runs in sub-second time, I’m happy it’ll be fine.

If people want to know more about the structure I’ve chosen, let me know. I’m enjoying using Clickhouse more and more, but there’s a shed-load of stuff I’ve yet to use and lots of pointy edges I’ve found.