Does Cloudflare offer a database solution for Analytics type data?

I'll have a stream of events, potentially TB's / month. Does cloudflare offer any NoSQL/SQL solution that could support saving this amount of data? There's also the need to do some basic querying on it later.
34 Replies
Hello, I’m Allie!
Workers Analytics Engine (beta) · Cloudflare Analytics docs
Workers Analytics Engine is a new way to quickly get analytics about anything, using Cloudflare Workers. Whether you are developing a new software …
Hello, I’m Allie!
Though it does sample at some point
Chaika
Chaika7mo ago
Yea that's pretty much the closest option, based on Clickhouse, but a few things to keep in mind: - Limited to 91 days retention - Currently in beta and free at the moment (pricing yet to be announced)
Paul
Paul7mo ago
What do you mean by "sample"? Ah, i got it, it's a way of not recording every single event Seems a little heavy handed. i already have all of that written, i just need somewhere to push the data
Chaika
Chaika7mo ago
it's not quite as simple as a set amount and then sampled, there's a document here explaining it all if you're interested: https://docs.google.com/document/d/17iThVQ3wr40JlRJ_8Y0qQr2YP1S8Cbn3FujQ5WYLG8E/edit#heading=h.9hfgsp5qfkgp
Google Docs
PUBLIC: Understanding Sampling with Workers Analytics Engine
Understanding Sampling with Workers Analytics Engine Workers Analytics Engine allows you to write an effectively unbounded amount of data and query it back very quickly, at a price that is very low (or free). In order to allow users to write an unbounded amount of data at a bounded cost, we use w...
Paul
Paul7mo ago
OK wow, checking out ClickHouse now... there is definitely way more beyond NoSQL/SQL when it comes to this stuff. I need to do some learning I'm using postgres right now, but it's just incredibly slow to query @ 1 million records (with proper indexing)
Chaika
Chaika7mo ago
Analytics Engine probably isn't what you want then, it's more for use alongside Workers and such. You can only input 25 entries per http requests, not made to have a bunch of stuff shoved into it at once Look into TimescaleDB then maybe, postgres extension for that kind of data
Paul
Paul7mo ago
I have events grouped by "session_id", so i figured a NoSQL solution would be better, since i can just keep the events in the session document
Hello, I’m Allie!
Yeah, Clickhouse is very powerful. It is what powers Cloudflare's Analytics, both for your domains, and for Cloudflare Radar
Paul
Paul7mo ago
The thing is events for the same session_id could be separated by years... so I'm not sure if i can take advantage of timescaledb I'm guessing it gets it's performance boost by having well defined time bounds?
Unsmart
Unsmart7mo ago
NoSQL databases are not meant for analytics. NoSQL databases are generally very OLTP. You want on OLAP database (clickhouse for example).
Paul
Paul7mo ago
Ok excellent. I knew i was missing something. Thanks all
Chaika
Chaika7mo ago
I know Clickhouse would let you partition by session_id or other keys, not sure about for timescaledb
Paul
Paul7mo ago
Clickhouse is it's own DB? what are they running underneath, SQL?
Chaika
Chaika7mo ago
own db
Chaika
Chaika7mo ago
The main table engine you use with Clickhouse is MergeTree: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree (or the various types of it), it stores files in parts. Clickhouse is also a columnar database rather then row based
MergeTree | ClickHouse Docs
The MergeTree engine and other engines of this family (*MergeTree) are the most commonly used and most robust ClickHouse table engines.
Unsmart
Unsmart7mo ago
"SQL" is not really a database fwiw its just a query language. You do use SQL for clickhouse
Paul
Paul7mo ago
ah of course
Chaika
Chaika7mo ago
yea Clickhouse supports postgres wire protocol as well, http interface and a few other ones
Paul
Paul7mo ago
This is wild. I think i'll need to set up multiple partition keys, i've got bucket_id for different tenants, visitor_id for different people, and session_id for different sessions all within the "events" table, which i'd like to grow to infinity
Chaika
Chaika7mo ago
One of the big advantages of Clickhouse is Materialized Views, you can create a view that aggregates by time/tenant/etc, the magic is unlike normal views which are put together when you query them, Materialized Views update on rows being inserted onto the table they are made on (like a trigger could), so cheap to query
Paul
Paul7mo ago
ok woah, so they are like indexes... but views? i wonder what happens if you modify a row that was already materialized, does the entire Materialized View get rebuilt?
Chaika
Chaika7mo ago
eh I think it's easier to just think of them as Views but you do more work on on insert instead of on query Clickhouse isn't very modification or deletion friendly. If possible, avoid doing that entirely
Paul
Paul7mo ago
that makes sense events should be append only anyways so i'm good with that this seems like just what i needed, so glad
Chaika
Chaika7mo ago
Yea you can't do updates or deletes normally in Clickhouse either (although iirc you can enable that now if you wanted to), they're called mutations https://clickhouse.com/docs/en/sql-reference/statements/alter#mutations
ALTER | ClickHouse Docs
Most ALTER TABLE queries modify table settings or data:
Chaika
Chaika7mo ago
One of the things that makes Clickhouse fast is that when you definite a partition key (let's say, by month), it defines how the data is stored into parts (single files) on the disk. When you insert data, it's super quick because it's a new part. In the background Clickhouse merges all of these parts (well, any that can be merged because they're in the same partition key). Deleting/updating/mutating data thus requires rewriting the part and is expensive
Paul
Paul7mo ago
So there would be no need to have separate tables per tenant_id, i can simply define a partition key and let Clickhouse do it's magic?
Chaika
Chaika7mo ago
I think that is generally not recommended as you really want to limit those part counts, and instead maybe use materialized views for that https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key
In most cases you do not need a partition key, and in most other cases you do not need a partition key more granular than by months. You should never use too granular of partitioning. Don't partition your data by client identifiers or names. Instead, make a client identifier or name the first column in the ORDER BY expression.
Paul
Paul7mo ago
oh wow, so good
Chaika
Chaika7mo ago
There's a lot of content out there about Clickhouse, this is a good watch: https://www.youtube.com/watch?v=6WICfakG84c I explored it a fair bit a while ago and I'm sure I've forgotten quite a few things by now lol
Altinity
YouTube
Secrets of ClickHouse Query Performance
Here's the link to the updated webinar: https://youtu.be/1TGGCIr6dMY #ClickHouse is famous for speed. That said, you can almost always make it faster! In this webinar (September 2019), Robert Hodges and Altinity Engineering Team use examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. ...
Paul
Paul7mo ago
how have i not heard of this? Is this fairly new product? Cloudflare is using this for all of their analytics?
Hello, I’m Allie!
Maybe because it is one of those things where it is a comparatively niche product? Might also be because it was originally built by Yandex
Chaika
Chaika7mo ago
It's been around for a bit. Cloudflare uses it for all the major analytics (dns, http, radar, etc), they have a bunch of blog posts about it, ex: https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
The Cloudflare Blog
HTTP Analytics for 6M requests per second using ClickHouse
One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:
Paul
Paul7mo ago
Excellent, time to dive in. Thank you!