Cloudflare Developers•2mo ago

Hi!

Hi! Let’s say I need 3-4 streams (as per the current limits which are 5MB/s) to handle spikes in load - Is it possible to direct the corresponding pipelines to a single Sink ? - If the Sink is an R2 bucket, can I ensure, using custom partitioning, that files are written in lexical order ? (I had issues with pipelines legacy where files would be created in R2 out of order) Use case : events ingestion using Cloudflare Pipelines -> R2 -> Clickhouse Clickpipes S3 integration with continuous ingest (requires lexical ordering of files)

11 Replies

stard91OP•2mo ago

Okay so for first bullet, I managed to direct 2 streams to 1 pipeline using SQL like :

INSERT INTO test2_sink (
SELECT * FROM test2_stream
UNION ALL
SELECT * FROM test_stream
);

INSERT INTO test2_sink (
SELECT * FROM test2_stream
UNION ALL
SELECT * FROM test_stream
);

Marc•2mo ago

For your first bullet, I think the old version had that issue because of the way it sharded partitions and not having any coordination between them when writing to R2 - in this case we are now using something completely different to write to R2 (based on Arroyo) and this shouldn’t be the case but @Micah | Data Platform or @cole | pipelines can confirm for me

Micah Wylde•2mo ago

It is possible to union multiple streams, but we can also increase limits for your stream so that's not necessary — DM me your account and stream id and we can discuss We write files by default with ULID names (https://github.com/ulid/spec), which are are lexicographically sorted. We support custom partitioning by date/time fields (see https://developers.cloudflare.com/pipelines/sinks/available-sinks/r2/#partitioning), and in general there will be a single writer so files will always be written in order.

Cloudflare Docs

Write data as JSON or Parquet files to R2 object storage

stard91OP•2mo ago

Hi @Micah Wylde , thanks, I realise just now that I missed your last message, thanks for the explanation! I was wondering what kind of limits we could expect in terms of MB/s from a single Stream once the beta is over? We currently have a baseline of 10-15 MB/s with spikes at 30-40 MB/s sent to AWS Kinesis straight from cloudflare workers I am guessing there also has to be a limit in throughput at sink level, where at some point several sinks have to be used?

Micah Wylde•2mo ago

For GA, we're targetting 1GB/s per stream. Today we can up your streams to 50MB/s. Also—the ingest limits are computed over an hour, so short spikes above the limit aren't a problem. There are no throughput limits for sinks (although in practice it's somewhat limited by how much data you can get in)

stard91OP•4w ago

Hi @Micah Wylde , I have DM you the stream ID I am going to use to test, if it's possible to up it

Micah Wylde•4w ago

Sure, I'll up that stream to 20MB/s

stard91OP•4w ago

Thanks!

stard91OP•4w ago

Hi @Micah | Data Platform, do you know how to enable ULID format ? Because on my side it looks like UUIDs so files are not sorted by time I am using %F/%H%M%S partitioning so it means in my example those 2 files belonged to the same "second" partition despite containing relatively few events 🤔

Micah Wylde•4w ago

Hi Stephane, we actually just changed the default format from uuid v4 to uuid v7 which has the same time-ordering property as ulid. If you want to specifically use ulid you can specify the format via file_naming.strategy option on the sink config (supported options are serial, uuid, uuid_v7, and ulid).

stard91OP•4w ago

Oh that's even better, thanks

Gaming

Programming

Hi!

Did you find this page helpful?