I believe my pipeline (

I believe my pipeline (73bd2c7436274b76ab94148aa617dccb) is dropping some events. I have a worker that writes the same event to this pipeline and to worker analytics. I expected analytics to be more unreliable because it uses sampling. But I see all the events in the Analytics dataset, but some are missing from the sink. I also tried sending events directly using the HTTP endpoint and that's not working either. I think end-to-end traceability during the beta period would be useful.
5 Replies
Micah Wylde
Micah Wylde2mo ago
Do you have a sense of how many events you're missing? In our metrics, I see that 4 events were dropped due to schema issues in that pipeline. When you send events, are you ensuring that you get a 200 back from the ingest endpoint (for http) or that the worker binding call returns successfully? We consider durability (fraction of events acknowledged by the stream that eventually end up in r2) to be the most important property of the product, and any data loss is unacceptable.
Jkl
JklOP2mo ago
@Micah | Data Platform Sorry for the delayed response. Yes, just 4 from that day. But tried some more the next day and 5 are dropped. Nothing is returned. I have a hunch about what could be failing schema validation. I will try with a different schema this weekend. The docs say: "For structured streams, ensure your events match the schema definition. Invalid events will be accepted but dropped, so validate your data before sending to avoid dropped events." A standalone validator function would be great. Also ability to turn on debug level observability so that each message generates a log entry for each stage from stream to sink would be great.
Micah Wylde
Micah Wylde2mo ago
We know this is a huge pain point right now, and are working on improvements! I'm also happy to help debug in the meantime if you want to send me a sample event.
Jkl
JklOP5w ago
I figured out what was happening in this case. One value was expected to be a string but I was sometimes getting a boolean (true). IMO the primary use cases for Pipelines would be audit and analytics. For both cases, it's preferable to coerce values into their expected types when possible, than to drop the records. That's what I'm doing now field-by-field, but would be a good built-in feature. Maybe a flag I can pass.
Micah Wylde
Micah Wylde5w ago
That's good feedback — we can definitely look at supporting looser type coercion rules for our json parsing

Did you find this page helpful?