Topics

Ash Framework•6mo ago

opentelemetry_ash not ready for use or not properly configured?

Hi! We're struggling with some performance issue in prod and can see some gaps between the top level spans (typically, LiveView event handlers) and DB queries. So there's a suspicion that something might be going slow in building queries and/or processing results. So we've tried enabling opentelemetry with the opentelemetry_ash package (0.1.0). But it doesn't seem to hook correctly into the parent spans, looks kind of useless and hard to navigate. We use SigNoz to visualise the spans, it looks like on the screenshot. Here's the config:

config :ash, tracer: [OpentelemetryAsh]

# enable everything for detailed analysis
config :opentelemetry_ash,
  trace_types: [
    :custom,
    :action,
    :changeset,
    :validation,
    :change,
    :before_transaction,
    :before_action,
    :after_transaction,
    :after_action,
    :request_step,
    :custom_flow_step,
    :flow,
    :query,
    :preparation
  ]

config :ash, tracer: [OpentelemetryAsh]

# enable everything for detailed analysis
config :opentelemetry_ash,
  trace_types: [
    :custom,
    :action,
    :changeset,
    :validation,
    :change,
    :before_transaction,
    :before_action,
    :after_transaction,
    :after_action,
    :request_step,
    :custom_flow_step,
    :flow,
    :query,
    :preparation
  ]

Is there something wrong with our config or with opentelemetry_ash?

No description

30 Replies

ZachDaniel•6mo ago

Hm... that does look quite strange. Would you be able to contribute a test to opentelemetry_ash that starts a non-ash span and then calls an action, and asserts on the outputted spans relation? It's a pretty simple package and I've moved on from the app where I was using it in anger so maybe something has changed/broken.

arconautOP•6mo ago

I would love to, just have to find the time. The first priority now is to fix our performance issues, but since there's a suspicion of Ash being involved, I might need to have some Ash telemetry up and running.

ZachDaniel•6mo ago

It seems like you're getting telemetry there despite it somehow missing a parent span What can you tell me about your perf issue?

arconautOP•6mo ago

Oh I could say a lot :), we still don't know what exactly is causing the issues, they're also "flaky" - as in sometimes it's worse than others. We're hosted on Fly, and sometimes a region change or a restart solves everything (for some time). But the longest telemetry spans have huge gaps between Ecto traces, where no application code is executed. Like this one, for example (see screenshot), is produced by this event_handler in LV:

case AshPhoenix.Form.submit(state.form, opts) do
  {:ok, _annotation} -> {:noreply, assign(socket, state: state_module.start(socket.assigns.document_id))}
  {:error, form} -> {:noreply, assign(socket, state: %{state | form: form})}
end

case AshPhoenix.Form.submit(state.form, opts) do
  {:ok, _annotation} -> {:noreply, assign(socket, state: state_module.start(socket.assigns.document_id))}
  {:error, form} -> {:noreply, assign(socket, state: %{state | form: form})}
end

The form submit produces the left-most DB queries, the state_module.start is probably what produces the right-most. The action is a create with one after_action hook that inserts a related record (can't remember why manage_relationship didn't work in that case).

No description

arconautOP•6mo ago

Hmmm, interesting - I've just tried (locally) to get the opentelemetry_ash library from github instead of 0.1.0 and it seems to produce beautiful traces, all ordered correctly. So perhaps the latest published version is not the greatest? 🙂

ZachDaniel•6mo ago

Well, there are specifically changes to parent span tracking So yes that could be it Something seems very strange there Ash doesn't do that much outside of ecto It runs your hooks etc. Are there hooks or anything like that on the action? sorry, you already said so I'm going to update the tests to do some better assertions on the span relationships Also FWIW opentelemetry_ash is like 100 LoC so you could adapt it to your own needs or fix pretty easily, if you're already proficient with otel

arconautOP•6mo ago

One of the current suspicions is that we heavily rely on PubSub to keep the LV up to date. And pretty much every Ash action broadcasts to PubSub. So it might get noisy. But not all the handlers are very efficient at picking those up, some just do a "reload all". So we may be seeing message boxes of LVs getting backed up. This could explain some of the symptoms, but not everything. It doesn't explain, at least not to me, why the kind of code posted above would take 15 sec to finish...

ZachDaniel•6mo ago

🤔 that could be related, yes But I agree with you on why it would take 15s How are you tracing the actual event handler itself? one other question: the perf issue is something you've observed directly right? Not just via like otel metrics?

arconautOP•6mo ago

The perf issue has been observed by our customers, internal users, and also personally. However, mileage may vary per user (might depend on the machine, we're running a cluster of 4 machines).

ZachDaniel•6mo ago

👍 just wanted to rule out some problem w/ the observability tooling itself

arconautOP•6mo ago

LV event handlers are simply traced by OpentelemetryLiveView

ZachDaniel•6mo ago

Is it localized to this specific event handler? Or are you seeing this in various places? Are there any related usage metrics spiking? high mem/high cpu?

arconautOP•6mo ago

Unfortunately, it's all over the place. Page mounts/handle_params, all kinds of event handlers. And this does not correlate with any high RAM or CPU usage 😦

ZachDaniel•6mo ago

🤔 how interesting Good and bad news potentially though how about database load? Hm, I just saw flow:version:create are you using Ash.Flow?

arconautOP•6mo ago

Database load is also pretty calm, CPU hovering around 50%, lots of memory, no long-running queries, enough connections.

are you using Ash.Flow?

no, one of our domains is called "Flow" 🙂

ZachDaniel•6mo ago

okay, good 😄 I was sweating for a sec since that was deprecated years ago haha So it being related to notifications sounds very possible, given that we see the after action hook ending but the action not returning Do you have this reproduced anywhere other than production?

arconautOP•6mo ago

Nope, only production

ZachDaniel•6mo ago

🤔 you've mentioned a restart sometimes helps Which implies something is "building up" Have you seen things like the effects of pubsub "slowing down"? like live-reactivity feeling very slow? Are you a very high traffic app? How much of your UI is pubsub driving? What if you just disabled pubsub? Would things all come crashing down?

arconautOP•6mo ago

Disabling PubSub would most probably break quite a few bits of UI, some updates happen deep in the component tree and are implemented as fire-and-let-parent-LV-handle-it. So it would require some UI work. There's a lot reactivity - everything is live (multiple users can watch the same "document" at the same time), although it's not really a multiplayer editor experience, there's only one active user. There are hundreds of LVs open during business hours, so not extreme (B2B). And when the slowdowns happen then people start complaining that "everything is slow". From what we've gathered and seen ourselves, it's really a lot of things - page loads during navigation & live-navigation, LV events that result in Ash actions and DB state updates.

ZachDaniel•6mo ago

Are there any paths that are slow that don't include Ash? That you've seen, at least? "when the slowdowns happen", so normal state is "things are fine and snappy"? And then something starts happening and things get slow? Does it return to normal at some point? I've just pushed something up to main for opentelemetry_ash. The testing exporter doesn't show me parent span ids, nor does the stdout one as far as I can tell? This tooling is not always the easiest to work with. But there is a place I can provide the parent span that I wasn't. But you could try that out locally to see if it resolves your orphan span issues at least Are you pretty recent on your Ash version? I could add some additional tracing that could tell you more about what is going on For instance, I could add tracing for notifiers in fact, I want that for myself 😄

arconautOP•6mo ago

Does it return to normal at some point?

It's still slow (compared to what it should be), we have some UI actions taking seconds instead of ms. But once it gets really bad, I don't remember it coming back to "relatively normal" by itself - only hard reset/redeploy. We've been trying many things, the experiments haven't always been very scientific. Ash 3.4

ZachDaniel•6mo ago

I'm familiar withthe process of trying to balance "make our customers not lose their shit" and fixing the issue 😆 I think our best bet is to just increase the surface area of the tracing in Ash If you can't tell if Ash is the problem, then we aren't emitting enough tracing info I'll do this now, but to use it you'll need to update to the latest not sure if thats on the table or not

arconautOP•6mo ago

I'll probably be able to see more tracing tomorrow (based on what's on main of opentelemetry_ash).

ZachDaniel•6mo ago

I unfortunately could not verify that it does what its supposed to you'll want to check locally I don't have a deployed app using otel, and probably just me forgetting how it works but the exporters in test don't say what their parent span is? just getting huge tuples and guessing whats what isn't easy and the otel docs aren't that helpful Question whats your logging situation in production? We've had a lot of customers hit weird perf issues often by backing up against their logger

arconautOP•6mo ago

I found something that works for me locally - install SigNoz locally and use it to visualise opentelemetry (all from localhost) We log in JSON, but just to stdout IIRC, the logs are then picked up from Fly

ZachDaniel•6mo ago

Whats your prod log level? Have you tuned the logger settings at all? I've shipped a new notifier trace type to main

arconautOP•6mo ago

prod.exs:

config :logger, :default_handler, level: :info, formatter: {LoggerJSON.Formatters.Basic, metadata: [:application, :domain, :request_id]}

config :logger, :default_handler, level: :info, formatter: {LoggerJSON.Formatters.Basic, metadata: [:application, :domain, :request_id]}

ZachDaniel•6mo ago

I'd look at the message queue size of your logger process

arconautOP•6mo ago

thanks for all the questions and help, by the way!

ZachDaniel•6mo ago

config :logger,
  sync_threshold: 500

config :logger,
  sync_threshold: 500

you can try increasing your sync threshold If you're logging like a whole lot once you have 50 logs waiting to be logged, everything starts waiting on the logger

Ash Framework Join

The Elixir backend framework for unparalleled productivity. Declarative tools that let you stop wasting time. Use with Phoenix LiveView or build APIs in minutes for your front-end of choice.

2.1KMembers

View on Discord

Did you find this page helpful?