Topics

Ash Elixir•3y ago

Pagination 'count' is not accurate if the SQL query result contains duplicate id's

The pagination count SQL query is using DISTINCT like this:

SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM (...) ...

SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM (...) ...

If the result set returned by the sub query has duplicate ids then the count doesn't reflect the actual number of rows returned by the sub query because DISTINCT is removing them from the count. Is there any way to remove DISTINCT from the pagination count query?

62 Replies

ZachDaniel•3y ago

🤔 what query are you running? I feel like it might be a bug that it is using DISTINCT there

l00kerOP•3y ago

Yeah. IMO it shouldn't do that.

ZachDaniel•3y ago

but at the same time, I feel like DISTINCT shouldn't affect the count given that "id" is almost certainly unique Maybe I need to see the whole query to see what you mean though

l00kerOP•3y ago

Well in my case there are duplicates. Don't know if you remember the published TV episodes from another support question. In that case, there are reruns of episodes, therefore it will have duplicate ids.

ZachDaniel•3y ago

Can I see the whole action/query that you are running? Just want to wrap my head around it

l00kerOP•3y ago

...
read :published do
  prepare build(
    filter: expr(published),
    sort: [published_air_date: :desc],
    load: [:thumb_url]
  )

  pagination do
    offset? true
    default_limit 10
    countable :by_default
  end
end
...
aggregates do
  first :published_air_date, :air_dates, :start_datetime do
    sort start_datetime: :desc
    filter expr(status == :published)
  end

  first :this_weeks_air_date, :air_dates, :start_datetime do
    sort start_datetime: :desc
    filter expr(status in [:published, :scheduled])
  end
end
...
calculations do
  calculate :published, :boolean, expr(not is_nil(published_air_date))
end

...
read :published do
  prepare build(
    filter: expr(published),
    sort: [published_air_date: :desc],
    load: [:thumb_url]
  )

  pagination do
    offset? true
    default_limit 10
    countable :by_default
  end
end
...
aggregates do
  first :published_air_date, :air_dates, :start_datetime do
    sort start_datetime: :desc
    filter expr(status == :published)
  end

  first :this_weeks_air_date, :air_dates, :start_datetime do
    sort start_datetime: :desc
    filter expr(status in [:published, :scheduled])
  end
end
...
calculations do
  calculate :published, :boolean, expr(not is_nil(published_air_date))
end

That's the relevant parts

ZachDaniel•3y ago

It would be strange for the id to end up duplicated in the result set honestly I think that might be the bug well, and then the distinct would be unnecessary but like what you should get back is one row per thing can I see the SQL generated for both? The count and the regular query?

l00kerOP•3y ago

[debug] QUERY OK source="tv_episodes" db=453.1ms idle=721.0ms
SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY  st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $2::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $3::boolean)::boolean::boolean) [0, :published, true]

[debug] QUERY OK source="tv_episodes" db=453.1ms idle=721.0ms
SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY  st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $2::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $3::boolean)::boolean::boolean) [0, :published, true]

[debug] QUERY OK source="tv_episodes" db=459.4ms queue=0.1ms idle=857.9ms
SELECT t0."id", t0."title", t0."brief_desc", t0."text", t0."thumb", t0."video_file", t0."video_poster", t0."date_added", t0."first_date", t0."member_level", t0."cat_id", t0."is_featured", t0."times_viewed", s1."published_air_date"::timestamp::timestamp, ($1::varchar::varchar || ($2::varchar::varchar || t0."thumb"::text::varchar)::varchar)::text, ($3::varchar::varchar || ($4::varchar::varchar || t0."video_file"::text::varchar)::varchar)::text, ($5::varchar::varchar || ($6::varchar::varchar || t0."video_poster"::text::varchar)::varchar)::text FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY  st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $7::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $8::boolean)::boolean::boolean) ORDER BY s1."published_air_date" DESC LIMIT $9 ["https://cdn1.wx3media.net", "/wog/", "https://cdn1.wx3media.net", "/wog-videos/", "https://cdn1.wx3media.net", "/wog/", :published, true, 11]

[debug] QUERY OK source="tv_episodes" db=459.4ms queue=0.1ms idle=857.9ms
SELECT t0."id", t0."title", t0."brief_desc", t0."text", t0."thumb", t0."video_file", t0."video_poster", t0."date_added", t0."first_date", t0."member_level", t0."cat_id", t0."is_featured", t0."times_viewed", s1."published_air_date"::timestamp::timestamp, ($1::varchar::varchar || ($2::varchar::varchar || t0."thumb"::text::varchar)::varchar)::text, ($3::varchar::varchar || ($4::varchar::varchar || t0."video_file"::text::varchar)::varchar)::text, ($5::varchar::varchar || ($6::varchar::varchar || t0."video_poster"::text::varchar)::varchar)::text FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY  st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $7::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $8::boolean)::boolean::boolean) ORDER BY s1."published_air_date" DESC LIMIT $9 ["https://cdn1.wx3media.net", "/wog/", "https://cdn1.wx3media.net", "/wog-videos/", "https://cdn1.wx3media.net", "/wog/", :published, true, 11]

ZachDaniel•3y ago

I see...

l00kerOP•3y ago

I did notice that you removed all the calculation parts from the count query in one of the last few updates.

ZachDaniel•3y ago

and tv_episodes has_many published_air_dates right?

l00kerOP•3y ago

Yes

ZachDaniel•3y ago

I feel like the query is likely producing duplicate tv_episodes which is problematic

l00kerOP•3y ago

No... it's exactly what I want

ZachDaniel•3y ago

well...thats not how its supposed to work it should only be giving you one tv episode, sorted by its latest air date well, one tv episode per latest air date

l00kerOP•3y ago

It will because of reruns of the episodes on a different date

ZachDaniel•3y ago

Actually I'm a bit confused GROUP BY st0."episode_id" That ought to prevent it from duplicating any tv_episodes in this process Can you check the actual results of the query? And confirm that Enum.count(results, &(&1.id)) is equal to the count of unique ids? i.e

ids = Enum.map(results, &(&1.id))
IO.inspect(Enum.count(ids))
IO.inspect(Enum.count(Enum.uniq(ids))

ids = Enum.map(results, &(&1.id))
IO.inspect(Enum.count(ids))
IO.inspect(Enum.count(Enum.uniq(ids))

l00kerOP•3y ago

Okay. You're correct. The count does match. Let me tell you what put me onto this then. Go look at the query times in the [debug] - db=453.1ms & db=459.4ms

ZachDaniel•3y ago

🤔 yeah, that is not ideal How big is the data set?

l00kerOP•3y ago

I was looking for a way the speed that up so I tried modify_query to use my normal ecto join in the original app that runs in like 30ms. But the count was off. So I ran the SQL query in psql and removed the DISTINCT to get the right count.

ZachDaniel•3y ago

Ah, yeah that makes sense. Because your join is basically just an inner join, right? your join would produce duplicates, I mean. Either way, I have plans to add some options to optimize those kinds of queries (and have already put some of them into place), but there may also be some ways to optimize it with what we have now

l00kerOP•3y ago

If I run the Ash query's in psql and use EXPLAIN ANAYZE 95% of the time is in the array_agg It has to loop over the table 2x

ZachDaniel•3y ago

Yeah, that tracks. I think the array_agg should not be necessary most of the time but there are configurations where by it is necessary for correctness, and what I basically need to add is for ways to tell it that its unnecessary. I've optimized it out for cases where the relationships along the aggregate path are all belongs_to (because we can safely assume that each record can't possibly be related to multiple rows otherwise tons of things would break).

l00kerOP•3y ago

Right now there is only 2106 rows there and it's taking almost 500ms ea. If that table grows to say, 10_000 rows it will be insane.

ZachDaniel•3y ago

Try this:

  calculate :published, :boolean, expr(exists(published_air_dates, status == :published and not is_nil(start_datetime)))

  calculate :published, :boolean, expr(exists(published_air_dates, status == :published and not is_nil(start_datetime)))

l00kerOP•3y ago

The modify_query escape hatch was giving me exactly what I needed and I wanted to have Ash manage the pagination but the count was off.

ZachDaniel•3y ago

well, I think the modify_query was giving you duplicate rows

l00kerOP•3y ago

Yeah... it is

ZachDaniel•3y ago

So if you add a distinct or something to your query or do something to prevent the join from expanding

l00kerOP•3y ago

I want the duplicate rows

ZachDaniel•3y ago

😢 its part of the design of resources that they aren't expecting to return duplicates records. Like if that was done over an API or any other interface it would be very strange. Why do you want the duplicates? Not saying you don't have a valid reason, just trying to understand what you want to do

l00kerOP•3y ago

The other way out was to just do the query from the AirDates resource. That does duplicate and avoids the aggregate.

ZachDaniel•3y ago

what does the duplication of results help you do?

l00kerOP•3y ago

The list is in descending order by air date so as a viewer pages back in time the episodes will continue to be displayed in the proper order that they 'aired' If they are removed from the list query there are gaps

ZachDaniel•3y ago

but you want them to see the same tv episode multiple times in that list

l00kerOP•3y ago

In that case, yes

ZachDaniel•3y ago

Gotcha. I'll have to think about this, because I can see the use case, but querying the air dates filtered by them having been published is probably your best bet for now The hooks for telling pagination not to do DISTINCT aren't there are the moment. and there are things that actively expect not to find duplicates in the query response (like when we zip up any loaded data, we expect the first record w/ matching ids is the one we got back from the query)

l00kerOP•3y ago

And back to the query times for a moment. If used in a LiveView, those queries are run 2x. Once for the initial page load and again in the mount.

ZachDaniel•3y ago

Yeah, I agree, those query times are no good. The basic problem is simply one of optimization on our end. i.e when a first aggregate is used in a filter, it shouldn't be done in the same way as its done to load the value. In a filter, we should just join and filter well...potentially because we have to ensure we're only considering the last one according to the sort but the exists route would be significantly faster for you, I imagine.

l00kerOP•3y ago

I'm sure there are many things to be considered for sure. I figured there are reasons that DISTINCT was used. That's why I was wondering if there was a way to remove it I hadn't found.

ZachDaniel•3y ago

Lemme take a look at why its even doing that but even if I fix it, I'd say to make sure you add some tests around the duplication behavior at a minimum, because that isn't really part of the "things we expect you to do with resources" and so could break in the future.

l00kerOP•3y ago

I see that you've been busy optimizing things but I the meantime that's why I decided to try the modify_query

ZachDaniel•3y ago

Yeah, I think there are also some improvements we could make to manual actions here, because I don't think manual actions can return pages/get pagination options, but that would be the best way for you to go about doing what you want to do I think

l00kerOP•3y ago

But the count being off was a no go there.

ZachDaniel•3y ago

manual fn query, ecto_query, _ -> 
  # do what you want here, just return records of this resource
end

manual fn query, ecto_query, _ -> 
  # do what you want here, just return records of this resource
end

If you had a way to interact with pagination there, or to do the pagination work yourself, then you could do w/e you want and the duplicates thing would be fine(ish, again, might break some day)

l00kerOP•3y ago

I couldn't get pagination to work using manual but I could with modify_query

ZachDaniel•3y ago

Out of curiosity, what happened when you did it with manual?

l00kerOP•3y ago

IIRC I got results fine but when I tried to add the pagination it tossed an error

ZachDaniel•3y ago

jfc

          if Map.get(aggregate, :uniq?) do
            Ecto.Query.dynamic([row], count(^field, :distinct))
          else
            Ecto.Query.dynamic([row], count(^field, :distinct))
          end

          if Map.get(aggregate, :uniq?) do
            Ecto.Query.dynamic([row], count(^field, :distinct))
          else
            Ecto.Query.dynamic([row], count(^field, :distinct))
          end

I broke it in a recent release

l00kerOP•3y ago

I think the error was about duplicate selects.

ZachDaniel•3y ago

sorry I added a test to confirm the :distinct behavior because I just added the uniq? option, and I guess the previous tests weren't testing the non-unique behavior in the first place. okay, should be fixed and added a test so that won't get broken again. So your query ought to work, but keep in mind I can't guarantee that every feature around it will work. Its a reasonable enough thing for me to not assume that query results represent unique records of a resource, so if you encounter other things that don't work the way you expect, perhaps I can just account for that fact in those places if necessary. and I can't actually think of anything off of the top of my head that will break. Just trying to be conservative and not tell you to have at it and then discover there are issues. need to release that fix because I'm sure that breaks some things for others.

l00kerOP•3y ago

So you fixed the count for the modify_query? Or the original Ash query?

ZachDaniel•3y ago

both

l00kerOP•3y ago

Ahh

ZachDaniel•3y ago

the DISTINCT being added was just a regular old copy-paste bug

l00kerOP•3y ago

Just wanted to make sure I was on the same page

ZachDaniel•3y ago

👍

l00kerOP•3y ago

Thanks again Zach!

ZachDaniel•3y ago

My pleasure 🙂 Thanks for highlighting an interesting use case. I'll need to keep that in mind when designing in the future.

l00kerOP•3y ago

You can count on me! 🤣

ZachDaniel•3y ago

I can COUNT DISTINCT on you 😆

l00kerOP•3y ago

🤣

Ash Elixir Join

1.6KMembers

View on Discord

Did you find this page helpful?