Pagination 'count' is not accurate if the SQL query result contains duplicate id's

The pagination count SQL query is using DISTINCT like this:
SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM (...) ...
SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM (...) ...
If the result set returned by the sub query has duplicate ids then the count doesn't reflect the actual number of rows returned by the sub query because DISTINCT is removing them from the count. Is there any way to remove DISTINCT from the pagination count query?
62 Replies
ZachDaniel
ZachDaniel3y ago
🤔 what query are you running? I feel like it might be a bug that it is using DISTINCT there
l00ker
l00kerOP3y ago
Yeah. IMO it shouldn't do that.
ZachDaniel
ZachDaniel3y ago
but at the same time, I feel like DISTINCT shouldn't affect the count given that "id" is almost certainly unique Maybe I need to see the whole query to see what you mean though
l00ker
l00kerOP3y ago
Well in my case there are duplicates. Don't know if you remember the published TV episodes from another support question. In that case, there are reruns of episodes, therefore it will have duplicate ids.
ZachDaniel
ZachDaniel3y ago
Can I see the whole action/query that you are running? Just want to wrap my head around it
l00ker
l00kerOP3y ago
...
read :published do
prepare build(
filter: expr(published),
sort: [published_air_date: :desc],
load: [:thumb_url]
)

pagination do
offset? true
default_limit 10
countable :by_default
end
end
...
aggregates do
first :published_air_date, :air_dates, :start_datetime do
sort start_datetime: :desc
filter expr(status == :published)
end

first :this_weeks_air_date, :air_dates, :start_datetime do
sort start_datetime: :desc
filter expr(status in [:published, :scheduled])
end
end
...
calculations do
calculate :published, :boolean, expr(not is_nil(published_air_date))
end
...
read :published do
prepare build(
filter: expr(published),
sort: [published_air_date: :desc],
load: [:thumb_url]
)

pagination do
offset? true
default_limit 10
countable :by_default
end
end
...
aggregates do
first :published_air_date, :air_dates, :start_datetime do
sort start_datetime: :desc
filter expr(status == :published)
end

first :this_weeks_air_date, :air_dates, :start_datetime do
sort start_datetime: :desc
filter expr(status in [:published, :scheduled])
end
end
...
calculations do
calculate :published, :boolean, expr(not is_nil(published_air_date))
end
That's the relevant parts
ZachDaniel
ZachDaniel3y ago
It would be strange for the id to end up duplicated in the result set honestly I think that might be the bug well, and then the distinct would be unnecessary but like what you should get back is one row per thing can I see the SQL generated for both? The count and the regular query?
l00ker
l00kerOP3y ago
[debug] QUERY OK source="tv_episodes" db=453.1ms idle=721.0ms
SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $2::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $3::boolean)::boolean::boolean) [0, :published, true]
[debug] QUERY OK source="tv_episodes" db=453.1ms idle=721.0ms
SELECT coalesce(count(DISTINCT t0."id"::bigint), $1::bigint)::bigint FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $2::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $3::boolean)::boolean::boolean) [0, :published, true]
[debug] QUERY OK source="tv_episodes" db=459.4ms queue=0.1ms idle=857.9ms
SELECT t0."id", t0."title", t0."brief_desc", t0."text", t0."thumb", t0."video_file", t0."video_poster", t0."date_added", t0."first_date", t0."member_level", t0."cat_id", t0."is_featured", t0."times_viewed", s1."published_air_date"::timestamp::timestamp, ($1::varchar::varchar || ($2::varchar::varchar || t0."thumb"::text::varchar)::varchar)::text, ($3::varchar::varchar || ($4::varchar::varchar || t0."video_file"::text::varchar)::varchar)::text, ($5::varchar::varchar || ($6::varchar::varchar || t0."video_poster"::text::varchar)::varchar)::text FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $7::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $8::boolean)::boolean::boolean) ORDER BY s1."published_air_date" DESC LIMIT $9 ["https://cdn1.wx3media.net", "/wog/", "https://cdn1.wx3media.net", "/wog-videos/", "https://cdn1.wx3media.net", "/wog/", :published, true, 11]
[debug] QUERY OK source="tv_episodes" db=459.4ms queue=0.1ms idle=857.9ms
SELECT t0."id", t0."title", t0."brief_desc", t0."text", t0."thumb", t0."video_file", t0."video_poster", t0."date_added", t0."first_date", t0."member_level", t0."cat_id", t0."is_featured", t0."times_viewed", s1."published_air_date"::timestamp::timestamp, ($1::varchar::varchar || ($2::varchar::varchar || t0."thumb"::text::varchar)::varchar)::text, ($3::varchar::varchar || ($4::varchar::varchar || t0."video_file"::text::varchar)::varchar)::text, ($5::varchar::varchar || ($6::varchar::varchar || t0."video_poster"::text::varchar)::varchar)::text FROM "tv_episodes" AS t0 LEFT OUTER JOIN LATERAL (SELECT (array_agg(st0."start_datetime"::timestamp ORDER BY st0."start_datetime" DESC ))[1]::timestamp AS "published_air_date", st0."episode_id" AS "episode_id" FROM "public"."tv_episodes_air_dates" AS st0 WHERE (st0."status"::varchar = $7::varchar) AND (t0."id" = st0."episode_id") GROUP BY st0."episode_id") AS s1 ON TRUE WHERE (NOT ((s1."published_air_date"::timestamp IS NULL) = $8::boolean)::boolean::boolean) ORDER BY s1."published_air_date" DESC LIMIT $9 ["https://cdn1.wx3media.net", "/wog/", "https://cdn1.wx3media.net", "/wog-videos/", "https://cdn1.wx3media.net", "/wog/", :published, true, 11]
ZachDaniel
ZachDaniel3y ago
I see...
l00ker
l00kerOP3y ago
I did notice that you removed all the calculation parts from the count query in one of the last few updates.
ZachDaniel
ZachDaniel3y ago
and tv_episodes has_many published_air_dates right?
l00ker
l00kerOP3y ago
Yes
ZachDaniel
ZachDaniel3y ago
I feel like the query is likely producing duplicate tv_episodes which is problematic
l00ker
l00kerOP3y ago
No... it's exactly what I want
ZachDaniel
ZachDaniel3y ago
well...thats not how its supposed to work it should only be giving you one tv episode, sorted by its latest air date well, one tv episode per latest air date
l00ker
l00kerOP3y ago
It will because of reruns of the episodes on a different date
ZachDaniel
ZachDaniel3y ago
Actually I'm a bit confused GROUP BY st0."episode_id" That ought to prevent it from duplicating any tv_episodes in this process Can you check the actual results of the query? And confirm that Enum.count(results, &(&1.id)) is equal to the count of unique ids? i.e
ids = Enum.map(results, &(&1.id))
IO.inspect(Enum.count(ids))
IO.inspect(Enum.count(Enum.uniq(ids))
ids = Enum.map(results, &(&1.id))
IO.inspect(Enum.count(ids))
IO.inspect(Enum.count(Enum.uniq(ids))
l00ker
l00kerOP3y ago
Okay. You're correct. The count does match. Let me tell you what put me onto this then. Go look at the query times in the [debug] - db=453.1ms & db=459.4ms
ZachDaniel
ZachDaniel3y ago
🤔 yeah, that is not ideal How big is the data set?
l00ker
l00kerOP3y ago
I was looking for a way the speed that up so I tried modify_query to use my normal ecto join in the original app that runs in like 30ms. But the count was off. So I ran the SQL query in psql and removed the DISTINCT to get the right count.
ZachDaniel
ZachDaniel3y ago
Ah, yeah that makes sense. Because your join is basically just an inner join, right? your join would produce duplicates, I mean. Either way, I have plans to add some options to optimize those kinds of queries (and have already put some of them into place), but there may also be some ways to optimize it with what we have now
l00ker
l00kerOP3y ago
If I run the Ash query's in psql and use EXPLAIN ANAYZE 95% of the time is in the array_agg It has to loop over the table 2x
ZachDaniel
ZachDaniel3y ago
Yeah, that tracks. I think the array_agg should not be necessary most of the time but there are configurations where by it is necessary for correctness, and what I basically need to add is for ways to tell it that its unnecessary. I've optimized it out for cases where the relationships along the aggregate path are all belongs_to (because we can safely assume that each record can't possibly be related to multiple rows otherwise tons of things would break).
l00ker
l00kerOP3y ago
Right now there is only 2106 rows there and it's taking almost 500ms ea. If that table grows to say, 10_000 rows it will be insane.
ZachDaniel
ZachDaniel3y ago
Try this:
calculate :published, :boolean, expr(exists(published_air_dates, status == :published and not is_nil(start_datetime)))
calculate :published, :boolean, expr(exists(published_air_dates, status == :published and not is_nil(start_datetime)))
l00ker
l00kerOP3y ago
The modify_query escape hatch was giving me exactly what I needed and I wanted to have Ash manage the pagination but the count was off.
ZachDaniel
ZachDaniel3y ago
well, I think the modify_query was giving you duplicate rows
l00ker
l00kerOP3y ago
Yeah... it is
ZachDaniel
ZachDaniel3y ago
So if you add a distinct or something to your query or do something to prevent the join from expanding
l00ker
l00kerOP3y ago
I want the duplicate rows
ZachDaniel
ZachDaniel3y ago
😢 its part of the design of resources that they aren't expecting to return duplicates records. Like if that was done over an API or any other interface it would be very strange. Why do you want the duplicates? Not saying you don't have a valid reason, just trying to understand what you want to do
l00ker
l00kerOP3y ago
The other way out was to just do the query from the AirDates resource. That does duplicate and avoids the aggregate.
ZachDaniel
ZachDaniel3y ago
what does the duplication of results help you do?
l00ker
l00kerOP3y ago
The list is in descending order by air date so as a viewer pages back in time the episodes will continue to be displayed in the proper order that they 'aired' If they are removed from the list query there are gaps
ZachDaniel
ZachDaniel3y ago
but you want them to see the same tv episode multiple times in that list
l00ker
l00kerOP3y ago
In that case, yes
ZachDaniel
ZachDaniel3y ago
Gotcha. I'll have to think about this, because I can see the use case, but querying the air dates filtered by them having been published is probably your best bet for now The hooks for telling pagination not to do DISTINCT aren't there are the moment. and there are things that actively expect not to find duplicates in the query response (like when we zip up any loaded data, we expect the first record w/ matching ids is the one we got back from the query)
l00ker
l00kerOP3y ago
And back to the query times for a moment. If used in a LiveView, those queries are run 2x. Once for the initial page load and again in the mount.
ZachDaniel
ZachDaniel3y ago
Yeah, I agree, those query times are no good. The basic problem is simply one of optimization on our end. i.e when a first aggregate is used in a filter, it shouldn't be done in the same way as its done to load the value. In a filter, we should just join and filter well...potentially because we have to ensure we're only considering the last one according to the sort but the exists route would be significantly faster for you, I imagine.
l00ker
l00kerOP3y ago
I'm sure there are many things to be considered for sure. I figured there are reasons that DISTINCT was used. That's why I was wondering if there was a way to remove it I hadn't found.
ZachDaniel
ZachDaniel3y ago
Lemme take a look at why its even doing that but even if I fix it, I'd say to make sure you add some tests around the duplication behavior at a minimum, because that isn't really part of the "things we expect you to do with resources" and so could break in the future.
l00ker
l00kerOP3y ago
I see that you've been busy optimizing things but I the meantime that's why I decided to try the modify_query
ZachDaniel
ZachDaniel3y ago
Yeah, I think there are also some improvements we could make to manual actions here, because I don't think manual actions can return pages/get pagination options, but that would be the best way for you to go about doing what you want to do I think
l00ker
l00kerOP3y ago
But the count being off was a no go there.
ZachDaniel
ZachDaniel3y ago
manual fn query, ecto_query, _ ->
# do what you want here, just return records of this resource
end
manual fn query, ecto_query, _ ->
# do what you want here, just return records of this resource
end
If you had a way to interact with pagination there, or to do the pagination work yourself, then you could do w/e you want and the duplicates thing would be fine(ish, again, might break some day)
l00ker
l00kerOP3y ago
I couldn't get pagination to work using manual but I could with modify_query
ZachDaniel
ZachDaniel3y ago
Out of curiosity, what happened when you did it with manual?
l00ker
l00kerOP3y ago
IIRC I got results fine but when I tried to add the pagination it tossed an error
ZachDaniel
ZachDaniel3y ago
jfc
if Map.get(aggregate, :uniq?) do
Ecto.Query.dynamic([row], count(^field, :distinct))
else
Ecto.Query.dynamic([row], count(^field, :distinct))
end
if Map.get(aggregate, :uniq?) do
Ecto.Query.dynamic([row], count(^field, :distinct))
else
Ecto.Query.dynamic([row], count(^field, :distinct))
end
I broke it in a recent release
l00ker
l00kerOP3y ago
I think the error was about duplicate selects.
ZachDaniel
ZachDaniel3y ago
sorry I added a test to confirm the :distinct behavior because I just added the uniq? option, and I guess the previous tests weren't testing the non-unique behavior in the first place. okay, should be fixed and added a test so that won't get broken again. So your query ought to work, but keep in mind I can't guarantee that every feature around it will work. Its a reasonable enough thing for me to not assume that query results represent unique records of a resource, so if you encounter other things that don't work the way you expect, perhaps I can just account for that fact in those places if necessary. and I can't actually think of anything off of the top of my head that will break. Just trying to be conservative and not tell you to have at it and then discover there are issues. need to release that fix because I'm sure that breaks some things for others.
l00ker
l00kerOP3y ago
So you fixed the count for the modify_query? Or the original Ash query?
ZachDaniel
ZachDaniel3y ago
both
l00ker
l00kerOP3y ago
Ahh
ZachDaniel
ZachDaniel3y ago
the DISTINCT being added was just a regular old copy-paste bug
l00ker
l00kerOP3y ago
Just wanted to make sure I was on the same page
ZachDaniel
ZachDaniel3y ago
👍
l00ker
l00kerOP3y ago
Thanks again Zach!
ZachDaniel
ZachDaniel3y ago
My pleasure 🙂 Thanks for highlighting an interesting use case. I'll need to keep that in mind when designing in the future.
l00ker
l00kerOP3y ago
You can count on me! 🤣
ZachDaniel
ZachDaniel3y ago
I can COUNT DISTINCT on you 😆
l00ker
l00kerOP3y ago
🤣

Did you find this page helpful?