Pagination 'count' is not accurate if the SQL query result contains duplicate id's
The pagination count SQL query is using
DISTINCT
like this:
If the result set returned by the sub query has duplicate id
s then the count doesn't reflect the actual number of rows returned by the sub query because DISTINCT
is removing them from the count.
Is there any way to remove DISTINCT
from the pagination count query?62 Replies
🤔 what query are you running?
I feel like it might be a bug that it is using
DISTINCT
thereYeah. IMO it shouldn't do that.
but at the same time, I feel like
DISTINCT
shouldn't affect the count given that "id"
is almost certainly unique
Maybe I need to see the whole query to see what you mean thoughWell in my case there are duplicates. Don't know if you remember the published TV episodes from another support question. In that case, there are reruns of episodes, therefore it will have duplicate
id
s.Can I see the whole action/query that you are running? Just want to wrap my head around it
That's the relevant parts
It would be strange for the
id
to end up duplicated in the result set honestly
I think that might be the bug
well, and then the distinct would be unnecessary
but like what you should get back is one row per thing
can I see the SQL generated for both? The count and the regular query?I see...
I did notice that you removed all the calculation parts from the count query in one of the last few updates.
and
tv_episodes
has_many published_air_dates
right?Yes
I feel like the query is likely producing duplicate
tv_episodes
which is problematicNo... it's exactly what I want
well...thats not how its supposed to work
it should only be giving you one tv episode, sorted by its latest air date
well, one tv episode per latest air date
It will because of reruns of the episodes on a different date
Actually I'm a bit confused
GROUP BY st0."episode_id"
That ought to prevent it from duplicating any tv_episodes
in this process
Can you check the actual results of the query? And confirm that Enum.count(results, &(&1.id))
is equal to the count of unique ids?
i.e
Okay. You're correct. The count does match. Let me tell you what put me onto this then.
Go look at the query times in the [debug] - db=453.1ms & db=459.4ms
🤔 yeah, that is not ideal
How big is the data set?
I was looking for a way the speed that up so I tried
modify_query
to use my normal ecto join in the original app that runs in like 30ms.
But the count was off. So I ran the SQL query in psql
and removed the DISTINCT
to get the right count.Ah, yeah that makes sense. Because your join is basically just an inner join, right?
your join would produce duplicates, I mean. Either way, I have plans to add some options to optimize those kinds of queries (and have already put some of them into place), but there may also be some ways to optimize it with what we have now
If I run the Ash query's in
psql
and use EXPLAIN ANAYZE
95% of the time is in the array_agg
It has to loop over the table 2xYeah, that tracks. I think the array_agg should not be necessary most of the time
but there are configurations where by it is necessary for correctness, and what I basically need to add is for ways to tell it that its unnecessary.
I've optimized it out for cases where the relationships along the aggregate path are all
belongs_to
(because we can safely assume that each record can't possibly be related to multiple rows otherwise tons of things would break).Right now there is only 2106 rows there and it's taking almost 500ms ea. If that table grows to say, 10_000 rows it will be insane.
Try this:
The
modify_query
escape hatch was giving me exactly what I needed and I wanted to have Ash manage the pagination but the count was off.well, I think the
modify_query
was giving you duplicate rowsYeah... it is
So if you add a
distinct
or something to your query
or do something to prevent the join from expandingI want the duplicate rows
😢 its part of the design of resources that they aren't expecting to return duplicates records. Like if that was done over an API or any other interface it would be very strange.
Why do you want the duplicates?
Not saying you don't have a valid reason, just trying to understand what you want to do
The other way out was to just do the query from the
AirDates
resource. That does duplicate and avoids the aggregate.what does the duplication of results help you do?
The list is in descending order by air date so as a viewer pages back in time the episodes will continue to be displayed in the proper order that they 'aired'
If they are removed from the list query there are gaps
but you want them to see the same tv episode multiple times
in that list
In that case, yes
Gotcha. I'll have to think about this, because I can see the use case, but querying the air dates filtered by them having been published is probably your best bet for now
The hooks for telling pagination not to do
DISTINCT
aren't there are the moment.
and there are things that actively expect not to find duplicates in the query response (like when we zip up any loaded data, we expect the first record w/ matching ids is the one we got back from the query)And back to the query times for a moment. If used in a LiveView, those queries are run 2x. Once for the initial page load and again in the mount.
Yeah, I agree, those query times are no good. The basic problem is simply one of optimization on our end. i.e when a
first
aggregate is used in a filter, it shouldn't be done in the same way as its done to load the value. In a filter, we should just join and filter
well...potentially
because we have to ensure we're only considering the last one according to the sort
but the exists
route would be significantly faster for you, I imagine.I'm sure there are many things to be considered for sure. I figured there are reasons that
DISTINCT
was used. That's why I was wondering if there was a way to remove it I hadn't found.Lemme take a look at why its even doing that
but even if I fix it, I'd say to make sure you add some tests around the duplication behavior at a minimum, because that isn't really part of the "things we expect you to do with resources" and so could break in the future.
I see that you've been busy optimizing things but I the meantime that's why I decided to try the
modify_query
Yeah, I think there are also some improvements we could make to manual actions here, because I don't think manual actions can return pages/get pagination options, but that would be the best way for you to go about doing what you want to do I think
But the count being off was a no go there.
If you had a way to interact with pagination there, or to do the pagination work yourself, then you could do w/e you want
and the duplicates thing would be fine(ish, again, might break some day)
I couldn't get pagination to work using
manual
but I could with modify_query
Out of curiosity, what happened when you did it with
manual
?IIRC I got results fine but when I tried to add the pagination it tossed an error
jfc
I broke it in a recent release
I think the error was about duplicate selects.
sorry
I added a test to confirm the
:distinct
behavior because I just added the uniq?
option, and I guess the previous tests weren't testing the non-unique behavior in the first place.
okay, should be fixed
and added a test so that won't get broken again.
So your query ought to work, but keep in mind I can't guarantee that every feature around it will work. Its a reasonable enough thing for me to not assume that query results represent unique records of a resource, so if you encounter other things that don't work the way you expect, perhaps I can just account for that fact in those places if necessary.
and I can't actually think of anything off of the top of my head that will break.
Just trying to be conservative and not tell you to have at it and then discover there are issues.
need to release that fix because I'm sure that breaks some things for others.So you fixed the count for the
modify_query
? Or the original Ash query?both
Ahh
the
DISTINCT
being added was just a regular old copy-paste bugJust wanted to make sure I was on the same page
👍
Thanks again Zach!
My pleasure 🙂 Thanks for highlighting an interesting use case. I'll need to keep that in mind when designing in the future.
You can count on me! 🤣
I can
COUNT DISTINCT
on you 😆🤣