Best approach to fetch hundreds of thousands of records

I'm wondering what's the best approach or best practice around fetching hundreds of thousands of records for initial full data loads to transfer somewhere else? Currently, I'm using jpa + pagination and it is slow. I'm loading around ~400k objects in memory but is the better approach to use jdbctemplate? What if you also need to join a few tables? What would you recommend here?
25 Replies
JavaBot
JavaBot7mo ago
This post has been reserved for your question.
Hey @Kale Vivi! Please use /close or the Close Post button above when your problem is solved. Please remember to follow the help guidelines. This post will be automatically marked as dormant after 300 minutes of inactivity.
TIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here. 💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.
dan1st
dan1st7mo ago
JDBCTemplate might be faster but it depends on how you are doing it and the exact issues Can you profile your application? Is most of the work done in the DB? IO? The application/processing? For profiling, you could use something like VisualVM
JavaBot
JavaBot7mo ago
💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.
Kale Vivi
Kale ViviOP7mo ago
@Override
public List<SomeObject> findSomeObjects() {
return findSomeObjectsQuery().stream()
.map(SomeObject::from)
.toList();
}

public List<SomeObject> findSomeObjectsQuery() {
int page = 0;
int size = 1000;
List<SomeObject> allSomeObjects = new ArrayList<>();
Page<SomeObject> someObjectPage;

do {
someObjectPage = someObjectRepository.findSomeObjectsQuery(Constants.DATE, PageRequest.of(page, size));
allSomeObjects.addAll(someObjectPage.getContent());
page++;
} while (!someObjectPage.isLast());

return allSomeObjects;
}
@Query("""
SELECT n
FROM SomeObject n
JOIN n.someOtherObject e
WHERE n.deleted = false
AND e.date >= :date
""")
Page<SomeObject> findSomeObjectsQuery(@Param("date") ZonedDateTime date, Pageable pageable);
@Override
public List<SomeObject> findSomeObjects() {
return findSomeObjectsQuery().stream()
.map(SomeObject::from)
.toList();
}

public List<SomeObject> findSomeObjectsQuery() {
int page = 0;
int size = 1000;
List<SomeObject> allSomeObjects = new ArrayList<>();
Page<SomeObject> someObjectPage;

do {
someObjectPage = someObjectRepository.findSomeObjectsQuery(Constants.DATE, PageRequest.of(page, size));
allSomeObjects.addAll(someObjectPage.getContent());
page++;
} while (!someObjectPage.isLast());

return allSomeObjects;
}
@Query("""
SELECT n
FROM SomeObject n
JOIN n.someOtherObject e
WHERE n.deleted = false
AND e.date >= :date
""")
Page<SomeObject> findSomeObjectsQuery(@Param("date") ZonedDateTime date, Pageable pageable);
Particularly on this area: SomeObject::from, I'm running into a memory issue loading all the objects. Is there a more performant way to do this operation here?
dan1st
dan1st7mo ago
DO you really need everything as a List?
Kale Vivi
Kale ViviOP7mo ago
Do you mean if I should use a stream instead? At the some point I need a list but maybe I can have a stream until that point right?
ayylmao123xdd
ayylmao123xdd7mo ago
i mean technically you could save it all in a file and then write another code that reads that file idfk if thats what u want tho
Kale Vivi
Kale ViviOP7mo ago
It isn't It needs to read from db
ayylmao123xdd
ayylmao123xdd7mo ago
then fetch it with a stream
JavaBot
JavaBot7mo ago
💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.
Kale Vivi
Kale ViviOP7mo ago
One other thing around this do you have advice on debugging long running operations? Like if you have to wait 2 hours to reproduce an issue, that is too long. Do you normally have to adjust the code to build a smaller sample that reflects only that object to try to reprodue and debug or any other tips here around that?
ayylmao123xdd
ayylmao123xdd7mo ago
i mean if you wanna test some scheduler you can just set up a new one or something that calls every second or if you are processing a huge file just put a small one for testing
dan1st
dan1st7mo ago
Why? Having a List requires storing all in memory ideally you would process them in batches and never store all in memory
straightface
straightface7mo ago
It depends, we use jdbctemplate to fetch about 19M rows, what is the problem you are trying to solve? honestly if it works, it works if you have the HW might as well put it to work if you have lot of duplicate strings you might want to look into string interning, saved me lot of memory without having to do much
dan1st
dan1st7mo ago
well they store everything in a List at the end...
Kale Vivi
Kale ViviOP7mo ago
What do you mean by duplicate strings here? I need to eventually send the data to an external api and api contract requires objects be in an array/list, etc..
ayylmao123xdd
ayylmao123xdd7mo ago
maybe just make a big page size like 10k if you have to send it in a list anyway
JavaBot
JavaBot7mo ago
💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.
dan1st
dan1st7mo ago
Do you really need to send all of the data in a single request?
FullStackBadAss
FullStackBadAss7mo ago
Really dude?
JavaBot
JavaBot7mo ago
💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.
Kale Vivi
Kale ViviOP7mo ago
Not in a single request, I take the list and make calls in batches async
dan1st
dan1st7mo ago
then you take each batch from the DB, convert that batch to a list and then make the call for that specific batch don't create a list for all of the data
Kale Vivi
Kale ViviOP7mo ago
Makes sense, thank you
JavaBot
JavaBot6mo ago
If you are finished with your post, please close it. If you are not, please ignore this message. Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts. 💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.
💤 Post marked as dormant
This post has been inactive for over 300 minutes, thus, it has been archived. If your question was not answered yet, feel free to re-open this post or create a new one. In case your post is not getting any attention, you can try to use /help ping. Warning: abusing this will result in moderative actions taken against you.

Did you find this page helpful?