Kord•11mo ago

Bot reccuringly not responding at specific times

For a few months now, our bot has been having issues during Sundays. It begins every Sunday at 04:00 CET and ends every Monday at 04:00 CET. This coincides with a whole 24h Sunday time on BRT timezone (might be relevant due to bot demographics) We have invested a lot of time investigating it together at the Kordex discord(warning, huge thread) Would like to check with Kord if there is something that you can see that we are not seeing. Basically, this last Sunday, we grabbed a 10minute CPU profiling graph during a time the issue was happening. Today in the morning, I grabbed another 10min profile during a regular behavior time. Comparing both, we could notice that during the issues, the bot had 25% CPU time usage on the DefaultGatewayEventInterceptor.handle and GuildEventHandler.handle methods. When doing a 5minute profile with a heavy extension of our bot disabled, this number grew up to 42% of CPU time used by these methods. On regular behavior, these methods are taking only 6% of the CPU. For some visual, i am attaching what it looked like in the past 2 Sundays. This is how it looks like every Sunday for the past couple months for us Interaction Latency means how long ago was the command created, based on the time the bot handled it Because of the spike in how long it takes for the bot to handle, a huge percentage of our users are affected because it takes more than 3s to react, therefore returning "the application did not respond" We have made several improvements on the bot, as well as some Kordex releases, but are at the end of our ideas on what it could be. Perhaps some of you might have some more insights or ideas for us to try

11 Replies

TschisOP•11mo ago

Might be worth noting some of our suspects and what we have changed. The first suspect was a couple of servers our bot was added to, that had crypto bot users in them. Every Sunday, these bots would just go crazy with UpdateMemberEvents. Like, hundreds per minute. We assumed they were spending the whole day updating their own names/profile with information about cryptocurrencies values. Our bot has a use case where we consumed MemberUpdateEvents. Two major changes happened since this was found: - we worked with Kordex to improve how we filter events, and are dropping all events coming from Bot users before we send them to any handler - we removed a bug in our handler that caused all remaining (real user) MemberUpdateEvents to come in through, even though only those who come from 1 specific server was desired Even with these two improvements, the bot still hangs. However, now we are having issues finding event related inputs that could overwhelm the bot. And are now looking for other possible bottlenecks

Moon•11mo ago

@Tschis have you checked the follow up on your previous thread https://discord.com/channels/556525343595298817/1325146059575660565/1327421428374569010

TschisOP•11mo ago

This is a separate issue, that is for our gateway service and this is for our actual bot process (they are running in different, separate containers) Our bot does not suffer the same issue with the EOF exceptions I have tried filtering the events we get that could be spamming, e.g. MemberUpdateEvents, to directly drop them instead of handling, but no effect Unfortunately, the way that the Intents work is a bit complicated I can not ask for just MemberJoinEvents, I have to request the GuildCreate intent which comes with more than what I could want Not to mention some very weird behavior, such as Intent.GuildPresence intent, which should only give me PresenceUpdateEvent, changes how GuildCreate events work, which comes from a completely different intent We basically do not need PresenceUpdateEvents but have requested it because it makes starting up the bot faster as we need Member information from all Guilds and with presence intent the initial GuildCreate event contains information from all member data So when the bot starts, we either have to request member data for 16k+ guilds, or we request GuildPresence intent and get all of that directly But then get spammed by "undesired" PresenceUpdateEvents

Moon•11mo ago

@Tschis Hello, I wasn't available most of the day; let me brainstorm this with you by filtering, could you elaborate how you are doing it?

g•11mo ago

they're using Kord Extensions' ability to filter out Kord events from its events flow using a predicate

TschisOP•11mo ago

Hey! Yes, this predicate to filter out so that we do not let our handlers process it. However, the bot still has to receive the events

Moon•11mo ago

I read your thread from the beginning and I reduced it to the following: You need presence update to be a one-time event specifically to get the data of all members

TschisOP•11mo ago

We need the presence intent to get the member data on guildcreate events when the bot starts up. Yes However, so far we have not concluded that this is the root cause of our issues, as we also have the intent througout the whole week and the problem only ocurrs on Sundays

g•11mo ago

I do think half of your problem is expecting there to be a single, clear root cause Sometimes you just need to work on optimising things that can obviously be optimised

TschisOP•11mo ago

I expect that because of the condition in which the problem appears, which is exactly 24h duration from 04:00 CET to 04:00 CET from Sunday to Monday

NoComment•11mo ago

The time consistency would also lead me to think that

Gaming

Programming

Bot reccuringly not responding at specific times

Did you find this page helpful?