For a few months now, our bot has been having issues during Sundays. It begins every Sunday at 04:00 CET and ends every Monday at 04:00 CET. This coincides with a whole 24h Sunday time on BRT timezone (might be relevant due to bot demographics)
Would like to check with Kord if there is something that you can see that we are not seeing. Basically, this last Sunday, we grabbed a 10minute CPU profiling graph during a time the issue was happening. Today in the morning, I grabbed another 10min profile during a regular behavior time.
Comparing both, we could notice that during the issues, the bot had 25% CPU time usage on the
DefaultGatewayEventInterceptor.handle
DefaultGatewayEventInterceptor.handle
and
GuildEventHandler.handle
GuildEventHandler.handle
methods. When doing a 5minute profile with a heavy extension of our bot disabled, this number grew up to 42% of CPU time used by these methods.
On regular behavior, these methods are taking only 6% of the CPU.
For some visual, i am attaching what it looked like in the past 2 Sundays. This is how it looks like every Sunday for the past couple months for us Interaction Latency means how long ago was the command created, based on the time the bot handled it Because of the spike in how long it takes for the bot to handle, a huge percentage of our users are affected because it takes more than 3s to react, therefore returning "the application did not respond"
We have made several improvements on the bot, as well as some Kordex releases, but are at the end of our ideas on what it could be. Perhaps some of you might have some more insights or ideas for us to try