Sometimes one worker in endpoint fails because of internal errors, misconfiguration, out of space (because of memory purging errors) and etc (happens in less than 1%). Unfortunately this worker will generate endless errors and each task going to that worker will fail . So it is always a job to be done by logging in to account and manually kicking that worker out of endpoint to stop errors. Definitely need an API to be able to kick unhealthy workers from endpoints.
Continue the conversation
Join the Discord to ask follow-up questions and connect with the community
R
Runpod
We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!