crawlee not respecting cgroup resource limits

crawlee doesnt seem to respect resource limits imposed by cgroups. This poses problems for containerised enviroments where ethier crawlee gets oom killed or silently slows to a crawl as it thinks it has much more resource available then it actually does. reading and setting the maximum ram is pretty easy
function getMaxMemoryMB(): number | null {
const cgroupPath = '/sys/fs/cgroup/memory.max';

if (!existsSync(cgroupPath)) {
log.warning('Cgroup v2 memory limit file not found.');
return null;
}

try {
const data = readFileSync(cgroupPath, 'utf-8').trim();

if (data === 'max') {
log.warning('No memory limit set (cgroup reports "max").');
return null;
}

const maxMemoryBytes = parseInt(data, 10);
return maxMemoryBytes / (1024 * 1024); // Convert to MB
} catch (error) {
log.exception(error as Error, 'Error reading cgroup memory limit:');
return null;
}
}
function getMaxMemoryMB(): number | null {
const cgroupPath = '/sys/fs/cgroup/memory.max';

if (!existsSync(cgroupPath)) {
log.warning('Cgroup v2 memory limit file not found.');
return null;
}

try {
const data = readFileSync(cgroupPath, 'utf-8').trim();

if (data === 'max') {
log.warning('No memory limit set (cgroup reports "max").');
return null;
}

const maxMemoryBytes = parseInt(data, 10);
return maxMemoryBytes / (1024 * 1024); // Convert to MB
} catch (error) {
log.exception(error as Error, 'Error reading cgroup memory limit:');
return null;
}
}
this can then be used to set a reasonable RAM limit for crawlee however, the CPU limits are proving more difficult. Has anyone found a fix yet?
3 Replies
Hall
Hall4mo ago
Someone will reply to you shortly. In the meantime, this might help:
extended-salmon
extended-salmonOP4mo ago
reading the code for the getMemoryInfo utility function, https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L53 it relies on the isDocker utility function to read from cGroups https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/general.ts#L39 i think my problem may be that since im running in k8, this check fails meaning crawlee defaults to working against the host resource limits
GitHub
crawlee/packages/utils/src/internals/general.ts at master · apify/c...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...
GitHub
crawlee/packages/utils/src/internals/memory-info.ts at master · api...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...
extended-salmon
extended-salmonOP4mo ago
im going to try tricking this function by manually creating a /.dockerenv file to make isDocker return true this appears to have worked, fudging in a /.dockerenv file makes crawlee respect cgroup memory and cpu limits the same issue still persists for cpu though, its reading the hosts total cpu usage rather than its cgroup's

Did you find this page helpful?