Runpod•17mo ago

Pipeline is not using gpu on serverless

Hi! I 'm running bart-large-mnli on serverless but as I can see from the worker stats it's not using the gpu, do you know what I'm doing wrong? The image is my current handler.py And as docker base I'm using "FROM runpod/base:0.6.2-cuda12.2.0", also tried with "runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04" but still 0% usage of gpu. Let me know if you need more details! Thank you 🙂

57 Replies

digigoblin•17mo ago

How are you running the model?

BadNoiseOP•17mo ago

this is the docker, I'm building + push on my docker and running it from a 24gb gpu on serverless

BadNoiseOP•17mo ago

and this is the model downloader

PatrickR•17mo ago

I have a feeling this line:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Is doing something funky. You should try doing a print right after that:

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

And see if your code thinks it is running on a CPU.

BadNoiseOP•17mo ago

thank you! I'll try it immediately and let you know

BadNoiseOP•17mo ago

@PatrickR this is the output

BadNoiseOP•17mo ago

I can give you the full repo if you need 🙂

digigoblin•17mo ago

Yep, will be useful for us to help you test it

PatrickR•17mo ago

That would be useful yes! Would love to test out and see what is going on.

BadNoiseOP•17mo ago

here it is! thank you so much for your help

bart-worker.zip

PatrickR•17mo ago

Risky click 😆

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

if you'd prefer I can give you single files

BadNoiseOP•17mo ago

Dockerfile

requirements.txt

cache_model.py

handler.py

BadNoiseOP•17mo ago

this is the folder structure

Unknown User•17mo ago

Message Not Public

digigoblin•17mo ago

its already doing that

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

with 5 concurrent requests ~5s per request

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

let me try again cause I don't remember 😅 I'll launch the 32vcpu and let you know!

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

sure no problem, I see 100% CPU usage and 0% for the GPU

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

thanks for the tip, but I'm performing stress tests sending constantly requests for 1 minutes on it to understand how many requests it can handle so it's always running

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

another strange thing is that on a cheap cpu on hugging face inference endpoint it performs faster than on a 24gb gpu on runpod (that's also why I think that is not using it) 😅 always ~5 seconds with 5 concurrent requests on a 32 vcpu

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

@nerdylive tried now, still 100% CPU usage and 0% for the GPU 😦

Madiator2011•17mo ago

I might look at it

BadNoiseOP•17mo ago

thank you 🙂

PatrickR•17mo ago

Hey, so I went through this and I've this input:

{
    "input": {
        "sequence": "The weather is sunny today.",
        "labels": ["weather", "sports", "news"]
    }
}

{
    "input": {
        "sequence": "The weather is sunny today.",
        "labels": ["weather", "sports", "news"]
    }
}

and this output:

{
  "id": "test-822c3793-23b3-4464-8b65-972bb5776867",
  "status": "COMPLETED",
  "output": {
    "classification_result": {
      "sequence": "The weather is sunny today.",
      "labels": [
        "weather",
        "news",
        "sports"
      ],
      "scores": [
        0.989009439945221,
        0.24655567109584808,
        0.008112689480185509
      ]
    },
    "device": "cuda"
  }
}

{
  "id": "test-822c3793-23b3-4464-8b65-972bb5776867",
  "status": "COMPLETED",
  "output": {
    "classification_result": {
      "sequence": "The weather is sunny today.",
      "labels": [
        "weather",
        "news",
        "sports"
      ],
      "scores": [
        0.989009439945221,
        0.24655567109584808,
        0.008112689480185509
      ]
    },
    "device": "cuda"
  }
}

Here is my python code:

import torch
import runpod
from runpod.serverless.utils.rp_validator import validate
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)
INPUT_SCHEMA = {
    'sequence': {
        'type': str,
        'required': True
    },
    'labels': {
        'type': list,
        'required': True,
    }
}

def classify_text(sequence, labels):
    model = AutoModelForSequenceClassification.from_pretrained(
        "facebook/bart-large-mnli",
        local_files_only=False  # Change this to False to download if not available locally
    ).to(device)
    tokenizer = AutoTokenizer.from_pretrained(
        "facebook/bart-large-mnli", local_files_only=False)  # Change this to False to download if not available locally

    classifier = pipeline(
        "zero-shot-classification",
        model=model,
        tokenizer=tokenizer,
        device=0,
    )

    return classifier(sequence, labels, multi_label=True)

async def handler(job):
    val_input = validate(job['input'], INPUT_SCHEMA)
    if 'errors' in val_input:
        return {"error": val_input['errors']}
    val_input = val_input['validated_input']

    classification_result = classify_text(val_input["sequence"], val_input["labels"])
    
    return {
        "classification_result": classification_result,
        "device": str(device)
    }

runpod.serverless.start({"handler": handler, "concurrency_modifier": lambda x: 1000})

import torch
import runpod
from runpod.serverless.utils.rp_validator import validate
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)
INPUT_SCHEMA = {
    'sequence': {
        'type': str,
        'required': True
    },
    'labels': {
        'type': list,
        'required': True,
    }
}

def classify_text(sequence, labels):
    model = AutoModelForSequenceClassification.from_pretrained(
        "facebook/bart-large-mnli",
        local_files_only=False  # Change this to False to download if not available locally
    ).to(device)
    tokenizer = AutoTokenizer.from_pretrained(
        "facebook/bart-large-mnli", local_files_only=False)  # Change this to False to download if not available locally

    classifier = pipeline(
        "zero-shot-classification",
        model=model,
        tokenizer=tokenizer,
        device=0,
    )

    return classifier(sequence, labels, multi_label=True)

async def handler(job):
    val_input = validate(job['input'], INPUT_SCHEMA)
    if 'errors' in val_input:
        return {"error": val_input['errors']}
    val_input = val_input['validated_input']

    classification_result = classify_text(val_input["sequence"], val_input["labels"])
    
    return {
        "classification_result": classification_result,
        "device": str(device)
    }

runpod.serverless.start({"handler": handler, "concurrency_modifier": lambda x: 1000})

Unknown User•17mo ago

Message Not Public

PatrickR•17mo ago

So I am getting the GPU to run through CUDA. Yes, output of the device is GPU. BTW I used the CLI tool runpodctl project create for faster itteration cycles/not having to rebuild docker constantly.

Unknown User•17mo ago

Message Not Public

PatrickR•17mo ago

I rebuilt the new Docker image based off another image:

FROM runpod/base:0.6.1-cuda12.2.0


COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
    rm /requirements.txt

ADD . /

CMD python3.11 -u /src/handler.py

FROM runpod/base:0.6.1-cuda12.2.0


COPY builder/requirements.txt /requirements.txt
RUN python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt --no-cache-dir && \
    rm /requirements.txt

ADD . /

CMD python3.11 -u /src/handler.py

yhlong00000•17mo ago

I think he trying to use the cache_model.py to cache the model locally when building the docker image. He set local_files_only=True, just to make sure it never download from internet.

Unknown User•17mo ago

Message Not Public

yhlong00000•17mo ago

i don't feel anything wrong with that😂 , I am still wondering what Patrick changed make it works to start using the GPU.

Unknown User•17mo ago

Message Not Public

PatrickR•17mo ago

Sorry, my code was a little bit of a redherring. Here is a screenshot of it running on GPU though.

Unknown User•17mo ago

Message Not Public

BadNoiseOP•17mo ago

hi! thank you so much for your help, I will try with the suggested docker image 🙂

yhlong00000•17mo ago

I think this might be the root cause, in your requirements.txt, you have to set: torch==2.2.1

Madiator2011•17mo ago

Make sure to install cuda version not cpu

BadNoiseOP•17mo ago

I'll try setting manually the torch version, because it's strange that I still see 0% of the GPU usage

BadNoiseOP•17mo ago

so I have to remove torch and use pytorch and pytorch-cuda=12.1 right?

digigoblin•17mo ago

pip3 install --no-cache-dir torch==2.3.0+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 && \
    pip3 install --no-cache-dir xformers==0.0.26.post1 --index-url https://download.pytorch.org/whl/cu121

pip3 install --no-cache-dir torch==2.3.0+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 && \
    pip3 install --no-cache-dir xformers==0.0.26.post1 --index-url https://download.pytorch.org/whl/cu121

Assming your base image is CUDA 12.1

BadNoiseOP•17mo ago

that's crazy, always 0% 😩