Serverless multi gpu

I have a model deployed on 2 48 GB GPUs and 1 worker. It ran correctly for the first time with cuda distributed. But then fails with this "error_message": "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)",\n "error_traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py\\ What can be the issue here?
9 Replies
codeRetarded
codeRetarded4mo ago
Update : If I stop for some large amount of time and then send a request then it is working. I think it is working every time after some refresh. Please help.
ashleyk
ashleyk4mo ago
What model? What are you running on serverless? Impossible to help without full information.
codeRetarded
codeRetarded4mo ago
def get_chat_response(job):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_query = job["input"]["input_query"]
base_model, llama_tokenizer = create_base_model()
prompt = f"""
something
"""
model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
prompt_len = len(prompt)

base_model.eval()
with torch.no_grad():
resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
resp = extract_regex(resp)
return resp

def create_base_model():
model_id="/base/13B-chat"
peft_id="/base/LLM_Finetune/tmp3/llama-output"

base_model = AutoModelForCausalLM.from_pretrained(
model_id,
#quantization_config=quant_config,
device_map='auto'
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right" # Fix for fp16

base_model = PeftModel.from_pretrained(
base_model,
peft_id,
)

return base_model, llama_tokenizer
def get_chat_response(job):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_query = job["input"]["input_query"]
base_model, llama_tokenizer = create_base_model()
prompt = f"""
something
"""
model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
prompt_len = len(prompt)

base_model.eval()
with torch.no_grad():
resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
resp = extract_regex(resp)
return resp

def create_base_model():
model_id="/base/13B-chat"
peft_id="/base/LLM_Finetune/tmp3/llama-output"

base_model = AutoModelForCausalLM.from_pretrained(
model_id,
#quantization_config=quant_config,
device_map='auto'
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right" # Fix for fp16

base_model = PeftModel.from_pretrained(
base_model,
peft_id,
)

return base_model, llama_tokenizer
so this is my code, where I am trying to run a chat model, get_chat_response is the handler
acomquest
acomquest4mo ago
I am facing similar issue !
codeRetarded
codeRetarded4mo ago
I don't know if I should make any changes to runpod source code for multi-gpu?
ashleyk
ashleyk4mo ago
You usually need to set CUDA_VISIBLE_DEVICES to use more than one GPU or configure your code to do so, it doesn't happen magically by itself.
codeRetarded
codeRetarded3mo ago
oh you mean adding devices in the dockerfile while creating the container?
ashleyk
ashleyk3mo ago
No, that won't work
codeRetarded
codeRetarded3mo ago
Then you mean exporting the variable before running the code? But I don't seem to understand why does it work correctly for the first time the worker is spawned