2024-01-20T00:36:26.942667713Z
2024-01-20T00:36:26.943297221Z ==========
2024-01-20T00:36:26.943372701Z == CUDA ==
2024-01-20T00:36:26.943619654Z ==========
2024-01-20T00:36:26.952680191Z
2024-01-20T00:36:26.952702058Z CUDA Version 11.8.0
2024-01-20T00:36:26.952707724Z
2024-01-20T00:36:26.952711901Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-01-20T00:36:26.952716521Z
2024-01-20T00:36:26.952720974Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-01-20T00:36:26.952726474Z By pulling and using the container, you accept the terms and conditions of this license:
2024-01-20T00:36:26.952730628Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-01-20T00:36:26.952735187Z
2024-01-20T00:36:26.952739194Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-01-20T00:36:26.967114577Z
2024-01-20T00:36:31.398534811Z /usr/local/lib/python3.11/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
2024-01-20T00:36:31.398583407Z warnings.warn(
2024-01-20T00:36:33.126995125Z WARNING 01-20 00:36:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-20T00:36:33.127225878Z INFO 01-20 00:36:33 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-20T00:39:28.047075314Z INFO 01-20 00:39:28 llm_engine.py:223] # GPU blocks: 4982, # CPU blocks: 2048
2024-01-20T00:39:30.772439067Z INFO 01-20 00:39:30 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-01-20T00:39:30.772471720Z INFO 01-20 00:39:30 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
2024-01-20T00:39:50.500041757Z INFO 01-20 00:39:50 model_runner.py:449] Graph capturing finished in 20 secs.
2024-01-20T00:39:50.506268314Z --- Starting Serverless Worker | Version 1.5.2 ---
2024-01-20T00:39:57.685484188Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished running generator.", "level": "INFO"}
2024-01-20T00:39:57.755166616Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished.", "level": "INFO"}
2024-01-20T00:36:26.942667713Z
2024-01-20T00:36:26.943297221Z ==========
2024-01-20T00:36:26.943372701Z == CUDA ==
2024-01-20T00:36:26.943619654Z ==========
2024-01-20T00:36:26.952680191Z
2024-01-20T00:36:26.952702058Z CUDA Version 11.8.0
2024-01-20T00:36:26.952707724Z
2024-01-20T00:36:26.952711901Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-01-20T00:36:26.952716521Z
2024-01-20T00:36:26.952720974Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-01-20T00:36:26.952726474Z By pulling and using the container, you accept the terms and conditions of this license:
2024-01-20T00:36:26.952730628Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-01-20T00:36:26.952735187Z
2024-01-20T00:36:26.952739194Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-01-20T00:36:26.967114577Z
2024-01-20T00:36:31.398534811Z /usr/local/lib/python3.11/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
2024-01-20T00:36:31.398583407Z warnings.warn(
2024-01-20T00:36:33.126995125Z WARNING 01-20 00:36:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-20T00:36:33.127225878Z INFO 01-20 00:36:33 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-20T00:39:28.047075314Z INFO 01-20 00:39:28 llm_engine.py:223] # GPU blocks: 4982, # CPU blocks: 2048
2024-01-20T00:39:30.772439067Z INFO 01-20 00:39:30 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-01-20T00:39:30.772471720Z INFO 01-20 00:39:30 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
2024-01-20T00:39:50.500041757Z INFO 01-20 00:39:50 model_runner.py:449] Graph capturing finished in 20 secs.
2024-01-20T00:39:50.506268314Z --- Starting Serverless Worker | Version 1.5.2 ---
2024-01-20T00:39:57.685484188Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished running generator.", "level": "INFO"}
2024-01-20T00:39:57.755166616Z {"requestId": "7756be1f-38eb-4672-9097-7785b367df08-u1", "message": "Finished.", "level": "INFO"}