Runpod•4mo ago

Runpod S3 multipart upload error

When I use boto3.client.upload_file for a multipart upload, all parts are successfully uploaded to S3, but I encounter an error: "An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file". File size is 2.1GB and storage space(20GB) && part sizes(200~300MB) are within limits.

24 Replies

Poddy•4mo ago

@lsy0882#6315

Escalated To Zendesk

The thread has been escalated to Zendesk!

Ticket ID: #19858

pave7946•4mo ago

I have the same issues when I try use boto3, aws cli or s3api Command for upload:

aws s3 cp --region DATACENTER --endpoint-url https://s3api-[DATACENTER].runpod.io/ [LOCAL_FILE] s3://NETWORK_VOLUME_ID/[REMOTE_FILE] --debug

or

aws s3api get-object --bucket [NETWORK_VOLUME_ID] \
    --key [REMOTE_FILE] \
    --region [DATACENTER] \
    --endpoint-url https://s3api-[DATACENTER].runpod.io/ \
    [LOCAL_FILE]

./s3_put.sh --endpoint https://s3api-[DATACENTER].runpod.io/ --region [DATACENTER] --bucket [NETWORK_VOLUME_ID] --object [REMOTE_FILE] --file [LOCAL_FILE]

aws s3 cp --region DATACENTER --endpoint-url https://s3api-[DATACENTER].runpod.io/ [LOCAL_FILE] s3://NETWORK_VOLUME_ID/[REMOTE_FILE] --debug

or

aws s3api get-object --bucket [NETWORK_VOLUME_ID] \
    --key [REMOTE_FILE] \
    --region [DATACENTER] \
    --endpoint-url https://s3api-[DATACENTER].runpod.io/ \
    [LOCAL_FILE]

./s3_put.sh --endpoint https://s3api-[DATACENTER].runpod.io/ --region [DATACENTER] --bucket [NETWORK_VOLUME_ID] --object [REMOTE_FILE] --file [LOCAL_FILE]

Errors:

Error uploading file 'ComfyUI/models/vae/FLUX1/ae.safetensors' to Network Volume 'NETWORK_VOLUME_ID' as 'models/vae/FLUX1/ae.safetensors': Failed to upload ComfyUI/models/vae/FLUX1/ae.safetensors to NETWORK_VOLUME_ID/models/vae/FLUX1/ae.safetensors: An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): Failed to create final object file

botocore.exceptions.ClientError: An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): Failed to create final object file

Error uploading file 'ComfyUI/models/vae/FLUX1/ae.safetensors' to Network Volume 'NETWORK_VOLUME_ID' as 'models/vae/FLUX1/ae.safetensors': Failed to upload ComfyUI/models/vae/FLUX1/ae.safetensors to NETWORK_VOLUME_ID/models/vae/FLUX1/ae.safetensors: An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): Failed to create final object file

botocore.exceptions.ClientError: An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): Failed to create final object file

lsy0882#6315OP•4mo ago

I've experienced exactly the same issue you're describing with boto3 multipart uploads to RunPod's S3-compatible storage. Here's the exact error message I consistently received at the final step (CompleteMultipartUpload):

(InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file

(InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file

To confirm, I checked that all individual parts were successfully uploaded. I verified this by observing the temporary files stored in the .s3compat_uploads directory. Here's a direct snapshot from my logs showing that all parts (for a ~2.1 GB file) were successfully uploaded:

2025-07-02 17:55:29 300.0 MiB .s3compat_uploads/.../1
2025-07-02 17:55:19 300.0 MiB .s3compat_uploads/.../2
2025-07-02 17:54:35 300.0 MiB .s3compat_uploads/.../3
2025-07-02 17:55:54 300.0 MiB .s3compat_uploads/.../4
2025-07-02 17:57:38 300.0 MiB .s3compat_uploads/.../5
2025-07-02 17:56:58 300.0 MiB .s3compat_uploads/.../6
2025-07-02 17:57:26 300.0 MiB .s3compat_uploads/.../7
2025-07-02 17:57:12 37.6 MiB .s3compat_uploads/.../8

2025-07-02 17:55:29 300.0 MiB .s3compat_uploads/.../1
2025-07-02 17:55:19 300.0 MiB .s3compat_uploads/.../2
2025-07-02 17:54:35 300.0 MiB .s3compat_uploads/.../3
2025-07-02 17:55:54 300.0 MiB .s3compat_uploads/.../4
2025-07-02 17:57:38 300.0 MiB .s3compat_uploads/.../5
2025-07-02 17:56:58 300.0 MiB .s3compat_uploads/.../6
2025-07-02 17:57:26 300.0 MiB .s3compat_uploads/.../7
2025-07-02 17:57:12 37.6 MiB .s3compat_uploads/.../8

When I reached out to RunPod's support team with detailed information, including timestamps and bucket details, they responded as follows:

"A 524 error is generated by Cloudflare, not by RunPod. This error indicates a timeout between Cloudflare and our backend services. One common scenario where this occurs is when a file upload takes too long to complete—this is likely what's happening in your case.

As suggested earlier, using a smaller part size for your uploads can help prevent this timeout. Most S3-compatible clients allow you to set or adjust the part size; we recommend ensuring each part is 500MB or less."

From their explanation, it seems Cloudflare, acting as an intermediary between our client and RunPod's storage backend, triggers this timeout issue specifically during the final object creation step. The suggested workaround from RunPod was explicitly to use smaller chunk sizes, ideally less than or equal to 500MB. Despite implementing smaller chunks, I've continued to experience intermittent issues, suggesting the underlying network timeout issue remains a problem that isn't fully addressed by chunk size alone. I hope sharing these detailed experiences and logs helps clarify the underlying issue for you.

Unknown User•4mo ago

Message Not Public

pave7946•4mo ago

Following team's suggestion, I tried modifying the upload chunk size. I tested various settings with 16MB, 64MB, and even 500MB chunks, but unfortunately, the error persists.

lsy0882#6315OP•4mo ago

I'm in the same boat. I've also tried uploading with different chunk sizes—5MB, 8MB, 16MB, 32MB, 64MB, 128MB, 300MB, 500MB !! —but I'm still facing the same error. I've confirmed that the multipart parts were uploaded successfully using the command

aws s3 ls --summarize --human-readable --recursive --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ s3://{bucket_id}/

, but an error keeps occurring at the CompleteMultipartUpload operation stage. I suspect there is a bug in the RunPod S3 backend's logic for concatenating the multipart data.

Unknown User•4mo ago

Message Not Public

lsy0882#6315OP•4mo ago

I sent a reply to support ticket but haven't received a response yet. Please open a new ticket.

Unknown User•4mo ago

Message Not Public

pave7946•4mo ago

I have open a new ticket for this Issue

Augenbrauensenker•4mo ago

I am having exactly the same issue. When using boto3, even multiparts are not uploaded. With aws s3 cp multiparts are uploaded, but merging them fails with the error above. This happens at 8 MB chunks. Basically this is making the network volume unusable, because I cannot upload any files there. aws s3api list-multipart-uploads lists the failed upload. put-object works for 300 MB file, but is much slower. For 10 GB file I get "An error occurred (524) when calling the PutObject operation: ".

CodingNinja•4mo ago

I'm experiencing this issue as well and am following this thread for updates on the bug.

Augenbrauensenker•3mo ago

Any update?

riverfog7•3mo ago

Maybe try smaller and bigger chunks?

pave7946•3mo ago

The commands we used are as follows:

# File Size
flux1-fill-dev-fp8.safetensors ~ 11GB

# Command to upload to the root
aws s3 cp --profile default --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ --cli-connect-timeout 0 --cli-read-timeout 0 ComfyUI/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors s3://xyz/flux1-fill-dev-fp8.safetensors --debug

# Command to upload to a subfolder
aws s3 cp --profile default --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ --cli-connect-timeout 0 --cli-read-timeout 0 ComfyUI/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors s3://xyz/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors --debug

# File Size
flux1-fill-dev-fp8.safetensors ~ 11GB

# Command to upload to the root
aws s3 cp --profile default --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ --cli-connect-timeout 0 --cli-read-timeout 0 ComfyUI/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors s3://xyz/flux1-fill-dev-fp8.safetensors --debug

# Command to upload to a subfolder
aws s3 cp --profile default --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ --cli-connect-timeout 0 --cli-read-timeout 0 ComfyUI/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors s3://xyz/models/checkpoints/FLUX1/flux1-fill-dev-fp8.safetensors --debug

Errors: - An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 2): Failed to create final object file - An error occurred (504) when calling the CompleteMultipartUpload operation (reached max retries: 2): Gateway Timeout - An error occurred (524) when calling the UploadPart operation Exception caught when parsing error response body: Traceback (most recent call last): File "awscli/botocore/parsers.py", line 537, in _parse_xml_string_to_dom xml.etree.ElementTree.ParseError: syntax error: line 1, column 0 This consistent failure, irrespective of the destination path or the multipart configuration, strongly reinforces our belief that this is a server-side issue with the final file assembly process. This problem is preventing us from using the Network Volume service.

Dj•3mo ago

Hey, sorry for no response here over the long weekend. Let me look through this thread and understand the issue.

pave7946•3mo ago

Do you have any news?

Dj•3mo ago

Thanks for the reminder, I can forget things sometimes :wires: To reiterate supports request, we recommend a multipart upload of <500MB parts because of limits placed by Cloudflare. For you, can we see the Python you're using here or what exactly you're doing to get any of these errors? Particularly the 524 for UploadPart? For everything else, I have someone looking into the issues you're seeing - I'll update you here when I get a message from him

pave7946•3mo ago

At the following link https://drive.google.com/drive/folders/1lqWoBfgTq9QhJ36IQSKdiQuHtFIcycoa?usp=sharing you will find the various log files and the python script that I used to upload files using boto3 with Python 3.9

Google Drive

pave7946•3mo ago

Any update from client service?

Dj•3mo ago

All I have so far is this:

Based on the timestamps of those files, the "Failed to create final object file" errors were occurring before the fix for subdirectories was deployed. So it is possible that they were still hitting that issue.

I'm still waiting to hear back on the 524 and 504, but you may just be good to go? Let me know!

lsy0882#6315OP•3mo ago

More than a week after I submitted a ticket for this issue, I received the following response:

Hi there,

Following up on your requests, our engineering team recommends increasing the timeouts in your tools to work around this issue.

For aws s3 and aws s3api:
Set the read timeout using the CLI flag --cli-read-timeout 7200 has been shown to help. While we haven’t thoroughly tested this for every file size, very large files might require even longer timeouts.

Alternatively, you can update your AWS config file (~/.aws/config) with:

[default]  
cli_read_timeout = 7200 
For boto3:
You can set a larger read_timeout using the Config class:

Option 1:

Python

import boto3
from botocore.config import Config
custom_config = Config(
read_timeout=7200,
)
s3_client = boto3.client('s3', config=custom_config)


**Option 2:**
Python

import boto3
from botocore.config import Config
custom_config = Config(
read_timeout=7200,
)
session = boto3.Session()
s3_client = session.client('s3', config=custom_config)


Please give this a try and let us know if the issue persists after implementing the above suggestions. We’re here to help if you need further guidance.

Hi there,

Following up on your requests, our engineering team recommends increasing the timeouts in your tools to work around this issue.

For aws s3 and aws s3api:
Set the read timeout using the CLI flag --cli-read-timeout 7200 has been shown to help. While we haven’t thoroughly tested this for every file size, very large files might require even longer timeouts.

Alternatively, you can update your AWS config file (~/.aws/config) with:

[default]  
cli_read_timeout = 7200 
For boto3:
You can set a larger read_timeout using the Config class:

Option 1:

Python

import boto3
from botocore.config import Config
custom_config = Config(
read_timeout=7200,
)
s3_client = boto3.client('s3', config=custom_config)


**Option 2:**
Python

import boto3
from botocore.config import Config
custom_config = Config(
read_timeout=7200,
)
session = boto3.Session()
s3_client = session.client('s3', config=custom_config)


Please give this a try and let us know if the issue persists after implementing the above suggestions. We’re here to help if you need further guidance.

I have already tried experimenting with a higher read_timeout setting, which resulted in the exact same error. Let me be clear: the multi-part concatenation logic within RunPod S3 itself seems to be flawed. Instead of testing or suggesting solutions on the configuration side, why don't you inspect the internal logic of RunPod S3? To be frank, it is technically very disappointing to receive such a simple answer after Runpod Team supposedly spent over a week investigating the cause. Our company is currently reviewing the integration of RunPod S3 into our services. However, if Runpod Team continue to provide these kinds of answers and the bug remains because have failed to properly identify the root cause, we will be unable to use RunPod. @Dj

Dj•3mo ago

We're aware that Multipart Concat is a little buggy, especially at extremely large file sizes. Admittedly, it can take a very long time for especially technical details to travel through support. You can always ask me here, although after hours like now it could take me a while to reply. I'm working with our team to make sure users don't get delivered the same suggestions twice. I can't access your ticket while I'm driving, but the engineer working on the S3 API Compatibility was made aware of a potential issue with this code path on Monday and on Tuesday suggested he was still testing for the root cause. We believe that users at files over 12GB will continue to see issues, but most users today (even those at that file size) should see the issue remedied with a longer timeout. It's a fairly complicated issue, if we're sloppy we risk corrupting files and a better fix would involve patching the file system that powers our network storage clusters (which is what powers the S3 API). We're definitely not done working on this.

CodingNinja•3mo ago

Hi @lsy0882#6315, Please check this thread - https://discord.com/channels/912829806415085598/1401371439176745100 . Upload works fine with their script. Hope it helps.

Gaming

Programming

Runpod S3 multipart upload error

Did you find this page helpful?