Runpod S3 multipart upload error
When I use boto3.client.upload_file for a multipart upload, all parts are successfully uploaded to S3, but I encounter an error: "An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 10): Failed to create final object file".
File size is 2.1GB and storage space(20GB) && part sizes(200~300MB) are within limits.
24 Replies
@lsy0882#6315
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #19858
I have the same issues when I try use boto3, aws cli or s3api
Command for upload:
Errors:
I've experienced exactly the same issue you're describing with boto3 multipart uploads to RunPod's S3-compatible storage.
Here's the exact error message I consistently received at the final step (
CompleteMultipartUpload
):
To confirm, I checked that all individual parts were successfully uploaded. I verified this by observing the temporary files stored in the .s3compat_uploads
directory. Here's a direct snapshot from my logs showing that all parts (for a ~2.1 GB file) were successfully uploaded:
When I reached out to RunPod's support team with detailed information, including timestamps and bucket details, they responded as follows:
"A 524 error is generated by Cloudflare, not by RunPod. This error indicates a timeout between Cloudflare and our backend services. One common scenario where this occurs is when a file upload takes too long to complete—this is likely what's happening in your case.>
As suggested earlier, using a smaller part size for your uploads can help prevent this timeout. Most S3-compatible clients allow you to set or adjust the part size; we recommend ensuring each part is 500MB or less."From their explanation, it seems Cloudflare, acting as an intermediary between our client and RunPod's storage backend, triggers this timeout issue specifically during the final object creation step. The suggested workaround from RunPod was explicitly to use smaller chunk sizes, ideally less than or equal to 500MB. Despite implementing smaller chunks, I've continued to experience intermittent issues, suggesting the underlying network timeout issue remains a problem that isn't fully addressed by chunk size alone. I hope sharing these detailed experiences and logs helps clarify the underlying issue for you.
Unknown User•4mo ago
Message Not Public
Sign In & Join Server To View
Following team's suggestion, I tried modifying the upload chunk size.
I tested various settings with 16MB, 64MB, and even 500MB chunks, but unfortunately, the error persists.
I'm in the same boat. I've also tried uploading with different chunk sizes—5MB, 8MB, 16MB, 32MB, 64MB, 128MB, 300MB, 500MB !! —but I'm still facing the same error.
I've confirmed that the multipart parts were uploaded successfully using the command
aws s3 ls --summarize --human-readable --recursive --region EU-RO-1 --endpoint-url https://s3api-eu-ro-1.runpod.io/ s3://{bucket_id}/
, but an error keeps occurring at the CompleteMultipartUpload operation
stage.
I suspect there is a bug in the RunPod S3 backend's logic for concatenating the multipart data.Unknown User•4mo ago
Message Not Public
Sign In & Join Server To View
I sent a reply to support ticket but haven't received a response yet. Please open a new ticket.
Unknown User•4mo ago
Message Not Public
Sign In & Join Server To View
I have open a new ticket for this Issue
I am having exactly the same issue. When using boto3, even multiparts are not uploaded. With aws s3 cp multiparts are uploaded, but merging them fails with the error above. This happens at 8 MB chunks. Basically this is making the network volume unusable, because I cannot upload any files there.
aws s3api list-multipart-uploads
lists the failed upload.
put-object
works for 300 MB file, but is much slower. For 10 GB file I get "An error occurred (524) when calling the PutObject operation: ".I'm experiencing this issue as well and am following this thread for updates on the bug.
Any update?
Maybe try smaller and bigger chunks?
The commands we used are as follows:
Errors:
- An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 2): Failed to create final object file
- An error occurred (504) when calling the CompleteMultipartUpload operation (reached max retries: 2): Gateway Timeout
- An error occurred (524) when calling the UploadPart operation
Exception caught when parsing error response body: Traceback (most recent call last):
File "awscli/botocore/parsers.py", line 537, in _parse_xml_string_to_dom
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
This consistent failure, irrespective of the destination path or the multipart configuration, strongly reinforces our belief that this is a server-side issue with the final file assembly process.
This problem is preventing us from using the Network Volume service.
Hey, sorry for no response here over the long weekend. Let me look through this thread and understand the issue.
Do you have any news?
Thanks for the reminder, I can forget things sometimes :wires:
To reiterate supports request, we recommend a multipart upload of <500MB parts because of limits placed by Cloudflare.
For you, can we see the Python you're using here or what exactly you're doing to get any of these errors? Particularly the 524 for UploadPart?
For everything else, I have someone looking into the issues you're seeing - I'll update you here when I get a message from him
At the following link https://drive.google.com/drive/folders/1lqWoBfgTq9QhJ36IQSKdiQuHtFIcycoa?usp=sharing you will find the various log files and the python script that I used to upload files using boto3 with Python 3.9
Any update from client service?
All I have so far is this:
Based on the timestamps of those files, the "Failed to create final object file" errors were occurring before the fix for subdirectories was deployed. So it is possible that they were still hitting that issue.I'm still waiting to hear back on the 524 and 504, but you may just be good to go? Let me know!
More than a week after I submitted a ticket for this issue, I received the following response:
I have already tried experimenting with a higher read_timeout setting, which resulted in the exact same error.
Let me be clear: the multi-part concatenation logic within RunPod S3 itself seems to be flawed. Instead of testing or suggesting solutions on the configuration side, why don't you inspect the internal logic of RunPod S3?
To be frank, it is technically very disappointing to receive such a simple answer after Runpod Team supposedly spent over a week investigating the cause. Our company is currently reviewing the integration of RunPod S3 into our services. However, if Runpod Team continue to provide these kinds of answers and the bug remains because have failed to properly identify the root cause, we will be unable to use RunPod.
@Dj
We're aware that Multipart Concat is a little buggy, especially at extremely large file sizes. Admittedly, it can take a very long time for especially technical details to travel through support. You can always ask me here, although after hours like now it could take me a while to reply.
I'm working with our team to make sure users don't get delivered the same suggestions twice. I can't access your ticket while I'm driving, but the engineer working on the S3 API Compatibility was made aware of a potential issue with this code path on Monday and on Tuesday suggested he was still testing for the root cause.
We believe that users at files over 12GB will continue to see issues, but most users today (even those at that file size) should see the issue remedied with a longer timeout. It's a fairly complicated issue, if we're sloppy we risk corrupting files and a better fix would involve patching the file system that powers our network storage clusters (which is what powers the S3 API).
We're definitely not done working on this.
Hi @lsy0882#6315, Please check this thread - https://discord.com/channels/912829806415085598/1401371439176745100 . Upload works fine with their script. Hope it helps.