Describe the problem underlying the enhancement request In my app,

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

My main issue is with this line: <div clas

Add support for nodejs streams to relevant S3 operations about aws-lite HOT 12 CLOSED

filmaj commented on July 22, 2024

Add support for nodejs streams to relevant S3 operations

from aws-lite.

Comments (12)

metadaddy commented on July 22, 2024 2

I've been thinking along similar lines - PutObject now has Body, which can be a string or a Buffer. It should also allow Uint8Array and Readable.

from aws-lite.

ryanblock commented on July 22, 2024 1

@keithlayne an important thing to note: streaming put is actually a different S3 API (which I did not recall or fully appreciate when I left those comments in the codebase), known as a multipart upload (ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html). This is the underlying API to @aws-sdk/lib-storage Upload and aws-sdk S3.upload.

If reading off the filesystem is insufficient for (y)our needs, I believe that will be the approach we'll have to take!

from aws-lite.

ryanblock commented on July 22, 2024 1

@metadaddy this is very helpful, thanks! There's definitely a lot of overloaded terminology at play here, as well as the significant complexity of the S3 API. In this case, I think most of the streaming stuff we're referring to is just Node.js streams to/from the filesystem (even with single-chunk payloads) to keep things as efficient as possible. But certainly what you're talking about is also highly relevant background for these conversations, and represents functionality we need to ship, as well!

from aws-lite.

keithlayne commented on July 22, 2024

I just really started looking at aws-lite today, and it looks like a great fit for our needs, but streaming is probably essential to us - we're running node in some cases on pretty constrained hardware.

I looked at the PutObject source today, looks like even with a File input it's calling fs.readFile, and then sorta streaming to S3. There are some comments in there already regarding e.g. fs.createReadStream. I think there are some potential challenges there with chunking and signatures but I think streaming is very doable.

I'm going to see if I can get my stuff basically working with aws-lite as-is (we have some other constraints), and if so, I'll look into working on this, probably starting with PutObject.

from aws-lite.

keithlayne commented on July 22, 2024

@keithlayne an important thing to note: streaming put is actually a different S3 API (which I did not recall or fully appreciate when I left those comments in the codebase), known as a multipart upload (ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html). This is the underlying API to @aws-sdk/lib-storage Upload and aws-sdk S3.upload.

Hmmm. I'm sorta aware of that, but have only used lib-storage in the browser, via a bunch of presigned http urls. My existing stuff passes fs.createReadStream to PutObject.

I will double-check that to make sure I'm not lying to you, and maybe look into the aws source (fun).

from aws-lite.

keithlayne commented on July 22, 2024

My main issue is with this line:

aws-lite/plugins/s3/src/put-object.mjs

Line 155 in 993a073

let data = Body || await readFile(File)

Also, some things I found while experimenting:

The limit for PutObject in the sdk is 5G. When I tried to put a 5GiB file with aws-lite, I got the error: File size (5368709120) is greater than 2 GiB.
Then I tried to put a 2GiB file...and got File size (2147483648) is greater than 2 GiB - turns out the actual limit is 2Gi - 1 😄
I was able to upload the 5GiB file using aws-sdk like this: systemd-run --scope -p MemoryMax=50M -p MemorySwapMax=50M --user node ./test. The script just does s3.putObject({ Body: createReadStream(...), ... }) and prints out process.memoryUsage every 250ms). There's a lower limit for memory/swap where this works, but it certainly is streaming.

We have on rare occasions needed to upload files > 5GiB, which would necessitate a multipart upload, but that's the exception for us. Getting regular puts to not read the whole file into memory would be a major improvement for us.

I did look at aws-sdk source, which is pretty painful and hard to follow. The easiest thing might be for me to run it through the debugger to get a real sense of what's going on. There's a package there about node streams, which is fairly simple but I didn't look hard enough yet to see how it's being used.

from aws-lite.

keithlayne commented on July 22, 2024

Hah! I was confused - the 2GB limit comes from node, and the off-by-one error seems to be theirs. Please disregard that part.

from aws-lite.

ryanblock commented on July 22, 2024

Yeah, I recently also ran into the 2GB limit and was aware it's S3 related. So we do need to enable streaming of static, fully-written files as well!

from aws-lite.

keithlayne commented on July 22, 2024

I'm working on a POC for streaming right now. I have some criticisms (hopefully constructive!) of the current implementation.

I think some fairly large changes here would:

be very useful to me
help align the API with aws-sdk

(There are also some not-quite-as-large changes that I think would do basically the same stuff withing the current s3 plugin api)

Aside from other daily work distractions, I'm pretty much focused on this. My hope is that there can be consensus and we can work together to get something acceptable for the s3 plugin proper.

I apologize for kinda highjacking this issue, I can make a new one specific to PutObject with my thoughts and ideas, or just make a PR. If anyone has feelings about this, please let me know.

from aws-lite.

ryanblock commented on July 22, 2024

I don't think you highjacked the issue, this all feels very much in the same vein to me. Very curious to see how your streaming proof of concept goes; in general I'm aligned with making GetObject + PutObject more streamable. Related, streaming PutObject is currently lower hanging fruit than GetObject just because the lower-level aws-lite request methods already support transmitting via stream; the trick there is going to be handling all the signature calculations.

from aws-lite.

metadaddy commented on July 22, 2024

This whole area gets complicated quickly. Here's some, hopefully, useful information:

You can't really stream data to S3, since S3 uploads are atomic; every upload operation (PutObject/UploadPart) has to include the length of the content in advance of the payload.

What you can do is stream to a buffer, and upload the buffer to S3.

There are two orthogonal dimensions to uploading data to S3 - single vs multipart uploads, and single chunk vs multiple chunk payloads.

Single vs Multipart Uploads

if you're creating an object of 5 GiB or less, you can use a single PutObject operation. If you know the size of the object, it's more efficient to use PutObject, since it's a single HTTP request vs a minimum of three HTTP requests for a multipart upload (CreateMultipartUpload, UploadPart, CompleteMultipartUpload).

Since, in a streaming situation, the SDK doesn't know the size of the object in advance, it doesn't know whether it will fit into a single PutObject payload. So it makes sense to just use multipart, since a multipart upload can have a single part, and the last part of a multipart upload can be smaller than the minimum part size (5 MiB) enforced for the previous parts.

Single Chunk vs Multiple Chunk Payloads

Now, when uploading a part via UploadPart, you can send the payload as a single chunk, including the Content-Length header and the entire body, or use AWS' non-standard aws-chunked content encoding to send multiple chunks. Even in the latter case, you still have to supply the payload content length upfront, in x-amz-decoded-content-length, so it's not useful in this situation. As far as I can see, the only reason to use aws-chunked would be if you were memory-constrained and you wanted your upload buffer to be less than the minimum part size of 5 MiB.

So, you need to use multipart uploads, where each part is between 5 MiB and 5 GiB, and a producer-consumer pattern where client code is streaming to a buffer which the SDK periodically uploads to S3.

One more complication: as well as the constraints on part size, an object can comprise a maximum of 10,000 parts, so you have to allow the caller to specify the part size. If you (the SDK) guess a part size that's too small, you'll run out of parts before you've uploaded the entire object. Too big, and you needlessly increase your memory footprint.

Hope this makes sense, and is useful. I saw the recent conversation here, and I thought it was worth doing a quick brain dump!

from aws-lite.

ryanblock commented on July 22, 2024

Streaming on GetObject and PutObject is now available in @aws-lite/s3 v0.1.21 when used with aws-lite v0.21.0!

For GetObject: opt into the stream by passing the new streamResponsePayload: true property in your requests, like so:

await aws.s3.GetObject({ Bucket, Key, streamResponsePayload: true })

For PutObject, simply pass the File property, and by default it will stream from disk.

from aws-lite.

Add support for nodejs streams to relevant S3 operations about aws-lite HOT 12 CLOSED

Comments (12)

Single vs Multipart Uploads

Single Chunk vs Multiple Chunk Payloads

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent