Comments (12)
I've been thinking along similar lines - PutObject
now has Body
, which can be a string
or a Buffer
. It should also allow Uint8Array
and Readable
.
from aws-lite.
@keithlayne an important thing to note: streaming put is actually a different S3 API (which I did not recall or fully appreciate when I left those comments in the codebase), known as a multipart upload (ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html). This is the underlying API to @aws-sdk/lib-storage
Upload
and aws-sdk
S3.upload
.
If reading off the filesystem is insufficient for (y)our needs, I believe that will be the approach we'll have to take!
from aws-lite.
@metadaddy this is very helpful, thanks! There's definitely a lot of overloaded terminology at play here, as well as the significant complexity of the S3 API. In this case, I think most of the streaming stuff we're referring to is just Node.js streams to/from the filesystem (even with single-chunk payloads) to keep things as efficient as possible. But certainly what you're talking about is also highly relevant background for these conversations, and represents functionality we need to ship, as well!
from aws-lite.
I just really started looking at aws-lite today, and it looks like a great fit for our needs, but streaming is probably essential to us - we're running node in some cases on pretty constrained hardware.
I looked at the PutObject
source today, looks like even with a File
input it's calling fs.readFile
, and then sorta streaming to S3. There are some comments in there already regarding e.g. fs.createReadStream
. I think there are some potential challenges there with chunking and signatures but I think streaming is very doable.
I'm going to see if I can get my stuff basically working with aws-lite as-is (we have some other constraints), and if so, I'll look into working on this, probably starting with PutObject
.
from aws-lite.
@keithlayne an important thing to note: streaming put is actually a different S3 API (which I did not recall or fully appreciate when I left those comments in the codebase), known as a multipart upload (ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html). This is the underlying API to
@aws-sdk/lib-storage
Upload
andaws-sdk
S3.upload
.
Hmmm. I'm sorta aware of that, but have only used lib-storage in the browser, via a bunch of presigned http urls. My existing stuff passes fs.createReadStream
to PutObject
.
I will double-check that to make sure I'm not lying to you, and maybe look into the aws source (fun).
from aws-lite.
My main issue is with this line:
aws-lite/plugins/s3/src/put-object.mjs
Line 155 in 993a073
Also, some things I found while experimenting:
- The limit for
PutObject
in the sdk is 5G. When I tried to put a 5GiB file with aws-lite, I got the error:File size (5368709120) is greater than 2 GiB
. - Then I tried to put a 2GiB file...and got
File size (2147483648) is greater than 2 GiB
- turns out the actual limit is 2Gi - 1 😄 - I was able to upload the 5GiB file using aws-sdk like this:
systemd-run --scope -p MemoryMax=50M -p MemorySwapMax=50M --user node ./test
. The script just doess3.putObject({ Body: createReadStream(...), ... })
and prints outprocess.memoryUsage
every 250ms). There's a lower limit for memory/swap where this works, but it certainly is streaming.
We have on rare occasions needed to upload files > 5GiB, which would necessitate a multipart upload, but that's the exception for us. Getting regular puts to not read the whole file into memory would be a major improvement for us.
I did look at aws-sdk source, which is pretty painful and hard to follow. The easiest thing might be for me to run it through the debugger to get a real sense of what's going on. There's a package there about node streams, which is fairly simple but I didn't look hard enough yet to see how it's being used.
from aws-lite.
Hah! I was confused - the 2GB limit comes from node, and the off-by-one error seems to be theirs. Please disregard that part.
from aws-lite.
Yeah, I recently also ran into the 2GB limit and was aware it's S3 related. So we do need to enable streaming of static, fully-written files as well!
from aws-lite.
I'm working on a POC for streaming right now. I have some criticisms (hopefully constructive!) of the current implementation.
I think some fairly large changes here would:
- be very useful to me
- help align the API with aws-sdk
(There are also some not-quite-as-large changes that I think would do basically the same stuff withing the current s3 plugin api)
Aside from other daily work distractions, I'm pretty much focused on this. My hope is that there can be consensus and we can work together to get something acceptable for the s3 plugin proper.
I apologize for kinda highjacking this issue, I can make a new one specific to PutObject
with my thoughts and ideas, or just make a PR. If anyone has feelings about this, please let me know.
from aws-lite.
I don't think you highjacked the issue, this all feels very much in the same vein to me. Very curious to see how your streaming proof of concept goes; in general I'm aligned with making GetObject
+ PutObject
more streamable. Related, streaming PutObject
is currently lower hanging fruit than GetObject
just because the lower-level aws-lite request methods already support transmitting via stream; the trick there is going to be handling all the signature calculations.
from aws-lite.
This whole area gets complicated quickly. Here's some, hopefully, useful information:
You can't really stream data to S3, since S3 uploads are atomic; every upload operation (PutObject
/UploadPart
) has to include the length of the content in advance of the payload.
What you can do is stream to a buffer, and upload the buffer to S3.
There are two orthogonal dimensions to uploading data to S3 - single vs multipart uploads, and single chunk vs multiple chunk payloads.
Single vs Multipart Uploads
if you're creating an object of 5 GiB or less, you can use a single PutObject
operation. If you know the size of the object, it's more efficient to use PutObject
, since it's a single HTTP request vs a minimum of three HTTP requests for a multipart upload (CreateMultipartUpload
, UploadPart
, CompleteMultipartUpload
).
Since, in a streaming situation, the SDK doesn't know the size of the object in advance, it doesn't know whether it will fit into a single PutObject payload. So it makes sense to just use multipart, since a multipart upload can have a single part, and the last part of a multipart upload can be smaller than the minimum part size (5 MiB) enforced for the previous parts.
Single Chunk vs Multiple Chunk Payloads
Now, when uploading a part via UploadPart
, you can send the payload as a single chunk, including the Content-Length
header and the entire body, or use AWS' non-standard aws-chunked
content encoding to send multiple chunks. Even in the latter case, you still have to supply the payload content length upfront, in x-amz-decoded-content-length
, so it's not useful in this situation. As far as I can see, the only reason to use aws-chunked
would be if you were memory-constrained and you wanted your upload buffer to be less than the minimum part size of 5 MiB.
So, you need to use multipart uploads, where each part is between 5 MiB and 5 GiB, and a producer-consumer pattern where client code is streaming to a buffer which the SDK periodically uploads to S3.
One more complication: as well as the constraints on part size, an object can comprise a maximum of 10,000 parts, so you have to allow the caller to specify the part size. If you (the SDK) guess a part size that's too small, you'll run out of parts before you've uploaded the entire object. Too big, and you needlessly increase your memory footprint.
Hope this makes sense, and is useful. I saw the recent conversation here, and I thought it was worth doing a quick brain dump!
from aws-lite.
Streaming on GetObject
and PutObject
is now available in @aws-lite/s3
v0.1.21 when used with aws-lite v0.21.0!
For GetObject
: opt into the stream by passing the new streamResponsePayload: true
property in your requests, like so:
await aws.s3.GetObject({ Bucket, Key, streamResponsePayload: true })
For PutObject
, simply pass the File
property, and by default it will stream from disk.
from aws-lite.
Related Issues (20)
- aws-lite/client may not discover all plugins HOT 2
- RFC: explicit plugin loading by default HOT 8
- "Package" is reserved word HOT 1
- Bug with TS & DynamoDB plugin with TS option `strict` is enabled HOT 7
- RFC: integrated system for using `aws-lite` in tests
- Add retries
- Deno + JSR support
- Add options for (un)marshalling AWS-flavored JSON
- Add pagination support for `cursor` / `token` arrays for APIs that paginate with multiple tokens
- Add XML namespace support
- DynamoDb: Response from `Query` command does not `unmarshall` `LastEvaluatedKey`attribute HOT 2
- Weird error from @aws-lite/client: SQS.SendMessage HOT 7
- Is there a way to mock requests based on their incoming params, rather than their order? HOT 6
- DynamoDB.Query: socket hang up HOT 4
- Intermittent InvalidSignatureException with DynamoDB HOT 21
- 0.21.6: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object HOT 3
- Improve .ini reads
- Expose S3 User Defined Metadata HOT 2
- `removeUndefinedVariables` is not passed to AWS vendor'ed marshaller HOT 4
- Multiple Pagination Accumulators
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-lite.