This is a container ticket to discuss the design of a custom transfer adapter supporti

Note that if we remove the custom body from <code cla

Point to consider: add a cancel or <code class="notra

Discussion has been summarized into a spec doc here: <a href="https://github.com/datop

[epic] [design] Custom transfer mode for multipart uploads about giftless HOT 8 CLOSED

datopian commented on August 16, 2024

[epic] [design] Custom transfer mode for multipart uploads

from giftless.

Comments (8)

shevron commented on August 16, 2024

Problem: abstracting vendor-specific multipart APIs that require state and capturing response data

There are complex, hard to abstract requirements from cloud vendors for multipart uploads - for example S3 needs the client is expected to retain the ETag header returned for each part and then send them all encapsulated in a specific XML envelope to S3 upon completion.

Needless to say the Azure BlockBlob chunked upload API is quite different.

Q: Does this mean there can't be a generic multi-part upload transfer adapter that can handle multiple vendors?

A: It means that most of the init / commit logic should be handled in the LFS server side, not the client; As long as the client is the one directly pushing the actual data blocks (all the parts requests), I still believe there is great benefit in defining this as a transfer protocol. The server will just have to handle vendor-specific logic in init / commit.

Q: Should we still allow custom init and commit actions?

A: commit - for sure, it has to be somehow sent by the client, even if the URL is an LFS server URL. init - not sure but most likely yes, defining it as part of the protocol can help support additional storage backend in the future.

Q: Should we still support allowing custom HTTP method and body in actions?

A: This may be a YAGNI thing; On the other hand, I find this to be lacking from the basic transfer protocol and is required to support really a really flexible array of vendors. I am really on the fence on this one.

Wait, so how do we solve the S3 `Etag` capturing problem?

We can try to let the client upload all the parts without capturing the ETag value, and then when the client calls the server to commit, issue a ListParts call to get all the uploaded parts and their ETags. (aside: this could also be used in init to identify missing parts for a pre-existing objec?). Then send a CompleteMultipartUpload request with the ETag values to finalize.

We should further investigate this approach by:

Testing it
See what other vendors (namely Azure and GCP) expects and how it compares
Think what this means in terms of security and data consistency

Thought: we should define a response to the commit action in the protocol that says something like "you still have parts to upload" or "the data is not consistent yet". Most likely, a 422 response with a specific message basically telling the client to call the batch API again - if the problem is that some parts are missing, the reply should be a new upload reply with only the missing parts listed under parts.

from giftless.

shevron commented on August 16, 2024

Note that if we remove the custom body from commit / init, they will still need to allow some custom JSON attributes beyond the oid and size of the object. For example, S3 requires an uploadId token. Since we want to avoid or at least minimize state in the server, the client should retain this value and send it when calling commit.

Same goes for the init response - it will provide the client with such values and the client will need to know to store and send them again when calling commit.

For this reason, I think it may make sense to just allow specifying a custom body as in the example above, which means we can keep clients dumb.

from giftless.

shevron commented on August 16, 2024

Both S3 and Azure support content integrity verification using the Content-MD5 header which is a standard HTTP feature. It would be great if we can somehow incorporate this into the protocol. I'm wondering if we need flexibility (to specify the digest algorithm and header / other param in use) or can just hard-code support for Content-Md5.

A couple of suggestions:

MD5 specific

"actions": {
  "parts": [
    {
      "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
      "header": {
        "Authorization": "Bearer someauthorizationtokenwillbesethere"
      },
      "pos": 7500001,
      "set_content_md5": true
    }
  ]
}

Setting the set_content_md5 flag to true will tell the client to calculate an MD5 digest of the part, and send it as the content-md5 header.

More generic approach to content digest

This is inspired by RFC-3230 and RFC-5843 which define a more flexible approach to content digest in HTTP. The following approach specifies what digest header(s) to send with the content in an RFC-3230 like manner:

"actions": {
  "parts": [
    {
      "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
      "header": {
        "Authorization": "Bearer someauthorizationtokenwillbesethere"
      },
      "pos": 7500001,
      "want_digest": "contentMD5"
    }
  ]
}

The want_digest attribute value is to follow the spec in RFC-3230 section 4.3.1, with possible algorithms as specified by RFC-5853.

RFC-3230 defines contentMD5 as a special value which tells the client to send the Content-MD5 header with an MD5 digest of the payload in base64 encoding.

Other possible values include a comma-separated list of q-factor flagged algorithms, one of MD5, SHA, SHA-256 and SHA-512. Of one or more of these are specified, the digest of the payload is to be specified by the client as part of the Digest header, using the format specified by RFC-3230 section 4.3.2. For example:

"actions": {
  "parts": [
    {
      "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
      "header": {
        "Authorization": "Bearer someauthorizationtokenwillbesethere"
      },
      "pos": 7500001,
      "want_digest": "sha-256;q=1.0, md5;q=0.5"
    }
  ]
}

Will cause the client to set a request:

HTTP/1.1 PUT /storage/upload/20492a4d0d84?part=3
Authorization: Bearer someauthorizationtokenwillbesethere
Digest: SHA-256=thvDyvhfIqlvFe+A9MYgxAfm1q5=,MD5=qweqweqweqweqweqweqwe=

...

NOTE: Azure allows for a crc32 based check, but also supports content-md5 so I am not sure crc32 has any benefit over md5.

from giftless.

shevron commented on August 16, 2024

Point to consider: add a cancel or abort URL that can be used to optionally allow clients to cancel a started upload. This could allow cleaning up uploaded parts if supported / required by the storage vendor.

How vendors handle uncomitted partial uploads:

S3: You pay for them until you clean them up, but you can set up automatic cleanup after a period of time using object lifecycle management
Azure: They are cleaned up automatically after 7 days
Google: Resumable uploads - couldn't find any reference to how an unfinished resumable upload is managed, so I assume it is just a regular object with some missing data, which means it should be cleaned up manually.

Of course, there is no guarantee that the client will be able to successfully call the abort action even if supported - so maybe this is moot. Still could be nice.

from giftless.

shevron commented on August 16, 2024

Supporting cleanup of unfinished uploads in GCP could be implemented by tagging objects as "draft" when we init, and removing that tag when we commit. Then, users can write an external script to delete objects tagged as draft older than a certain age.

from giftless.

shevron commented on August 16, 2024

Q: does it make sense to have some kind of "common to all parts" attribute for the operation that specifies headers and other attributes that may be common to all parts objects?

These could be:

header values
want_digest values
Base URL

Pros: More compact and clean messages, less repetition
Cons: More divergence from the basic protocol, less encapsulation of objects, will require more complex code at least on the client side (client needs to have more state shared between action requests).

I am leaning against it, but may be convinced otherwise ;)

from giftless.

shevron commented on August 16, 2024

Undecided:

Should we name it multipart or multipart-basic?
Define a generic way to reply to commit if not all parts have been uploaded successfully
Define a generic way to reply to init if some parts have already been uploaded
Should we allow clients to request a chunk size? Is there reason for that?
Should we have a "common to all parts" attribute to specify common headers, want_digest, etc.?
Should we define an "abort" action to support clean cancelling?
Specify handling of content digest (prob accept the RFC based suggestion above)

Tasks:

Get answers for undecided questions
Write up a "0.9" protocol draft (~1 hr)
Implement a local storage / giftless internal views based implementation to test the protocol (~1 day)
Design and estimate implementation for Azure, S3 and GCP (~2 hours each)
Implement specific backend support (~1-2 days each)

from giftless.

shevron commented on August 16, 2024

Discussion has been summarized into a spec doc here: https://github.com/datopian/giftless/blob/feature/11-multipart-protocol/multipart-spec.md

from giftless.

[epic] [design] Custom transfer mode for multipart uploads about giftless HOT 8 CLOSED

Comments (8)

Problem: abstracting vendor-specific multipart APIs that require state and capturing response data

Wait, so how do we solve the S3 `Etag` capturing problem?

MD5 specific

More generic approach to content digest

Undecided:

Tasks:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (8)

Problem: abstracting vendor-specific multipart APIs that require state and capturing response data

Wait, so how do we solve the S3 Etag capturing problem?

MD5 specific

More generic approach to content digest

Undecided:

Tasks:

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Wait, so how do we solve the S3 `Etag` capturing problem?