Giter Club home page Giter Club logo

Comments (8)

shevron avatar shevron commented on August 16, 2024

Problem: abstracting vendor-specific multipart APIs that require state and capturing response data

There are complex, hard to abstract requirements from cloud vendors for multipart uploads - for example S3 needs the client is expected to retain the ETag header returned for each part and then send them all encapsulated in a specific XML envelope to S3 upon completion.

Needless to say the Azure BlockBlob chunked upload API is quite different.

Q: Does this mean there can't be a generic multi-part upload transfer adapter that can handle multiple vendors?

A: It means that most of the init / commit logic should be handled in the LFS server side, not the client; As long as the client is the one directly pushing the actual data blocks (all the parts requests), I still believe there is great benefit in defining this as a transfer protocol. The server will just have to handle vendor-specific logic in init / commit.

Q: Should we still allow custom init and commit actions?

A: commit - for sure, it has to be somehow sent by the client, even if the URL is an LFS server URL. init - not sure but most likely yes, defining it as part of the protocol can help support additional storage backend in the future.

Q: Should we still support allowing custom HTTP method and body in actions?

A: This may be a YAGNI thing; On the other hand, I find this to be lacking from the basic transfer protocol and is required to support really a really flexible array of vendors. I am really on the fence on this one.

Wait, so how do we solve the S3 Etag capturing problem?

We can try to let the client upload all the parts without capturing the ETag value, and then when the client calls the server to commit, issue a ListParts call to get all the uploaded parts and their ETags. (aside: this could also be used in init to identify missing parts for a pre-existing objec?). Then send a CompleteMultipartUpload request with the ETag values to finalize.

We should further investigate this approach by:

  1. Testing it
  2. See what other vendors (namely Azure and GCP) expects and how it compares
  3. Think what this means in terms of security and data consistency

Thought: we should define a response to the commit action in the protocol that says something like "you still have parts to upload" or "the data is not consistent yet". Most likely, a 422 response with a specific message basically telling the client to call the batch API again - if the problem is that some parts are missing, the reply should be a new upload reply with only the missing parts listed under parts.

from giftless.

shevron avatar shevron commented on August 16, 2024

Note that if we remove the custom body from commit / init, they will still need to allow some custom JSON attributes beyond the oid and size of the object. For example, S3 requires an uploadId token. Since we want to avoid or at least minimize state in the server, the client should retain this value and send it when calling commit.

Same goes for the init response - it will provide the client with such values and the client will need to know to store and send them again when calling commit.

For this reason, I think it may make sense to just allow specifying a custom body as in the example above, which means we can keep clients dumb.

from giftless.

shevron avatar shevron commented on August 16, 2024

Both S3 and Azure support content integrity verification using the Content-MD5 header which is a standard HTTP feature. It would be great if we can somehow incorporate this into the protocol. I'm wondering if we need flexibility (to specify the digest algorithm and header / other param in use) or can just hard-code support for Content-Md5.

A couple of suggestions:

MD5 specific

"actions": {
  "parts": [
    {
      "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
      "header": {
        "Authorization": "Bearer someauthorizationtokenwillbesethere"
      },
      "pos": 7500001,
      "set_content_md5": true
    }
  ]
}

Setting the set_content_md5 flag to true will tell the client to calculate an MD5 digest of the part, and send it as the content-md5 header.

More generic approach to content digest

This is inspired by RFC-3230 and RFC-5843 which define a more flexible approach to content digest in HTTP. The following approach specifies what digest header(s) to send with the content in an RFC-3230 like manner:

"actions": {
  "parts": [
    {
      "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
      "header": {
        "Authorization": "Bearer someauthorizationtokenwillbesethere"
      },
      "pos": 7500001,
      "want_digest": "contentMD5"
    }
  ]
}

The want_digest attribute value is to follow the spec in RFC-3230 section 4.3.1, with possible algorithms as specified by RFC-5853.

RFC-3230 defines contentMD5 as a special value which tells the client to send the Content-MD5 header with an MD5 digest of the payload in base64 encoding.

Other possible values include a comma-separated list of q-factor flagged algorithms, one of MD5, SHA, SHA-256 and SHA-512. Of one or more of these are specified, the digest of the payload is to be specified by the client as part of the Digest header, using the format specified by RFC-3230 section 4.3.2. For example:

"actions": {
  "parts": [
    {
      "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
      "header": {
        "Authorization": "Bearer someauthorizationtokenwillbesethere"
      },
      "pos": 7500001,
      "want_digest": "sha-256;q=1.0, md5;q=0.5"
    }
  ]
}

Will cause the client to set a request:

HTTP/1.1 PUT /storage/upload/20492a4d0d84?part=3
Authorization: Bearer someauthorizationtokenwillbesethere
Digest: SHA-256=thvDyvhfIqlvFe+A9MYgxAfm1q5=,MD5=qweqweqweqweqweqweqwe=

...

NOTE: Azure allows for a crc32 based check, but also supports content-md5 so I am not sure crc32 has any benefit over md5.

from giftless.

shevron avatar shevron commented on August 16, 2024

Point to consider: add a cancel or abort URL that can be used to optionally allow clients to cancel a started upload. This could allow cleaning up uploaded parts if supported / required by the storage vendor.

How vendors handle uncomitted partial uploads:

  • S3: You pay for them until you clean them up, but you can set up automatic cleanup after a period of time using object lifecycle management
  • Azure: They are cleaned up automatically after 7 days
  • Google: Resumable uploads - couldn't find any reference to how an unfinished resumable upload is managed, so I assume it is just a regular object with some missing data, which means it should be cleaned up manually.

Of course, there is no guarantee that the client will be able to successfully call the abort action even if supported - so maybe this is moot. Still could be nice.

from giftless.

shevron avatar shevron commented on August 16, 2024

Supporting cleanup of unfinished uploads in GCP could be implemented by tagging objects as "draft" when we init, and removing that tag when we commit. Then, users can write an external script to delete objects tagged as draft older than a certain age.

from giftless.

shevron avatar shevron commented on August 16, 2024

Q: does it make sense to have some kind of "common to all parts" attribute for the operation that specifies headers and other attributes that may be common to all parts objects?

These could be:

  • header values
  • want_digest values
  • Base URL

Pros: More compact and clean messages, less repetition
Cons: More divergence from the basic protocol, less encapsulation of objects, will require more complex code at least on the client side (client needs to have more state shared between action requests).

I am leaning against it, but may be convinced otherwise ;)

from giftless.

shevron avatar shevron commented on August 16, 2024

Undecided:

  • Should we name it multipart or multipart-basic?
  • Define a generic way to reply to commit if not all parts have been uploaded successfully
  • Define a generic way to reply to init if some parts have already been uploaded
  • Should we allow clients to request a chunk size? Is there reason for that?
  • Should we have a "common to all parts" attribute to specify common headers, want_digest, etc.?
  • Should we define an "abort" action to support clean cancelling?
  • Specify handling of content digest (prob accept the RFC based suggestion above)

Tasks:

  • Get answers for undecided questions
  • Write up a "0.9" protocol draft (~1 hr)
  • Implement a local storage / giftless internal views based implementation to test the protocol (~1 day)
  • Design and estimate implementation for Azure, S3 and GCP (~2 hours each)
  • Implement specific backend support (~1-2 days each)

from giftless.

shevron avatar shevron commented on August 16, 2024

Discussion has been summarized into a spec doc here: https://github.com/datopian/giftless/blob/feature/11-multipart-protocol/multipart-spec.md

from giftless.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.