Comments (8)
Problem: abstracting vendor-specific multipart APIs that require state and capturing response data
There are complex, hard to abstract requirements from cloud vendors for multipart uploads - for example S3 needs the client is expected to retain the ETag
header returned for each part and then send them all encapsulated in a specific XML envelope to S3 upon completion.
Needless to say the Azure BlockBlob chunked upload API is quite different.
Q: Does this mean there can't be a generic multi-part upload transfer adapter that can handle multiple vendors?
A: It means that most of the init
/ commit
logic should be handled in the LFS server side, not the client; As long as the client is the one directly pushing the actual data blocks (all the parts
requests), I still believe there is great benefit in defining this as a transfer protocol. The server will just have to handle vendor-specific logic in init
/ commit
.
Q: Should we still allow custom init
and commit
actions?
A: commit
- for sure, it has to be somehow sent by the client, even if the URL is an LFS server URL. init
- not sure but most likely yes, defining it as part of the protocol can help support additional storage backend in the future.
Q: Should we still support allowing custom HTTP method
and body
in actions?
A: This may be a YAGNI thing; On the other hand, I find this to be lacking from the basic
transfer protocol and is required to support really a really flexible array of vendors. I am really on the fence on this one.
Wait, so how do we solve the S3 Etag
capturing problem?
We can try to let the client upload all the parts without capturing the ETag
value, and then when the client calls the server to commit
, issue a ListParts
call to get all the uploaded parts and their ETag
s. (aside: this could also be used in init
to identify missing parts for a pre-existing objec?). Then send a CompleteMultipartUpload
request with the ETag values to finalize.
We should further investigate this approach by:
- Testing it
- See what other vendors (namely Azure and GCP) expects and how it compares
- Think what this means in terms of security and data consistency
Thought: we should define a response to the commit
action in the protocol that says something like "you still have parts to upload" or "the data is not consistent yet". Most likely, a 422
response with a specific message basically telling the client to call the batch API again - if the problem is that some parts are missing, the reply should be a new upload reply with only the missing parts listed under parts
.
from giftless.
Note that if we remove the custom body
from commit
/ init
, they will still need to allow some custom JSON attributes beyond the oid
and size
of the object. For example, S3 requires an uploadId
token. Since we want to avoid or at least minimize state in the server, the client should retain this value and send it when calling commit
.
Same goes for the init
response - it will provide the client with such values and the client will need to know to store and send them again when calling commit
.
For this reason, I think it may make sense to just allow specifying a custom body as in the example above, which means we can keep clients dumb.
from giftless.
Both S3 and Azure support content integrity verification using the Content-MD5
header which is a standard HTTP feature. It would be great if we can somehow incorporate this into the protocol. I'm wondering if we need flexibility (to specify the digest algorithm and header / other param in use) or can just hard-code support for Content-Md5
.
A couple of suggestions:
MD5 specific
"actions": {
"parts": [
{
"href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
"header": {
"Authorization": "Bearer someauthorizationtokenwillbesethere"
},
"pos": 7500001,
"set_content_md5": true
}
]
}
Setting the set_content_md5
flag to true
will tell the client to calculate an MD5 digest of the part, and send it as the content-md5
header.
More generic approach to content digest
This is inspired by RFC-3230 and RFC-5843 which define a more flexible approach to content digest in HTTP. The following approach specifies what digest header(s) to send with the content in an RFC-3230 like manner:
"actions": {
"parts": [
{
"href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
"header": {
"Authorization": "Bearer someauthorizationtokenwillbesethere"
},
"pos": 7500001,
"want_digest": "contentMD5"
}
]
}
The want_digest
attribute value is to follow the spec in RFC-3230 section 4.3.1, with possible algorithms as specified by RFC-5853.
RFC-3230 defines contentMD5
as a special value which tells the client to send the Content-MD5
header with an MD5 digest of the payload in base64 encoding.
Other possible values include a comma-separated list of q-factor flagged algorithms, one of MD5
, SHA
, SHA-256
and SHA-512
. Of one or more of these are specified, the digest of the payload is to be specified by the client as part of the Digest
header, using the format specified by RFC-3230 section 4.3.2. For example:
"actions": {
"parts": [
{
"href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
"header": {
"Authorization": "Bearer someauthorizationtokenwillbesethere"
},
"pos": 7500001,
"want_digest": "sha-256;q=1.0, md5;q=0.5"
}
]
}
Will cause the client to set a request:
HTTP/1.1 PUT /storage/upload/20492a4d0d84?part=3
Authorization: Bearer someauthorizationtokenwillbesethere
Digest: SHA-256=thvDyvhfIqlvFe+A9MYgxAfm1q5=,MD5=qweqweqweqweqweqweqwe=
...
NOTE: Azure allows for a crc32 based check, but also supports content-md5 so I am not sure crc32 has any benefit over md5.
from giftless.
Point to consider: add a cancel
or abort
URL that can be used to optionally allow clients to cancel a started upload. This could allow cleaning up uploaded parts if supported / required by the storage vendor.
How vendors handle uncomitted partial uploads:
- S3: You pay for them until you clean them up, but you can set up automatic cleanup after a period of time using object lifecycle management
- Azure: They are cleaned up automatically after 7 days
- Google: Resumable uploads - couldn't find any reference to how an unfinished resumable upload is managed, so I assume it is just a regular object with some missing data, which means it should be cleaned up manually.
Of course, there is no guarantee that the client will be able to successfully call the abort
action even if supported - so maybe this is moot. Still could be nice.
from giftless.
Supporting cleanup of unfinished uploads in GCP could be implemented by tagging objects as "draft" when we init
, and removing that tag when we commit
. Then, users can write an external script to delete objects tagged as draft
older than a certain age.
from giftless.
Q: does it make sense to have some kind of "common to all parts" attribute for the operation that specifies headers and other attributes that may be common to all parts
objects?
These could be:
header
valueswant_digest
values- Base URL
Pros: More compact and clean messages, less repetition
Cons: More divergence from the basic
protocol, less encapsulation of objects, will require more complex code at least on the client side (client needs to have more state shared between action requests).
I am leaning against it, but may be convinced otherwise ;)
from giftless.
Undecided:
- Should we name it
multipart
ormultipart-basic
? - Define a generic way to reply to
commit
if not all parts have been uploaded successfully - Define a generic way to reply to
init
if some parts have already been uploaded - Should we allow clients to request a chunk size? Is there reason for that?
- Should we have a "common to all parts" attribute to specify common headers,
want_digest
, etc.? - Should we define an "abort" action to support clean cancelling?
- Specify handling of content digest (prob accept the RFC based suggestion above)
Tasks:
- Get answers for undecided questions
- Write up a "0.9" protocol draft (~1 hr)
- Implement a local storage / giftless internal views based implementation to test the protocol (~1 day)
- Design and estimate implementation for Azure, S3 and GCP (~2 hours each)
- Implement specific backend support (~1-2 days each)
from giftless.
Discussion has been summarized into a spec doc here: https://github.com/datopian/giftless/blob/feature/11-multipart-protocol/multipart-spec.md
from giftless.
Related Issues (20)
- pypi release (0.5.0) broken HOT 2
- Not able to make the quickstart example to work HOT 4
- S3 basic_external doesn't work HOT 4
- I'm running build on google cloud build, can i use google auth instead of using the JSON file? HOT 9
- Example in document does not work HOT 1
- Add support for CGI middleware HOT 2
- TypeError: 'NoneType' object is not iterable HOT 3
- Time for a new release? HOT 1
- I think merging pyjwt 2 broke things HOT 7
- [epic] Springclean and update Giftless (2023/2024) - for latest python etc HOT 11
- [enhancement] Implement auth provider for GitHub HOT 11
- Docker container doesn't die gracefully on SIGTERM HOT 1
- [suggestion|lint] wrap code at column 120 HOT 3
- [github-auth] Implement core functionality
- [github-auth] Write tests
- [enhancement] Use automatic LFS server discovery endpoints HOT 1
- Make Google token exchange/signed URL acquisition more robust
- Error when pulling, after push worked fine HOT 4
- default_lifetime config value set in yml config is ignored in favour of default value from `config.py`
- batch operation fails if optional hash_algo key is used HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from giftless.