datopian / giftless Goto Github PK

View Code? Open in Web Editor NEW

120.0 4.0 30.0 608 KB

🎁 A pluggable Git LFS server written in Python. Highly customizable and easy to extend.

Home Page: https://giftless.datopian.com

License: MIT License

Dockerfile 0.78% Makefile 1.65% Shell 0.44% Python 97.07% Jinja 0.05%

git git-lfs file-storage lfs-server version-control cloud-storage

giftless's Introduction

Giftless - a Pluggable Git LFS Server

Giftless a Python implementation of a Git LFS Server. It is designed with flexibility in mind, to allow pluggable storage backends, transfer methods and authentication methods.

Giftless supports the basic Git LFS transfer mode with the following storage backends:

Local storage
Google Cloud Storage
Azure Blob Storage with direct-to-cloud or streamed transfers
Amazon S3 Storage

In addition, Giftless implements a custom transfer mode called multipart-basic, which is designed to take advantage of many vendors' multipart upload capabilities. It requires a specialized Git LFS client to use, and is currently not supported by standard Git LFS.

See the giftless-client project for a compatible Python Git LFS client.

Additional transfer modes and storage backends could easily be added and configured.

Documentation

License

Giftless is free / open source software and is distributed under the terms of the MIT license. See LICENSE for details.

giftless's People

Stargazers

Watchers

giftless's Issues

[design] Allow custom object metadata in batch API

The Git LFS protocol does not allow us to set some custom metadata on objects, which is useful in some cases (e.g. storing additional object metadata such as the original file name, or tags in the storage backend).

It could be useful to extend the LFS batch API with the ability to set some custom object attributes.

Note that the LFS protocol does not allow arbitrary / custom properties on batch API objects, or at least this is not specified. The Go implementation of git-lfs contains JSON schema that strictly forbids this: https://github.com/git-lfs/git-lfs/blob/df881bf23a08f1b57209825e0f6b2d0b9e6dcd5c/tq/schemas/http-batch-request-schema.json

Batch API Extension

It is suggested that Giftless will accept any object attribute that begins with x-, similar to how non-standard HTTP headers are specified:

{
  "transfers": ["basic"],
  "operation": "download",
  "objects": [
    {
      "oid": "123123123123123123123123123123123123123",
      "size": 123,
      "x-filename": "original-data.csv",
      "x-tags": ["data", "csv"]
    }
  ]
}

These custom attributes will be passed to transfer adapters in Giftless, which will be able to use them for any purpose.

To clarify, these x- attributes will be accepted for both upload and download operations.

GCP Integration

Feat: Create an integration with GCP

basic_external not working on GCP

Acceptance

make basic_external work with GCP

Tasks

Analysis

Error we're getting:

$ git lfs push origin master --all

Uploading LFS objects:   0% (0/1), 0 B | 0 B/s, done
LFS: Authorization error: https://storage.googleapis.com/datahub-jjj/hannelita/example-proj-datahub-io/798bee0d79324678f48b32b9dd1a49a7f199946cb2ecd8cd7a16ed08b7f0bdd6?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=dev-impl%40datahub-next-test.iam.gserviceaccount.com%2F20200813%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200813T122948Z&X-Goog-Expires=899&X-Goog-SignedHeaders=host&response-content-disposition=%7B%7D&X-Goog-Signature=[strip]
Check that you have proper access to the repository

Batch API error responses are (sometimes) not according to protocol

If all objects in upload / download operations have errors, we need the entire request to return an error status code, most likely 422 (the API docs are somewhat unclear on that).

See https://github.com/datopian/giftless/blob/master/giftless/view.py#L65

[auth] Pass pre-signed request JWT token to local storage in the URL query string

When downloading with the local storage, in some situations, it is much more convenient to accept a pre-signed download URL which already includes the JWT token. For example, CKAN redirecting to download from local storage will not be able to include the Authorization header in the request, because this is a redirection and not an AJAX request like we do for uploads.

This is currently only a problem with the local storage.

We can perhaps "work around" it by doing some client side 💩 but this is so much more complex, and we don't have this issue with other backends because the support URLs that already contain auth tokens.

Document the API and how to extend Giftless

Giftless has a very flexible / extensible design but without documentation, nobody can know how to contribute additional capabilities or modify for their own needs without reading and understanding the source code.

We should:

Create an infrastructure for documentation (probably Sphinx / readthedocs) - this should not be in README.md
Document the transfer adapters API
Document the API for adding storage backends for the 'basic_streaming' and 'basic_external' transfer adapters
Document the authenticators API
Document how JWT scopes work (once implemented)
Add documentation on development and contributing (setting up an env, testing, coding standards etc.)

Allow passing JWT token in Basic HTTP auth header to better support CLI git integration

Idea: allow passing a JWT token in Basic HTTP auth header to better support CLI git integration

The git / git-lfs CLI isn't very flexible when it comes to passing custom auth headers, but it does support "Basic" HTTP auth quite well. It could be nice if users will be allowed to pass a JWT token in Authorization, but instead of passing it as a Bearer token, we will piggyback on Basic authorization. Users can pass a constant user such as _jwt as the username, and the token as the password.

This is similar to how some other vendors piggyback on basic / digest HTTP auth to provide tokens, e.g. Github PAS or Google JSON account keys can be provided in such way.

Implement multipart support in Google Cloud storage backend

Following up on #11 and #51 (multipart on Azure), we should implement multipart for GCS as well.

Tasks

A little bit research on Google Cloud's resumable uploads feature; Do not be confused by GCS's support for "multipart uploads" - this is unrelated and is related to uploading from a browser using multipart/form-data payload encoding.
Implement MultipartStorage on GoogleCloudStorage
Add tests (VCR) with mutlipart setup

Also, consider some refactoring of the boundary between transfer adapters and storage adapters, and fixing of "Verify" actions conflicting between Basic and Multipart transfers when both are enabled (and they should always both be enabled).

Better test coverage for the Azure storage adapter

Better mocks for Azure; at this time there are only basic unit tests
Support for CI / End to end test with cypress (?)

[Docs] Docs for Heroku with examples of Procfile and env vars

Acceptance

Write proper docs for Heroku deploy (describe Procfile and ENV vars configuration)

Add VCR-based tests for Azure multipart

Leftovers from #51

Error on missing argument if installed from pypi

After installing from pypi I get:

Traceback (most recent call last):
  File "${HOME}/.local/lib/python3.7/site-packages/giftless/wsgi_entrypoint.py", line 7, in <module>
    app = init_app()
  File "${HOME}.local/lib/python3.7/site-packages/giftless/app.py", line 43, in init_app
    transfer.init_flask_app(app)
  File "${HOME}/.local/lib/python3.7/site-packages/giftless/transfer/__init__.py", line 71, in init_flask_app
    adapter.register_views(app)
  File "${HOME}/.local/lib/python3.7/site-packages/giftless/transfer/basic_streaming.py", line 158, in register_views
    ObjectsView.register(app, init_argument=self.storage)
  File "${HOME}/.local/lib/python3.7/site-packages/giftless/view.py", line 30, in register
    return super().register(*args, **kwargs)
  File "${HOME}/.local/lib/python3.7/site-packages/flask_classful.py", line 138, in register
    proxy = cls.make_proxy_method(name)
  File "${HOME}/.local/lib/python3.7/site-packages/flask_classful.py", line 230, in make_proxy_method
    i = cls()
TypeError: __init__() missing 1 required positional argument: 'storage'
unable to load app 0 (mountpoint='') (callable not found or import error)

The last call comes from flask_classfull, where I have version 0.14.2, if that is important.

Originally posted by @ANaumann85 in #57 (comment)

Better test coverage for the GCP storage adapter

Acceptance

Better mocks for GCP; at this time there are only basic unit tests
Support for CI / End to end test with cypress (?)

Add CORS support

As we want to allow browsers to directly talk to Giftless, we may want to consider adding CORS configuration support directly in Giftless.

This should be optional and configurable.

Alternatively, we should document a way to deploy Giftless so it is accessible via a proxy (e.g. nginx) in a way that handles CORS.

Implement Azure multipart support

Implement Azure multipart support as described in #11 and #48

Acceptance Criteria

multipart transfer adapter + Azure backend
Can configure giftless to upload to Azure using multipart transfer
Have solid test coverage + vcr
abort is supported
SAS URLs are provided with correct permissions for each action URL (upload/part, upload/commit, download, abort)
want_digest value correctly set for Azure (Content-MD5)
Falls back to basic for smaller files (<single part size)

Implement Travis CI and other build / quality autoamtion

Would be really nice to have:

Travis CI to replace our old Gitlab CI
Coveralls or similar for code coverage

Improve file name sanitization logic used by streaming transfer adapter

Now that we have merged #68, the streaming transfer adapter allows passing the content-disposition filename value as a parameter. However, the filename sanitization logic in https://github.com/datopian/giftless/blob/master/giftless/util.py#L73..L84 is very strict, and only allows latin alphanumerics, dashes, underscores and dots.

While this is a good security measure (we don't want any special characters injected into the HTTP headers), this is restrictively strict. In addition, many users would want file names with international, non-latin (Hebrew, Arabic, Chinese, European umlauts and accents, Cyrillic etc. etc.) characters.

There is really no reason to avoid any special character other than characters that could affect HTTP headers, and even in this case we may be safe depending on Flask / Werkzeug's handling of headers.

Specifically, I think we should avoid / escape anything that is non-printing, semicolons, double quotes and new-lines. Other than that, we should be fine.

Perhaps it would be better to escape rather than strip these characters to ensure we never send an empty filename - for example URL-encoding only a handful of "unsafe" characters could be a good solution here.

GCP should not create a bucket if it doesn't exist

We should define whether or not Giftless should create a bucket if it does not exist.

Currently, the Azure adapter will blow up if the bucket does not exist, but GCP will create it (because the API we use will auto-create non existing bucket).

As a rule of thumb, Giftless shouldn't have the rights to create buckets, but it might. In such a case, should we try to create the bucket?

Decision

Let's not create buckets on purpose, and operate under the assumption we can't.

Acceptance

Disable bucket creation by default (if the bucket does not exist on GCS, do not create it)

Allow customizing identity object and scope class used by JWT authorizer

Currently in order to use JWT but have slightly different identity rules (e.g. if you want to default to read-only for unauthorized scopes), or handle scopes in a different way, you need to subclass and replace the entire authorizer.

This is not a huge setback, but it would be nice if the default scope class and identity class could be replaced via config.

Incorrect Content-type header for StreamingStorage factory

This is a bug/new feature request in which Giftless is used with CKAN, ckanext-blob-storage and datapusher, see original discussion.

When using StreamingStorage (factory: giftless.transfer.basic_external:factory) the Content-type of a file to be stored is not preserved. For instance, after uploading a csv-file and then downloaded its Content-type returned by Giftless is text/html which causes an error at datapusher thus resulting that tabular data cannot be previewed in CKAN UI. ExternalStorage seems to work correctly in this sense.

Error when pulling, after push worked fine

Hi,
I'm trying to get Git LFS working using local storage on a Raspberry Pi. Just Git was so easy - just set up SSH - but LFS is proving annoyingly hard. Seems like Giftless is just what I need!

I installed giftless from Pypi using these instructions, and I'm running it with the uwsgi command listed there, not with the flask development server from the Getting Started guide. I'm also using 192.168.0.35:5000 instead of 127.0.0.1:8080, but I don't think that should make a difference. (should it?)

I'm following the "Getting Started" guide and thought everything was going relatively smoothly, I was able to push two PNG files (instead of bin) and see similarly-sized files appear in storage. However, when I tried pulling from a different clone this happened:

2021-07-05 23:12:51,917 giftless.app    ERROR Exception on /neatnit/hello/objects/storage/bf6d84c4eb78fce5b34ef70a74a492227783328301096ee375b77cdf30d844b8 [GET]
Traceback (most recent call last):
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask_classful.py", line 301, in proxy
    response = view(**request.view_args)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/giftless/auth/__init__.py", line 90, in decorated_function
    return f(*args, **kwargs)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/flask_classful.py", line 269, in inner
    return fn(*args, **kwargs)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/giftless/transfer/basic_streaming.py", line 85, in get
    filename = safe_filename(filename)
  File "/srv/git-lfs-server/giftless/venv/lib/python3.7/site-packages/giftless/util.py", line 84, in safe_filename
    return ''.join(c for c in original_filename if c in valid_chars)
TypeError: 'NoneType' object is not iterable
[pid: 1949|app: 0|req: 63/63] 192.168.0.35 () {28 vars in 1277 bytes} [Mon Jul  5 23:12:51 2021] GET /neatnit/hello/objects/storage/bf6d84c4eb78fce5b34ef70a74a492227783328301096ee375b77cdf30d844b8?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiIsImtpZCI6ImdpZnRsZXNzLWludGVybmFsLWp3dC1rZXkifQ.eyJleHAiOjE2MjU1MTYwMzEsImlhdCI6MTYyNTUxNTk3MSwibmJmIjoxNjI1NTE1OTcxLCJzdWIiOm51bGwsIm5hbWUiOiJhbm9ueW1vdXMiLCJzY29wZXMiOiJvYmo6bmVhdG5pdC9oZWxsby9iZjZkODRjNGViNzhmY2U1YjM0ZWY3MGE3NGE0OTIyMjc3ODMzMjgzMDEwOTZlZTM3NWI3N2NkZjMwZDg0NGI4OnJlYWQifQ.y5IduY4eMS5XT4Jts6SjgXXYlDt3x8lSPL7Nhp5vJ_M => generated 196 bytes in 4 msecs (HTTP/1.1 500) 2 headers in 103 bytes (1 switches on core 0)

This is repeated over and over as the client tries again and again to get the file.

For reference:

neatnit@neatberrypi:~/Documents $ ls -lA /srv/git-lfs-server/giftless/lfs-storage/neatnit/hello
total 28
-rw-r--r-- 1 neatnit neatnit 12815 Jul  5 22:59 b7a90a039c6b38b403d01ffb84d5ba0b969bfbe4fad7246eeff9c0783a9d1b14
-rw-r--r-- 1 neatnit neatnit  9438 Jul  5 22:33 bf6d84c4eb78fce5b34ef70a74a492227783328301096ee375b77cdf30d844b8

I don't think I did any step wrong when following the guide.

Sorry that this might be more of a support request than a bug/issue! I didn't know where else to ask (and for all I know it might actually be a bug).

[research] Git client UX with giftless esp how does authentication work?

What is UX for a git user on the command line using giftless? In particular, how does authentication work?

Tasks

draw a sequence diagram

Analysis

We will be using http for the server e.g. https://giftless.datahub.io. Thus, this is the relevant part of https://github.com/git-lfs/git-lfs/blob/master/docs/api/authentication.md:

Git provides a credentials command [see below] for storing and retrieving credentials through a customizable credential helper. By default, it associates the credentials with a domain. You can enable credential.useHttpPath so different repository paths have different credentials.

Git ships with a really basic credential cacher that stores passwords in memory, so you don't have to enter your password frequently. However, you are encouraged to setup a custom git credential cacher, if a better one exists for your platform

This is detailing the first leg of interactions, i.e. the attempt to auth with giftless. In CKAN setup we want this to go to ckan authz api and request relevant token. In standalone giftless this is a TODO atm.

You can read more about git credentials here:
https://git-scm.com/docs/gitcredentials. Reading this it looks like you would want to configure this as follows:

[credential "https://giftless.datahub.io"]
	helper = /path/to/my/ckan/auth/utility

Then the /path/to/my/ckan/auth/utility would be something that went and got the token from CKAN.

How this token is then used by git lfs is still not totally clear - i hope it just sends it in the authorization header. See excerpt below from git-lfs/git-lfs#2330 (comment)

Research

https://github.com/git-lfs/git-lfs/blob/master/docs/api/authentication.md - authentication docs
https://developer.lsst.io/v/DM-6968/tools/git_lfs.html - detailed instructions on how to set up your git credentials so as to access custom git lfs storage
- https://developer.lsst.io/v/DM-6968/tools/git_setup.html#git-credential-helper

Authentication docs

https://git-scm.com/docs/gitcredentials

git-lfs/git-lfs#2330 (comment)

By default, Git LFS will attempt to authenticate with no authentication. LFS-Authenticate is an LFS-specific version of the WWW-Authenticate header for web browsers. It is not intended to authenticate the request, but to tell Git LFS how to authenticate.

Git LFS makes request with no auth
LFS server returns 401 something like Lfs-Authenticate: Basic realm="GitHub"
Git LFS retries request with Basic authentication
Success!
Instead, it looks like your LFS server is replying information like this on each object:

"expires_in":600,
"header":{"LFS-Authenticate":"TOKEN"}
I don't know how your LFS server is written, but it should probably be sending an Authorization header. Tweak the expires_in value to an acceptable level. At 600s, LFS will re-access the batch API after 10m of running your upload or download command. It looks like LFS is confused by the incorrect usage of LFS-Authenticate. There are two things LFS looks to determine if one of those URLs is already authenticated:

It has an Authorization header. The example above only has Lfs-Authenticate.
It has the "authenticated" property enabled. Check out the uploads example in the API docs.
LFS is still sending the Lfs-Authenticate header, but it's also going through git credentials to try to get a valid login for staging.sthse.co. If you change your server to use a valid Authorization header and set the "authenticated" property, you should be good to go.

Add this to git lfs wiki

Consider: switching to dataclasses + marshmallow-dataclasses

This will allow a nicer API for request / response payloads (based on strict object structures rather than dictionaries).

marshmallow-dataclasses should allow us easy marshaling to / from dicts and JSON, as well as remove the need to write Marshmallow schemas as it will all be based on dataclasses (?)

giftless.local.yaml configuration - error

I tried to follow this configuration but didn't work properly. I received an error "Object does not exist"

giftless.loca.yaml file

AUTH_PROVIDERS: 
  - "giftless.auth.allow_anon:read_write"

TRANSFER_ADAPTERS:
  basic:
    factory: giftless.transfer.basic_external:factory
    options:
      storage_class: ..storage.azure:AzureBlobsStorage
      storage_options:
        connection_string: key
        container_name: lfs-storage
        path_prefix: lfs

Post:

{
    "operation": "upload",
    "transfers": [ "basic" ],
    "ref": { "name": "refs/heads/contrib" },
    "objects": [
      {
	"oid": "8857053d874453bbe8e7613b09874e2d8fc9ddffd2130a579ca918301c31b369",
        "size": 36
      }
    ]
 }

console error:

Traceback (most recent call last):
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 1951, in full_dispatch_requ
est
    rv = self.handle_user_exception(e)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 1820, in handle_user_except
ion
    reraise(exc_type, exc_value, tb)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask_classful.py", line 301, in proxy
    response = view(**request.view_args)
  File "/home/rodeghiero/datopian/giftless-new/giftless/auth/__init__.py", line 78, in decorated_function
    return f(*args, **kwargs)
  File "/home/rodeghiero/datopian/giftless-new/.venv/lib/python3.6/site-packages/flask_classful.py", line 269, in inner
    return fn(*args, **kwargs)
  File "/home/rodeghiero/datopian/giftless-new/giftless/view.py", line 59, in post
    response['objects'] = [action(**o) for o in payload['objects']]
  File "/home/rodeghiero/datopian/giftless-new/giftless/view.py", line 59, in <listcomp>
    response['objects'] = [action(**o) for o in payload['objects']]
  File "/home/rodeghiero/datopian/giftless-new/giftless/transfer/basic_external.py", line 58, in upload
    if self.storage.verify_object(prefix, oid, size):
  File "/home/rodeghiero/datopian/giftless-new/giftless/transfer/basic_streaming.py", line 54, in verify_object
    return self.exists(prefix, oid) and self.get_size(prefix, oid) == size
  File "/home/rodeghiero/datopian/giftless-new/giftless/transfer/storage/azure.py", line 37, in exists
    self.get_size(prefix, oid)
  File "/home/rodeghiero/datopian/giftless-new/giftless/transfer/storage/azure.py", line 49, in get_size
    raise ObjectNotFound("Object does not exist")
giftless.transfer.exc.ObjectNotFound: Object does not exist
127.0.0.1 - - [08/Apr/2020 10:32:06] "GET /myorg/myrepo/objects/batch?__debugger__=yes&cmd=resource&f=style.css HTTP/1.1" 200 -
127.0.0.1 - - [08/Apr/2020 10:32:06] "GET /myorg/myrepo/objects/batch?__debugger__=yes&cmd=resource&f=jquery.js HTTP/1.1" 200 -
127.0.0.1 - - [08/Apr/2020 10:32:06] "GET /myorg/myrepo/objects/batch?__debugger__=yes&cmd=resource&f=debugger.js HTTP/1.1" 200 -
127.0.0.1 - - [08/Apr/2020 10:32:06] "GET /myorg/myrepo/objects/batch?__debugger__=yes&cmd=resource&f=console.png HTTP/1.1" 200

[auth] support for ssh-initiated auth?

Hi, the git-lfs client supports a weird ssh-initiated authentication, whereby it will dial to ssh hostname git-lfs-authenticate group/project.git upload. Is this something in scope for your server?

Google cloud credentials

As a developer trying to deploy Giftless with GCP on Heroku (or any other PaaS), I would like to configure my credential files without having to track them (i.e. as an env var or a group of env vars).

At this time I would need to either put it into the YAML file or specify a path and track the credentials.json file.

[epic] Giftless documentation v1

We want to create some great giftless documentation so that it is easy for others to use and contribute to.

We already have pretty good docs but we need more info on getting started as an integrator and extender.

Job Stories

When configuring and deploying giftless as a developer I want good documentation how to do that so that i can get started quickly

When I have my cloud storage I want to be able to us that storage with giftless e.g. via adding a new storage backend so that i can store data in my cloud storage

When wanting to control access to storage that giftless is the gatekeeper for I want to add/customize authentication/authorization handlers so that I control who can store data in my storage (possibly using my existing auth system)

When integrating giftless into a project (probably for data storage e.g. from a client e.g. giftless-client-js) I want to understand the overall flow including how auth works and where data gets stored (and how i control that)

When considering using Giftless i want to understand what it does and why it is valuable (and why it is designed the way it is)

Acceptance

Tasks

Filenames not handled by basic streaming transfer adapter

Context: We are looking to set giftless up to work with ckanext-blob-storage.

Problem: Basic streaming with local storage does not currently handle the desired download file names or file types correctly. Instead all files, irrespective of type are downloaded as ".html".

Observations: I can see that the file names with extension are passed to the batch request by ckanext-blob-storage, but this is simply dropped by the basic streaming transfer adaptor. I can't see anywhere where the Content-Type is passed from blob storage to giftless though.

Solution: We're very happy to help with this if you think that is constructive. I can submit a small PR for your review. Whether or not you use the PR (I really don't mind), I'd still really value the process of digging around giftless and hearing your feedback and thoughts.

Documentation on how CONFIG dictionary is built.

The current logic to override giftless default configuration with the custom one in the .yaml file overrides the default configuration dictionary only if the same key exist in the .yaml file. Is that the desired behaviour or a buggy one?

For example if I have a .yaml file with the following config spec:

PRE_AUTHORIZED_ACTION_PROVIDER:
 options:
   private_key: my-new-private-key

The final config will be:

"PRE_AUTHORIZED_ACTION_PROVIDER": {
 'factory': 'giftless.auth.jwt:factory',
 'options': {
 'algorithm': 'HS256',
 'private_key': 'my-new-private-key',
 'private_key_file': None,
 'public_key': None,
 'public_key_file': None,
 'default_lifetime': 60, # 60 seconds for default actions
 'key_id': 'giftless-internal-jwt-key',
 }
 },

I may expect (IMHO), the options element to be identical as the one in my .yaml file and not a mix of both since it is not clear why all the other values are there (for example key_id).

[epic] [design] Custom transfer mode for multipart uploads

This is a container ticket to discuss the design of a custom transfer adapter supporting multipart upload. This is not a part of the official git-lfs spec, but will be extremely valuable to us and if it works, could be used by custom git-lfs clients, and eventually could be proposed as an addition to the LFS protocol.

Goal

Spec a transfer protocol that will allow uploading files in parts to a storage backend, focusing on cloud storage services such as S3 and Azure Blobs.

Design goals:

Must:

Abstract vendor specific API and flow into a generic protocol
Remain as close as possible to the basic transfer API
Work at least with the multi-part APIs of S3 and Azure Blobs, and local storage

Nice / Should:

Define how uploads can be resumed by re-doing parts and not-redoing parts that were uploaded successfully (this may be vendor specific and not always supported)

Initial Protocol design

The name of the transfer is multipart-basic
{"operation": "download"} requests work exactly like basic download request with no change
{"operation": "upload"} requests will break the upload into several actions:
- init (optional), a request to initialize the upload
- parts (optional), zero or more part upload requests
- commit (optional), a request to finalize the upload
- verify (optional), a request to verify the file is in storage, similar to basic upload verify actions
Just like basic transfers, if the file fully exists and is committed to storage, no actions will be provided and the upload can simply be skipped
Requests are the same as basic requests except that {"transfers": ["multipart-basic", "basic"]} is the expected transfers value.
Authentication and authorization behave just like with the basic protocol

Request Objects

The init, commit and each one of the parts actions contain a "request spec". These are similar to basic transfer adapter actions but in addition to href and header also include method (optional) and body (optional) attributes, to indicate the HTTP request method and body. This allows the protocol to be vendor agnostic, especially as the format of init and commit requests tends to vary greatly between storage backends.

The default values for these fields depends on the action:

init defaults to no body and POST method
commit defaults to no body and POST method
parts requests default to PUT method and should include the file part as body, just like with basic transfer adapters.

In addition, each parts request will include the pos attribute to indicate the position in bytes within the file in which the part should begin, and size attribute to indicate the part size in bytes. If pos is omitted, default to 0. If size is omitted, default to read until the end of file.

Examples

Sample Upload Request

The following is a ~10mb file upload request:

{ 
  "transfers": ["multipart-basic", "basic"],
  "operation": "upload",
  "ref": "some-ref",
  "objects": [
    {
      "oid": "20492a4d0d84f8beb1767f6616229f85d44c2827b64bdbfb260ee12fa1109e0e",
      "size": 10000000
    }
  ]
}

Sample Upload Response:

The following is a response for the same request, given an imagined storage backend:

{
  "transfer": "multipart-basic",
  "objects": [
    {
      "oid": "20492a4d0d84f8beb1767f6616229f85d44c2827b64bdbfb260ee12fa1109e0e",
      "size": 10000000,
      "actions": {
        "parts": [
          {
            "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=0",
            "header": {
              "Authorization": "Bearer someauthorizationtokenwillbesethere"
            },
            "pos": 0,
            "size": 2500000
          },
          {
            "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=1",
            "header": {
              "Authorization": "Bearer someauthorizationtokenwillbesethere"
            },
            "pos": 2500001,
            "size": 2500000
          },
          {
            "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=2",
            "header": {
              "Authorization": "Bearer someauthorizationtokenwillbesethere"
            },
            "pos": 5000001,
            "size": 2500000
          },
          {
            "href": "https://foo.cloud.com/storage/upload/20492a4d0d84?part=3",
            "header": {
              "Authorization": "Bearer someauthorizationtokenwillbesethere"
            },
            "pos": 7500001
          }
        ],
        "commit": {
          "href": "https://lfs.mycompany.com/myorg/myrepo/multipart/commit",
          "authenticated": true,
          "header": {
            "Authorization": "Basic 123abc123abc123abc123abc123=",
            "Content-type": "application/vnd.git-lfs+json"
          },
          "body": "{\"oid\": \"20492a4d0d84\", \"size\": 10000000, \"parts\": 4, \"transferId\": \"foobarbazbaz\"}"
        },
        "verify": {
          "href": "https://lfs.mycompany.com/myorg/myrepo/multipart/verify",
          "authenticated": true,
          "header": {
            "Authorization": "Basic 123abc123abc123abc123abc123="
          },
        }
      }
    }
  ]
}

As you can see, the init action is omitted as will be the case with many backend implementations (we assume initialization, if needed, will most likely be done by the LFS server at the time of the batch request).

Chunk sizes

It is up to the LFS server to decide the size of each file chunk.

TBD: Should we allow clients to request a chunk size? Is there reason for that?

Configuration: replace GCP_CREDENTIALS env var with a more convensional config method

Currently, the GCP storage backend looks for GCP_CREDENTIALS in the environment and uses it as a verbatim JSON string to use as credentials; This breaks our convnsion of passing in all configuration as arguments to the class constructor, hinders testing and complicates deployment by creating a "special case" for GCP. Also, it's not pretty 💩

Let's replace it by adding the following config options to our standard config:

TRANSFER_ADAPTERS:
  basic:
    factory: giftless.transfer.basic_external:factory
    options:
      storage_class: giftless.storage.google_cloud.GoogleCloudBlobStorage
      storage_options:
        account_key_file: /path/to/key/file.json  # Path to account key json file
        account_key_base64: ewogICJ0eXBlIjogInNlcnZp...2NvdW50LmNvbSIKfQo=  # Literal JSON string encoded with base64

This will allow users to provide the key either as a path to a local file or as an inline string containing the JSON encoded in base64. The reason for base64 encoding is to avoid the ugliness of escaped literal JSON inside a YAML string / env var.

This will also allow users to specify the key as an environment variable by setting:

export GIFTLESS_TRANSFER_ADAPTERS_basic_options_storage_options_account_key_base64="ewogICJ0eXBlIjogInNlcnZp...2NvdW50LmNvbSIKfQo"

So it also handles the case of environments that don't support file uploading well correctly.

Clean up the "components" documentation

Now that we have written some howto guides / tutorials as part of #72 and #73, we should clean up the docs around Transfer adapters, Storage backends and Auth providers to make them more concise.

I'm thinking that each should be composed of a more "reference" style section, detailing each component type and each of the available options, and a "discussion" style section discussing design and abstractions.

Maybe, content that relates to creating custom components (if we have anything like that at all) should be moved to the development guide.

[design] multipart upload transfer implementation for Azure

This ticket includes design notes for an Azure storage backend supporting multipart upload, and a transfer adapter to wrap it.
The protocol is based on the discussion in #11 and on the spec in multipart-spec.md.

Azure specific multipart upload flow

Azure multipart upload is based on the Put Block and Put Block List APIs for block blobs. Do not be confused by "append blobs" or "page blobs", these are not what we need.
You upload any number (up to 100,000) "blocks" of a blob; These could be of varying length. Each needs to have an ID of up to 64 bytes and all IDs need to have the same length. This is done using the Put Block API. Blocks can be up to 4gb in size.
Once you finish uploading you call Put Block List to commit.
There is no need to init a multipart upload; It happens automatically when you upload the first block to a blob
There is no way / need to abort an upload as uncommitted parts are deleted after 7 days. However, we may want to implement abort which is really do delete the entire blob - need to test what happens when you delete an uncommitted blob.
We can enable Content-MD5 validation on each block by sending Content-MD5 headers. A 400 response means the content MD5 is not valid.
The Put Block List API contains an XML structure that lists the blocks in order to form the full blob

Open Questions:

Will an uncommitted blob be validated based on listing and size? Do we need to add more checks to get_size / exists / verify_object?
How do we know, when batch is called, what parts still need to be uploaded?

Allow automatic fallback to basic for small files

If the request is to upload a file under a configured size, allow the LFS server to negotiate basic transfer instead of mutlipart-basic to simplify the transaction and speed things up. Users can disable this by setting the minimum size for multipart in config to 0.

This is a leftover from #51, and is probably not more than a "nice to have" feature.

Document basic_streaming vs basic_external

I managed to get files uploaded to GCP using basic_streaming (on the YAML file config) only. basic_external would always give some sort of error and would never call a PUT action; is this expected?

Usage example with mapping from url to bucket

I get this is git-lfs but i don't quite get from README how a given request maps to a location in a bucket.

Could we get a short Usage section (reffing git-lfs) explaining how to push a file and where it ends up location wise in the configured bucket.

Implement permissions based on JWT scopes

As we now have an authorization layer, we should be able to use JWT scopes to implement permission checking.

Scope generation / parsing / mapping to permissions should be flexible

Naming: gitfless / gifted / gifts / gitfs / gitcloud

I get that giftless is a bit a better as it fits git + lfs a bit nicer. However, it has a bit of a negative sense. You have now gifts for me 😢

gifted or gifts still has git + f (or fs) of lfs and is more positive 🎁

Finally we could go for giftly which has git + lf (but no s).

Options

giftless
gifted
gifts
gitfs - this could still be pronounced gifts and also has that sense of git + filesystem
gitcloud - this would keep the purpose

What do people think?

Implemented automated Docker image builds

Would be nice to have our "stable" branch always build and tag public Docker images

GCP: add support for the `x-filename` extra attribute and setting the downloaded file name in content-disposition

It would be good to maintain support for the x-filename "extra" property. This allows us (at least with the Azure backend) to specify a value for the Content-Disposition header, which enables to set the file name in the browser when downloading (otherwise the name of the file suggested in the "Save As" dialog in the browser is the sha256 which is not pretty).

I'm not sure how to obtain this in GCS, but probably there is a away.

Originally posted by @shevron in #36

Implement S3 storage class for Streaming and External basic transfer adapters

We can relatively easily implement S3 support for both streamed and direct-to-cloud storage modeled after our current Azure backend.

Can use https://github.com/datopian/node-git-lfs/tree/master/lib/store as a reference (although our architecture is different, the usage of S3 APIs should be similar).

Align multipart transfer protocol with the protocol proposed in the official git-lfs repo

After we implemented multipart-basic support, I've started working with the Git LFS team on an official proposal to add a similar transfer protocol to Git LFS: git-lfs/git-lfs#4438 . This is very similar, but slightly cleaned up and simplified version of our own protocol.

Once this proposal is accepted, it would be good to align our own implementation with it.

As it is not 100% compatible with our current transfer protocol, we can maintain BC by implementing it as a separate transfer adapter and keeping the old one around, but recommend the new one in our docs.

ImportError: cannot import name 'transfer' from 'giftless'

python version: 3.7.3
pip3 version: 18.1
giftless version: 0.2.0

When running from Pypi with

uwsgi -M -T --threads 2 -p 2 --manage-script-name \
    --module giftless.wsgi_entrypoint --callable app --http 127.0.0.1:8080

I get the error ImportError: cannot import name 'transfer' from 'giftless' .

It seems like the folder "transfer" is missing in the pypi module.

[docs][question]

If we're trying to set up a local server for test, what's the endpoint we should specify in the .lfsconfig file? "http://127.0.0.1:5000/"?

Switch from pytz to python-dateutil

For reasons well described here: https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html, pytz is probably not the best way to do timezones in Python, and we should probably rely on dateutil.timezone instead.

Make storage adapter API return types consistent across Azure and GCP

It looks like the streaming API for GUnicorn on Heroku does not support the method tell(). ATM there is a workaround where I return the size of the stream; which is not fully compatible with the requested return type of put on the implementation of the provider. What's a good way to abstract over different providers and wsgi servers? Should we revisit the api?

Note: the code below is bad

# google_cloud.py
def put(self, prefix: str, oid: str, data_stream: BinaryIO) -> int:
        bucket = self.storage_client.get_bucket(self.bucket_name)
        blob = bucket.blob(self._get_blob_path(prefix, oid))
        blob.upload_from_string(data_stream.read())
        return data_stream.get_size()

GCP: try to refactor our own implementation of `_get_signed_url` to use the Google library

For some reason, using Google's generate_signed_url method is not working for us, and we had to revert to implementing URL signing ourselves for #36.

We should revisit this because we do not know why Google's own library doesn't work for us, and the assumption is that it should work. Removing our own implementation would mean less code, less bugs and less security issues for us.

[auth] Add option to pull the JWT public key from a URL on startup

When JWT authenticator is configured to use public key for verification, it would be nice to have an option to pull that key from URL (https only!) as opposed to uploading it to the server / pre-configuring it in an env var. This will allow easier deployment.

Need to consider if this has security implication (e.g. if a key is spoofed + URL is hijacked to deliver matching public key + server is restarted...). I don't think it does as long as HTTPS is used.

Note that ckanext-authz-service now offers the public key (if set) in /authz/public_key.

Normalize configuration handling: YAML or ENV variables or both?

At this time we use a YAML file to load specific configs (providers, credentials, etc).
When deploying into a PaaS, it would be good to provide an env-based configuration system.

Analysis

From the comments: Should we support both a YAML file config and an .env config or just one of them? See #22 (comment)

datopian / giftless Goto Github PK

giftless's Introduction

Giftless - a Pluggable Git LFS Server

Documentation

License

giftless's People

Stargazers

Watchers

Forkers

giftless's Issues

Batch API Extension

Acceptance

Tasks

Analysis

Tasks

Acceptance

Acceptance

Acceptance Criteria

Decision

Acceptance

Tasks

Analysis

Research

Authentication docs

Job Stories

Acceptance

Tasks

Goal

Design goals:

Must:

Nice / Should:

Initial Protocol design

Request Objects

Examples

Sample Upload Request

Sample Upload Response:

Chunk sizes

Azure specific multipart upload flow

Open Questions:

Analysis

Recommend Projects

Recommend Topics

Recommend Org