Giter Club home page Giter Club logo

gcsfs's Introduction

GCSFS

A Python filesystem abstraction of Google Cloud Storage (GCS) implemented as a PyFilesystem2 extension.

https://travis-ci.org/Othoz/gcsfs.svg?branch=master https://readthedocs.org/projects/fs-gcsfs/badge/?version=latest

With GCSFS, you can interact with Google Cloud Storage as if it was a regular filesystem.

Apart from the nicer interface, this will highly decouple your code from the underlying storage mechanism: Exchanging the storage backend with an in-memory filesystem for testing or any other filesystem like S3FS becomes as easy as replacing gs://bucket_name with mem:// or s3://bucket_name.

For a full reference on all the PyFilesystem possibilities, take a look at the PyFilesystem Docs!

Documentation

Installing

Install the latest GCSFS version by running:

$ pip install fs-gcsfs

Or in case you are using conda:

$ conda install -c conda-forge fs-gcsfs

Examples

Instantiating a filesystem on Google Cloud Storage (for a full reference visit the Documentation):

from fs_gcsfs import GCSFS
gcsfs = GCSFS(bucket_name="mybucket")

Alternatively you can use a FS URL to open up a filesystem:

from fs import open_fs
gcsfs = open_fs("gs://mybucket/root_path?project=test&api_endpoint=http%3A//localhost%3A8888&strict=False")

Supported query parameters are:

  • project (str): Google Cloud project to use
  • api_endpoint (str): URL-encoded endpoint that will be passed to the GCS client's client_options
  • strict ("True" or "False"): Whether GCSFS will be opened in strict mode

You can use GCSFS like your local filesystem:

>>> from fs_gcsfs import GCSFS
>>> gcsfs = GCSFS(bucket_name="mybucket")
>>> gcsfs.tree()
├── foo
│   ├── bar
│   │   ├── file1.txt
│   │   └── file2.csv
│   └── baz
│       └── file3.txt
└── file4.json
>>> gcsfs.listdir("foo")
["bar", "baz"]
>>> gcsfs.isdir("foo/bar")
True

Uploading a file is as easy as:

from fs_gcsfs import GCSFS
gcsfs = GCSFS(bucket_name="mybucket")
with open("local/path/image.jpg", "rb") as local_file:
    with gcsfs.open("path/on/bucket/image.jpg", "wb") as gcs_file:
        gcs_file.write(local_file.read())

You can even sync an entire bucket on your local filesystem by using PyFilesystem's utility methods:

from fs_gcsfs import GCSFS
from fs.osfs import OSFS
from fs.copy import copy_fs

gcsfs = GCSFS(bucket_name="mybucket")
local_fs = OSFS("local/path")

copy_fs(gcsfs, local_fs)

For exploring all the possibilities of GCSFS and other filesystems implementing the PyFilesystem interface, we recommend visiting the official PyFilesystem Docs!

Development

To develop on this project make sure you have pipenv installed and run the following from the root directory of the project:

$ pipenv install --dev --three

This will create a virtualenv with all packages and dev-packages installed.

Tests

All CI tests run against an actual GCS bucket provided by Othoz.

In order to run the tests against your own bucket, make sure to set up a Service Account with all necessary permissions:

  • storage.objects.get
  • storage.objects.list
  • storage.objects.create
  • storage.objects.update
  • storage.objects.delete

All five permissions listed above are e.g. included in the predefined Cloud Storage IAM Role roles/storage.objectAdmin.

Expose your bucket name as an environment variable $TEST_BUCKET and run the tests via:

$ pipenv run pytest

Note that the tests mostly wait for I/O, therefore it makes sense to highly parallelize them with xdist, e.g. by running the tests with:

$ pipenv run pytest -n 10

Credits

Credits go to S3FS which was the main source of inspiration and shares a lot of code with GCSFS.

gcsfs's People

Contributors

bemeyvif avatar bgroenks96 avatar birnbaum avatar btschroer avatar elephantum avatar mathiaseitz avatar trendelkampschroer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gcsfs's Issues

After applying fix_storage for first time, second time fix_storage hangs

Hello,
I've run into this issue:

  1. I've created a fresh bucket
  2. gsutil cp data to the bucket
  3. ran GCSFS.fix_storage(), root_path is default ("")
  4. GCSFS created directory markers, also created a "/" directory marker for the bucket's root
  5. issue -> I want to create a new directory, so I run GCSFS.fix_storage() again, but it is stuck
gcsfs.fix_storage()
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): www.googleapis.com:443
DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/redacted/o?projection=noAcl&prefix= HTTP/1.1" 200 249836
<nothing happens>
  1. If I delete "/" directory marker in bucket root, GCSFS.fix_storage() runs correctly and marks the new directory properly

Btw thanks for this lib.

GCSFS constructor with retry !=0 produces a DeprecationWarning

The code of GCSFS below results in a deprecation warning issued by urllib3

if retry:
    adapter = HTTPAdapter(max_retries=Retry(total=retry,
                                             status_forcelist=[429, 502, 503, 504],
                                             method_whitelist=False,  # retry on any HTTP method
                                             backoff_factor=0.5))
    self.client._http.mount("https://", adapter)

The warning:
DeprecationWarning: Using 'method_whitelist' with Retry is deprecated and will be removed in v2.0. Use 'allowed_methods' instead

Versions:
fs-gcsfs 1.4.1
urllib3 1.26.2

P.S. retry is not described in the GCSFS.init() docstring.

Add equivalent of S3Map to gcsfs

s3fs has a handy class called S3Map that wraps the file system in a MutableMapping. This is a necessary feature for use with some third party libraries (in my case, xarray and zarr).

The implementation is pretty simple and can be ported directly from s3fs. I have submitted a pull request doing so.

Problems accessing certain files with fs-gcsfs while it works with plain gcsfs

Hi!
I have some problems accessing certain files using fs_gcsfs. Im struggling with figuring out what exactly the problem is, as it does not affect all files. Accessing the files directly through gcsfs works though.

>>> import gcsfs
>>> t = gcsfs.GCSFileSystem(project='some-company')
>>> p = t.open('gs://my-bucket/my-folder/README.md')
>>> p
<File-like object GCSFileSystem, my-bucket/my-folder/README.md>
>>>
>>> from fs import open_fs
>>> t = open_fs('gs://my-bucket/')
>>> t.open('my-folder/README.md')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/fs/base.py", line 1166, in open
    bin_file = self.openbin(path, mode=bin_mode, buffering=buffering)
  File "/usr/local/lib/python3.7/site-packages/fs_gcsfs/_gcsfs.py", line 348, in openbin
    info = self.getinfo(path)
  File "/usr/local/lib/python3.7/site-packages/fs_gcsfs/_gcsfs.py", line 127, in getinfo
    raise errors.ResourceNotFound(path)
fs.errors.ResourceNotFound: resource 'my-folder/README.md' not found
>>> t.exists('my-folder/README.md')
True
>>> t.open('other-folder/some-file.csv')
    <_io.TextIOWrapper name='other-folder/some-file.csv' encoding='utf-8'>

So for some reason i can access other-folder/some-file.csv but not my-folder/README.md when using fs_gcsfs. From what i see, the acl and iam for both files look pretty much the same:

% gsutil acl get gs://my-bucket/my-folder/README.md
[
  {
    "entity": "project-owners-605693490522",
    "projectTeam": {
      "projectNumber": "605693490522",
      "team": "owners"
    },
    "role": "OWNER"
  },
  {
    "entity": "project-editors-605693490522",
    "projectTeam": {
      "projectNumber": "605693490522",
      "team": "editors"
    },
    "role": "OWNER"
  },
  {
    "entity": "project-viewers-605693490522",
    "projectTeam": {
      "projectNumber": "605693490522",
      "team": "viewers"
    },
    "role": "READER"
  },
  {
    "email": "[email protected]",
    "entity": "[email protected]",
    "role": "OWNER"
  }
]
% gsutil iam get gs://my-bucket/my-folder/README.md
{
  "bindings": [
    {
      "members": [
        "projectViewer:some-company"
      ], 
      "role": "roles/storage.legacyObjectReader"
    }, 
    {
      "members": [
        "projectOwner:some-company", 
        "projectEditor:some-company", 
        "serviceAccount:[email protected]"
      ], 
      "role": "roles/storage.legacyObjectOwner"
    }
  ], 
  "etag": "CAE="
}

% gsutil acl get gs://my-bucket/other-folder/some-file.csv
[
  {
    "email": "[email protected]",
    "entity": "[email protected]",
    "role": "OWNER"
  },
  {
    "entity": "project-owners-605693490522",
    "projectTeam": {
      "projectNumber": "605693490522",
      "team": "owners"
    },
    "role": "OWNER"
  },
  {
    "entity": "project-editors-605693490522",
    "projectTeam": {
      "projectNumber": "605693490522",
      "team": "editors"
    },
    "role": "OWNER"
  },
  {
    "entity": "project-viewers-605693490522",
    "projectTeam": {
      "projectNumber": "605693490522",
      "team": "viewers"
    },
    "role": "READER"
  }
]
% gsutil iam get gs://my-bucket/other-folder/some-file.csv
{
  "bindings": [
    {
      "members": [
        "serviceAccount:[email protected]",
        "projectOwner:some-company",
        "projectEditor:some-company"
      ],
      "role": "roles/storage.legacyObjectOwner"
    },
    {
      "members": [
        "projectViewer:some-company"
      ],
      "role": "roles/storage.legacyObjectReader"
    }
  ],
  "etag": "CAE="
}

The GOOGLE_APPLICATION_CREDENTIALS env-variable is set to a service-account file for some-service-account.
Any idea how i can debug/solve this? Using

fs-gcsfs                   1.0.0    
gcsfs                      0.3.0 

glob without star (*) returns wrong value

>>> gcsfs.__version__
'0.2.3'

The glob() function without star (*) returns wrong value. This is an inconsistent behavior different than glob.glob or tensorflow.io.gfile.glob.

>>> fs.glob('gs://mybucket/folder/*.csv')
['mybucket/folder/a.csv']          # why without gs:// ???

>>> fs.glob('gs://mybucket/folder/a.csv')
[]

>>> fs.exists('gs://mybucket/folder/a.csv')
True

Would it make sense to create missing directory-markers in `getinfo`?

Im wondering: im creating files on the bucket from another application which doesnt create markers, so i have to write some additional code somewhere to handle this. But i think if i would do a small "does the file exists but the markers along the path are missing? if so, create the markers" snippet somewhere on getinfo under the block

if check_parent_dir:
            [...]
            if parent_dir != "/" and not self._get_blob(parent_dir_key):
                raise errors.ResourceNotFound(path)

would fix this problem. Is there any reason why this is a bad idea?

Wrong `fs.errors.CreateFailed: Root path "XXX" does not exist`

Hi,

When doing:

fs = open_fs("gs://my_bucket/path/to/blob")

I'm getting:

  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/fs/opener/registry.py", line 228, in open_fs
    default_protocol=default_protocol,
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/fs/opener/registry.py", line 189, in open
    open_fs = opener.open_fs(fs_url, parse_result, writeable, create, cwd)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/fs_gcsfs/opener.py", line 25, in open_fs
    return GCSFS(bucket_name, root_path=root_path, create=create)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/fs_gcsfs/_gcsfs.py", line 88, in __init__
    raise errors.CreateFailed("Root path \"{}\" does not exist".format(root_path))
fs.errors.CreateFailed: Root path "path/to/blob" does not exist

But I'm sure gs://my_bucket/path/to/blob does exist (running gsutil ls gs://my_bucket/path/to/blob returns files)

Wondering if #9 broke something cc @birnbaum

TypeError: __init__() got an unexpected keyword argument 'allowed_methods'

Hi, I get the following error using either of the two options (obviously I replaced this with the real name of my bucket):

from fs_gcsfs import GCSFS
gcsfs = GCSFS("mybucket")

or

from fs import open_fs
gcsfs = open_fs("gs://mybucket/mydir") 

Error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-da26c33d3a55> in <module>
----> 1 gcsfs = open_fs("gs://mybucket/mydir")

/opt/conda/lib/python3.7/site-packages/fs/opener/registry.py in open_fs(self, fs_url, writeable, create, cwd, default_protocol)
    224                 create=create,
    225                 cwd=cwd,
--> 226                 default_protocol=default_protocol,
    227             )
    228         return _fs

/opt/conda/lib/python3.7/site-packages/fs/opener/registry.py in open(self, fs_url, writeable, create, cwd, default_protocol)
    185         opener = self.get_opener(protocol)
    186 
--> 187         open_fs = opener.open_fs(fs_url, parse_result, writeable, create, cwd)
    188         return open_fs, open_path
    189 

/opt/conda/lib/python3.7/site-packages/fs_gcsfs/opener.py in open_fs(self, fs_url, parse_result, writeable, create, cwd)
     37             client.client_options = {"api_endpoint": api_endpoint}
     38 
---> 39         return GCSFS(bucket_name, root_path=root_path, create=create, client=client, strict=strict)

/opt/conda/lib/python3.7/site-packages/fs_gcsfs/_gcsfs.py in __init__(self, bucket_name, root_path, create, client, retry, strict)
     80                                                     status_forcelist=[429, 502, 503, 504],
     81                                                     allowed_methods=False,  # retry on any HTTP method
---> 82                                                     backoff_factor=0.5))
     83             self.client._http.mount("https://", adapter)
     84 

TypeError: __init__() got an unexpected keyword argument 'allowed_methods'
import fs_gcsfs
fs_gcsfs.__version__
# 1.4.2

Update dependencies

The used version of packaging is old and conflicts with updating in my other dependency.

  Because poetryup (0.7.1) depends on packaging (>=21.3,<22.0)
   and fs-gcsfs (1.4.5) depends on packaging (>=20.0,<21.0), poetryup (0.7.1) is incompatible with fs-gcsfs (1.4.5).
  And because no versions of fs-gcsfs match >1.4.5,<2.0.0, poetryup (0.7.1) is incompatible with fs-gcsfs (>=1.4.5,<2.0.0).
  So, because *** depends on both fs-gcsfs (^1.4.5) and poetryup (0.7.1), version solving failed.

Consider implementing server-side md5

GCP can compute the md5 hash server-side: https://googleapis.dev/python/storage/latest/blobs.html#google.cloud.storage.blob.Blob.md5_hash

hash() is not typically implemented in pyfilesystem according to: https://docs.pyfilesystem.org/en/latest/implementers.html#helper-methods however, in this case it could be argued that there is a performance benefit to getting the hash from the server rather than downloading+computing it client-side.

Note that this only works for md5 and crc32c. Therefore the implementation could just fallback to the parent class otherwise. An implementation might look like this (complelely untested):

def hash(self, path, name):
    if name.lower() == 'md5':
        _path = self.validatepath(path)
        _key = self._path_to_key(_path)
        blob = self._get_blob(_key)
        if not blob:
            raise errors.ResourceNotFound(path)
        return blob.md5_hash
    else:
        return super().hash(self, path, name)

Secret credential stored in public repository

I'm not completely familiar with Travis for CI, but it seems like a credential to deploy to pypi might be exposed here?

secure: MfxtR41hpvchUoBGIDC8CnsQeO7xJDH22IHK72AjumuASnxeDeJFI6/SLzCzlsm+QLK1bXC3BFu1LLpJQxiLsKmiQVe1zAc6IYs5Hy/u/zSQ1DZSnqEgqS05DnEckf4DPn9QbmOazH7B9IvOAF3Vii5mmGjfUk1I3pT5/ZPBbbl54Tiptm9qtmc96llvx1j3WSi4Ug/aM3w3K3fmjpVBnUxu+TO2dyo78qiP0RbU0KoH85Ec97xqZm98gO4TySUc+X8+hWyRNOB+qtArjeGCxV/VTequkU+HZ4IXnhD2SlnILLicU94NLcH9PjE2F8JV3auCOAyn3Jx3CbgrzsnA5m85DYkiuQjK45fC13SoU2L02fMA/6eeWdeYw69bXQAy/0XMvZVILkGrFhoxbu4kcvZf3EahluVuB5IGSEE4/ZBJYCvKbwX2pxwGFDUy+ile+FrFRRqKA3F3Vc++ToceE1m+pGemP0M3G2oBDQ0UhDUUITqW3buZ6GdinMePksAmfBLu/maVzKge2gjeBOkIj1+5JvuKGcL+6h4idRdRGa6mMJw6PihV9YoQZEfGKIldtub+as2eApFUYX1JKux9m4Cu1KJiyQBeok8zm2X6eKTSz53EwJD5xqpgnOsISbutcmlJ2jjZFUV72EUhSpfctk4jINX4nIZmNG4O8HN8FR0=

Feature request: Passing more arguments to Client object

I'm working on a large codebase using fs-gcsfs and we'd like to test our code without having to: 1) create development versions of buckets and 2) go through thousands of lines of code to find every bucket, change the name during testing, and remember to change every single one back before deployment.

It'd be really great if there were a way to supply the api_endpoint argument to the Client object via the URL. I haven't thought too long about this but I was thinking adding client_options as an argument and the value would be a JSON object that we'd then pass directly to the client_options argument of Client:

fs.open_fs("gs://my-bucket?client_options=%7B%22api_endpoint%22%3A%20%22http%3A//localhost%3A1234%22%7D")

You'd then modify the opener and add:

options_string = parse_result.params.get("client_options")
if options_string is not None:
    # You may need to run options_string through urllib.parse.unquote() first
    client_options = json.loads(options_string)
else:
    client_options = None

return GCSFS(..., client_options=client_options)

Then in the GCSFS constructor you'd add an optional client_options argument and make a small change to how the client is created:

if self.client is None:
    self.client = Client(client_options=client_options)

I know JSON is quite ugly but it'll get the job done and is extensible. Another possibility is using dot notation along with dpath to get something much more readable:

fs.open_fs("gs://my-bucket?client_options.api_endpoint=http%3A//localhost%3A1234")

It's a tad bit more work but I think it'll be fine? Google's API client only defines one client option so far and it's a string. If they add another option (say a timeout length) and it's an integer then we'd have to handle that some other way.

Guess MIME type from file extension

Currently all files created in GCS are created with application/octet-stream MIME type, that makes it impossible to upload images for hotlinking from GCS.

It should be relatively easy to add support for mimetypes.guess_type, but it will increase usability of the module a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.