fsspec / universal_pathlib Goto Github PK

View Code? Open in Web Editor NEW

192.0 192.0 36.0 478 KB

pathlib api extended to use fsspec backends

License: MIT License

Python 98.05% Jupyter Notebook 1.95%

universal_pathlib's People

Stargazers

Watchers

universal_pathlib's Issues

Feature Request: Add adlfs support

Thanks for creating this package, its a great addition to the fsspec project.

I'd like to see support for adlfs in addition to the current supported options (S3, GCS, https, and hdfs). Is anyone working on this or are there any known blockers to doing this?

subclassing from UPath

Not sure if I am doing this properly, but subclassing UPath is challenging, in part due to this check in the __new__ method:

https://github.com/Quansight/universal_pathlib/blob/dbc57f7ec4c8bb8b2e22975f6b62e25b2b82ecff/upath/core.py#L12

I am curious if this could be changed to:

        if issubclass(cls, UPath):

This would return True even for issubclass(UPath, UPath).

Happy to submit a PR if you are ok with this change.

My surprisingly challenging attempt to subclass UPath to add some convenience methods has really grown my appreciation for what this project has accomplished.

On a related note, after digging deeper into the code, and exploring the pathlib implementation, I am wondering if UPath would benefit from having something like a _URI_Flavour derived from the pathlib._Flavour class. This might make it easier to have consistent path handling across the fsspec based implementations.

Updating admins/maintainers

@normanrz, would you be interested in being added as an admin for this project, as you have already been manning the helm for a while? Also would you be interested in being a maintainer for the conda-forge feedstock?

@ap--, since you have made several contributions, would you be interested in being added as a maintainer?

Updating the example notebook

@andrewfulton9,

As mentioned in PR #32, the notebook output needs to be updated, but there were a couple of issues, as stated in a comment to the PR:

The issues I noticed were that the UPath untested default warning appears, which we may want to filter or suppress, since this notebook is the primary user documentation. Also, the readme.exists() cell is now returning False, so it looks like something somewhere may have broken, although I don't think it is in this PR, I tested on main and the issue still happens, so I will create an issue to address that.

In addition to those issues, I think it would be worth clarifying this point:

Some filesystems may require extra imports to use.

Since fsspec tries to instantiate registered implementations as needed, I think it would be more clear to state something like:

Some filesystems may require additional packages to be installed.

The import of s3fs is not necessary and may cause confusion about whether target filesystem packages need to be imported prior to usage with UPath.

One idea I had is that it might be a handy reference to import and cleanly print the known fsspec implementations at the end of the notebook.

I'd be interested in any feedback. Hopefully I'll be able to submit another PR sometime soon with some of these changes.

Python 3.10: TypeError: PurePath._from_parts() got an unexpected keyword argument 'init'

let the new python problems begin:

>>> from upath import UPath
>>> UPath('asdfsad')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/fixtures-cache/lib/python3.10/site-packages/upath/core.py", line 39, in __new__
    self = cls._from_parts(args, init=False)
TypeError: PurePath._from_parts() got an unexpected keyword argument 'init'

The noxfile has to be changed to support newer nox versions

starting with nox>=2022.8.7 the ci breaks: https://github.com/fsspec/universal_pathlib/runs/7724453295?check_suite_focus=true#step:5:61

requires fixing the noxfile.

Current workaround: bf1ea06

`mkdir(parents=True`) tries to create the bucket if it's empty

I run into this issue today...
I was using UPath("gs://my-fresh-bucket/some-dir/another-dir/file").parent.mkdir(parents=True, exist_ok=True).
The bucket was empty and didn't contain any objects.
universal-pathlib attempted to create the bucket instead of creating the directory structure, which caused the code to fail.

Glob on s3 bucket should be coherent with other implementations?

Given this

print([f for f in path.glob("**/*") if f.is_file() ])

if path is a Path from pathlib holding a folder, that would return all the files recursively contained in that folder. I'd expect that:

path = UPath("s3://my-bucket")
print([f for f in path.glob("**/*") if f.is_file()])

would return all files, but instead it returns nothing. Using /**/* works instead.

Is this a bug or a feature?

My use case is providing a generic "folder" path, and being able to walk all the files within it recursively

Updating package dev tools

from #96

Migrate to hatch, stop using legacy setup metadata format.

Update and consolidate dev tools and test dependencies.

New nox tasks and minor changes for convenience.

Add, configure, and apply pre-commit for consistent code formatting.

@brl0, I'm all for this. It might be worth to wait for fsspec/community#6 (and https://github.com/fsspec/cookiecutter-fsspec) to be ready, so that we don't have to do this twice.

I would almost say, let's wait for either the template to be ready, or Python3.12 to be available as a release candidate and whatever is first, will trigger this refactor.

But I have no strong opinion here.

Adhere to semver and add a changelog

We should start giving more guarantees and adhere to semver, as well as keep a changelog.

I think this should be done together with #98

Add type annotations to UPath

We use UPath in a project with mypy typing. Most functions and attributes are inferred properly, even without type annotations in upath. However, we had to create a stub to teach mypy the subclass relationship with pathlib.Path and declare fs and _kwargs attributes in a stub file (see below). Perhaps it would be easier for other users, if upath shipped with these stub file or has type annotations directly in the code. I'd be happy to help with a PR.

./stubs/upath/__init__.py:

from pathlib import Path
from typing import Any, Dict

import fsspec


class UPath(Path):
    fs: fsspec.AbstractFileSystem
    _kwargs: Dict[str, Any]

Weird result with as_uri() on s3 files

The as_uri() attribute of an s3path looks like it is returning a file of an encoded url; however, it should really be likely a reimplementation of the .as_posix() attribute which would return the same as the input below.

f = UPath(f's3://{bucket}/{key}') 

f.as_uri()==f"file://s3%3A//{bucket}/{key}"
f.as_posix()==f's3://{bucket}/{key}'

Found this because for some reason polars doesnt like UPaths unless the .as_posix() property is used.

MemoryFileSystem strips leading path separator

I am really excited about this project, thanks for sharing!

I ran into a minor issue using UPath to create a MemoryFile, the leading path separator gets stripped.

For example, this should work, but both examples cause FileNotFoundError.

from fsspec.implementations.memory import MemoryFileSystem
from upath import UPath

fs = MemoryFileSystem()
content = b"a,b,c\n1,2,3\n4,5,6"

# write with upath, read with fsspec
p1 = "memory:///tmp/output1.csv"
upath1 = UPath(p1)
upath1.write_bytes(content)
with fs.open(p1) as file:
    assert file.read() == content

# write with fsspec, read with upath
p2 = "memory:///tmp/output2.csv"
with fs.open(p2, "wb") as file:
    file.write(content)
upath2 = UPath(p2)
assert upath2.read_text() == content

Looking at fs.store, it is clear that UPath is not handling the path the same way as fsspec. One is written with the leading path separator, the other is not.

{'tmp/output1.csv': <fsspec.implementations.memory.MemoryFile at 0x7f2cbce685e0>,
 '/tmp/output2.csv': <fsspec.implementations.memory.MemoryFile at 0x7f2cbce6e360>}

I should also mention that I tried this on the current release as well as on the master branch.

move to fsspec?

Merely a suggestion, totally up to you, but many fsspec-related repos have moved recently to the fsspec github org, and you would be welcome there too.

UPath incorrectly creates a parent object after calling UPath.parents method

The bug appears when you try to call parents method of a path with an URL schema prefix like gs://.

Way to reproduce

from upath import UPath

path = UPath('gs://my-bucket/dir1/dir2/')  # Create a path object.
parent = path.parents[0]                   # Get any parent. The object is missing _url attribute.
# Next line will throw an error because the object is incorect.
str(parent)

Traceback

Traceback (most recent call last):
File "C:\Users\korotkovk\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
str(a)
File "C:\Users\korotkovk\anaconda3\lib\pathlib.py", line 723, in str
self._str = self._format_parsed_parts(self._drv, self._root,
File "C:\Users\korotkovk\anaconda3\lib\site-packages\upath\core.py", line 149, in _format_parsed_parts
scheme, netloc = self._url.scheme, self._url.netloc
AttributeError: 'NoneType' object has no attribute 'scheme'

Possible solution

The parents method is implemented in the PurePath class. There _PathParents class is used for the parents sequence, which has the __getitem__ method, and it doesn't provide "required" url parameter into the _from_parsed_parts method:

# Code block from pathlib.
def __getitem__(self, idx):
    if idx < 0 or idx >= len(self):
        raise IndexError(idx)
    return self._pathcls._from_parsed_parts(self._drv, self._root,     # <---- doesn't give the url param, which leads to an error.
                                            self._parts[:-idx - 1])

I suggest that parents method may be inherited inside UPath to fix this error.

add contributing guidelines

SSH protocol does not work

It looks like the ssh protocol is not working at the moment. I have paramiko installed and using fsspec with it works fine.
The warning says indeed that the ssh filesystem is not explicitly implemented, what needs to be done for it to be?

Minimal example:

from upath import UPath

url = "ssh://myserver/tmp/somefile.tar"
gpath = UPath(url)

This crashes on my machine with:

/home/a001673/mambaforge/envs/satpy/lib/python3.10/site-packages/upath/registry.py:25: UserWarning: ssh filesystem path not explicitly implemented. falling back to default implementation. This filesystem may not be tested
  warnings.warn(warning_str, UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/site-packages/upath/core.py", line 156, in __new__
    self = cls._from_parts_init(args, init=False)
  File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/site-packages/upath/core.py", line 306, in _from_parts_init
    return super()._from_parts(args, init=init)
TypeError: PurePath._from_parts() got an unexpected keyword argument 'init'

Webdav tests sporadically error on windows

They fail in the webdav_fixture. Rerunning the job usually recovers it.
We need to make some improvements here to be more reliable on windows.

@pytest.fixture
    def webdav_fixture(local_testdir, webdav_server):
        webdav_path, webdav_url = webdav_server
        if os.path.isdir(webdav_path):
            shutil.rmtree(webdav_path, ignore_errors=True)
        try:
>           shutil.copytree(local_testdir, webdav_path)

upath\tests\conftest.py:339: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\shutil.py:557: in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\shutil.py:[45](https://github.com/fsspec/universal_pathlib/actions/runs/5304477693/jobs/9600822124?pr=105#step:5:46)8: in _copytree
    os.makedirs(dst, exist_ok=dirs_exist_ok)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

name = 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\webdav0'
mode = 511, exist_ok = False

    def makedirs(name, mode=0o777, exist_ok=False):
        """makedirs(name [, mode=0o777][, exist_ok=False])
    
        Super-mkdir; create a leaf directory and all intermediate ones.  Works like
        mkdir, except that any intermediate path segment (not just the rightmost)
        will be created if it does not exist. If the target directory already
        exists, raise an OSError if exist_ok is False. Otherwise no exception is
        raised.  This is recursive.
    
        """
        head, tail = path.split(name)
        if not tail:
            head, tail = path.split(head)
        if head and tail and not path.exists(head):
            try:
                makedirs(head, exist_ok=exist_ok)
            except FileExistsError:
                # Defeats race condition when another thread created the path
                pass
            cdir = curdir
            if isinstance(tail, bytes):
                cdir = bytes(curdir, 'ASCII')
            if tail == cdir:           # xxx/newdir/. exists if xxx/newdir exists
                return
        try:
>           mkdir(name, mode)
E           FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\webdav0'

C:\hostedtoolcache\windows\Python\3.8.10\x[64](https://github.com/fsspec/universal_pathlib/actions/runs/5304477693/jobs/9600822124?pr=105#step:5:65)\lib\os.py:223: FileExistsError

accept proper URI

I think this project is a great idea! I hope to make use of it to simplify and standardize some file operations, mostly for gcs and s3 storage but keeping flexibility for local storage.

It would be nice it UPath would accept a proper URI, with double slashes rather than a single slash, such as "s3://spacenet-dataset/".

@andrewfulton9, would this be possible to do? I would be happy to assist in any way I can.

UPath does not play with pydantic-settings like it used to

Hi,

I noticed that a pattern I was using before is not working well anymore. I am using pydantic-settings (version 2.0.0), which is now a separate package, for settings management. Possibly I should create this issue at pydantic-settings instead of here. My excuses if that is the case, I can also do that of course.

What I observed is the following. It used to be possible to do this:

from dotenv import find_dotenv
from pydantic import BaseConfig
from pydantic_settings import BaseSettings
from upath import UPath


class Settings(BaseSettings):

    example_path: UPath = UPath(__file__).parent.parent / "example_data"

    class Config(BaseConfig):
        env_prefix = "my_env_"
        env_file = find_dotenv(".env")
        env_file_encoding = 'utf-8'

and I created a .env file with:

MY_ENV_EXAMPLE_PATH=/some/other/path

That worked well in the past. The string in the .env file was converted to a UPath object by pydantic.

Now it gives me this error:

ImportError while loading conftest '..../tests/conftest.py'.
tests/conftest.py:6: in <module>
    from bigwig_loader.collection import BigWigCollection
bigwig_loader/__init__.py:15: in <module>
    config = _Settings()
mambaforge/envs/bigwig/lib/python3.11/site-packages/pydantic_settings/main.py:61: in __init__
    super().__init__(
E   pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
E   example_data_dir
E     Input should be an instance of UPath [type=is_instance_of, input_value=PosixPath('/bla/bla/r.../example_data'), input_type=PosixPath]
E       For further information visit https://errors.pydantic.dev/2.0.1/v/is_instance_of

It seems like pydantic-settings is more strict than it used to be or something changed with the types of universal_pathlib?

universal_pathlib version: 0.0.23
operating system: RHEL 7.9 and macOS

mkdir does not respect exist_ok when parents is set to True

The mkdir API does not respect exist_ok = True when trying to create directory tree with parents = True. Example,

>>> from upath import UPath
>>> test_dir = UPath("file:///tmp/parent/child")
>>> test_dir.mkdir(parents=True)
>>> test_dir.exists()
True
>>> test_dir.mkdir(exist_ok=True, parents=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rahuliyer/workspace/nbaiju-team/asset-mgmt-pipelines/icloud-cassandra-hubble-stats/.venv/lib/python3.10/site-packages/upath/core.py", line 403, in mkdir
    self._accessor.mkdir(
  File "/Users/rahuliyer/workspace/nbaiju-team/asset-mgmt-pipelines/icloud-cassandra-hubble-stats/.venv/lib/python3.10/site-packages/upath/core.py", line 67, in mkdir
    return self._fs.mkdir(
  File "/Users/rahuliyer/workspace/nbaiju-team/asset-mgmt-pipelines/icloud-cassandra-hubble-stats/.venv/lib/python3.10/site-packages/fsspec/implementations/local.py", line 46, in mkdir
    raise FileExistsError(path)
FileExistsError: /tmp/parent/child

I don't expect this to fail with FileExistsError as exist_ok is set to True

The prefix file:// results in ignoring the first path component.

I'm seeing this issue in both macos and linux

PATHS = ['/var/tmp', 'file:/var/tmp', 'file://var/tmp', 'file:///var/tmp']
{p:list(upath.UPath(p).iterdir()) for p in PATHS}

The variant "file://var/tmp" gives me the contents of "/tmp". The other variants work as expected.

Should `UPath.new` return `pathlib.Path` instances for local paths

Summarizing from #84

When creating a UPath instance universal_pathlib returns a pathlib.Path instance for local filesystems, which can cause problems with type checkers:

from upath import UPath     # UPath is a subclass of pathlib.Path

pth = UPath("/local/path")  # returns i.e. a pathlib.PosixPath instance on non-windows
reveal_type(pth)            # Revealed type is "upath.core.UPath"

Possible solutions:

(1) user explicitly type annotates

No changes. We could just document it very explicitly.

pth: pathlib.Path = UPath(...)

(2) always return a UPath instance

I think the reason for returning a pathlib.Path instance for local filesystems is to guarantee that there are no changes in behavior when using UPath as a replacement.

(2A) create an intermediate subclass

We could still guarantee pathlib behavior by creating an intermediate UPath class

class UPath(pathlib.Path):
    def __new__(cls, *args, **kwargs):
        if ...:                         # is local path
            return cls(*args, **kwargs)
        else:                           # promote to a fsspec backed class
            return FSSpecUPath(*args, **kwargs)
    
    # > we could define additional properties here too `.fs`, etc...

class FSSpecUPath(UPath):
    ...  # provides all the methods current UPath provides

(2B) we always return the fsspec LocalFileSystem backed UPath instance for local paths

This would be a simple change, but I think we should then port the CPython pathlib test suite to universal_pathlib, to guarantee as good as we can that the behavior is identical. I'm worried that symlink support, the windows UNC path stuff, and other weird edge-cases will probably cause quite a few issues.

(3) apply some magic

We could provide UPath as an alias for the UPath class but cast it to type[pathlib.Path] to trick the static typecheckers. And we could fallback to some metaclass magic to make runtime typechecking work, but this would basically (partially?) revert the cleanup done here: #56

FileNotFoundError in write_bytes to Google Storage

Hey! I am trying to write a pickled object to google storage and I am getting FileNotFoundError. Here is a minimal script reproducing the problem:

from upath import UPath
import pickle

x = [1,2,3]
path = UPath("gs://some-bucket/x.pkl")

path.write_bytes(pickle.dumps(x))

Output:

Traceback (most recent call last):
File "test.py", line 8, in
path.write_bytes(pickle.dumps(x))
File "/home/conda/environments/aes_pinned/lib/python3.8/pathlib.py", line 1246, in write_bytes
return f.write(view)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/spec.py", line 1602, in exit
self.close()
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/spec.py", line 1569, in close
self.flush(force=True)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/spec.py", line 1435, in flush
self._initiate_upload()
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 1236, in _initiate_upload
self.location = sync(
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 1324, in initiate_upload
headers, _ = await fs._call(
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 340, in _call
status, headers, info, contents = await self._request(
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/decorator.py", line 221, in fun
return await caller(func, *(extras + args), **kw)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/retry.py", line 110, in retry_request
return await func(*args, **kwargs)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 332, in _request
validate_response(status, contents, path)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/retry.py", line 89, in validate_response
raise FileNotFoundError

Interestingly DataFrame.to_csv is working for objects in that bucket so I don't think its a matter of permissions.

Copy a UPath object

I would expect that creating a new UPath object by supplying another UPath object as argument to the constructor to copy the object. Currently, that works only for the path, but not for any additional kwargs.

a = UPath("s3://bucket/folder", key="...", secret="...")
b = UPath(a)

assert b._kwargs["key"] == "..."
# KeyError: 'key'

UPath broken in Python 3.11

In Python 3.11 (currently slated for release in October 2022), pathlib accessors have been removed (see https://bugs.python.org/issue43012), resulting in UPath breaking.

For example, with an s3 remote mocked with moto (pip install moto[server])

$ moto_server s3                                                                                             
 * Running on http://127.0.0.1:5000 (Press CTRL+C to quit)

from upath import UPath
path = UPath("s3://bucketname")
path.mkdir()

FileNotFoundError                         Traceback (most recent call last)

----> 1 u.mkdir()

File /usr/lib/python3.10/pathlib.py:1175, in Path.mkdir(self, mode, parents, exist_ok)
   1171 """
   1172 Create a new directory at this given path.
   1173 """
   1174 try:
-> 1175     self._accessor.mkdir(self, mode)
   1176 except FileNotFoundError:
   1177     if not parents or self.parent == self:

FileNotFoundError: [Errno 2] No such file or directory: 's3://bucketname/'

or, trying to write something to a file:

f = path / "file"
f.write_text("ciao")

Traceback (most recent call last)
----> 1 f.write_text("ciao")

File /usr/lib/python3.10/pathlib.py:1154, in Path.write_text(self, data, encoding, errors, newline)
   1151     raise TypeError('data must be str, not %s' %
   1152                     data.__class__.__name__)
   1153 encoding = io.text_encoding(encoding)
-> 1154 with self.open(mode='w', encoding=encoding, errors=errors, newline=newline) as f:
   1155     return f.write(data)

File universal_pathlib/upath/core.py:168, in UPath.open(self, *args, **kwargs)
    167 def open(self, *args, **kwargs):
--> 168     return self.fs.open(self, *args, **kwargs)

File universal_pathlib/upath/core.py:122, in UPath.__getattr__(self, item)
    120     return _accessor
    121 else:
--> 122     raise AttributeError(item)

AttributeError: fs

UPath with_name, with_stem, with_suffix is broken

>>> from upath import UPath
>>> UPath("s3://some/file.txt")
S3Path('s3://some/file.txt')
>>> UPath("s3://some/file.txt").with_suffix(".zip")
Traceback (most recent call last):
  File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 721, in __str__
    return self._str
  File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/site-packages/upath/core.py", line 125, in __getattr__
    raise AttributeError(item)
AttributeError: _str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 742, in __repr__
    return "{}({!r})".format(self.__class__.__name__, self.as_posix())
  File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 734, in as_posix
    return str(self).replace(f.sep, '/')
  File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 723, in __str__
    self._str = self._format_parsed_parts(self._drv, self._root,
  File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/site-packages/upath/core.py", line 153, in _format_parsed_parts
    scheme, netloc = self._url.scheme, self._url.netloc
AttributeError: 'NoneType' object has no attribute 'scheme'
>>>

Fixing this requires overriding those methods with ones based on the pathlib.Path.with_* (python=3.10, probably) versions.

URI query component is ignored when opening a file

When opening a resource using universal pathlib, the query component of the URI is dropped, resulting in the wrong resource being opened.

For example, running the following:

>>> UPath("https://duckduckgo.com/?q=upath").read_text()

makes a GET request for https://duckduckgo.com/ without the query parameter, giving unexpected output.

Accessing the underlying filesystem

As I understand, right now it can be done with

path._accessor._fs

This is non very obvious, perhaps path.fs @property would be more convenient?

Should the `UPath` public class API be identical to the public `pathlib.Path` class API

Summarizing from #84

To use universal_pathlib seamlessly with other 3rd party libraries, we need to provide access to the underlying fsspec instances.

(1) Right now UPath provides UPath(...).path and UPath(...).fs to access the fsspec path string and the fsspec FileSystem instance. (We should document this prominently in the README...)

(2) Another option would be to provide this access via functions we would have to provide in universal_pathlib.

The question is: should access to those two be provided as class properties / methods on the UPath class, or should the UPath class have an identical public API to the pathlib.Path class.

I personally think that get_filesystem, get_urlpath, get_openfile, get_storage_options, etc will be simpler to maintain and extend in the future. With Python3.12 we probably want to refactor the entire UPath implementation again, since a bunch of important pathlib changes landed in cpython that should allow easier subclassing of pathlib.Path. I would add deprecation warnings to the two properties (UPath.fs, UPath.path) and keep them around for 6-12 months and recommend the functions in the warning.

But my opinion on that is not set in stone.

URI paths are incorrectly parsed as posix paths

Thanks for this awesome library!

I've noticed that when creating a UPath for a URL ending in a slash, the trailing slash gets stripped:

>>> UPath("http://www.example.com/page/")
HTTPPath('http://www.example.com/page')

I don't think this is desirable behaviour for the following reasons:

This can cause unexpected behaviour, as a web server could return different resources in response to these two URLs
Stripping the slash can often result in the need to follow an additional redirect (adding a performance cost), as URLs without trailing slashes are often redirected to URLs with a slash
It can lead to issues when resolving relative URLs

The way I'd expect this to work would be for HTTPPaths to leave the orignal URL unchanged, but for HTTPPath.resolve() (currently not implement) to follow any redirects and return the new resulting URL.

create constructor: UPath.from_fs

Constructing a fsspec fs can get tricky. Alternatively, bypass trying to interpret the arguments and have that be the user's responsibility.

Creating new path objects of same type

I am trying to write code that is somewhat agnostic to the type of path or string object passed. The basic pattern I would like to use is to take the input path, process as needed, then create new output objects of the same type as the input object. It is safe to assume that the output will be on the same filesystem or bucket as the input.

So with pathlib we can do this:

from pathlib import Path

p = Path(".")
s = str(p)
type(p)(s)  # returns PosixPath or WindowsPath object

But this pattern does not work with UPath objects:

from upath import UPath

u = UPath("s3://test")
s = str(upath)
type(u)(s)  # fails with TypeError

The error message is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-1d3d93c45a9b> in <module>
      3 u = UPath("s3://test")
      4 s = str(upath)
----> 5 type(u)(s)

/home/conda/store/9c828ea9bf6dc65c4cd4f7331f255b67f76ac3534f6383157d0ce0ddc5bdb8e2-datum38/lib/python3.8/pathlib.py in __new__(cls, *args, **kwargs)
   1039         if cls is Path:
   1040             cls = WindowsPath if os.name == 'nt' else PosixPath
-> 1041         self = cls._from_parts(args, init=False)
   1042         if not self._flavour.is_supported:
   1043             raise NotImplementedError("cannot instantiate %r on your system"

TypeError: _from_parts() missing 1 required positional argument: 'args'

PR #32 was created as a step in that direction. It consolidates UniversalPath into UPath, so now all implementations are subclasses of UPath. After those changes, at least for generic UPath objects, something similar to the example above should work (although probably not for s3, since that has its own implementation). Additional changes will likely need to be made to the various implementations to get those to work in a similar manner.

@andrewfulton9, I wanted to get your thoughts on this effort before I go much further. Let me know if you have any questions or see any issues with the approach I am taking in the PR.

UPath instance check

I think it would be nice if something like this worked:

from upath import UPath

path = UPath(".")
assert isinstance(path, UPath)  # this currently returns false

This issue occurs for local and remote paths, so for pathlib and UniversalPath objects.

Thanks to UPath and various implementations inheriting from pathlib.Path, here is one basic approach that seems to allow this to work:

from abc import ABCMeta


class UPathMeta(ABCMeta):
    def __instancecheck__(self, instance):
        return isinstance(instance, Path)

    def __subclasscheck__(self, subclass):
        return issubclass(subclass, Path)

class UPath(Path, metaclass=UPathMeta):
    ...

I am working to submit a PR for #25 and #26, so I am planning to also include this change, but can take it out if undesired or preferred in a separate PR.

Inconsistency with Path

These two examples should probably behave identically, but they don't :

Path("a", "b") , Path("a") / "b", UPath("a", "b"), and UPath("a") / "b"` produce same result: "a/b" (as expected).

However

UPath("s3://a", "b") results in "s3://ab" (no separator), which is, probably, a bug.

`joinpath` does not update url

There is a bug in joinpath which leaves URL not updated.

Please see below:

p1 = UPath("s3://a/b").joinpath("c")
p2 = UPath("s3://a/b") / "c"

p1._url # the path is incorrect: missing "c"
>>>  SplitResult(scheme='s3', netloc='a', path='/b', query='', fragment='')

p2._url
>>>  SplitResult(scheme='s3', netloc='a', path='/b/c', query='', fragment='')

Use `urllib.parse.urljoin` when joining paths

Hello!

Should UPath._make_child replicate the behaviour of like pathlib.PurePath._make_child as it does currently, or should it behave like urllib.parse.urljoin?

>>> UPath("http://example.com/a") / "b/c"
HTTPPath('http://example.com/a/b/c')

>>> UPath("http://example.com/a/") / "b/c"
HTTPPath('http://example.com/a//b/c')  # I think this one is a bug...

>>> urljoin("http://example.com/a", "b/c")
'http://example.com/b/c'

>>> urljoin("http://example.com/a/", "b/c")
'http://example.com/a/b/c'

Personally I would expect it to behave like urljoin.

Thoughts?

s3 prefix with a + character causes incorrect path when iterating

If you have an object in s3 with an + character in the prefix, iterdir creates incorrect paths.

from upath import UPath
for in_path in UPath("s3://my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir").iterdir():
    print(in_path)

The actual output is:
s3://my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir/my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir/file1

The expected output is:
s3://my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir/file1

Path loses leading `//` after `joinpath` operation

I can do this:

from upath import UPath
import pandas as pd

pd.read_csv(str(UPath('gs://bucket/filename.csv')))

but not this:

up = UPath('gs://') 
pd.read_csv(str(up.joinpath('bucket', 'colombia/filename.csv')))

This is because

str(UPath('gs://bucket/filenamecsv'))

gives 'gs://bucket/colombia/filename.csv' which maintains the leading //
while

str(UPath('gs://').joinpath('bucket', 'filename.csv'))

gives 'gs:/bucket/filename.csv' which is invalid.

This is important because I need to be able to programmatically construct the path separate from the bucket. The idea behind this tool is to keep me from using f strings, but it appears that is my only option here.

fsspec URL chaining

This is not urgent, but based on my very limited and quick testing, it seems that fsspec URL chaining may not work properly.
It is possible I was not doing it properly, or there maybe some clever way to work around the limitation, but I thought it was worth raising an issue for now.

FSSpecFlavour implementation for URI handling

Starting new issue as requested based on comment in issue #26.

after digging deeper into the code, and exploring the pathlib implementation, I am wondering if UPath would benefit from having something like a _URI_Flavour derived from the pathlib._Flavour class. This might make it easier to have consistent path handling across the fsspec based implementations.

The base class: pathlib._Flavour

The posix implementation: pathlib._PosixFlavour

Here is a list of the members of the class for an idea of what may need to be implemented:

pathlib._PosixFlavour

altsep
casefold
casefold_parts
compile_pattern
gethomedir
has_drv
is_reserved
is_supported
join_parsed_parts
make_uri
parse_parts
pathmod
resolve
sep
splitroot

Of course it probably makes sense to base as much as possible on fsspec functionality, especially considering its ability for url chaining.
Here are some possibly related functions from fsspec.core:

_un_chain
url_to_fs
split_protocol
strip_protocol

UPath constructor does not take UPath object

This isn't really a big problem, but below is an issue I ran into, and I am hoping a little fix might simplify my usage.

from pathlib import Path

from upath import UPath

Path(Path("."))  # this works
UPath(UPath("."))  # this does not work

AttributeError: 'PosixPath' object has no attribute 'decode'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-c788cff7e39c> in <module>
----> 1 UPath(UPath("."))

~/anaconda3/envs/py38/lib/python3.8/site-packages/upath/core.py in __new__(cls, *args, **kwargs)
     13             args_list = list(args)
     14             url = args_list.pop(0)
---> 15             parsed_url = urllib.parse.urlparse(url)
     16             for key in ["scheme", "netloc"]:
     17                 val = kwargs.get(key)

~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in urlparse(url, scheme, allow_fragments)
    370     Note that we don't break the components up in smaller bits
    371     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 372     url, scheme, _coerce_result = _coerce_args(url, scheme)
    373     splitresult = urlsplit(url, scheme, allow_fragments)
    374     scheme, netloc, url, query, fragment = splitresult

~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in _coerce_args(*args)
    122     if str_input:
    123         return args + (_noop,)
--> 124     return _decode_args(args) + (_encode_result,)
    125
    126 # Result objects are more helpful than simple tuples

~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in _decode_args(args, encoding, errors)
    106 def _decode_args(args, encoding=_implicit_encoding,
    107                        errors=_implicit_errors):
--> 108     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    109
    110 def _coerce_args(*args):

~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in <genexpr>(.0)
    106 def _decode_args(args, encoding=_implicit_encoding,
    107                        errors=_implicit_errors):
--> 108     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    109
    110 def _coerce_args(*args):

AttributeError: 'PosixPath' object has no attribute 'decode'

Due to these lines:
https://github.com/Quansight/universal_pathlib/blob/dbc57f7ec4c8bb8b2e22975f6b62e25b2b82ecff/upath/core.py#L14-L15

fsspec has a stringify_path utility function that might be appropriate:

fsspec.utils.stringify_path

from fsspec.utils import stringify_path

def stringify_path(filepath):
    """Attempt to convert a path-like object to a string.

    Parameters
    ----------
    filepath: object to be converted

    Returns
    -------
    filepath_str: maybe a string version of the object

    Notes
    -----
    Objects supporting the fspath protocol (Python 3.6+) are coerced
    according to its __fspath__ method.

    For backwards compatibility with older Python version, pathlib.Path
    objects are specially coerced.

    Any other object is passed through unchanged, which includes bytes,
    strings, buffers, or anything else that's not even path-like.
    """
    if isinstance(filepath, str):
        return filepath
    elif hasattr(filepath, "__fspath__"):
        return filepath.__fspath__()
    elif isinstance(filepath, pathlib.Path):
        return str(filepath)
    elif hasattr(filepath, "path"):
        return filepath.path
    else:
        return filepath

I think this would fix it:

from fsspec.utils import stringify_path
...
            url = stringify_path(args_list.pop(0))
            parsed_url = urllib.parse.urlparse(url)

Happy to submit a small PR and test if you'd like.

Also, thanks for the awesome project!

Heads up: pathlib._PathBase may be coming soon

I've logged a CPython PR that adds a private pathlib._VirtualPath class:

python/cpython#106337

If/when that lands, it's not much more work to drop the underscore and introduce pathlib.VirtualPath to the world. This would be Python 3.13 at the earliest.

Would it be suitable for use in universal_pathlib? It would be great to hear your feedback, thank you.

Add default implementations for is_symlink etc.

Several pathlib.Path methods are marked as not implemented, including is_symlink. See https://github.com/fsspec/universal_pathlib/blob/main/upath/core.py#L112-L117
Calling these methods leads to a NotImplementedError. However, in most cases (especially for object storage that don't support symlinks) the expected behavior for these is_* methods would be to return False.

I was wondering, if it would be possible to add default implementations for them that return False in the UPath class. Some subclasses could even choose to implement specific logic.
Again, happy to help with a PR!

Refactor tests

from #96

Refactor tests and fixtures for improved isolation.
Make it easier and safer to allow for tests to be run live.
Reduce inadvertent test interaction and flakiness.
Eventually try to use UPath's great test suite directly from other fsspec filesystem projects to improve coverage.

@brl0 that would be great. Maybe we could coordinate with fsspec/filesystem_spec#651 ?

Did you run into specific issues when running the current test suite?
Should we add some test order randomization to reveal issues regarding isolation?

Async support

It would be great to have async support like aiopath has.
Currently one needs to fallback to filesystem-specific libraries to write async code (as some of them support asynchronous=True).

Handling of absolute file:// paths

If I try using UPath with a file path to an absolute path, such as

path = UPath("file:///Users/myname/Documents/foo")
print(path)

then it seems to change the triple slash /// into just a single slash and print "file:/Users/myname/Documents/foo"

If there's only 2 slashes then it's fine, it handles it like a schema like gs, s3 etc. But absolute file paths often start with a leading slash as well.

UPath isn't serializing correctly

Currently hashing is inherited from directly from pathlib.PurePath. However, that implementation only accounts for the path values. upath needs to be updated to also include any other path attributes a filesystem might have like scheme and bucket.

Example

gpath = UPath('github://fsspec:universal_pathlib@main/')
import pickle

with open('gpath.pickle', 'wb') as f:
    pickle.dump(gpath, f)

with open('gpath.pickle', 'rb') as f:
    gpath2 = pickle.load(f)

gpath2.path

raises

Truncated Traceback (Use C-c C-$ to view full TB):
/tmp/ipykernel_18499/459951566.py in <module>
      8 
      9 
---> 10 gpath2.path

AttributeError: 'PosixPath' object has no attribute 'path'

Add ability to register custom UPath implementations

It appears that the only way currently to register a custom implementation and avoid falling back on upath.core is to import and append to upath.registry._register.known_implementations. It would be helpful if there were a top-level registration, similar to fsspec.register_implementation(name, filesystem)

Support for data URI scheme?

Would you be interested in supporting data URIs in universal_pathlib?

A data URI consists of a scheme and a path, with no authority part, query string, or fragment. The optional media type, the optional base64 indicator, and the data are all parts of the URI path.

Most methods would be unimplemented, as the "path" is not a real hierarchical path 😃

I think the only useful method would be open, which would return the data encoded in the path.
So a DataPath wouldn't require fsspec.