fsspec / universal_pathlib Goto Github PK
View Code? Open in Web Editor NEWpathlib api extended to use fsspec backends
License: MIT License
pathlib api extended to use fsspec backends
License: MIT License
Thanks for creating this package, its a great addition to the fsspec project.
I'd like to see support for adlfs in addition to the current supported options (S3, GCS, https, and hdfs). Is anyone working on this or are there any known blockers to doing this?
Not sure if I am doing this properly, but subclassing UPath is challenging, in part due to this check in the __new__
method:
I am curious if this could be changed to:
if issubclass(cls, UPath):
This would return True even for issubclass(UPath, UPath)
.
Happy to submit a PR if you are ok with this change.
My surprisingly challenging attempt to subclass UPath to add some convenience methods has really grown my appreciation for what this project has accomplished.
On a related note, after digging deeper into the code, and exploring the pathlib
implementation, I am wondering if UPath would benefit from having something like a _URI_Flavour
derived from the pathlib._Flavour class. This might make it easier to have consistent path handling across the fsspec based implementations.
@normanrz, would you be interested in being added as an admin for this project, as you have already been manning the helm for a while? Also would you be interested in being a maintainer for the conda-forge feedstock?
@ap--, since you have made several contributions, would you be interested in being added as a maintainer?
As mentioned in PR #32, the notebook output needs to be updated, but there were a couple of issues, as stated in a comment to the PR:
The issues I noticed were that the UPath untested default warning appears, which we may want to filter or suppress, since this notebook is the primary user documentation. Also, the readme.exists() cell is now returning False, so it looks like something somewhere may have broken, although I don't think it is in this PR, I tested on main and the issue still happens, so I will create an issue to address that.
In addition to those issues, I think it would be worth clarifying this point:
Some filesystems may require extra imports to use.
Since fsspec tries to instantiate registered implementations as needed, I think it would be more clear to state something like:
Some filesystems may require additional packages to be installed.
The import of s3fs
is not necessary and may cause confusion about whether target filesystem packages need to be imported prior to usage with UPath.
One idea I had is that it might be a handy reference to import and cleanly print the known fsspec implementations at the end of the notebook.
I'd be interested in any feedback. Hopefully I'll be able to submit another PR sometime soon with some of these changes.
let the new python problems begin:
>>> from upath import UPath
>>> UPath('asdfsad')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/fixtures-cache/lib/python3.10/site-packages/upath/core.py", line 39, in __new__
self = cls._from_parts(args, init=False)
TypeError: PurePath._from_parts() got an unexpected keyword argument 'init'
starting with nox>=2022.8.7 the ci breaks: https://github.com/fsspec/universal_pathlib/runs/7724453295?check_suite_focus=true#step:5:61
requires fixing the noxfile.
Current workaround: bf1ea06
I run into this issue today...
I was using UPath("gs://my-fresh-bucket/some-dir/another-dir/file").parent.mkdir(parents=True, exist_ok=True)
.
The bucket was empty and didn't contain any objects.
universal-pathlib
attempted to create the bucket instead of creating the directory structure, which caused the code to fail.
Given this
print([f for f in path.glob("**/*") if f.is_file() ])
if path
is a Path
from pathlib
holding a folder, that would return all the files recursively contained in that folder. I'd expect that:
path = UPath("s3://my-bucket")
print([f for f in path.glob("**/*") if f.is_file()])
would return all files, but instead it returns nothing. Using /**/*
works instead.
Is this a bug or a feature?
My use case is providing a generic "folder" path, and being able to walk all the files within it recursively
from #96
- Migrate to hatch, stop using legacy setup metadata format.
- Update and consolidate dev tools and test dependencies.
- New nox tasks and minor changes for convenience.
- Add, configure, and apply pre-commit for consistent code formatting.
@brl0, I'm all for this. It might be worth to wait for fsspec/community#6 (and https://github.com/fsspec/cookiecutter-fsspec) to be ready, so that we don't have to do this twice.
I would almost say, let's wait for either the template to be ready, or Python3.12 to be available as a release candidate and whatever is first, will trigger this refactor.
But I have no strong opinion here.
We should start giving more guarantees and adhere to semver, as well as keep a changelog.
I think this should be done together with #98
We use UPath
in a project with mypy
typing. Most functions and attributes are inferred properly, even without type annotations in upath
. However, we had to create a stub to teach mypy
the subclass relationship with pathlib.Path
and declare fs
and _kwargs
attributes in a stub file (see below). Perhaps it would be easier for other users, if upath
shipped with these stub file or has type annotations directly in the code. I'd be happy to help with a PR.
./stubs/upath/__init__.py
:
from pathlib import Path
from typing import Any, Dict
import fsspec
class UPath(Path):
fs: fsspec.AbstractFileSystem
_kwargs: Dict[str, Any]
The as_uri() attribute of an s3path looks like it is returning a file of an encoded url; however, it should really be likely a reimplementation of the .as_posix() attribute which would return the same as the input below.
f = UPath(f's3://{bucket}/{key}')
f.as_uri()==f"file://s3%3A//{bucket}/{key}"
f.as_posix()==f's3://{bucket}/{key}'
Found this because for some reason polars doesnt like UPaths unless the .as_posix() property is used.
I am really excited about this project, thanks for sharing!
I ran into a minor issue using UPath to create a MemoryFile, the leading path separator gets stripped.
For example, this should work, but both examples cause FileNotFoundError.
from fsspec.implementations.memory import MemoryFileSystem
from upath import UPath
fs = MemoryFileSystem()
content = b"a,b,c\n1,2,3\n4,5,6"
# write with upath, read with fsspec
p1 = "memory:///tmp/output1.csv"
upath1 = UPath(p1)
upath1.write_bytes(content)
with fs.open(p1) as file:
assert file.read() == content
# write with fsspec, read with upath
p2 = "memory:///tmp/output2.csv"
with fs.open(p2, "wb") as file:
file.write(content)
upath2 = UPath(p2)
assert upath2.read_text() == content
Looking at fs.store
, it is clear that UPath is not handling the path the same way as fsspec. One is written with the leading path separator, the other is not.
{'tmp/output1.csv': <fsspec.implementations.memory.MemoryFile at 0x7f2cbce685e0>,
'/tmp/output2.csv': <fsspec.implementations.memory.MemoryFile at 0x7f2cbce6e360>}
I should also mention that I tried this on the current release as well as on the master branch.
Merely a suggestion, totally up to you, but many fsspec-related repos have moved recently to the fsspec github org, and you would be welcome there too.
The bug appears when you try to call parents
method of a path with an URL schema prefix like gs://
.
Way to reproduce
from upath import UPath
path = UPath('gs://my-bucket/dir1/dir2/') # Create a path object.
parent = path.parents[0] # Get any parent. The object is missing _url attribute.
# Next line will throw an error because the object is incorect.
str(parent)
Traceback (most recent call last):
File "C:\Users\korotkovk\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
str(a)
File "C:\Users\korotkovk\anaconda3\lib\pathlib.py", line 723, in str
self._str = self._format_parsed_parts(self._drv, self._root,
File "C:\Users\korotkovk\anaconda3\lib\site-packages\upath\core.py", line 149, in _format_parsed_parts
scheme, netloc = self._url.scheme, self._url.netloc
AttributeError: 'NoneType' object has no attribute 'scheme'
Possible solution
The parents
method is implemented in the PurePath class. There _PathParents class is used for the parents sequence, which has the __getitem__
method, and it doesn't provide "required" url
parameter into the _from_parsed_parts method:
# Code block from pathlib.
def __getitem__(self, idx):
if idx < 0 or idx >= len(self):
raise IndexError(idx)
return self._pathcls._from_parsed_parts(self._drv, self._root, # <---- doesn't give the url param, which leads to an error.
self._parts[:-idx - 1])
I suggest that parents
method may be inherited inside UPath to fix this error.
It looks like the ssh protocol is not working at the moment. I have paramiko installed and using fsspec with it works fine.
The warning says indeed that the ssh filesystem is not explicitly implemented, what needs to be done for it to be?
Minimal example:
from upath import UPath
url = "ssh://myserver/tmp/somefile.tar"
gpath = UPath(url)
This crashes on my machine with:
/home/a001673/mambaforge/envs/satpy/lib/python3.10/site-packages/upath/registry.py:25: UserWarning: ssh filesystem path not explicitly implemented. falling back to default implementation. This filesystem may not be tested
warnings.warn(warning_str, UserWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/site-packages/upath/core.py", line 156, in __new__
self = cls._from_parts_init(args, init=False)
File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/site-packages/upath/core.py", line 306, in _from_parts_init
return super()._from_parts(args, init=init)
TypeError: PurePath._from_parts() got an unexpected keyword argument 'init'
They fail in the webdav_fixture
. Rerunning the job usually recovers it.
We need to make some improvements here to be more reliable on windows.
@pytest.fixture
def webdav_fixture(local_testdir, webdav_server):
webdav_path, webdav_url = webdav_server
if os.path.isdir(webdav_path):
shutil.rmtree(webdav_path, ignore_errors=True)
try:
> shutil.copytree(local_testdir, webdav_path)
upath\tests\conftest.py:339:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\shutil.py:557: in copytree
return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\shutil.py:[45](https://github.com/fsspec/universal_pathlib/actions/runs/5304477693/jobs/9600822124?pr=105#step:5:46)8: in _copytree
os.makedirs(dst, exist_ok=dirs_exist_ok)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
name = 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\webdav0'
mode = 511, exist_ok = False
def makedirs(name, mode=0o777, exist_ok=False):
"""makedirs(name [, mode=0o777][, exist_ok=False])
Super-mkdir; create a leaf directory and all intermediate ones. Works like
mkdir, except that any intermediate path segment (not just the rightmost)
will be created if it does not exist. If the target directory already
exists, raise an OSError if exist_ok is False. Otherwise no exception is
raised. This is recursive.
"""
head, tail = path.split(name)
if not tail:
head, tail = path.split(head)
if head and tail and not path.exists(head):
try:
makedirs(head, exist_ok=exist_ok)
except FileExistsError:
# Defeats race condition when another thread created the path
pass
cdir = curdir
if isinstance(tail, bytes):
cdir = bytes(curdir, 'ASCII')
if tail == cdir: # xxx/newdir/. exists if xxx/newdir exists
return
try:
> mkdir(name, mode)
E FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\webdav0'
C:\hostedtoolcache\windows\Python\3.8.10\x[64](https://github.com/fsspec/universal_pathlib/actions/runs/5304477693/jobs/9600822124?pr=105#step:5:65)\lib\os.py:223: FileExistsError
I think this project is a great idea! I hope to make use of it to simplify and standardize some file operations, mostly for gcs and s3 storage but keeping flexibility for local storage.
It would be nice it UPath would accept a proper URI, with double slashes rather than a single slash, such as "s3://spacenet-dataset/".
@andrewfulton9, would this be possible to do? I would be happy to assist in any way I can.
Hi,
I noticed that a pattern I was using before is not working well anymore. I am using pydantic-settings (version 2.0.0), which is now a separate package, for settings management. Possibly I should create this issue at pydantic-settings instead of here. My excuses if that is the case, I can also do that of course.
What I observed is the following. It used to be possible to do this:
from dotenv import find_dotenv
from pydantic import BaseConfig
from pydantic_settings import BaseSettings
from upath import UPath
class Settings(BaseSettings):
example_path: UPath = UPath(__file__).parent.parent / "example_data"
class Config(BaseConfig):
env_prefix = "my_env_"
env_file = find_dotenv(".env")
env_file_encoding = 'utf-8'
and I created a .env file with:
MY_ENV_EXAMPLE_PATH=/some/other/path
That worked well in the past. The string in the .env file was converted to a UPath object by pydantic.
Now it gives me this error:
ImportError while loading conftest '..../tests/conftest.py'.
tests/conftest.py:6: in <module>
from bigwig_loader.collection import BigWigCollection
bigwig_loader/__init__.py:15: in <module>
config = _Settings()
mambaforge/envs/bigwig/lib/python3.11/site-packages/pydantic_settings/main.py:61: in __init__
super().__init__(
E pydantic_core._pydantic_core.ValidationError: 1 validation error for Settings
E example_data_dir
E Input should be an instance of UPath [type=is_instance_of, input_value=PosixPath('/bla/bla/r.../example_data'), input_type=PosixPath]
E For further information visit https://errors.pydantic.dev/2.0.1/v/is_instance_of
It seems like pydantic-settings is more strict than it used to be or something changed with the types of universal_pathlib?
universal_pathlib version: 0.0.23
operating system: RHEL 7.9 and macOS
The mkdir
API does not respect exist_ok = True
when trying to create directory tree with parents = True
. Example,
>>> from upath import UPath
>>> test_dir = UPath("file:///tmp/parent/child")
>>> test_dir.mkdir(parents=True)
>>> test_dir.exists()
True
>>> test_dir.mkdir(exist_ok=True, parents=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/rahuliyer/workspace/nbaiju-team/asset-mgmt-pipelines/icloud-cassandra-hubble-stats/.venv/lib/python3.10/site-packages/upath/core.py", line 403, in mkdir
self._accessor.mkdir(
File "/Users/rahuliyer/workspace/nbaiju-team/asset-mgmt-pipelines/icloud-cassandra-hubble-stats/.venv/lib/python3.10/site-packages/upath/core.py", line 67, in mkdir
return self._fs.mkdir(
File "/Users/rahuliyer/workspace/nbaiju-team/asset-mgmt-pipelines/icloud-cassandra-hubble-stats/.venv/lib/python3.10/site-packages/fsspec/implementations/local.py", line 46, in mkdir
raise FileExistsError(path)
FileExistsError: /tmp/parent/child
I don't expect this to fail with FileExistsError
as exist_ok
is set to True
I'm seeing this issue in both macos and linux
PATHS = ['/var/tmp', 'file:/var/tmp', 'file://var/tmp', 'file:///var/tmp']
{p:list(upath.UPath(p).iterdir()) for p in PATHS}
The variant "file://var/tmp" gives me the contents of "/tmp". The other variants work as expected.
Summarizing from #84
When creating a UPath
instance universal_pathlib returns a pathlib.Path
instance for local filesystems, which can cause problems with type checkers:
from upath import UPath # UPath is a subclass of pathlib.Path
pth = UPath("/local/path") # returns i.e. a pathlib.PosixPath instance on non-windows
reveal_type(pth) # Revealed type is "upath.core.UPath"
Possible solutions:
No changes. We could just document it very explicitly.
pth: pathlib.Path = UPath(...)
I think the reason for returning a pathlib.Path instance for local filesystems is to guarantee that there are no changes in behavior when using UPath as a replacement.
We could still guarantee pathlib behavior by creating an intermediate UPath class
class UPath(pathlib.Path):
def __new__(cls, *args, **kwargs):
if ...: # is local path
return cls(*args, **kwargs)
else: # promote to a fsspec backed class
return FSSpecUPath(*args, **kwargs)
# > we could define additional properties here too `.fs`, etc...
class FSSpecUPath(UPath):
... # provides all the methods current UPath provides
This would be a simple change, but I think we should then port the CPython pathlib test suite to universal_pathlib, to guarantee as good as we can that the behavior is identical. I'm worried that symlink support, the windows UNC path stuff, and other weird edge-cases will probably cause quite a few issues.
We could provide UPath
as an alias for the UPath class but cast it to type[pathlib.Path]
to trick the static typecheckers. And we could fallback to some metaclass magic to make runtime typechecking work, but this would basically (partially?) revert the cleanup done here: #56
Hey! I am trying to write a pickled object to google storage and I am getting FileNotFoundError
. Here is a minimal script reproducing the problem:
from upath import UPath
import pickle
x = [1,2,3]
path = UPath("gs://some-bucket/x.pkl")
path.write_bytes(pickle.dumps(x))
Output:
Traceback (most recent call last):
File "test.py", line 8, in
path.write_bytes(pickle.dumps(x))
File "/home/conda/environments/aes_pinned/lib/python3.8/pathlib.py", line 1246, in write_bytes
return f.write(view)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/spec.py", line 1602, in exit
self.close()
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/spec.py", line 1569, in close
self.flush(force=True)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/spec.py", line 1435, in flush
self._initiate_upload()
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 1236, in _initiate_upload
self.location = sync(
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 1324, in initiate_upload
headers, _ = await fs._call(
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 340, in _call
status, headers, info, contents = await self._request(
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/decorator.py", line 221, in fun
return await caller(func, *(extras + args), **kw)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/retry.py", line 110, in retry_request
return await func(*args, **kwargs)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/core.py", line 332, in _request
validate_response(status, contents, path)
File "/home/conda/environments/aes_pinned/lib/python3.8/site-packages/gcsfs/retry.py", line 89, in validate_response
raise FileNotFoundError
Interestingly DataFrame.to_csv
is working for objects in that bucket so I don't think its a matter of permissions.
I would expect that creating a new UPath
object by supplying another UPath
object as argument to the constructor to copy the object. Currently, that works only for the path, but not for any additional kwargs.
a = UPath("s3://bucket/folder", key="...", secret="...")
b = UPath(a)
assert b._kwargs["key"] == "..."
# KeyError: 'key'
In Python 3.11 (currently slated for release in October 2022), pathlib accessors have been removed (see https://bugs.python.org/issue43012), resulting in UPath breaking.
For example, with an s3 remote mocked with moto
(pip install moto[server]
)
$ moto_server s3
* Running on http://127.0.0.1:5000 (Press CTRL+C to quit)
from upath import UPath
path = UPath("s3://bucketname")
path.mkdir()
FileNotFoundError Traceback (most recent call last)
----> 1 u.mkdir()
File /usr/lib/python3.10/pathlib.py:1175, in Path.mkdir(self, mode, parents, exist_ok)
1171 """
1172 Create a new directory at this given path.
1173 """
1174 try:
-> 1175 self._accessor.mkdir(self, mode)
1176 except FileNotFoundError:
1177 if not parents or self.parent == self:
FileNotFoundError: [Errno 2] No such file or directory: 's3://bucketname/'
or, trying to write something to a file:
f = path / "file"
f.write_text("ciao")
Traceback (most recent call last)
----> 1 f.write_text("ciao")
File /usr/lib/python3.10/pathlib.py:1154, in Path.write_text(self, data, encoding, errors, newline)
1151 raise TypeError('data must be str, not %s' %
1152 data.__class__.__name__)
1153 encoding = io.text_encoding(encoding)
-> 1154 with self.open(mode='w', encoding=encoding, errors=errors, newline=newline) as f:
1155 return f.write(data)
File universal_pathlib/upath/core.py:168, in UPath.open(self, *args, **kwargs)
167 def open(self, *args, **kwargs):
--> 168 return self.fs.open(self, *args, **kwargs)
File universal_pathlib/upath/core.py:122, in UPath.__getattr__(self, item)
120 return _accessor
121 else:
--> 122 raise AttributeError(item)
AttributeError: fs
>>> from upath import UPath
>>> UPath("s3://some/file.txt")
S3Path('s3://some/file.txt')
>>> UPath("s3://some/file.txt").with_suffix(".zip")
Traceback (most recent call last):
File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 721, in __str__
return self._str
File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/site-packages/upath/core.py", line 125, in __getattr__
raise AttributeError(item)
AttributeError: _str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 742, in __repr__
return "{}({!r})".format(self.__class__.__name__, self.as_posix())
File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 734, in as_posix
return str(self).replace(f.sep, '/')
File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/pathlib.py", line 723, in __str__
self._str = self._format_parsed_parts(self._drv, self._root,
File "/Users/poehlmann/Environments/envs/pasu/lib/python3.8/site-packages/upath/core.py", line 153, in _format_parsed_parts
scheme, netloc = self._url.scheme, self._url.netloc
AttributeError: 'NoneType' object has no attribute 'scheme'
>>>
Fixing this requires overriding those methods with ones based on the pathlib.Path.with_*
(python=3.10, probably) versions.
When opening a resource using universal pathlib, the query component of the URI is dropped, resulting in the wrong resource being opened.
For example, running the following:
>>> UPath("https://duckduckgo.com/?q=upath").read_text()
makes a GET request for https://duckduckgo.com/
without the query parameter, giving unexpected output.
As I understand, right now it can be done with
path._accessor._fs
This is non very obvious, perhaps path.fs
@property
would be more convenient?
Summarizing from #84
To use universal_pathlib seamlessly with other 3rd party libraries, we need to provide access to the underlying fsspec instances.
(1) Right now UPath
provides UPath(...).path
and UPath(...).fs
to access the fsspec path string and the fsspec FileSystem instance. (We should document this prominently in the README...)
(2) Another option would be to provide this access via functions we would have to provide in universal_pathlib.
The question is: should access to those two be provided as class properties / methods on the UPath
class, or should the UPath
class have an identical public API to the pathlib.Path class.
I personally think that get_filesystem
, get_urlpath
, get_openfile
, get_storage_options
, etc will be simpler to maintain and extend in the future. With Python3.12 we probably want to refactor the entire UPath implementation again, since a bunch of important pathlib changes landed in cpython that should allow easier subclassing of pathlib.Path. I would add deprecation warnings to the two properties (UPath.fs
, UPath.path
) and keep them around for 6-12 months and recommend the functions in the warning.
But my opinion on that is not set in stone.
Thanks for this awesome library!
I've noticed that when creating a UPath for a URL ending in a slash, the trailing slash gets stripped:
>>> UPath("http://www.example.com/page/")
HTTPPath('http://www.example.com/page')
I don't think this is desirable behaviour for the following reasons:
The way I'd expect this to work would be for HTTPPath
s to leave the orignal URL unchanged, but for HTTPPath.resolve()
(currently not implement) to follow any redirects and return the new resulting URL.
Constructing a fsspec fs can get tricky. Alternatively, bypass trying to interpret the arguments and have that be the user's responsibility.
I am trying to write code that is somewhat agnostic to the type of path or string object passed. The basic pattern I would like to use is to take the input path, process as needed, then create new output objects of the same type as the input object. It is safe to assume that the output will be on the same filesystem or bucket as the input.
So with pathlib we can do this:
from pathlib import Path
p = Path(".")
s = str(p)
type(p)(s) # returns PosixPath or WindowsPath object
But this pattern does not work with UPath objects:
from upath import UPath
u = UPath("s3://test")
s = str(upath)
type(u)(s) # fails with TypeError
The error message is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-1d3d93c45a9b> in <module>
3 u = UPath("s3://test")
4 s = str(upath)
----> 5 type(u)(s)
/home/conda/store/9c828ea9bf6dc65c4cd4f7331f255b67f76ac3534f6383157d0ce0ddc5bdb8e2-datum38/lib/python3.8/pathlib.py in __new__(cls, *args, **kwargs)
1039 if cls is Path:
1040 cls = WindowsPath if os.name == 'nt' else PosixPath
-> 1041 self = cls._from_parts(args, init=False)
1042 if not self._flavour.is_supported:
1043 raise NotImplementedError("cannot instantiate %r on your system"
TypeError: _from_parts() missing 1 required positional argument: 'args'
PR #32 was created as a step in that direction. It consolidates UniversalPath into UPath, so now all implementations are subclasses of UPath. After those changes, at least for generic UPath objects, something similar to the example above should work (although probably not for s3, since that has its own implementation). Additional changes will likely need to be made to the various implementations to get those to work in a similar manner.
@andrewfulton9, I wanted to get your thoughts on this effort before I go much further. Let me know if you have any questions or see any issues with the approach I am taking in the PR.
I think it would be nice if something like this worked:
from upath import UPath
path = UPath(".")
assert isinstance(path, UPath) # this currently returns false
This issue occurs for local and remote paths, so for pathlib and UniversalPath objects.
Thanks to UPath and various implementations inheriting from pathlib.Path
, here is one basic approach that seems to allow this to work:
from abc import ABCMeta
class UPathMeta(ABCMeta):
def __instancecheck__(self, instance):
return isinstance(instance, Path)
def __subclasscheck__(self, subclass):
return issubclass(subclass, Path)
class UPath(Path, metaclass=UPathMeta):
...
I am working to submit a PR for #25 and #26, so I am planning to also include this change, but can take it out if undesired or preferred in a separate PR.
These two examples should probably behave identically, but they don't :
Path("a", "b")
, Path("a") / "b"
, UPath("a", "b"), and
UPath("a") / "b"` produce same result: "a/b" (as expected).
However
UPath("s3://a", "b")
results in "s3://ab" (no separator), which is, probably, a bug.
There is a bug in joinpath
which leaves URL not updated.
Please see below:
p1 = UPath("s3://a/b").joinpath("c")
p2 = UPath("s3://a/b") / "c"
p1._url # the path is incorrect: missing "c"
>>> SplitResult(scheme='s3', netloc='a', path='/b', query='', fragment='')
p2._url
>>> SplitResult(scheme='s3', netloc='a', path='/b/c', query='', fragment='')
Hello!
Should UPath._make_child
replicate the behaviour of like pathlib.PurePath._make_child
as it does currently, or should it behave like urllib.parse.urljoin
?
>>> UPath("http://example.com/a") / "b/c"
HTTPPath('http://example.com/a/b/c')
>>> UPath("http://example.com/a/") / "b/c"
HTTPPath('http://example.com/a//b/c') # I think this one is a bug...
>>> urljoin("http://example.com/a", "b/c")
'http://example.com/b/c'
>>> urljoin("http://example.com/a/", "b/c")
'http://example.com/a/b/c'
Personally I would expect it to behave like urljoin
.
Thoughts?
If you have an object in s3 with an + character in the prefix, iterdir creates incorrect paths.
from upath import UPath
for in_path in UPath("s3://my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir").iterdir():
print(in_path)
The actual output is:
s3://my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir/my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir/file1
The expected output is:
s3://my_bucket/manual__2022-02-19T14:31:25.891270+00:00/dir/file1
I can do this:
from upath import UPath
import pandas as pd
pd.read_csv(str(UPath('gs://bucket/filename.csv')))
but not this:
up = UPath('gs://')
pd.read_csv(str(up.joinpath('bucket', 'colombia/filename.csv')))
This is because
str(UPath('gs://bucket/filenamecsv'))
gives 'gs://bucket/colombia/filename.csv'
which maintains the leading //
while
str(UPath('gs://').joinpath('bucket', 'filename.csv'))
gives 'gs:/bucket/filename.csv'
which is invalid.
This is important because I need to be able to programmatically construct the path separate from the bucket. The idea behind this tool is to keep me from using f
strings, but it appears that is my only option here.
This is not urgent, but based on my very limited and quick testing, it seems that fsspec URL chaining may not work properly.
It is possible I was not doing it properly, or there maybe some clever way to work around the limitation, but I thought it was worth raising an issue for now.
Starting new issue as requested based on comment in issue #26.
after digging deeper into the code, and exploring the pathlib implementation, I am wondering if UPath would benefit from having something like a _URI_Flavour derived from the pathlib._Flavour class. This might make it easier to have consistent path handling across the fsspec based implementations.
The base class: pathlib._Flavour
The posix implementation: pathlib._PosixFlavour
Here is a list of the members of the class for an idea of what may need to be implemented:
Of course it probably makes sense to base as much as possible on fsspec
functionality, especially considering its ability for url chaining.
Here are some possibly related functions from fsspec.core:
This isn't really a big problem, but below is an issue I ran into, and I am hoping a little fix might simplify my usage.
from pathlib import Path
from upath import UPath
Path(Path(".")) # this works
UPath(UPath(".")) # this does not work
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-c788cff7e39c> in <module>
----> 1 UPath(UPath("."))
~/anaconda3/envs/py38/lib/python3.8/site-packages/upath/core.py in __new__(cls, *args, **kwargs)
13 args_list = list(args)
14 url = args_list.pop(0)
---> 15 parsed_url = urllib.parse.urlparse(url)
16 for key in ["scheme", "netloc"]:
17 val = kwargs.get(key)
~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in urlparse(url, scheme, allow_fragments)
370 Note that we don't break the components up in smaller bits
371 (e.g. netloc is a single string) and we don't expand % escapes."""
--> 372 url, scheme, _coerce_result = _coerce_args(url, scheme)
373 splitresult = urlsplit(url, scheme, allow_fragments)
374 scheme, netloc, url, query, fragment = splitresult
~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in _coerce_args(*args)
122 if str_input:
123 return args + (_noop,)
--> 124 return _decode_args(args) + (_encode_result,)
125
126 # Result objects are more helpful than simple tuples
~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in _decode_args(args, encoding, errors)
106 def _decode_args(args, encoding=_implicit_encoding,
107 errors=_implicit_errors):
--> 108 return tuple(x.decode(encoding, errors) if x else '' for x in args)
109
110 def _coerce_args(*args):
~/anaconda3/envs/py38/lib/python3.8/urllib/parse.py in <genexpr>(.0)
106 def _decode_args(args, encoding=_implicit_encoding,
107 errors=_implicit_errors):
--> 108 return tuple(x.decode(encoding, errors) if x else '' for x in args)
109
110 def _coerce_args(*args):
AttributeError: 'PosixPath' object has no attribute 'decode'
Due to these lines:
https://github.com/Quansight/universal_pathlib/blob/dbc57f7ec4c8bb8b2e22975f6b62e25b2b82ecff/upath/core.py#L14-L15
fsspec
has a stringify_path utility function that might be appropriate:
from fsspec.utils import stringify_path
def stringify_path(filepath):
"""Attempt to convert a path-like object to a string.
Parameters
----------
filepath: object to be converted
Returns
-------
filepath_str: maybe a string version of the object
Notes
-----
Objects supporting the fspath protocol (Python 3.6+) are coerced
according to its __fspath__ method.
For backwards compatibility with older Python version, pathlib.Path
objects are specially coerced.
Any other object is passed through unchanged, which includes bytes,
strings, buffers, or anything else that's not even path-like.
"""
if isinstance(filepath, str):
return filepath
elif hasattr(filepath, "__fspath__"):
return filepath.__fspath__()
elif isinstance(filepath, pathlib.Path):
return str(filepath)
elif hasattr(filepath, "path"):
return filepath.path
else:
return filepath
I think this would fix it:
from fsspec.utils import stringify_path
...
url = stringify_path(args_list.pop(0))
parsed_url = urllib.parse.urlparse(url)
Happy to submit a small PR and test if you'd like.
Also, thanks for the awesome project!
I've logged a CPython PR that adds a private pathlib._VirtualPath
class:
If/when that lands, it's not much more work to drop the underscore and introduce pathlib.VirtualPath
to the world. This would be Python 3.13 at the earliest.
Would it be suitable for use in universal_pathlib? It would be great to hear your feedback, thank you.
Several pathlib.Path
methods are marked as not implemented, including is_symlink
. See https://github.com/fsspec/universal_pathlib/blob/main/upath/core.py#L112-L117
Calling these methods leads to a NotImplementedError
. However, in most cases (especially for object storage that don't support symlinks) the expected behavior for these is_*
methods would be to return False
.
I was wondering, if it would be possible to add default implementations for them that return False
in the UPath
class. Some subclasses could even choose to implement specific logic.
Again, happy to help with a PR!
from #96
Refactor tests and fixtures for improved isolation.
Make it easier and safer to allow for tests to be run live.
Reduce inadvertent test interaction and flakiness.
Eventually try to use UPath's great test suite directly from other fsspec filesystem projects to improve coverage.
@brl0 that would be great. Maybe we could coordinate with fsspec/filesystem_spec#651 ?
Did you run into specific issues when running the current test suite?
Should we add some test order randomization to reveal issues regarding isolation?
It would be great to have async
support like aiopath
has.
Currently one needs to fallback to filesystem-specific libraries to write async
code (as some of them support asynchronous=True
).
If I try using UPath with a file path to an absolute path, such as
path = UPath("file:///Users/myname/Documents/foo")
print(path)
then it seems to change the triple slash ///
into just a single slash and print "file:/Users/myname/Documents/foo"
If there's only 2 slashes then it's fine, it handles it like a schema like gs, s3 etc. But absolute file paths often start with a leading slash as well.
Currently hashing is inherited from directly from pathlib.PurePath
. However, that implementation only accounts for the path values. upath needs to be updated to also include any other path attributes a filesystem might have like scheme and bucket.
gpath = UPath('github://fsspec:universal_pathlib@main/')
import pickle
with open('gpath.pickle', 'wb') as f:
pickle.dump(gpath, f)
with open('gpath.pickle', 'rb') as f:
gpath2 = pickle.load(f)
gpath2.path
raises
Truncated Traceback (Use C-c C-$ to view full TB):
/tmp/ipykernel_18499/459951566.py in <module>
8
9
---> 10 gpath2.path
AttributeError: 'PosixPath' object has no attribute 'path'
It appears that the only way currently to register a custom implementation and avoid falling back on upath.core
is to import and append to upath.registry._register.known_implementations
. It would be helpful if there were a top-level registration, similar to fsspec.register_implementation(name, filesystem)
Would you be interested in supporting data URIs in universal_pathlib
?
A data URI consists of a scheme and a path, with no authority part, query string, or fragment. The optional media type, the optional base64 indicator, and the data are all parts of the URI path.
Most methods would be unimplemented, as the "path" is not a real hierarchical path ๐
I think the only useful method would be open
, which would return the data encoded in the path.
So a DataPath
wouldn't require fsspec
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.