backblaze / b2-sdk-python Goto Github PK
View Code? Open in Web Editor NEWPython library to access B2 cloud storage.
License: Other
Python library to access B2 cloud storage.
License: Other
Several times over the past several days we have witnessed b2 upload-file
hang. We run these commands with --noProgress
and there is no output that we can see from our logging that indicates that anything went wrong, however the command hangs for >24h until a human intervenes. In debugging this we've seen a connection open to backblaze but no network traffic is being sent or received from the process. Every time we kill and restart the process, it succeeds, so debugging this is is rather tricky.
We have experienced nearly identical issues with other python projects that were traced back to a lack of a timeout
parameter in python requests
. I can't confirm this is exactly what is happening, but I do see what appears to be similar code issues with .post
and .get
which seem to be at the core of the b2 HTTP API (both before the 1.4.0 release and after):
https://github.com/Backblaze/b2-sdk-python/blob/master/b2sdk/b2http.py#L290
https://github.com/Backblaze/b2-sdk-python/blob/master/b2sdk/b2http.py#L358
Without a timeout there, if the server simply stops responding but keeps the socket open, my understanding and experience is that these calls will hang indefinitely.
From the requests docs:
You can tell Requests to stop waiting for a response after a given number of seconds with the timeout parameter. Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely:
>>> requests.get('https://github.com/', timeout=0.001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)
Note timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.
In order to delete a file you need to know the file id of the version to delete but unless I'm missing something there is no straightforward way to list the different versions of a file with a given filename. bucket.ls
exists but since it enforces a trailing /
on the folder you pass the only way I can see right now to get this through the public API is to do something like this:
def get_file_ids(bucket, filename):
for file_info, folder_name in bucket.ls(folder_to_list=os.path.dirname(filename), show_versions=True):
if file_info.file_name == filename:
yield file_info
Obviously this potentially wastes bandwidth by returning all files in the same "folder".
def make_folder_sync_actions(
source_folder, dest_folder, args, now_millis, reporter, policies_manager=DEFAULT_SCAN_MANAGER
):
"""
Yields a sequence of actions that will sync the destination
folder to the source folder.
"""
if args.skipNewer and args.replaceNewer:
raise CommandError('--skipNewer and --replaceNewer are incompatible')
if args.delete and (args.keepDays is not None):
raise CommandError('--delete and --keepDays are incompatible')
if (args.keepDays is not None) and (dest_folder.folder_type() == 'local'):
raise CommandError('--keepDays cannot be used for local files')
In sdk those should probably be assertions and we should have similar checks in the CLI with proper reporting to the user
Request:
Provide typings for python
Sufficient to support libraries like mypy
.
A starting point perhaps: https://mypy.readthedocs.io/en/stable/stubgen.html?highlight=generate
Not sure if this has already been considered, but I saw no record of it.
It would also make using some of your APIs more intuitive, as I have a hard time discerning return types.
Thanks!
b2-sdk-python/b2sdk/session.py
Line 67 in b103939
If the user specifies no AccountInfo object, they get a SqliteAccountInfo and an AuthInfoCache by default. This is a good thing. However, if they specify any other AccountInfo object (InMemoryAccountInfo for instance), they get an even worse cache: the DummyCache.
What could break if at the very least this defaulted to the InMemoryCache at minimum? Personally I don't see the harm of always defaulting to the AuthInfoCache instead. AbstractAccountInfo already forces the implementation of the functions necessary for AuthInfoCache to function correctly. This could only break existing code where people use their own AccountInfo objects that are themselves broken.
Hi Author. Thanks a lot for this great SDK. Very easy to setup and use.
I'm wondering how i can avoid uploading duplicated files?
say i upload a file twice that has same key, is there a way to reject the second upload somehow?
Thanks!
move b2sdk/account_info/test_upload_url_concurrency.py
and test_raw_api
, test_raw_api_helper
, _clean_and_delete_bucket
, _should_delete_bucket
, _add_range_header
from b2sdk/raw_api.py
to somewhere else, so that we can still execute them in pre-commit.sh
, but so that it does not get test coverage tracking and so that it is not shipped in the package to the library user
When trying to make a backup with benji to backblaze b2 i get this error:
ERROR: An exception of type AttributeError occurred: module 'b2' has no attribute 'bucket'
Hello,
I have a python3 application leveraging the b2sdk; how can I throttle bucket.upload_local_file() for example? I'd like to limit to a set bandwidth if possible, say 100Mbps; is that possible?
from https://mail.python.org/archives/list/[email protected]/message/EYLXCGGJOUMZSE5X35ILW3UNTJM3MCRE/
use the development mode to see
DeprecationWarning and ResourceWarning: use the "-X dev" command line
option or set the PYTHONDEVMODE=1 environment variable. Or you can use
the PYTHONWARNINGS=default environment variable to see
DeprecationWarning.You might even want to treat all warnings as errors to ensure that you
don't miss any when you run your test suite in your CI. You can use
PYTHONWARNINGS=error, and combine it with PYTHONDEVMODE=1.Warnings filters can be used to ignore warnings in third party code,
see the documentation:
https://docs.python.org/dev/library/warnings.html#the-warnings-filter
The command run from a cron.hourly script was
b2 sync --noProgress --keepDays 14 /home/data/v2 b2://bhs-backup/ > /dev/null
b2 wrote to stderr
ERROR:b2sdk.bucket:error when uploading, upload_url was https://pod-000-1009-12.backblaze.com/b2api/v2/b2_upload_file/redacted/redacted
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/b2sdk/bucket.py", line 615, in _upload_small_file
content_type, HEX_DIGITS_AT_END, file_info, hashing_stream
File "/usr/local/lib/python2.7/dist-packages/b2sdk/raw_api.py", line 533, in upload_file
return self.b2_http.post_content_return_json(upload_url, headers, data_stream)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 297, in post_content_return_json
response = _translate_and_retry(do_post, try_count, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 127, in _translate_and_retry
return _translate_errors(fcn, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 60, in _translate_errors
int(error['status']), error['code'], error['message'], post_params
ServiceError: 500 internal_error incident id 62096cb3b343-0424780b144c
Hi,
First thanks for the awesome python sdk
I have one question how do we retrieve the friendly and native URL using this sdk.
When I upload a file, i just get a FileVersionInfo
object, which in the docs doesn't have a friendly or a native url property.
I think a way to get these URLs is useful if I want to upload a file and link this file (based on these URLs) to a column (for example file_url) in my database.
For now I'm just hardcoding the url with something like:
response = bucket.upload_bytes() # ....
file_id = response.id_
url = "https://f000.backblazeb2.com/b2api/v1/b2_download_file_by_id?fileId=" + file_id
Thanks
what the default values are for the parameters, which are required, why is it set up in this way and what is the intended usage
I have a loop uploading files locally to B2 cloud. After a few thousand uploads over a few hours, I get this error:
b2sdk.exception.UnknownError: Unknown error: 400 bad_request more than one upload using auth token 4_0021a87abca5e6b0000000005_018fe4f8_97f3ec_uplg_wANv-9n9ent4nLDoDI0yMFmZqeQ=
I feel like upload_local_file method should be able to handle renegotiating a new auth token on its own rather than failing as such.
def sync_folders(
source_folder,
dest_folder,
args,
now_millis,
stdout,
no_progress,
max_workers,
policies_manager=DEFAULT_SCAN_MANAGER,
dry_run=False,
allow_empty_source=False,
):
args
mandatory argument is a choice that we've made in the CLI times and an area which we can improve on.
Hi --
I'd like to set the bucket info on my bucket to not have any entries. To do that, I tried calling bucket.set_info() and got an assert. Here's a simple failing case:
b2_api = B2Api(InMemoryAccountInfo())
b2_api.authorize_account(url, app_key_id, app_key)
bucket = b2_api.get_bucket_by_name("myBucket"))bucket.set_info({})
This triggers an assert on line 558, in update_bucket:
assert bucket_info or bucket_type
I'm running on macOS 10.15.7.
I'm using Python 3.8.6 installed with mac ports.
i have the following b2 things installed with pip3:
$ pip3 list | grep b2
b2 2.0.2
b2sdk 1.1.4
Here's the stack trace that's generated: stack.txt
A similar thing happens when I use bucket.update(bucket_info={}), which ends up on the same place. I'm currently working around it by passing the bucket's type back in:
bucket.update(bucket_type=bucket.as_dict()['bucketType'], bucket_info={}, lifecycle_rules=[])
clear_large_file_upload_urls
may not clear some urls that are not in the pool right now, as they are "rented" by upload threads. Those would need to be blacklisted or kept in a structure which keeps track of them even during the rental period, or something like that.
The documentation doesn't say so, but the idea behind that feature in the first place was that a failing pod is very likely to fail any subsequent requests, so the sdk code invalidates "sister" tokens to that same pod in order to save a few failing requests and go directly to retrieval of new upload urls+tokens.
It is not a severe issue - the code that improves the behavior in a corner case may be improved and I am filing this issue to not forget about this problem, maybe solve it along with some other similar issue.
Not exactly related to the SDK. However, I am trying to access B2 using Python with the requests
library to create a new key. My code is as below:
import requests
response = request.get("https://api.backblazeb2.com/b2api/v2/b2_create_key", params={'accountId': 'xxxx', 'capabilities': ['listKeys','listBuckets','listFiles','readFiles','shareFiles','writeFiles','deleteFiles'], 'keyName': 'test'}, headers={'Authorization': 'xxxx'})
I have included the values for the account ID and the authorization. I am able to access other B2 services. However, when using the above function I get the following error:
{'code': 'bad_request', 'message': 'duplicate name in query string: capabilities', 'status': 400}
Any idea how to overcome this?
The command run from a cron.hourly script was
b2 sync --noProgress --keepDays 14 /home/data/v2 b2://bhs-backup/ > /dev/null
b2 wrote to stderr
ERROR:b2sdk.bucket:error when uploading, upload_url was https://pod-000-1117-16.backblaze.com/b2api/v2/b2_upload_file/xxxx/xxxx
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/b2sdk/bucket.py", line 615, in _upload_small_file
content_type, HEX_DIGITS_AT_END, file_info, hashing_stream
File "/usr/local/lib/python2.7/dist-packages/b2sdk/raw_api.py", line 533, in upload_file
return self.b2_http.post_content_return_json(upload_url, headers, data_stream)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 297, in post_content_return_json
response = _translate_and_retry(do_post, try_count, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 127, in _translate_and_retry
return _translate_errors(fcn, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 84, in _translate_errors
raise B2ConnectionError(str(e0))
B2ConnectionError: Connection error: HTTPSConnectionPool(host='pod-000-1117-16.backblaze.com', port=443): Max retries exceeded with url: /b2api/v2/b2_upload_file/xxxx/xxxx (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f139ca04c10>: Failed to establish a new connection: [Errno 111] Connection refused',))
Currently, the SDK is able to synchronize files between two B2 buckets (implemented in #165), but it synchronizes only the latest versions as the whole idea of synchronization works on files and not on file versions.
We may consider adding a feature to be able to sync every version of the files. It may not by b2 sync
and something else, or a special b2 sync
mode.
the requirements.txt
lists arrow>=0.8.0,<0.13.1
arrow is currently at 0.14.2 and therefore causes issues/warnings in pip
I reviewed the code, it's just a few lines of very basic usage which doesn't appear to have any issues with later versions. Please update the requirements.txt
file.
The command run from a cron.hourly script was
b2 sync --noProgress --keepDays 14 /home/data/v2 b2://bhs-backup/ > /dev/null
b2 wrote to stderr
ERROR:b2sdk.bucket:error when uploading, upload_url was https://pod-000-1128-03.backblaze.com/b2api/v2/b2_upload_file/redacted/redacted
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/b2sdk/bucket.py", line 615, in _upload_small_file
content_type, HEX_DIGITS_AT_END, file_info, hashing_stream
File "/usr/local/lib/python2.7/dist-packages/b2sdk/raw_api.py", line 533, in upload_file
return self.b2_http.post_content_return_json(upload_url, headers, data_stream)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 297, in post_content_return_json
response = _translate_and_retry(do_post, try_count, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 119, in _translate_and_retry
return _translate_errors(fcn, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 83, in _translate_errors
raise BrokenPipe()
BrokenPipe: Broken pipe: unable to send entire request
ERROR:b2sdk.bucket:error when uploading, upload_url was https://pod-000-1009-00.backblaze.com/b2api/v2/b2_upload_file/redacted/redacted
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/b2sdk/bucket.py", line 615, in _upload_small_file
content_type, HEX_DIGITS_AT_END, file_info, hashing_stream
File "/usr/local/lib/python2.7/dist-packages/b2sdk/raw_api.py", line 533, in upload_file
return self.b2_http.post_content_return_json(upload_url, headers, data_stream)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 297, in post_content_return_json
response = _translate_and_retry(do_post, try_count, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 127, in _translate_and_retry
return _translate_errors(fcn, post_params)
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 60, in _translate_errors
int(error['status']), error['code'], error['message'], post_params
ServiceError: 503 service_unavailable c001_v0001009_t0016 is too busy
We are in the process of setting up monitoring and alerts for our backblaze backups, so we are notified if one of our backup processes stops working.
Some metrics I'd like to track per bucket are:
For total size and number of files -- I found Backblaze/B2_Command_Line_Tool#404 which adds --showSize
to the CLI. But I looked at the code that calculates this, and it recursively looks at every file in the bucket and adds it up. That's simply not going to perform well. (It worked fine for a small bucket, but when I tried it for one of our larger buckets I didn't get a result after 15 minutes of waiting and killed the process) What's strange is, I can see this info on the Backblaze website, so it seems like you guys know these stats per bucket? Is there a chance it could be exposed somehow?
For time since last updated, we could add a "canary file" to the root of each bucket, make sure it gets updated regularly, and check that... but it would be far less brittle if backblaze provided this info. Do you store this info? If so, is there any way to access it?
For number of versions, I could parse out lifecycle_rules
from the bucket info, so that's fine.
Any guidance here? We use prometheus for metrics, so the plan is to use this python sdk and write a simple client to export the above metrics. It would be generic enough that we could open-source the project so others could use it. But as it stands right now, I can't figure out a way to get enough information to create useful metrics in a generic way.
Thanks!
Using b2api 1.0.2
I have a bucket and I can list files and get urls and everything so I think the bucket object is working.
However when I call bucket.download_file_by_name("filepath/filename", "download_path")
I get an "AttributeError: 'str' object has no attribute 'make_file_context'"
I can download the file from the backblaze webpage.
I can get the url from the b2api, so it doesnt seem to be an issue with the specific file.
Heres is the trace:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-88c956cdf6c6> in <module>
----> 1 bucket.download_file_by_name("filepath/filename", "download_path")
~/.pyenv/versions/3.8.1/envs/scipy/lib/python3.8/site-packages/logfury/v0_1/trace_call.py in wrapper(*wrapee_args, **wrapee_kwargs)
82 # actually log the call
83 self.logger.log(self.LEVEL, 'calling %s(%s)%s', function_name, arguments, suffix)
---> 84 return function(*wrapee_args, **wrapee_kwargs)
85
86 return wrapper
~/.pyenv/versions/3.8.1/envs/scipy/lib/python3.8/site-packages/b2sdk/bucket.py in download_file_by_name(self, file_name, download_dest, progress_listener, range_)
256 url_factory=self.api.account_info.get_download_url,
257 )
--> 258 return self.api.transferer.download_file_from_url(
259 url, download_dest, progress_listener, range_
260 )
~/.pyenv/versions/3.8.1/envs/scipy/lib/python3.8/site-packages/logfury/v0_1/trace_call.py in wrapper(*wrapee_args, **wrapee_kwargs)
82 # actually log the call
83 self.logger.log(self.LEVEL, 'calling %s(%s)%s', function_name, arguments, suffix)
---> 84 return function(*wrapee_args, **wrapee_kwargs)
85
86 return wrapper
~/.pyenv/versions/3.8.1/envs/scipy/lib/python3.8/site-packages/b2sdk/transferer/transferer.py in download_file_from_url(self, url, download_dest, progress_listener, range_)
94 )
95
---> 96 with download_dest.make_file_context(
97 metadata.file_id,
98 metadata.file_name,
~/.pyenv/versions/3.8.1/lib/python3.8/contextlib.py in __enter__(self)
111 del self.args, self.kwds, self.func
112 try:
--> 113 return next(self.gen)
114 except StopIteration:
115 raise RuntimeError("generator didn't yield") from None
~/.pyenv/versions/3.8.1/envs/scipy/lib/python3.8/site-packages/b2sdk/download_dest.py in write_file_and_report_progress_context(self, file_id, file_name, content_length, content_type, content_sha1, file_info, mod_time_millis, range_)
211 mod_time_millis, range_
212 ):
--> 213 with self.download_dest.make_file_context(
214 file_id, file_name, content_length, content_type, content_sha1, file_info,
215 mod_time_millis, range_
AttributeError: 'str' object has no attribute 'make_file_context'
Hello,
I've read quite a bit through the documentation, but haven't really found an answer.
Users are uploading their files to the website (Using flask) and I'd like to upload these files backblaze b2. I have a FileStorage Object which I'd rather not save to my disk locally and afterwards upload to backblaze as it creates an unnecessary step.
The way I see it there is 'upload_local_file' and 'upload'.
The first requires the full path to the file, which I don't have and the second one upload I don't quite understand how to use it or what upload_source is supposed to be.
Can I achieve what I want or does the API not support uploading files directly?
Edit: easier question while I work around this issue: How to get the generated Id when I upload a file with upload_local_file?
Thanks
the package is being built, but the tests are not being run
... so that if the main thread terminates, they terminate as well on their own.
I think we don't want the library user to call wait()
on the internal threads of b2sdk.
Hi,
I know that this question was asked previously here but I had some further questions regarding this functionality.
You previously mentioned that we could use list_file_versions
to do this - which is fairly easy to do (just limit the fetch_count to 1). But my concern with that is that in terms of "API cost" it's more expensive than calling get_file_id
- as list_file_versions
is in the same class as list_file_names.
So my question is, could you have a call which assumes the latest version for the file you want the information about?
If this is resolved with some other solution then I would be happy to use that but I am just trying to reduce the number of API calls and I couldn't see any obvious alternative.
Thanks
Sorry if this has already been discussed before, but I'm uploading to B2 using:
client.get_bucket_by_name(bucket_name).upload_bytes(b"0" * 2**18, path)
and I see that the uploaded file size on the B2 web portal is 262 KB instead of 256 KB. Is this expected? Is there some automatic padding that is added.
The command run from a cron.hourly script was
b2 sync --noProgress --keepDays 14 /home/data/v2 b2://bhs-backup/ > /dev/null
b2 wrote to stderr
ERROR: FAILED to upload after 5 tries. Encountered exceptions: Connection error: ('Connection aborted.', BadStatusLine("''",))
Connection error: ('Connection aborted.', BadStatusLine("''",))
Broken pipe: unable to send entire request
Connection error: ('Connection aborted.', error(104, 'Connection reset by peer'))
Connection error: ('Connection aborted.', error(104, 'Connection reset by peer'))
Hi there,
Is there functionality to upload a file to a folder when using the upload_local_file() function?
Example: "my_folder/my_file.txt"
I assumed that it would be possible simply by prepending the B2 folder path to the file_name argument. However, this didn't work (no file appears although the function doesn't throw any errors).
I've confirmed that uploading a file without any folder (e.g. "my_file.txt") does work.
What is the correct way to upload a file to a specific folder?
Just to be clear, the folder does exist.
It would be more convenient for users if the most-used classes, like B2Api wore available in the top-level package. Some things are internal, some are not, it is very hard to say for the user which things should be used and which shouldn't.
Following the simple installation instructions (in a virtualenv) does not seem to work:
% virtualenv v
% . v/bin/activate
(v) % pip install b2sdk
(v) % python -c 'from b2sdk.v1 import InMemoryAccountInfo'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'b2sdk.v1'
Extra info:
(v) % python --version
Python 3.7.3
(v) % pip show b2sdk
Name: b2sdk
Version: 0.1.8
Summary: Backblaze B2 SDK
Home-page: https://github.com/Backblaze/b2-sdk-python
Author: Backblaze, Inc.
Author-email: [email protected]
License: MIT
Location: /home/rob/tmp/v/lib/python3.7/site-packages
Requires: logfury, requests, tqdm, six, setuptools, arrow
Required-by:
when one passes a dict to file_info it is not checked whether too many entries fields are present.
Cryptic error is thrown, ie:
FAILED to upload after 5 tries. Encountered exceptions: Broken pipe: unable to send entire request
After getting a Bucket
instance by calling B2Api.get_bucket_by_id()
, I'm unable to download files with bucket.download_file_by_name()
. A minimal example of the problem:
from b2sdk.v1 import InMemoryAccountInfo, B2Api, DownloadDestLocalFile
app_id = '<app_id>'
app_key = '<app_key>'
bucket_id = '<bucket_id>'
bucket_name = '<bucket_name>'
b2 = B2Api(InMemoryAccountInfo())
b2.authorize_account('production', app_id, app_key)
bucket = b2.get_bucket_by_id(bucket_id)
# bucket = b2.get_bucket_by_name(bucket_name)
dst = DownloadDestLocalFile('local_file_name')
bucket.download_file_by_name('<b2_file_path>', dst)
If the commented line that calls get_bucket_by_name()
is used instead, the snippet works as expected.
Tried both v1.0.2
on PyPI and installing directly from master
(more specifically 1707190).
The problem seems to be that the Bucket
class assumes that its name
attribute is always set in download_file_by_name()
which is not a correct assumption when the bucket is instantiated with B2Api.get_bucket_by_id()
.
Is this a bug or have I missed something in the documentation?
The command
b2 sync --noProgress --keepDays 14 /home/data/v2 b2://backup/ > /path/to/logfile
produced this error on stderr.
WARNING:b2sdk.sync.report:could not output the following line with encoding None on stdout due
to 'ascii' codec can't encode character u'\xfa' in position 40: ordinal not in range(128):
upload files/Playlist-for-Derecho-a-la-música-9-15-19.csv
I think this is not my fault as a user of the b2
command.
If b2
uses ASCII on stdout then it should take care to not send any non-ASCII characters to stdout to avoid this python error.
Better than that is to use UTF-8 for stdout, at least when it is not connected to a term.
We need to significantly alter how uploads are handed in b2sdk to provide better configurability and to properly handle multi-part server side copy.
The logic of uploads is going to be moved to transferer, but it is going to be pretty abstract. The same flow will be used by single part uploads, multi-part uploads, multi-part upload continuation, single-part server side copy, mutli-part server side copy as well as "emerging" - synthesyzing a file out of pieces, some of which are in the cloud already and some are still local and need to be uploaded. The Emerger will be capable of uploading, server-side copying and downloading a stream to then upload it, that's why it is not called an Uploader.
A corner case of that is when we need to download some data to then upload it again (due to server-side lower size limit of part equal currently to 5MB) and another is that sometimes a very large file needs to be copied and Emerge will need to break it down to smaller chunks, otherwise the server-side copy will refuse to handle the request.
The copy operation will have an optional flag, which will terminate the operation if any data would need to be downloaded in order to fulfill the copying (so if it would not be a purely server-side copy). Setting that flag to True
will force the Emerger to consume the entire iterable to verify whether the operation is a pure copy (unless it is not - then it can exit as soon as a range that cannot be satisfied without downloading is found).
There will be two thread pools, one for uploading and another one for downloading. This is to avoid deadlocks when we'd need to download data in order to upload it back and all threads in the pool are busy doing that (waiting for someone to download the data). Since downloads will never wait for uploads (in the current design), two threadpools will ensure we avoid a deadlock.
To decrease the final PR size, we'll move the upload logic as is to transferer first. We'll implement b2_copy_part support in raw_api as well as in the simulator. Some tests can maybe be added before the feature is actually implemented.
The part of code which decides how to split a file into parts is going to have a simple implementation - it is a very hard problem to optimize and we are not going to spend a lot of effort on the strategy at this point. This may cause some copy operations which could be completed with just upload+copy to not be properly recognized, so if the forbid_downloads
flag is set to True
, we may return failure because of an imperfect strategy. This is a known limitation of the design and we will be open for pull requests from anyone wishing to optimize this further (as long as it doesn't degrade the performance of the naiive approach (too much)).
In Transferer we already delegate some functionality away to Downloaders. This is nice because Transferer class itself is smaller. Here the functionality will be moved to a new class, called Emerger. Emerger, in its main operation, will need to accept an iterable of RangeToEmerge objects, which may be:
but those can be mixed. Sometimes a file will be present both on the cloud and on the local filesystem and Emerger will decide whether to use the local part or download it. A range may be present on the cloud in multiple objects at different locations and maybe the same applies to the local side too, but optimization of local reading to maximize streaming is not the goal at this point. Therefore the interface will NOT allow the user to provide multiple locations of a range of the same type (cloud, local, unfinished part). Our API interface policy allows us to easily add such support in the future without breaking interface compatibility for the existing users, so we'll match the interface with the current implementation plan and will change the interface if it's ever required.
Transferer will provide a few wrappers for Emerger, allowing the user to easily:
Both of those operations may break an operation into smaller ones. There are no plans to allow for forcing a file to be treated as small or large, for upload or for server-side copy. If someone would need this (though I can't imagine why), the function would be easy to add by swapping out Emerger to another implementation (so one would have to pass it to an explicitly constructed Transferer, which would be passed to B2Api).
When a piece of a file needs to be downloaded from the cloud so that we can upload it again (simplest case being a "copy" request of two remote-only ranges at least one of which is less than 5MB), we will NOT save it into a temporary storage on the local drive, but we'll pipe it directly through to the uploading thread. Local storage would be tricky to configure and may not be expected by the user, may cause security issues etc, but streaming also has its issues - if downloading is faster than uploading, we may have a memory utilization issue. Therefore care will need to be taken to avoid exploding memory in case of asymmetric network performance.
This issue is now open for comments on the design, so that we can improve the concept before a significant amount of work is put into implementation.
When the source is B2 and the file is hidden, the file on the destination won't be deleted even with KeepOrDeleteMode.DELETE
.
We should talk if it's expected behavior or maybe the file should be deleted when the destination file is local.
But when the destination file is B2 (bucket to bucket sync of the latest versions implemented in #165), then we may want to hide that file. It may require synchronizing not only the latest versions as described in #166
Running tests on a clean checkout on OSX fail:
running nosetests
running egg_info
writing b2sdk.egg-info/PKG-INFO
writing dependency_links to b2sdk.egg-info/dependency_links.txt
writing requirements to b2sdk.egg-info/requires.txt
writing top-level names to b2sdk.egg-info/top_level.txt
reading manifest file 'b2sdk.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'b2sdk.egg-info/SOURCES.txt'
/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/config.py:430: DeprecationWarning: Use of multiple -w arguments is deprecated and support may be removed in a future release. You can get the same behavior by passing directories without the -w argument on the command line, or by using the --tests argument in a configuration file.
warn("Use of multiple -w arguments is deprecated and "
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 263, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/dave/workspace/b2-sdk-python/setup.py", line 52, in <module>
setup(
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/commands.py", line 158, in run
TestProgram(argv=argv, config=self.__config)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/core.py", line 118, in __init__
unittest.TestProgram.__init__(
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/unittest/main.py", line 100, in __init__
self.parseArgs(argv)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/core.py", line 145, in parseArgs
self.config.configure(argv, doc=self.usage())
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/config.py", line 346, in configure
self.plugins.configure(options, self)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/manager.py", line 284, in configure
cfg(options, config)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/manager.py", line 99, in __call__
return self.call(*arg, **kw)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/manager.py", line 167, in simple
result = meth(*arg, **kw)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/multiprocess.py", line 239, in configure
_import_mp()
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/multiprocess.py", line 150, in _import_mp
m = Manager()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/context.py", line 57, in Manager
m.start()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/managers.py", line 579, in start
self._process.start()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "setup.py", line 52, in <module>
setup(
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/commands.py", line 158, in run
TestProgram(argv=argv, config=self.__config)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/core.py", line 118, in __init__
unittest.TestProgram.__init__(
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/unittest/main.py", line 100, in __init__
self.parseArgs(argv)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/core.py", line 145, in parseArgs
self.config.configure(argv, doc=self.usage())
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/config.py", line 346, in configure
self.plugins.configure(options, self)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/manager.py", line 284, in configure
cfg(options, config)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/manager.py", line 99, in __call__
return self.call(*arg, **kw)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/manager.py", line 167, in simple
result = meth(*arg, **kw)
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/multiprocess.py", line 239, in configure
_import_mp()
File "/Users/dave/config/repos/pyenv/versions/b2-sdk/lib/python3.8/site-packages/nose/plugins/multiprocess.py", line 150, in _import_mp
m = Manager()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/context.py", line 57, in Manager
m.start()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/managers.py", line 583, in start
self._address = reader.recv()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/Users/dave/config/repos/pyenv/versions/3.8.2/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
make: *** [test] Error 1
how to use this sdk in python? who has a demo?
I'm running duplicity 0.7.19 on Centos 7, backing up to a Backblaze B2 account. I periodically get an error:
Attempt [x] failed. AttributeError: 'module' object has no attribute 'packages'
When I do, the progress information becomes nonsense. Here's an excerpt from my logs:
Jan 29 14:18:14 fafnir backups: Getting delta of (duplicati-bb8434f1fa55a4cacac317a30d3c9d1ca.dblock.zip.aes reg) and None
Jan 29 14:18:14 fafnir backups: A duplicati-bb8434f1fa55a4cacac317a30d3c9d1ca.dblock.zip.aes
Jan 29 14:18:22 fafnir backups: AsyncScheduler: running task synchronously (asynchronicity disabled)
Jan 29 14:18:22 fafnir backups: Writing duplicity-full.20200123T060005Z.vol362.difftar.gpg
Jan 29 14:18:22 fafnir backups: Put: /tmp/duplicity-B8z5vS-tempdir/mktemp-HfU6lV-7 -> [folder redacted]/duplicity-full.20200123T060005Z.vol362.difftar.gpg
Jan 29 14:18:25 fafnir backups: Backtrace of previous error: Traceback (innermost last):
Jan 29 14:18:25 fafnir backups: File "/usr/lib64/python2.7/site-packages/duplicity/backend.py", line 369, in inner_retry
Jan 29 14:18:25 fafnir backups: return fn(self, *args)
Jan 29 14:18:25 fafnir backups: File "/usr/lib64/python2.7/site-packages/duplicity/backend.py", line 529, in put
Jan 29 14:18:25 fafnir backups: self.__do_put(source_path, remote_filename)
Jan 29 14:18:25 fafnir backups: File "/usr/lib64/python2.7/site-packages/duplicity/backend.py", line 515, in __do_put
Jan 29 14:18:25 fafnir backups: self.backend._put(source_path, remote_filename)
Jan 29 14:18:25 fafnir backups: File "/usr/lib64/python2.7/site-packages/duplicity/backends/b2backend.py", line 121, in _put
Jan 29 14:18:25 fafnir backups: progress_listener=progress_listener_factory())
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/logfury/v0_1/trace_call.py", line 84, in wrapper
Jan 29 14:18:25 fafnir backups: return function(*wrapee_args, **wrapee_kwargs)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/bucket.py", line 537, in upload_local_file
Jan 29 14:18:25 fafnir backups: progress_listener=progress_listener
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/logfury/v0_1/trace_call.py", line 84, in wrapper
Jan 29 14:18:25 fafnir backups: return function(*wrapee_args, **wrapee_kwargs)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/bucket.py", line 593, in upload
Jan 29 14:18:25 fafnir backups: upload_source, file_name, content_type, file_info, progress_listener
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/bucket.py", line 677, in _upload_large_file
Jan 29 14:18:25 fafnir backups: part_sha1_array = [interruptible_get_result(f)['contentSha1'] for f in part_futures]
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/utils.py", line 41, in interruptible_get_result
Jan 29 14:18:25 fafnir backups: return future.result(timeout=1.0)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/concurrent/futures/_base.py", line 429, in result
Jan 29 14:18:25 fafnir backups: return self.__get_result()
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/concurrent/futures/thread.py", line 62, in run
Jan 29 14:18:25 fafnir backups: result = self.fn(*self.args, **self.kwargs)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/bucket.py", line 765, in _upload_part
Jan 29 14:18:25 fafnir backups: HEX_DIGITS_AT_END, hashing_stream
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/raw_api.py", line 545, in upload_part
Jan 29 14:18:25 fafnir backups: return self.b2_http.post_content_return_json(upload_url, headers, data_stream)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/b2http.py", line 297, in post_content_return_json
Jan 29 14:18:25 fafnir backups: response = _translate_and_retry(do_post, try_count, post_params)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/b2http.py", line 119, in _translate_and_retry
Jan 29 14:18:25 fafnir backups: return _translate_errors(fcn, post_params)
Jan 29 14:18:25 fafnir backups: File "/usr/lib/python2.7/site-packages/b2sdk/b2http.py", line 69, in _translate_errors
Jan 29 14:18:25 fafnir backups: if isinstance(e1, requests.packages.urllib3.exceptions.MaxRetryError):
Jan 29 14:18:25 fafnir backups: AttributeError: 'module' object has no attribute 'packages'
Jan 29 14:18:55 fafnir backups: Writing duplicity-full.20200123T060005Z.vol362.difftar.gpg
Jan 29 14:18:55 fafnir backups: Put: /tmp/duplicity-B8z5vS-tempdir/mktemp-HfU6lV-7 -> [folder redacted]/duplicity-full.20200123T060005Z.vol362.difftar.gpg
Hi,
I've noticed a lot of the file operations in the SDK take the ID of the file in order to work. Unfortunately, I am only keeping track of the file names locally. How would I go about getting the ID of a file given its name?
Thanks in advance.
Hi there.
I'm a new B2 user. I've got files stored in a private B2 bucket and I'd like the ability to be able to generate a download token with a set expiration time that I can give to users so they can temporarily download files from my private B2 bucket.
S3 supports this functionality, and since B2 is API compatible, I believe this should be possible.
Unfortunately, when I try to use the official boto3 library to access B2 and generate a presigned URL for a B2 item, I run into issues. For example:
url = s3.generate_presigned_url('get_object', Params={
'Bucket': 'BUCKET',
'Key': 'FILENAME',
}, ExpiresIn=60 * 60 * 24 * 7 # 7 days in seconds
print(url)
The resulting URL I get back from this call to B2 looks something like this:
https://s3.us-west-002.backblazeb2.com/BUCKET/FILE?AWSAccessKeyId=XXX&Signature=gEPp9je72g01htiu5VZJPMZA344%3D&Expires=1591850644
Unfortunately, when I go to visit this URL, I get a B2 authorization error:
<Error>
<Code>UnauthorizedAccess</Code>
<Message>bucket is not authorized: BUCKET</Message>
</Error>
This led me to try to use this native B2 library to accomplish this feat instead, but unfortunately, it doesn't appear there is a way to make this work using this library.
It'd be cool to get some support added to generate a presigned file download URL =)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.