Under the default settings for cloud storage, targets

Roadmap for AWS: Implement and test

I just posted <a href="https://stackoverflow.com/questions/77454033/is-there-a-way-to-

Also posted <a href="https://repost.aws/questions/QUq7vI636vTKy0-e48-3Cf1Q/is-there-a-

Cache list_objects_v2() to speed up the file cue for cloud objects about targets HOT 11 CLOSED

wlandau commented on June 3, 2024

Cache list_objects_v2() to speed up the file cue for cloud objects

from targets.

Comments (11)

wlandau commented on June 3, 2024

For this to work, I think I will need to switch to using ETags as hashes instead of the targets custom hash in the metadata. I think the reason I didn't do this initially was because I didn't know that S3 was strongly read-after-write consistent.

from targets.

wlandau commented on June 3, 2024

Roadmap for AWS:

Implement and test aws_s3_list() in the utils. Remember pagination.
Switch to ETags.
Modify store_aws_hash() to use a cache. This function should only be called locally in the central controlling R session. I could put guardrails to make sure that stays the case.

Unfortunately list_objects_v2() does not return version Ids, and list_object_Versions() returns too much information (never just the most current objects). So it looks like this caching will not be version-aware and will have to fall back on HEAD requests if you git reset your way back to historical metadata.

from targets.

wlandau commented on June 3, 2024

For GCS, it might be good to just switch to ETags for the next release, then wait for cloudyr/googleCloudStorageR#179.

from targets.

wlandau commented on June 3, 2024

Hmm.... I don't think we need to switch to ETags for hashes. We could just store the ETag as part of the metadata and use ETags instead of versions to corroborate objects.

from targets.

wlandau commented on June 3, 2024

I thought this through a bit more, and unfortunately this batched caching feature no longer seems feasible.

As I said before, list_object_versions() is not feasible because it lists all the versions of all the objects, without any kind of guardrail to list e.g. only the most recent versions. Any given object could have thousands of versions, and so listing all the versions of all the objects is way too much.

On the other hand, neither list_objects() nor list_objects_v2() lists version IDs at all, so it is impossible to confirm that the version listed in the metadata actually exists or is current. For example, suppose you revert to a historical copy of the metadata, and you see version ABC and ETag XYZ for target x. The bucket's current version could have ETag XYZ, but version ABC may no longer exist. (For example, it might have been automatically deleted by the object retention policy).

These and similar problems are impossible to reconcile unless:

targets sends a HEAD request for each individual object, as it currently does, or
sends a batched API request with a list of key-version pairs and to learn the existence of each one.

(2) seems impossible, so I think we have to stick with (1).

from targets.

wlandau commented on June 3, 2024

I just posted https://stackoverflow.com/questions/77454033/is-there-a-way-to-batch-check-the-existence-of-specific-object-versions-in-aws-s

from targets.

wlandau commented on June 3, 2024

Also posted https://repost.aws/questions/QUq7vI636vTKy0-e48-3Cf1Q/is-there-a-way-to-batch-check-the-existence-of-specific-object-versions-in-aws-s3

from targets.

wlandau commented on June 3, 2024

Tried to send a feature request on their feedback form, but it's glitchy today:

I am writing an R package which needs to check the existence of a specific version of each AWS S3 object in its data store. The version of a given object is the version ID recorded in the local metadata, and the recorded version may or may not be the most current version in the bucket. Currently, the package accomplishes this by sending a HEAD request for each relevant object-version pair.

I would like a more efficient/batched way to do this for each version/object pair. list_object_versions() returns every version of every object of interest, which is way too many versions to download efficiently, and neither list_objects() nor list_objects_v2() return any version IDs at all. It would be great to have something like delete_objects(), but instead of deleting the objects, accept the supplied key-version pairs and return the ETag and custom metadata of each one that exists.

c.f. https://repost.aws/questions/QUe-yNsIr0Td2aq2oA1RAQdQ/hudi-and-s3-object-versions

from targets.

wlandau commented on June 3, 2024

Note to self: if it ever becomes possible to revisit this issue, I will probably need to switch targets to use AWS/GCS ETags when available instead of custom local file hashes. The switch is as simple as this:

In store_upload_object_aws(), remove the targets-hash custom metadata:

targets/R/class_aws.R

Line 227 in 13470ef

metadata = list("targets-hash" = store$file$hash),

In store_upload_object_aws(), write store$file$hash <- digest_chr64(head$ETag) just above the following line:

targets/R/class_aws.R

Line 249 in 13470ef

store$file$path <- c(path, paste0("version=", head$VersionId))

At the end of store_aws_hash(), return digest_chr64(head$ETag) instead of head$Metadata[["targets-hash"]].
Test that the correct ETags get to the metadata and the correct ETags are being retrieved by store_aws_hash() to assert that up-to-date targets are indeed up to date.
Repeat all the above for GCS.

from targets.

wlandau commented on June 3, 2024

Taking a step back: this is actually feasible if targets can ignore version IDs. There could be a tar_option_set()-level option to either check or ignore version IDs. Things to consider:

Should the option be at the level of tar_option_set() and not tar_target()? At first glance, I thinks so because caching happens in bulk. Maybe the level of tar_resources_aws() could technically work, but those options are all implicitly target-level, which would be counterintuitive even with good documentation.
Should the version check still be enabled by default? I think so, for compatibility. But it will be slow.

from targets.

wlandau commented on June 3, 2024

Taking another step back: targets should:

Always use the version ID when downloading data, and
Always ignore the version ID when checking the hash.

(1) ensures behavior is clear, consistent, compatible, and version-aware. (2) ensures a target reruns if it is not the current object in the bucket. (2) also makes this issue so much easier to implement. And it lets us avoid adding a new version argument of tar_resources_aws(). The outcomes will be:

Pipelines with cloud targets will run dramatically faster.
The rules for checking/rerunning outdated targets will take into account which objects are the latest versions in the bucket. This makes more conceptual sense.
Users won't need to do anything extra.

from targets.

Cache list_objects_v2() to speed up the file cue for cloud objects about targets HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent