Comments (11)
For this to work, I think I will need to switch to using ETags as hashes instead of the targets
custom hash in the metadata. I think the reason I didn't do this initially was because I didn't know that S3 was strongly read-after-write consistent.
from targets.
Roadmap for AWS:
- Implement and test
aws_s3_list()
in the utils. Remember pagination. - Switch to ETags.
- Modify
store_aws_hash()
to use a cache. This function should only be called locally in the central controlling R session. I could put guardrails to make sure that stays the case.
Unfortunately list_objects_v2() does not return version Ids, and list_object_Versions() returns too much information (never just the most current objects). So it looks like this caching will not be version-aware and will have to fall back on HEAD requests if you git reset
your way back to historical metadata.
from targets.
For GCS, it might be good to just switch to ETags for the next release, then wait for cloudyr/googleCloudStorageR#179.
from targets.
Hmm.... I don't think we need to switch to ETags for hashes. We could just store the ETag as part of the metadata and use ETags instead of versions to corroborate objects.
from targets.
I thought this through a bit more, and unfortunately this batched caching feature no longer seems feasible.
As I said before, list_object_versions()
is not feasible because it lists all the versions of all the objects, without any kind of guardrail to list e.g. only the most recent versions. Any given object could have thousands of versions, and so listing all the versions of all the objects is way too much.
On the other hand, neither list_objects()
nor list_objects_v2()
lists version IDs at all, so it is impossible to confirm that the version listed in the metadata actually exists or is current. For example, suppose you revert to a historical copy of the metadata, and you see version ABC and ETag XYZ for target x. The bucket's current version could have ETag XYZ, but version ABC may no longer exist. (For example, it might have been automatically deleted by the object retention policy).
These and similar problems are impossible to reconcile unless:
targets
sends a HEAD request for each individual object, as it currently does, or- sends a batched API request with a list of key-version pairs and to learn the existence of each one.
(2) seems impossible, so I think we have to stick with (1).
from targets.
I just posted https://stackoverflow.com/questions/77454033/is-there-a-way-to-batch-check-the-existence-of-specific-object-versions-in-aws-s
from targets.
from targets.
Tried to send a feature request on their feedback form, but it's glitchy today:
I am writing an R package which needs to check the existence of a specific version of each AWS S3 object in its data store. The version of a given object is the version ID recorded in the local metadata, and the recorded version may or may not be the most current version in the bucket. Currently, the package accomplishes this by sending a HEAD request for each relevant object-version pair.
I would like a more efficient/batched way to do this for each version/object pair. list_object_versions() returns every version of every object of interest, which is way too many versions to download efficiently, and neither list_objects() nor list_objects_v2() return any version IDs at all. It would be great to have something like delete_objects(), but instead of deleting the objects, accept the supplied key-version pairs and return the ETag and custom metadata of each one that exists.
c.f. https://repost.aws/questions/QUe-yNsIr0Td2aq2oA1RAQdQ/hudi-and-s3-object-versions
from targets.
Note to self: if it ever becomes possible to revisit this issue, I will probably need to switch targets
to use AWS/GCS ETags when available instead of custom local file hashes. The switch is as simple as this:
- In
store_upload_object_aws()
, remove thetargets-hash
custom metadata:
Line 227 in 13470ef
- In
store_upload_object_aws()
, writestore$file$hash <- digest_chr64(head$ETag)
just above the following line:
Line 249 in 13470ef
- At the end of
store_aws_hash()
, returndigest_chr64(head$ETag)
instead ofhead$Metadata[["targets-hash"]]
. - Test that the correct ETags get to the metadata and the correct ETags are being retrieved by
store_aws_hash()
to assert that up-to-date targets are indeed up to date. - Repeat all the above for GCS.
from targets.
Taking a step back: this is actually feasible if targets
can ignore version IDs. There could be a tar_option_set()
-level option to either check or ignore version IDs. Things to consider:
- Should the option be at the level of
tar_option_set()
and nottar_target()
? At first glance, I thinks so because caching happens in bulk. Maybe the level oftar_resources_aws()
could technically work, but those options are all implicitly target-level, which would be counterintuitive even with good documentation. - Should the version check still be enabled by default? I think so, for compatibility. But it will be slow.
from targets.
Taking another step back: targets
should:
- Always use the version ID when downloading data, and
- Always ignore the version ID when checking the hash.
(1) ensures behavior is clear, consistent, compatible, and version-aware. (2) ensures a target reruns if it is not the current object in the bucket. (2) also makes this issue so much easier to implement. And it lets us avoid adding a new version
argument of tar_resources_aws()
. The outcomes will be:
- Pipelines with cloud targets will run dramatically faster.
- The rules for checking/rerunning outdated targets will take into account which objects are the latest versions in the bucket. This makes more conceptual sense.
- Users won't need to do anything extra.
from targets.
Related Issues (20)
- Push to saturated crew controllers HOT 4
- Avoid polling in crew-powered pipelines HOT 2
- Allow following the best practice of not using library()
- Change progress and reporter labels HOT 1
- [BUG] tar_watch times out after a while HOT 1
- Out of memory messaging HOT 1
- `crayon.enabled` is never set back HOT 2
- targets' use of warnings is redundant, prone to false positives and negatives, and risks (causing other packages to) violating CRAN policy in normal use HOT 2
- High memory usage when dispatching many targets HOT 9
- Cleaner "skipped branch" printing strategy for large (10000+ targets) pipelines HOT 3
- Wrong debugging suggestions in debug. HOT 1
- Best Practice for Pipelines On Top of Packages
- Targets woudn't run due to error "run more than one pipeline on the same data store"
- [ideas] `description` field for targets HOT 1
- Portability issue in hashes HOT 25
- Let `targets::tar_make(reporter = "timestamp_positives")` report timestamps in local time zone HOT 4
- Patterns do not marshal properly HOT 3
- the covr badge in README does not point to main HOT 2
- [Trouble] Targets and list objects HOT 2
- Parquet targets do not support list-columns of ordered factors HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from targets.