Comments (29)
I really don't care for making the "items" the first class thing for multiple reasons
- Less clear we are making multiple HTTP requests. When you get back an Iterable of response objects it's much more clear the SDK is making multiple requests. This is potentially dangerous as what may work fine in a dev environment where the number of pages are few, may explode in a prod environment where results could be unbounded
- Request level options are more ambiguous. Do request timeouts apply to each request (assuming you know the SDK is making requests), the time it takes to return the iterable (i.e. the first call), or every request to the service (this is how we'd likely implement it).
- Deviates from the service's API. DescribeInstances returns a DescribeInstanceResponse, not Reservations. I feel like we are hiding what the service returns which in some cases is useful data.
- The layering just feels off to me. Iterating over pages is more powerful and more flexible. That should be our primitive and we should build niceties like item iteration on top of it, not the other way around.
- Less clear that service exceptions may be thrown in the middle of iteration. Since you don't know when additional calls are being made (if they are at all) then you just have to know to try/catch around the iteration as a whole.
I propose we flip the suggestion above and have iteration over pages as the first class thing.
public interface SdkIterable<T> extends Iterable<T> {
Stream<T> stream();
}
public interface Paginated<PageT, ItemT> extends SdkIterable<PageT> {
// stream() inherited from SdkIterable
PageT firstPage();
SdkIterable<ItemT> allItems();
// Not sure if we need this? Guess it could be nice
void forEach(BiConsumer<PageT, ItemT> consumer);
}
Example usage
//The standard use-case
for(Reservation r : client.describeInstances().allItems()) {
System.out.println(r.reservationId());
}
//The standard use-case using streams
Optional<Reservation> someId = client.describeInstances().allItems().stream()
.filter(r -> r.reservationId() == "Some Id")
.findFirst();
//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().firstPage();
//If I need to have response information then I can get that
client.describeInstances().
.filter(p -> p.reservations().size() > 20)
.collect(toList());
//If I want to iterate over the items but also have the response detail for a given item
client.describeInstances().forEach((response, item) -> System.out.println(response + ":" + item));
// Can use flatmap if you're into that kind of thing
client.describeInstances().
.flatmap(r -> r.reservations().stream())
.collect(toList());
// Or nested for if that floats your boat
for(DescribeInstancesResponse response : client.describeInstances()) {
for(Reservation r : response.reservations()) {
// Do something
}
}
from aws-sdk-java-v2.
I for my part would like a simple internal iterate for de-pagination that takes a Consumer<AwsResult>
. For example:
ec2.describeInstances(request, result -> System.out::println);
Even a standard pattern of external depagination would require multiple lines instead of the above one liner. But I'm sure whatever you guys come up with will be just the ticket.
from aws-sdk-java-v2.
Do you think it would be idiomatic and easy to read if we provided easy access to a stream of the results?
Making up some syntax:
ec2.describeInstances(request).results().forEach(System.out::println);
from aws-sdk-java-v2.
Both approaches are useful - standard iterators and stream-style internal iterators. ie. ideally you should support:
ec2.describeInstances(request).results().forEach(System.out::println);
and
for (Foo foo : ec2.describeInstances(request).results()) {
// process
}
You get the stream-style method for free if implementing Iterable
.
from aws-sdk-java-v2.
//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().first();
s/first()/firstPage()/
from aws-sdk-java-v2.
Related : aws/aws-sdk-java#1239
from aws-sdk-java-v2.
I think the problem with this approach is that when invoking the operation it's not obvious to the SDK whether we should return a single result or return a stream of them. The thing I like about @rdifalco's suggested approach is that it separates the regular, single call from a call that does the pagination.
If we want to expose it more 'simply' I think we'd have to have a different operation names for Stream
s and Iterable
s:
ec2.describeInstancesAsStream(request)
.flatMap(r -> r.reservations().stream())
.filter(...)
.collect();
for (DescribeInstancesResponse response : ec2.describeInstancesAsIterable(request)) {
//do processing
}
from aws-sdk-java-v2.
We could also consider exposing the different views in the response object, to prevent having multiple methods:
ec2.describeInstances(request).allInstances() // SdkIterable<Instance> (implements Iterable<Instance>)
ec2.describeInstances(request).allInstances().stream() // Stream<Instance>
ec2.describeInstances(request).allInstances().iterator() // Iterator<Instance>
ec2.describeInstances(request).instances() // List<Instance> (single page)
We might be able to then support the DynamoDB-style pagination strategy when customers want standard collection types:
ec2.describeInstances(request).allInstances().asList(DepaginationStrategy.LOAD_ALL); // List<Instance>
from aws-sdk-java-v2.
However it is done, it would be nice if it was consistent across all calls. For example, if added to the response objects, then it would be nice if there was a base interface implemented so that the way to get all results or a single page is not custom per request type. In @millems example, I think the proposal is:
ec2.describeInstances(request).allInstances()
ec2.describeInstances(request).instances()
elb.describeLoadBalancers(request).allLoadBalancers()
elb.describeLoadBalancers(request).loadBalancers()
cloudwatch.listMetrics(request).allMetrics()
cloudwatch.listMetrics(request).metrics()
To the extent possible, I want all of them to work the same way so I can interact with the SDK in a generic way. Something like:
interface PaginatedResponse<T> {
SdkIterable<T> allResults();
List<T> singlePage();
}
ec2.describeInstances(request).allResults()
ec2.describeInstances(request).singlePage()
elb.describeLoadBalancers(request).allResults()
elb.describeLoadBalancers(request).singlePage()
cloudwatch.listMetrics(request).allResults()
cloudwatch.listMetrics(request).singlePage()
As an FYI, at Netflix there is a simple pagination helper for the 1.x sdk in use by some teams to make the access a bit more consistent.
from aws-sdk-java-v2.
One thing we're discussing having is just having an SdkIterable<T>
where T
would be the usual return type of the method for paginated APIs.
So instead of:
interface Ec2Client {
DescribeInstancesResponse describeInstances(DescribeInstancesRequest);
}
we'd have
interface Ec2Client {
SdkIterable<DescribeInstancesResponse> describeInstances(DescribeInstancesRequest);
}
interface SdkIterable<T> extends Iterable<T> {
Stream<T> stream();
}
You could then get a stream of all instances with:
Stream<Instance> allInstances = ec2.describeInstances(request).stream()
.flatMap(response -> response.instances().stream());
I'm not sure how we will handle APIs that start non-paginated and have pagination added as an option later. It's also not ideal if you just want the non-paginated portion of the response:
Owner owner = s3.listBuckets(request).iterator().next().owner(); // ???
from aws-sdk-java-v2.
Not sure I follow the s3.listBuckets
example as it can return many buckets and so the iterable seems like it would fit just fine. I would argue there is a clear distinction between list and describe calls that are expected to return many results and get calls that return a particular item. For example, with s3 I wouldn't expect s3.getS3AccountOwner
to ever return more than one item so it could just return the object. However, for ec2.describeImages
I am expecting many results and it may need pagination, but at least with 1.11.x it is not supported.
Are there any example of getXYZ calls in the 1.11.x SDK that require pagination today? If so, could/should those be listXYZ or describeXYZ instead?
from aws-sdk-java-v2.
The listBuckets
example is just illustrating what it would look like when you only need a member out of the first response in a paginated API. In this example, the buckets themselves aren't needed - just the value of the "owner" member.
The difference between Get
, List
, and Describe
is kind of nuanced in our current API standards.
Get
- Retrieves the contents of one or more resources.
Describe
- Retrieves metadata about one or more resources.
List
- Retrieves the list of instances of a given resource type.
It's possible that there are Get
APIs that could need to be paginated if they have a large set of contents associated with a particular resource, but that seems more rare than List
and Describe
results.
We don't always do a good job making sure we are using precisely the right verb in every case. Some APIs were also created before these standards were really nailed down, so the verbs aren't quite right.
Unfortunately any potentially-wrong verbs are probably here to stay, even in 2.x. We really want to remain consistent with the service-specific wire-protocol documentation and the other SDKs. It would be confusing for a customer to find the GetSprocketsRequest
in the service documentation, but ListSprocketsRequest
in the Java SDK.
from aws-sdk-java-v2.
The following are the two designs we are considering for automatic de-pagination. Please provide your feedback on these designs and let us know which one you like. If you have better ideas, share them too.
Option 1: Pagination on the Response objects
All paginated operations will return a custom iterable that can be used to iterate over multiple response objects. As you process current response, SDK will make service calls internally to get the next one.
Code sample (Print all EC2 reservation ids)
SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
responseIterable.stream()
.flatMap(response -> response.reservations().stream())
.map(r -> r.reservationId())
.forEach(System.out::println);
Use case: You only need first response and don't want additional service calls.
Call the first() method on the response which gives you only the first page and doesn't make any more service calls internally.
SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
DescribeInstancesResponse singleResponse = responseIterable.first();
Option 2: Pagination on both Response objects and the data structure in the response
For each paginated operation supported by service, there will be two operations in the client. One will return the iterable of paginated data structure and the other returns iterable of response objects (same as in Option 1)
a) Work directly with the paginated data structure
Example: When calling describeInstances, majority of users might be interested in the stream of reservations (instances) in the response object and not the other metadata. SDK will return an iterable of the reservations, which can be used to iterate through all the reservations in your account. SDK will make service calls internally to provide you a continuous stream of reservations.
Code Sample
SdkIterable<Reservation> reservations = ec2.describeInstances();
reservations.stream()
.map(r -> r.reservationId())
.forEach(System.out::println);
b) Work with the stream of response objects
This is same as option 1. But the name of operation will be different (The current name is not final and suggestions are welcome)
SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstancesResponses();
responseIterable.stream()
.flatMap(response -> response.reservations().stream())
.map(r -> r.reservationId())
.forEach(System.out::println);
from aws-sdk-java-v2.
For these SdkIterable
s, it's not clear to me what the behaviour is if I call .stream()
twice on the same object. Does it create a new chain of API calls for each stream?
from aws-sdk-java-v2.
Option 2 seems considerably more friendly in the common use case of wanting to access the data. Forcing flatMap
on everyone seems a bit mean.
However I don't see why SdkIterable
can't have both methods in option 2.
// dual generics - most use cases wouldn't actually assign this to a variable
SdkIterable<DescribeInstancesResponse, Reservation> paginated = ec2.describeInstances();
// stream() would stream the objects users generally care about
ec2.describeInstances().stream()
.map(r -> r.reservationId())
.forEach(System.out::println);
// streamResponses() would stream the responses
ec2.describeInstances().streamResponses()
.flatMap(response -> response.reservations().stream())
.map(r -> r.reservationId())
.forEach(System.out::println);
// SdkIterable would also implement Iterable
for (Reservation r : ec2.describeInstances()) {
System.out::println(r.reservationId());
}
// and toList() would return the complete list (with appropriate warnings on memory size)
ec2.describeInstances().toList();
from aws-sdk-java-v2.
Either option is fine with me, but option 2 would be more convenient for our use-cases. More detailed comments below.
I think option 1 would be more palatable if it could be flattened in a generic way. Perhaps by making the response object iterable:
class DescribeInstancesResponse implements SdkIterable<Reservation> {}
SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
responseIterable.stream()
.flatMap(SdkIterable::stream)
.map(r -> r.reservationId())
.forEach(System.out::println);
Option 2 is probably more convenient for many use-cases, but I think it would take a bit longer for a new user to understand why the two variants of the call exist on the client. Purely from a usage standpoint, @jodastephen's proposal would probably work fine, though SdkIterable doesn't seem like the right name for that return type.
Does any of the additional metadata vary across calls? In 1.11, it looks like it is mostly just generic response metadata from AmazonWebServiceResult and the tokens used to drive pagination. Since the user doesn't need to paginate manually, the tokens probably do not need to be exposed (though maybe that is still useful if a user wanted to implement resume if there was a failure in the middle?). If it is just the generic metadata, then that could be paired separately and the DescribeInstancesResponse object could represent the overall response rather than a response for a particular HTTP request. Something like:
// Makes the signature easier to read for the user and maybe keeps options open
// if additional stuff needs to be added later with less risk of compatibility issues.
public class DescribeInstancesResponse implements SdkIterable<Reservation> {
public Stream<Reservation> stream();
public Iterator<Reservation> iterator();
public Stream<Map.Entry<ResponseMetadata, List<Reservation>>> streamWithMetadata();
}
DescribeInstancesResponse response = ec2.describeInstances();
// stream() would stream the objects users generally care about
ec2.describeInstances().stream()
.map(r -> r.reservationId())
.forEach(System.out::println);
// streamWithMetadata() would stream pairs of the metadata and list of reservations
ec2.describeInstances().streamWithMetadata()
.flatMap(entry -> entry.getValue().stream())
.map(r -> r.reservationId())
.forEach(System.out::println);
// SdkIterable would also implement Iterable
for (Reservation r : ec2.describeInstances()) {
System.out::println(r.reservationId());
}
from aws-sdk-java-v2.
Does any of the additional metadata vary across calls?
@brharrington It depends on the service. For the most part it is generic metadata. For DDB it has the very useful consumed capacity which users may definitely want to throttle their requests. https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-dynamodb/src/main/java/com/amazonaws/services/dynamodbv2/model/QueryResult.java#L89
from aws-sdk-java-v2.
For these SdkIterables, it's not clear to me what the behaviour is if I call .stream() twice on the same object. Does it create a new chain of API calls for each stream?
@MikeFHay great question - what do you think would be the expected behaviour? My initial thought was that we'd throw an IllegalStateException
, but it may be equally valid to start a new call chain as you elude to.
from aws-sdk-java-v2.
@kiiadi That's interesting, because my initial impression is that it would behave the same as calling stream()
on a Collection
, or calling StreamSupport.stream(sdkIterable.spliterator(), false);
: you get a new stream.
from aws-sdk-java-v2.
@millems would that mean that it would also refetch the data from the first request? If the first request is made when describeInstances()
is called and then stream is called multiple times with some meaningful delay in between, then it could create an odd result where the first set of data is stale and subsequent pages are current.
from aws-sdk-java-v2.
SdkIterable
probably would only make requests lazily.
from aws-sdk-java-v2.
@kiiadi well if it doesn't support multiple iteration I'd be a bit confused at it being called SdkIterable. Iterables in Java are usually expected to support multiple iteration. There are a few options here, none perfect as far as I can tell:
-
OneShotIterable: Implement Iterable but don't support multiple iteration, and accept the inconsistency with most implementations of Iterable.
-
SdkStream: Don't implement Iterable, and require users to jump a weird hoop to do a for-each loop, i.e.
for(T t : stream::iterator)
-
LazyIterable: Do support multiple traversal, and accept the inconsistency that SdkIterable return values are fully lazy, whereas non-paginated calls are eager (potentially confusing when some errors aren't thrown immediately).
-
EagerIterable: Do support multiple traversal, eagerly load the first page, but have subsequent calls to
.stream()
/.iterator()
load it again. -
Stream: Forget this whole SdkIterable thing and have all paginated operations take a request and immediately return a plain old
Stream<Response>
, avoiding much of this ambiguity but requiring users to flatMap to aStream<T>
, and again making for-each loops slightly awkward.
Personally I like option 5, but I can see the usability argument for option 4. Not sure what your feelings are on eagerly loading the first page.
from aws-sdk-java-v2.
From reading all the feedback, I think what we're saying is having the response type be something like this:
public interface Paginated<PageT, ItemT> extends Iterable<ItemT> {
Stream<ItemT> stream();
PageT firstPage();
Stream<PageT> pageStream();
Iterator<PageT> pageIterator();
void forEach(BiConsumer<PageT, ItemT> consumer);
}
In the above case the PageT
represents each individual SDK response and the ItemT
represents the thing that the response contains a list of. So for describeInstances
the signature would look like so:
Paginated<DescribeInstancesResponse, Reservation> describeInstances(DescribeInstancesRequest request);
I can then use it in the following way:
//The standard use-case
for (Reservation r : client.describeInstances()) {
System.out.println(r.reservationId());
}
//The standard use-case using streams
Optional<Reservation> someId = client.describeInstances().stream()
.filter(r -> r.reservationId() == "Some Id")
.findFirst();
//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().firstPage();
//If I need to have response information then I can get that
client.describeInstances().pageStream()
.filter(p -> p.reservations().size() > 20)
.collect(toList());
//If I want to iterate over the items but also have the response detail for a given item
client.describeInstances().forEach((response, item) -> System.out.println(response + ":" + item));
In order to preserve the 'fail-fast' behaviour, the SDK would greedily request the first page (and thus ensure that the request was valid and well-formed) and this page would be used in the first call to iterator
or stream
. Subsequent calls to iterator
or stream
would create a new request chain.
from aws-sdk-java-v2.
👍 Couple of small tweaks:
- I'd push
first()
down to thePaginated
interface, we may useSdkIterable
for other things where the existence of at least one value is not necessarily guaranteed - I'd potentially rename
items()
toallItems()
or some other name that makes it clear that we're getting all items from all pages.
from aws-sdk-java-v2.
+1, made edits to the comment above.
from aws-sdk-java-v2.
I like it! Do we have an answer for APIs that start off non-paginated and have pagination added later on as an optional feature?
from aws-sdk-java-v2.
The usage experience in shorea@ suggestion looks good. It satisfies both use cases (iterating top level responses and iterating items). And there is no need to use flat map too :)
One problem with all these designs we discussed is potential backwards compatibility issue. If a service team changes an existing non-paginated operation into a paginated operation, this will change the API in the generated SDK and break customers.
Example:
Now Foo is a non-paginated operation
FooResponse foo(FooRequest); // API declaration
FooResponse response = client.foo(FooRequest); //Usage
Service team need to send large lists in the FooResponse and decides to make this a paginated operation. This cases happen occasionally. If we don't detect these cases, then the API in generated SDK will look like:
Paginated<FooResponse, ItemT> foo(FooRequest); // API declaration
Paginated<FooResponse, ItemT> response = client.foo(FooRequest); //Usage
Solutions
- SDK team can detect these changes and make customization to change the API name. So now the client will have following APIs:
FooResponse foo(FooRequest); // Old API
Paginated<FooResponse, ItemT> fooPaginated(FooRequest); // New API
The main problem with this solution is inconsistency of naming across paginated operations. Most paginated operations will have names without "Paginated" keyword (like describeInstances). But few operations that turned from non-paginated to paginated will have "Paginated" keyword. This might cause customer confusion.
The other problem is if SDK validation doesn't catch these cases, customer applications will break.
- The other solution is to always have two methods for paginated operations. I will use describeInstances example. The method with normal name (describeInstances) will be same as in v1. It is not auto-paginated and customers has to use do-while loops (or similar) to get all results. The other method (describeInstancesAutoPaginated) will be the auto-paginated design shorea@ mentioned.
DescribeInstancesResponse describeInstances(DescribeInstancesRequest); // v1 style. no auto-pagination
Paginated<DescribeInstancesResponse, Reservation> describeInstancesAutoPaginated(DescribeInstancesRequest);
// Similarly for foo API
FooResponse foo(FooRequest);
Paginated<FooResponse, ItemT> fooAutoPaginated(FooRequest);
In this solution, we don't need to worry about backwards compatibility as the method declaration (foo or describeInstances) will always be the same. More optional parameters might be added in Request/Response shapes to enable pagination when the API transitions from non-paginated to paginated.
The only problem I see with this is discoverability. When I look describeInstances API and IDE shows describeInstances method first and describeInstancesAutoPaginated next, I might not care about the second method and won't find that SDK supports the awesome auto-pagination feature.
from aws-sdk-java-v2.
I think we are closing down on the Sync discussion and converging towards a solution. Created #185 to continue the discussion for Async clients. Let us know what you think!
from aws-sdk-java-v2.
The feature is released in "2.0.0-preview-9" version. Here is a blog post about this feature with code samples. Please try out the feature and provide us your feedback. Thank you.
from aws-sdk-java-v2.
Related Issues (20)
- ClassNotFoundException after Spring Boot Native Image compilation for s3/internal/ApplyUserAgentInterceptor.java HOT 4
- CognitoIdentityProviderClient.adminInitiateAuth is consistently failing. What is missing? HOT 6
- Add GetSpotInstancesAction to retrieve action from spot instances like on demand instances HOT 4
- AsyncRequestBody: ability to set content-type HOT 5
- AWS Inspector client does not build the request correctly HOT 1
- Streaming of S3 large files using S3AsyncClient and AsyncResponseTransformer.toPublisher(), CRT vs netty backpressure HOT 10
- [Feature] Synchronized version of AwsCrtV4aSigner
- LambdaAsyncClient leads to OutOfMemoryErrors under a high load
- Support Java-based S3 multipart client as an altertivate to AWS CRT-based S3 client HOT 4
- Unable to connect to local Minio server when a proxy configuration is specified, even when the Minio server host is set in the http.nonProxyHosts.
- MediaLive SDK Archive Output Group HOT 2
- Uploading files to S3 bucket is inconsistent HOT 5
- PUT requests do not support setReadLimit option HOT 3
- Inconsistent methods for UpdateSecurityGroupRuleDescriptionsIngressRequest.Builder and AuthorizeSecurityGroupIngressRequest.Builder HOT 4
- Cognito: support for partial UpdateUserPoolClient calls HOT 1
- finalizeMultipartUpload is throwing 404 error HOT 3
- SDK Clients are `Closeable` and documented as thread safe but don't provide an `isClosed()` method HOT 1
- S3Async client fail to upload files (Service: S3, Status Code: 400, Request ID: null) HOT 1
- Getting occasional Crc32MismatchException using DynamoDbClient HOT 5
- Thread locals are lost after the control is returned from SDK methods to the client code HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-sdk-java-v2.