flyteorg / flyte Goto Github PK

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

License: Apache License 2.0

Makefile 0.28% Shell 0.85% Python 0.95% Dockerfile 0.06% Mustache 0.16% Go 97.53% Smarty 0.13% HTML 0.01% Batchfile 0.01% Rust 0.01%

data data-analysis data-science dataops declarative fine-tuning flyte golang grpc kubernetes kubernetes-operator llm machine-learning mlops orchestration-engine production production-grade python scale workflow

flyte's People

Contributors

Stargazers

Watchers

Forkers

chixcode honnix dnuang glsoda-zz petertfs chrismclennon moose007 shyamalschandra mbrukman chengjingfeng pzhao16me kinow yangkf1985 tomzhang djlin saonam ilikedata hubayirp gurpreetsachdeva liuweiping2020 adinin kikou2016 kunlqt chunmk lwflwflwf tejamoy hoyajigi pete1313 a3digit bobqiu dongminaug shoman2 joulroad carol8421 yaxche-io akshay-codemonk dashjim robertdigital brandon-segal patilvikram gabeochieng ai-cloud-kubernetes sbrunk kumare3 morristech pratikfalke sudhirsilwal23 ttanay winning1120xx cxz igorvalko leoleo17 ckiosidis narape pingsutw ibegleris leorleor irvifa emiliza seetharamireddy540 devopstoday11 rstanevich jens-hummelshoej-tri youngwookim isabella232 cmahima nuclyde-io keshann93 katrogan qlyu001 lsena ochienggot apatel-fn jeevb lowc1012 cosmicbboy ahmedrachid sandragh5 zelda3721 v01dxyz samhita-alla shrinivas-io bitbravo jakeneyer erichep dolthub akumor pragyanaischool stjordanis danielschulz forestlzj ankurd1 reloadbrain scarf-sh ajsalow usu-research eapolinario grantreedtrioutside pmahindrakar-oss daeruin

flyte's Issues

Default timeout policy

Right now if a container is misconfigured or something, the job sticks around forever. Propeller should garbage collect and fail.

Handle edge cases around schedule updates

Background: We don't have any transactional guarantees for the case where a schedule rule in cloudwatch is say, deleted but the subsequent database update fails. Although we return an error and a user can retry (and the delete call to cloudwatch is idempotent) unless the user retries we have no guarantee of being in a non-corrupt state.

We could update the scheduled workflow event dequeuing logic to trigger a call to delete a rule when no active launch plan versions exist. Unfortunately there's a possible race condition this exposes in the case of an end-user calling disable in one step, and then enable separately after that.

As a solution, [~matthewsmith] proposed adding an epoch to schedule names to distinguish them. Since we already want to make schedule names more descriptive (with some kind of truncated project & domain in the name) that work can fall under this work item.

Handle error codes from Admin API

The Admin API returns error code values that we can use to show more informative errors to users.

Switch flyteidl output to be commonjs

flyteidl is currently being output as an es6 module, which makes it incompatible with NodeJS unless it is run through webpack first. There's no real reason to do it that way, and protobufjs supports commonjs output, so we should switch to that.

Node Validators

It should be possible to specify pre and post validators on nodes to prevent advancement of a node (or cache poisoning) if the input/output data does not match standards.

Allow download of Inputs / Outputs

It's unclear exactly what format things should be in, but for I/O types like CSV/Blob/Schema we should be able to provide a download link for the user.

Options:

Convert it to a signed S3 link. This is probably not the right move because we need to verify the identity of a user before allowing them to download
Convert the s3:// protocol to an actual s3 link. It would be up to the user to ensure they are assuming the correct role to be able to download the file.

Likely it will be option 2.

For things like CSV list, we have to consider how to display a list of these items.

Parallel Node Executions (UI)

Ensure we have a good visualization for parallel nodes in the UI.

Add workflow level timeouts

Figure out validation / default value implementation for JS

Problem:

The messages coming back from the API are decoded by protobufjs. But since all the fields in a proto messages are optional by convention, we don't have any assurance that the records are valid and usable. This has caused errors before on the client side.

Solution options:

Manual validation of the records and type-casting (message as X) or type-guarding (: message is X) to the stricter types present on the client side. This has the advantage of being flexible in the UI requirements, and the disadvantage of being difficult to keep in sync with the protobuf source of truth.
Automated validation via some type of schema definition stored on the client side (JSON Schema is one such option). This has the advantage of generating consistent code on the client side which is kept up-to-date automatically as the schema is updated, as well as providing a schema document that can be used to validate the JSON output from the API. It has the same disadvantage of being a separate solution which must be updated manually any time the API contract changes.
Switch the console to use protoc-generated JS/TS libraries and decorate all protobuf messages with the appropriate validation. This has the advantage of the validation rules being identical on both server and client (and updating automatically) as well as providing a generic solution for validation (call validate() on the message class coming back from the server). It has the disadvantage of requiring a non-trivial amount of work: Switching from protobufjs to protoc, enabling the TS output from protoc, updating console code to work with the new typings and decoding strategy.

Option 3 is ideal, but the amount of work necessary to do so is concerning (especially considering it may not work correctly and we might have to back it out).

Validation of all CRD using OpenAPI spec

Creating CRDs should not result in a death spiral of the operator. We should provide hooks to validate the spec

Sorting/filtering by inputs

It's useful to filter executions down by the value of certain inputs. For instance, if a workflow takes a region code as an input and is run frequently with different values for the region code, a user may want to only see executions using one given value of that code ("SEA").

This functionality will require a design spec, since workflows may have many inputs of varying types and indexing across those types and values is non-trivial.

Note: There is an internal design document that could be cleaned up and moved to public in order to provide guidance for this item.

Support scheduling of workflows via the UI

Implement Launch Plan details

This will probably be similar to Workflow Version details, in that it will show information from the closure. But it may not show the graph, or it may optionally allow a user to show a graph view of the workflow at that version.

TODO: Determine which details of a LP are useful to show.

Rework dynamic node relationships in data model

Admin currently allows tasks to be parents of other nodes (1->many) and nodes to be parents of other tasks (1-1). This has lead to some confusion/assumptions:

While tasks do yield nodes, they, tasks, finish executing well before those nodes start. It's not entirely accurate to have this task->node parent relationship
Due to how they are currently presented in the data model, the nested UX looks confusing with the task row showing success and sub-rows showing running (indicating the yielded nodes are still running).

We have talked separately on different occasions about how this should ideally be represented. This task is to track the concrete steps towards a better model.

Investigate using gRPC in JS

Add Auth to Console

Admin handles most of the auth flow. Console needs to properly handle 401 responses and redirect to the auth flow to refresh cookies.

Graph Enhancements

This is to cover any overflow / nice-to-haves on the graph implementation after the initial usable version. Some ideas:

Diving into layers of the graph (i.e. expanding subworkflow nodes inline)
Zooming/panning
Hover animations, including highlighting data flow in adjacent nodes
Animations on nodes in progress
Different rendering for nodes which were not executed

Platform-Specified Defaults for Configs

Create Flyte Admin API JS client library

Replace loading indicators

We want to make some updates to the way we load items:

Show no loading indicator if the request returns within 1 second
After 1 second, show a shimmer/skeleton state

TODO: Document all the places where we currently use loading spinners.

Breadcrumbs for the UI

We need to determine what info should be available in the breadcrumbs.

Show project, domain, entity type, (sub-entity type), version. In this case, sub-entity is something like an execution or launch plan belonging to a particular workflow.
Show a static project/domain combo just to set context, but don't make them links, then show the same as in #1
Leave out project/domain entirely

test issue 1

test issue

Move flytegraph into a separate package

The graph components in the console are designed to be a reusable package, but while it's under active development I'm leaving it inside the flyteconsole repo. This ticket is for tracking the work to be done to publish it as a standalone package.

Local execution and end-to-end testing strategy

Filter/view executions by SHA in Flyte 2.0 UI

Already in the cli:

flyte-cli -h flyte.lyft.net -p flytekit -d development list-executions -f "eq(workflow.version,gitsha)"

This is to track potential for this in the UI.

Customer notes:

NOTE

The UI can already filter executions by Version, but we don't show versions in the executions table. The work here is mostly for adding that.

Will require a small amount of UX work to determine how to surface versions in the table rows.

New end2end tests

Expose execution caching status in the UI

Currently the UI does not show that a task execution is memoized. It is just absent from execution details if the execution was skipped because of cache.

Depends on #138

Expanded error message collapses when scrolling out of view

Find an execution in the executions table (workflow details page) that has a long error message.
Click to expand the error message.
Scroll the row out of view
Scroll the row back into view

Expected: The error message should still be expanded.

Actual: The error message renders collapsed, but the row is still the size that it would be with the error message expanded. Now the content sits in the middle of a row that is too tall.

Support additional input types in the Launch UI

We don't currently support list/map or some of the less common types. This task is to at least implement list/map and explore if there is anything we can do about supporting the other types.

Ensure SDK Error Messages Render Correctly for Entities When Config is Not Set

Some repr methods in Flytekit SDK rely on "required" configurations. This obscures exceptions when config is not available in the environment.

Console sends `undefined` instead of `false` for unchecked toggle switches

For workflows which take boolean values, the Console renders a toggle switch. When the toggle remains switched to "off", the resulting computed value is undefined instead of false. This translated to passing no value for the input when making the launch request.
For required inputs with no default value, that will result in a 400.

At the very least, if a boolean value is required and has no default, we should be translating an unchecked toggle to false to make sure the launch request succeeds.

Once default values are implemented for the form, this should become less of an issue.

500 returned when querying outputs for a custom container

There is an expectation from Admin that some type of output will exist in storage for a NodeExecution. This turns out not to be the case if a container is running without the SDK. We need some type of handling for this case.

Ability to delete/hide obsolete workflows from UI

The UI currently hides workflows which are marked as archived. But you can only set this value via the CLI / API. Users should be able to mark a workflow as archived through the UI as well.

Parallel/Map Node

Allow loose parallelism as a native part of the Flyte spec. In other words, allow a 'parallel node' to take a list of inputs and map the work out to replicas of the same executable: task, workflow, or launch plan.

Hotkeys

There are probably some hotkeys worth implementing. This is a placeholder to determine what those should be.

Execution IDs aren't copy-pastable across UI, CLI

The full execution idea ID ex:project:domain:id

In the UI we only show the last portion ("id")

The CLI requires the full "ex:project:domain:id", meaning you can easily copy-paste between the two.

Request from pricing.

Document launch plans in SDK

Implement a walkthrough or tutorial

HTTP 400 returned when attempting to retrieve data for NodeExecution child of a Dynamic Task

Update:

This is a UI bug. We should not attempt to retrieve inputs if no inputsUri is set, and should not attempt to retrieve outputs if closure.outputsUri is unset.

Direct child

[https://flyte.lyft.net/api/v1/data/node_executions/flytekit/production/y9n8xi9amd/task1-b0e1be7f74-h-task-sqb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0]

Grandchild (nested subtask)

[https://flyte.lyft.net/api/v1/data/node_executions/flytekit/production/y9n8xi9amd/task1-b0e1be7f74-h-task-sqb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0-78d085b30a--sub-taskb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0]

The above URLs should both return NodeExecution data for the ids provided, but instead they return an error "invalid URI".

Parallel Node Executions (CLI)

Ensure parallel node executions are visible in a reasonable manner in the CLI.

Update visuals used for errors

This is a task to audit our usage of error messages.

Ensure that all places where we use error messages are using an appropriately sized component
Evaluate messaging used
Discover any views/components which currently do not use error messages in their failure states and update them

If applicable, show the LP that created a given execution

On the Execution Details page, expose the Launch Plan which was used to create the execution.

Plugin Default Behavior Update

{"json":\{"exec_id":"","node":"","ns":"-development","routine":"worker-13","src":"handler.go:216","tasktype":"spark","wf":"***.SparkTasksWorkflow"}

,"level":"warning","msg":"No plugin found for Handler-type [spark], defaulting to [container]","ts":"2019-11-11T21:09:36Z"}

Defaulting Spark to container doesn't make sense and ideally we should fail cleanly at Propeller level and expose it to users instead of executing it as a container task and leading to an unknown/weird container failures. I think this also applies to other tasks like Hive/Sidecar.

Audit of UI / UX tests

We need a story around what types of testing we are doing for the UI, and an update of the existing test coverage to move toward that goal.
Right now, we have a mixture of tests implemented with react-testing-library, Enzyme(?), and react-test-renderer (mostly snapshots which we don't really need).

The target will be:

Use react-testing-library for all unit/component tests.
Remove Enzyme / react-test-renderer
Make a decision on whether we need any integration / end-2-end / automated UI testing (something like Cypress / Browserstack / etc.)
Choose a target for code coverage and open one or more issues to track hitting that target.

Better document the local testing story

The local testing story is weak... we can do a better job documenting tips for how to improve.

Our initial idea is that the pyflyte execute command can be run locally, but this has some problems like it uses an autodeleting temp dir and it might mess up real outputs in S3, etc.

We'll play around with stuff and at least come up with some short term workarounds.

Parallel Node (Propeller Side)

TCS are excited for the native parallelization offered in Flyte 2.0. This task is for the propeller side execution of parallel nodes.

Support specifying notifications when launching workflows via the UI

The Inputs for launching a workflow accept a Notifications fields, which can be used to specify notification rules for specific states. It's a little complicated (can be email, PD, Slack to multiple recipients for multiple states), so we'll tackle it as a separate task.

Notifications for Nodes in SDK

Implement specification of nodes in SDK.

Render Logs directly in the UI

We have enough information from activity execution entity to make calls directly to AWS to retrieve log stream events.

Accessing log streams requires specific permissions. These won't exist on the client (nor should they). But the server side could be granted that role and be a proxy for the logs.

So it might look something like this:

Client makes a request to UI server side to open logs for a specific execution, passing the execution ID. This opens a long-lived TCP request which will be used to stream the log back to the client
Server-side opens a connection to AWS to get the log stream for that execution. These have to be retrieved in chunks. Server-side begins streaming the chunks to the client
Server-side listens for (pings? Can AWS do push for these?) additional log stream lines and pushes them to the client as they are discovered.

Questions/Concerns:

This could be simpler if there was a way for the UI to retrieve a temporary token to use for AWS access. Can the server generate one of these and return it?
How do we know when the log stream has ended and we can close the connection to the client? Can we check for a specific string in it?
Each one of these will consume a connection to the server and hold it open for what could be a long time. This could cause resource constraints, but we can always scale the UI servers to accomodate
Should we consider web sockets for this type of thing? We could have a mechanism where, while an active websocket connection is open watching a particular execution, the server-side will continue to poll for the latest logs and deliver them to whatever listeners are active. This has the benefit of only making the requests to AWS once if there are multiple listeners
If we do use Websockets, this functionality is almost complicated enough to warrant spinning up a separate service to handle it.