Giter Club home page Giter Club logo

flyteorg / flyte Goto Github PK

View Code? Open in Web Editor NEW
4.8K 4.8K 515.0 301.1 MB

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

Home Page: https://flyte.org

License: Apache License 2.0

Makefile 0.28% Shell 0.85% Python 0.95% Dockerfile 0.06% Mustache 0.16% Go 97.53% Smarty 0.13% HTML 0.01% Batchfile 0.01% Rust 0.01%
data data-analysis data-science dataops declarative fine-tuning flyte golang grpc kubernetes kubernetes-operator llm machine-learning mlops orchestration-engine production production-grade python scale workflow

flyte's People

Contributors

akhurana001 avatar anandswaminathan avatar andrewwdye avatar bnsblue avatar byronhsu avatar chanadian avatar cosmicbboy avatar davidmirror-ops avatar ddl-ebrown avatar eapolinario avatar enghabu avatar flyte-bot avatar future-outlier avatar goreleaserbot avatar hamersaw avatar honnix avatar jeevb avatar katrogan avatar kumare3 avatar mayitbeegh avatar migueltol22 avatar neverett avatar pingsutw avatar pmahindrakar-oss avatar samhita-alla avatar sandragh5 avatar smritisatyanv avatar surindersinghp avatar wild-endeavor avatar yindia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flyte's Issues

Default timeout policy

Right now if a container is misconfigured or something, the job sticks around forever.  Propeller should garbage collect and fail.

Handle edge cases around schedule updates

Background: We don't have any transactional guarantees for the case where a schedule rule in cloudwatch is say, deleted but the subsequent database update fails. Although we return an error and a user can retry (and the delete call to cloudwatch is idempotent) unless the user retries we have no guarantee of being in a non-corrupt state.

 

We could update the scheduled workflow event dequeuing logic to trigger a call to delete a rule when no active launch plan versions exist. Unfortunately there's a possible race condition this exposes in the case of an end-user calling disable in one step, and then enable separately after that.

 

As a solution, [~matthewsmith] proposed adding an epoch to schedule names to distinguish them. Since we already want to make schedule names more descriptive (with some kind of truncated project & domain in the name) that work can fall under this work item.

Switch flyteidl output to be commonjs

flyteidl is currently being output as an es6 module, which makes it incompatible with NodeJS unless it is run through webpack first. There's no real reason to do it that way, and protobufjs supports commonjs output, so we should switch to that.

Node Validators

It should be possible to specify pre and post validators on nodes to prevent advancement of a node (or cache poisoning) if the input/output data does not match standards.

Allow download of Inputs / Outputs

It's unclear exactly what format things should be in, but for I/O types like CSV/Blob/Schema we should be able to provide a download link for the user.

Options:

  1. Convert it to a signed S3 link. This is probably not the right move because we need to verify the identity of a user before allowing them to download
  2. Convert the s3:// protocol to an actual s3 link. It would be up to the user to ensure they are assuming the correct role to be able to download the file.

Likely it will be option 2.

For things like CSV list, we have to consider how to display a list of these items.

Figure out validation / default value implementation for JS

Problem:

The messages coming back from the API are decoded by protobufjs. But since all the fields in a proto messages are optional by convention, we don't have any assurance that the records are valid and usable. This has caused errors before on the client side.

 

Solution options:

  • Manual validation of the records and type-casting (message as X) or type-guarding (: message is X) to the stricter types present on the client side. This has the advantage of being flexible in the UI requirements, and the disadvantage of being difficult to keep in sync with the protobuf source of truth.
  • Automated validation via some type of schema definition stored on the client side (JSON Schema is one such option). This has the advantage of generating consistent code on the client side which is kept up-to-date automatically as the schema is updated, as well as providing a schema document that can be used to validate the JSON output from the API. It has the same disadvantage of being a separate solution which must be updated manually any time the API contract changes.
  • Switch the console to use protoc-generated JS/TS libraries and decorate all protobuf messages with the appropriate validation. This has the advantage of the validation rules being identical on both server and client (and updating automatically) as well as providing a generic solution for validation (call validate() on the message class coming back from the server). It has the disadvantage of requiring a non-trivial amount of work: Switching from protobufjs to protoc, enabling the TS output from protoc, updating console code to work with the new typings and decoding strategy.

Option 3 is ideal, but the amount of work necessary to do so is concerning (especially considering it may not work correctly and we might have to back it out).

Sorting/filtering by inputs

It's useful to filter executions down by the value of certain inputs. For instance, if a workflow takes a region code as an input and is run frequently with different values for the region code, a user may want to only see executions using one given value of that code ("SEA").

This functionality will require a design spec, since workflows may have many inputs of varying types and indexing across those types and values is non-trivial.

Note: There is an internal design document that could be cleaned up and moved to public in order to provide guidance for this item.

Implement Launch Plan details

This will probably be similar to Workflow Version details, in that it will show information from the closure. But it may not show the graph, or it may optionally allow a user to show a graph view of the workflow at that version.

TODO: Determine which details of a LP are useful to show.

Rework dynamic node relationships in data model

Admin currently allows tasks to be parents of other nodes (1->many) and nodes to be parents of other tasks (1-1). This has lead to some confusion/assumptions:

  • While tasks do yield nodes, they, tasks, finish executing well before those nodes start. It's not entirely accurate to have this task->node parent relationship
  • Due to how they are currently presented in the data model, the nested UX looks confusing with the task row showing success and sub-rows showing running (indicating the yielded nodes are still running).

We have talked separately on different occasions about how this should ideally be represented. This task is to track the concrete steps towards a better model.

Add Auth to Console

Admin handles most of the auth flow. Console needs to properly handle 401 responses and redirect to the auth flow to refresh cookies.

Graph Enhancements

This is to cover any overflow / nice-to-haves on the graph implementation after the initial usable version. Some ideas:

  • Diving into layers of the graph (i.e. expanding subworkflow nodes inline)
  • Zooming/panning
  • Hover animations, including highlighting data flow in adjacent nodes
  • Animations on nodes in progress
  • Different rendering for nodes which were not executed

Replace loading indicators

We want to make some updates to the way we load items:

  • Show no loading indicator if the request returns within 1 second
  • After 1 second, show a shimmer/skeleton state

TODO: Document all the places where we currently use loading spinners.

Breadcrumbs for the UI

We need to determine what info should be available in the breadcrumbs.

  • Show project, domain, entity type, (sub-entity type),  version. In this case, sub-entity is something like an execution or launch plan belonging to a particular workflow.
  • Show a static project/domain combo just to set context, but don't make them links, then show the same as in #1
  • Leave out project/domain entirely

Move flytegraph into a separate package

The graph components in the console are designed to be a reusable package, but while it's under active development I'm leaving it inside the flyteconsole repo. This ticket is for tracking the work to be done to publish it as a standalone package.

Filter/view executions by SHA in Flyte 2.0 UI

Already in the cli: 

flyte-cli -h flyte.lyft.net -p flytekit -d development list-executions -f "eq(workflow.version,gitsha)"

 

This is to track potential for this in the UI. 

Customer notes:

 

NOTE

The UI can already filter executions by Version, but we don't show versions in the executions table. The work here is mostly for adding that.

Will require a small amount of UX work to determine how to surface versions in the table rows.

 

 

Expanded error message collapses when scrolling out of view

  • Find an execution in the executions table (workflow details page) that has a long error message.
  • Click to expand the error message.
  • Scroll the row out of view
  • Scroll the row back into view

Expected: The error message should still be expanded.

Actual: The error message renders collapsed, but the row is still the size that it would be with the error message expanded. Now the content sits in the middle of a row that is too tall.

Support additional input types in the Launch UI

We don't currently support list/map or some of the less common types. This task is to at least implement list/map and explore if there is anything we can do about supporting the other types.

Console sends `undefined` instead of `false` for unchecked toggle switches

For workflows which take boolean values, the Console renders a toggle switch. When the toggle remains switched to "off", the resulting computed value is undefined instead of false. This translated to passing no value for the input when making the launch request.
For required inputs with no default value, that will result in a 400.

At the very least, if a boolean value is required and has no default, we should be translating an unchecked toggle to false to make sure the launch request succeeds.

Once default values are implemented for the form, this should become less of an issue.

Parallel/Map Node

Allow loose parallelism as a native part of the Flyte spec.  In other words, allow a 'parallel node' to take a list of inputs and map the work out to replicas of the same executable: task, workflow, or launch plan.

Hotkeys

There are probably some hotkeys worth implementing. This is a placeholder to determine what those should be.

Execution IDs aren't copy-pastable across UI, CLI

The full execution idea ID ex:project:domain:id

 

In the UI we only show the last portion ("id")

The CLI requires the full "ex:project:domain:id", meaning you can easily copy-paste between the two. 

 

Request from pricing. 

HTTP 400 returned when attempting to retrieve data for NodeExecution child of a Dynamic Task

Update:

This is a UI bug. We should not attempt to retrieve inputs if no inputsUri is set, and should not attempt to retrieve outputs if closure.outputsUri is unset. 


Direct child

[https://flyte.lyft.net/api/v1/data/node_executions/flytekit/production/y9n8xi9amd/task1-b0e1be7f74-h-task-sqb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0]

Grandchild (nested subtask)

[https://flyte.lyft.net/api/v1/data/node_executions/flytekit/production/y9n8xi9amd/task1-b0e1be7f74-h-task-sqb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0-78d085b30a--sub-taskb5710215b84d56d6770b72f5e3cd4f797910c6e6-0-0]

The above URLs should both return NodeExecution data for the ids provided, but instead they return an error "invalid URI".

 

Update visuals used for errors

This is a task to audit our usage of error messages.

  • Ensure that all places where we use error messages are using an appropriately sized component
  • Evaluate messaging used
  • Discover any views/components which currently do not use error messages in their failure states and update them

Plugin Default Behavior Update

{"json":\{"exec_id":"","node":"","ns":"-development","routine":"worker-13","src":"handler.go:216","tasktype":"spark","wf":"***.SparkTasksWorkflow"}

,"level":"warning","msg":"No plugin found for Handler-type [spark], defaulting to [container]","ts":"2019-11-11T21:09:36Z"}

Defaulting Spark to container doesn't make sense and ideally we should fail cleanly at Propeller level  and expose it to users instead of executing it as a container task and leading to an unknown/weird container failures. I think this also applies to other tasks like Hive/Sidecar. 

Audit of UI / UX tests

We need a story around what types of testing we are doing for the UI, and an update of the existing test coverage to move toward that goal.
Right now, we have a mixture of tests implemented with react-testing-library, Enzyme(?), and react-test-renderer (mostly snapshots which we don't really need).

The target will be:

  • Use react-testing-library for all unit/component tests.
  • Remove Enzyme / react-test-renderer
  • Make a decision on whether we need any integration / end-2-end / automated UI testing (something like Cypress / Browserstack / etc.)
  • Choose a target for code coverage and open one or more issues to track hitting that target.

Better document the local testing story

The local testing story is weak... we can do a better job documenting tips for how to improve.

Our initial idea is that the pyflyte execute command can be run locally, but this has some problems like it uses an autodeleting temp dir and it might mess up real outputs in S3, etc.

We'll play around with stuff and at least come up with some short term workarounds.

Parallel Node (Propeller Side)

TCS are excited for the native parallelization offered in Flyte 2.0. This task is for the propeller side execution of parallel nodes.

Support specifying notifications when launching workflows via the UI

The Inputs for launching a workflow accept a Notifications fields, which can be used to specify notification rules for specific states. It's a little complicated (can be email, PD, Slack to multiple recipients for multiple states), so we'll tackle it as a separate task.

Render Logs directly in the UI

We have enough information from activity execution entity to make calls directly to AWS to retrieve log stream events.

Accessing log streams requires specific permissions. These won't exist on the client (nor should they). But the server side could be granted that role and be a proxy for the logs.

So it might look something like this:

  • Client makes a request to UI server side to open logs for a specific execution, passing the execution ID. This opens a long-lived TCP request which will be used to stream the log back to the client
  • Server-side opens a connection to AWS to get the log stream for that execution. These have to be retrieved in chunks. Server-side begins streaming the chunks to the client
  • Server-side listens for (pings? Can AWS do push for these?) additional log stream lines and pushes them to the client as they are discovered.

Questions/Concerns:

  • This could be simpler if there was a way for the UI to retrieve a temporary token to use for AWS access. Can the server generate one of these and return it?
  • How do we know when the log stream has ended and we can close the connection to the client? Can we check for a specific string in it? 
  • Each one of these will consume a connection to the server and hold it open for what could be a long time. This could cause resource constraints, but we can always scale the UI servers to accomodate
  • Should we consider web sockets for this type of thing? We could have a mechanism where, while an active websocket connection is open watching a particular execution, the server-side will continue to poll for the latest logs and deliver them to whatever listeners are active. This has the benefit of only making the requests to AWS once if there are multiple listeners
  • If we do use Websockets, this functionality is almost complicated enough to warrant spinning up a separate service to handle it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.