Giter Club home page Giter Club logo

tensorflow / tfx-addons Goto Github PK

View Code? Open in Web Editor NEW
123.0 101.0 62.0 68.28 MB

Developers helping developers. TFX-Addons is a collection of community projects to build new components, examples, libraries, and tools for TFX. The projects are organized under the auspices of the special interest group, SIG TFX-Addons. Join the group at http://goo.gle/tfx-addons-group

License: Apache License 2.0

Python 46.63% Jupyter Notebook 53.37%
tensorflow tfx special-interest-group machine-learning mlops neural-network python

tfx-addons's Introduction

TFX Addons

TFX Addons package CI TFX Addons CI for examples PyPI

SIG TFX-Addons is a community-led open source project. As such, the project depends on public contributions, bug fixes, and documentation. This project adheres to the TensorFlow Code of Conduct. By participating, you are expected to uphold this code.

Maintainership

The maintainers of TensorFlow Addons can be found in the CODEOWNERS file of the repo. If you would like to maintain something, please feel free to submit a PR. We encourage multiple owners for all submodules.

Installation

TFX Addons is available on PyPI for all OS. To install the latest version, run the following:

pip install tfx-addons

To ensure you have a compatible version of dependencies for any given project, you can specify the project name as an extra requirement during install:

pip install tfx-addons[feast_examplegen,schema_curation]

To use TFX Addons:

from tfx import v1 as tfx
import tfx_addons as tfxa

# Then you can easily load projects tfxa.{project_name}. Ex:

tfxa.feast_examplegen.FeastExampleGen(...)

TFX Addons projects

Check out proposals for a list of existing or upcoming projects proposals for TFX Addons.

Tutorials and examples

See examples/ for end-to-end examples of various addons.

Contributing

TFX Addons is a community-led project. Please have a look at our contributing and development guides if you want to contribute to the project: CONTRIBUTING.md

Meeting cadence:

We meet bi-weekly on Wednesday. Check out our Meeting notes and join [email protected] to get invited to the meeting.

Package releases

Check out RELEASE.md to learn how TFX Addons is released.

Resources

tfx-addons's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tfx-addons's Issues

TFX MLMD Client library

Currently it's hard to query MLMD as it requires a lot of information on how TFX structures data internally in MLMD. There's several examples (including Airflow TFX example or Model Cards) that have repetitive functions just because they are not included in the core library. We can simplify MLMD querying by providing some of those functions ourselves.

cc @casassg

GitHub Release Pusher

This is a project idea that pushes trained model to the GitHub Release of a GitHub repository. I think some models are maintained in GitHub Release. Also, this component could push the model along with the Model Card in markdown format generated by ModelCardGenerator since it supports markdown template out of the box.

AIC and/or BIC Component

AIC and/or BIC (Akaike and Bayesian Information Criterion) component to measure generalization error without a test set

GC process for MLMD

@1025KB @BrianSong
Currently the OSS version of TFX/MLMD has no garbage collection (GC), meaning that it's left up to individual developers or organizations to create their own. Since GC is something that many teams will want, this project aims to develop a standardized process for defining a GC policy and running GC. This could easily just be a cron script.

See:
google/ml-metadata#162 (comment)
tensorflow/tfx#5093
https://github.com/tensorflow/tfx/blob/4a6421b6b99f6529ef2c95437ce3f3f08a5c04d8/tfx/orchestration/metadata.py#L197

Upload Predictions to BigQuery

When running BulkInferrer, uploading the results to BigQuery in a downstream component. Actually this could be more general, to simply upload the rows of a dataset to BigQuery.

TFLiteToFirebaseML Component

Firebase ML is a great place to store TFLite model.

With Firebase ML, we can guarantee that mobile devices can be equipped with the latest ML model without explicitly embedding binary in project compiling stage. We can even A/B test different versions of a model with Google Analytics.

It would be nice if there is a TFX component that helps publishing TFLite model to Firebase ML.

What do you think?

A load test of the exported model

It will be helpful to have a component (dedicated of extension to the InfraValidator TFX Pipeline Component) that performs a load test of the exported model.

The expected behavior is to load the exported model, create a TensorFlow Serving endpoint, send requests to the prediction endpoint, and measure the response time. The load test may be performed using the common open-source software like Locust / Vegeta for HTTP or ghz for gRPC protocol.

The motivation behind it is that the prediction time may vary on the model type and structure. With this component, we can initially check if the model will meet the prediction time's business requirements.

A SampleGen component

It will be valuable to have a component to generate a sample dataset plus the entire dataset, both in the .TFrecord and column format like .csv. This sample set will help with experimentation and data quality checks, primarily when we work with massive datasets.

Expected behavior: create an x% sample dataset (i.e., 1%) as an addition to the dataset generated by the ExampleGen component. Make sampling in a reproducible way.

[Discussion] TFX Addons Release

Thread to bundle the discussion around TFX Addons releases.

Based on the feedback so far, here are some ideas for TFX Addons:


Feedback from @seanpmorgan (core member for https://github.com/tensorflow/addons)

Best Practices:

  • Create a RELEASE.md in your SIG folder. This improves your bus factor and makes it transparent to all what exactly goes on during a release. Here's ours.
  • Have more than one person authorized and experienced in doing the release process
  • TF-Addons actually does a "nightly" release for each PR that merges. This makes it so we have a lot of trial release runs and we're aware if there's anything broken in the release process. Here you can see us gaiting the uploading of wheels in our CI: https://github.com/tensorflow/addons/blob/master/.github/workflows/release.yml#L109
  • Utilize a stored secret in the repo for releases so there's no need to share credentials amongst team members

How do you manage the long-term maintenance?

  • Wish I had a better story, but one tool we use is backport. It lets us utilize a bot to quickly create PRs to several release branches but simply tagging the PR with labels

Criteria for cutting new releases

Incomplete implementation of FilterPredictionToDictFn in predictions-to-bigquery component.

Expected Behavior

In executor.py, the FilterPredictionToDictFn should process an element (PredictionLog) from an Apache Beam PCollection and converts it into a dict with input features and prediction label + score.

Actual Behavior

Parsing of the input features from the element is incomplete and was left as a TODO.

Steps to Reproduce the Problem

N/A

Specifications

  • Version: Python 3.9

ML Lineage UI for TFX

While all the data exists in MLMD, we currently have to rely on orchestrator to represent the executions and other metadata in a nice and organized UI. This would bring a frontend and backend to generate Web UIs from TFX metadata.

cc @casassg

[Discussion] How should we use CI for repo?

AS a followup conversation with @TheMichaelHu in #43 (comment) How should we set up CI?

For now here's what I added on pre-submit for anything under tfx_addons:

  • YAPF autoformatter (2 spaces + pep8) - This is similar to other repos and helps solve a lot of formatting issues/discussions.
  • isort - Helps maintain import orders nicely which helps on merge conflicts.
  • pylint - Used the TFX configuration here to make sure we have some nice python code.
  • Add-license - This is probs one that can be removed. I like it as it makes it easier to contribute code and you can kinda forget to add license as this adds it for you if you have pre-submitt activated. But we can skip it on CI?

Reasoning to not add to folders under examples:

  • Each example may use separate requirements and separate structures since we have not consolidated on one. We could standarize, but then we may add too much friction.
  • Each example can add their own CI job easily using GitHub Actions.

Also, added pytest automatic run for jobs added to tfx_addons. I didnt add examples to pytest to avoid issues on different configurations.

What are people thoughts on wether all those tests are enough/too much? Wether we should enable pre-submit-hook for examples folders?

Overall, took a fast opinionated stance to get something working for now, but would be interested on feedback or on wether we should change things here.

EvalResultToBigQuery

Component to export the output of an evaluator component to a BigQueryTable for further analysis

Stop pipeline when ExampleValidator finds anomalies

In cases where any anomalies that are found by ExampleValidator should stop the pipeline execution, the standard components don't have a way of stopping it. That's because none of the standard components depend on the results from ExampleValidator.

We can write a custom component to take the ExampleAnomalies artifact from ExampleValidator, along with the Examples artifact from ExampleGen, and parse the ExampleAnomalies artifact.

  1. If no anomalies are found, the component passes through the Examples artifact from ExampleGen
  2. If any anomalies are found, the component will fail

Downstream components which depend on the Examples artifact from ExampleGen can instead depend on the Examples artifact from this new custom component. When it fails, the pipeline will stop.

An enhancement of this would be to either include a user code module to check the type of anomaly and decide whether to fail or not, or some sort of configuration parameter.

XGBoost evaluator

While TFX can train any model (as long as you can feed it the data), evaluator needs a custom extractor to run Evaluation jobs. XGBoost evaluator would basically reuse the Evaluator executor but instead provide with the extractor for XGBoost internally.

MultiModelPusher

TF Serving supports hosting multiple models in one place. I want to discuss if it is a useful to pushes multiple models from multiple Trainers?

Have model cards in `HFPusher`

HFPusher component recently introduced in this blog post and is currently in this library has model cards feature (so it's in the project repository as of now, and not in tfx-addons' HFPusher) built for Hugging Face Hub. Reason why we had to go with HF model cards was that it follows a different structure, has metadata part as yaml on top to enable easy discovery and a free text markdown section below that. So it would be good to contribute that to the HFPusher in tfx-addons. Later on, we can introduce many other things if demanded, see this issue for discussion. Also the Hugging Face Hub client library has many functions for parsing and programmatically adding and editing model cards, so it's convenient for people too!
WDYT?

Also pinging @sayakpaul and @deep-diver.

Model Card Component to generate reports about pushed models

Expected Component Behavior

The purpose of this component is to use schemaGen and StatisticsGen output and generate an html version of a model card for the produced model version after the model was pushed to its destination.

The component is acting as a wrapper for model_card_toolkit.ModelCardToolkit.
Key parameters for ModelCardToolkit like model_card.model_parameters.data.train.graphics.description, model_card.model_parameters.data.eval.graphics.description, and/or model_card.quantitative_analysis.graphics.description are passed to the component as args as they won't change between pipeline runs.

ModelCardToolkit connects with the MLMD for dynamic information like data statistics. With all information about the pipeline run, ModelCardToolkit generates an html page (based on mct.export_format() and tf.io.write_file) and then saves the html page to a provided path and registers the component output with the MLMD.

A component stub based on TFX 0.2X exists and could be used as a starting point.

  • Model cards [link]

Model load test TFX component

A TFX component performs a load test of the exported TensorFlow model because the prediction time may vary on the model type and structure.

This component could extend a TFX InfraValidator to validate the TensorFlow model serving performance using the load test. The load test results will tell us:

  • if the model will meet the prediction time requirements (essential in large scale implementations),
  • if the changes in the new model will make it faster/slower comparing to the baseline model.

TFX Exit Handler for Slack Notifications (or Twilio et al.)

Proposal

Since TFX 1.4, TFX supports exit handlers if the pipeline runs on Vertex (and KFP?). At @digits, we created an exit handler component to notify us about the exit state and any messages from the pipelines (failed or succeeded). If anyone finds this setup interesting, we can open source this component to TFX Addons.

This could also work as an example implementation for exit handlers and it could be used to trigger other pipelines.

Expected Component Behavior

The exit handler component would take four parameters (no input artifacts):

    final_status: tfx.dsl.components.Parameter[str],
    slack_token: tfx.dsl.components.Parameter[str],
    slack_channel_id: tfx.dsl.components.Parameter[str],
    on_failure_only: tfx.dsl.components.Parameter[int] = 0,

In addition to the final_status from the pipeline orchestrator, we are currently consuming the slack tokens and channel id. We added an additional flag to only alert on failures for frequently running pipelines.

The component could support other communication ways (e.g. Twilio SMS) too.

Example Output

Below you can find a screenshot of the current, internal implementation and its output to Slack.

Pipeline Success Message

Screen_Shot_2022-01-05_at_3_23_43_PM_2

Pipeline Failure Message

_Screen_Shot_2022-01-05_at_2_45_47_PM

Visualization in Google Cloud Vertex Pipelines

Screen_Shot_2022-01-05_at_3_28_06_PM_2

Specifications

  • TFX Version: >= 1.4.0
  • Orchestrator Platforms: Vertex (aka KubeflowRunnerV2)

Out of Scope

  • Initial support for Slack only (unless someone wants to implement other communication platforms)
  • Decryption of credentials isn't supported through this proposal

Prefect as an orchestrator

I've been evaluating Prefect (on github) as an alternative to airflow for my team and was wondering how much effort it would take in order to automatically create prefect flows from a TFX Pipeline definition like how we can automatically create dags for airflow.

Since the prefect evaluation is going really well and I'm also interested in using TFX, I want to see if I can merge the two.

Align pylintrc with the one used by TFX

In #63 we needed to modify the original pylintrc to ignore super-with-arguments in order to get the sklearn examples to pass pre-commit checks.

Fix the linter errors in the sklearn example and remove the change that was added to pylint configs.

Undersampling Component

I am thinking of implementing an undersampling custom component for TFX, as described in this issue for TFX: tensorflow/tfx#3831. This component may also include oversampling and checks to gauge the difference between high-frequency and low-frequency classes. A possible alternative to a custom component could be a dataset factory - which approach is better for this particular project remains to be seen.

CopyExampleGen component

In cases where data does not need to be shuffled, this component will avoid using a Beam job and instead do a simple copy of the data to create a dataset artifact. This will need to be a completely custom ExampleGen and not extend BaseExampleGen in order to implement this behavior.

  • Does the data need to be pre-split, or will this component also do the split?
  • Should splitting be optional?

@rclough @1025KB

Transform component using Pandas

For developers who are not using TensorFlow for their modeling, the Transform graph (one of the key advantages of TF Transform) does not apply. These users may also already be familiar with Pandas, since it's so widely used, or may want to use other libraries which accept dataframes. So the idea for this project is to develop a TF Transform-like component which uses Pandas dataframes instead of TF Transform.

Per-user pseudonymization component

A component can be developed to do per-user pseudonymization, which pseudonymizes selected features while retaining consistency for each user.

First, the component is given two lists of feature names:

  1. The features which identify a user, such as first and last name, user ID, etc.
  2. The additional features which should be pseudonymized

Each user's identifier(s) are mapped to pseudonymized identifier(s), which are consistently used to replace the original identifier(s) for all examples for that user. For example, in the output the user "Barney Rubble" might always be given the name "Fred Flintstone", so that multiple examples for Barney can be analyzed as a group. This requires the creation of a map of user identifier(s) to pseudonymized user identifier(s), which can be done as data is read, without a full pass over the data. Note that different users with the first name of Barney should be given different pseudonymized first names, to avoid revealing the mapping.

Additional feature values will also be mapped and pseudonymized consistently. For example, "California" might always be given the name "Xanadu" (or actually the result of an pseudonymization algorithm, but you get the point).

Note that this is not full anonymization, and retains the information in the data while providing reasonably strong privacy protection. This is highly recommended by the GDPR.

ExampleFilter

Component to filter examples of a TF-record dataset using a user-defined predicate function

cc @sngahane

Fix lint issues in predictions_to_bigquery component.

Expected Behavior

Pre-commit checks pass in predictions_to_bigquery Python files.

Actual Behavior

Pre-commit checks are ignored for said files.
i.e. due to # pylint: skip-file comment in each file.

Steps to Reproduce the Problem

pre-commit run --files tfx_addons/predictions_to_bigquery/*

Integration with HuggingFace Hub

Hi folks (cc: @rcrowe-google , @sayakpaul)

as you know, HuggingFace provides Hub to host models and prototype applications. I think this is a nice place for some reasons:

  1. If the model is compatible to 🤗 Transformer, the model can be easily loaded up and used with the library. In these days, lots of ML practitioners are familiar with the usage of Transformers.

  2. HuggingFace Hub comes with Git-LFS feature out of the box, so it is a good place to store the large size of model files with version controls in Git manner.

  3. HuggingFace Space let us write simple applications with Gradio and Streamlit, and it provides a VM to serve the applications. The VM spec is not great, but it can be improved over time. Also, before the actual production releases of the model, we can let people to play with the experimental versions of the model, and get some feedback.

To this end, I want to propose two components:

  • HFModelPusher : It pushes a model from an upstream TFX component such as Trainer to the HuggingFace Model Hub

  • HFSpacePusher : It pushes an application codes to the HuggingFace Space Hub. This is intended to work as the downstream step of the HFModelPusher. Some placeholders in the application template codes will be replaced with the outputs of the HFModelPusher such as model_repo_id, model_version

Below image shows how they are interconnected

tmp

Two Vertex AI components

With regard to tensorflow/tfx#4125, I think it'd be a good idea to have two components for Vertex AI:

  • One that would import a model into Vertex AI from the location provided by Pusher.
  • One that would deploy a given model from Vertex AI to an Endpoint.

With these two components, the users will actually be able to author end-to-end TFX pipelines that could be orchestrated on Vertex Pipelines that take the advantage of Vertex AI's services.

@deep-diver and I have worked on a PoC (Colab Notebook) that might be helpful for this purpose. We realized the following pipeline:

image

Cc: @rcrowe-google

_parse_example error when input contains SparseTensor in predictions-to-bigquery component.

Expected Behavior

executor._parse_example should be able to parse tf.sparse.SparseTensor values that are derived from the input serialized TFRecord data.

Actual Behavior

executor._parse_example raises an error when parsing a tf.sparse.SparseTensor value due to:

value_list = tensor.numpy().tolist()

where tensor is expected to be tf.Tensor.

Steps to Reproduce the Problem

  1. Run executor._parse_example using a serialized tf.Example proto with tf.sparse.SparseTensor value.

TFX + PyTorch Example

There are a few TFX examples for how to train Scikit learn or JAX models, I haven't seen an example pipeline for PyTorch.

The pipeline could use a known dataset, e.g. MNIST, ingest the data via the CSVExampleGen, run the standard statistics and schema steps, performs a pseudo transformation (passthrough of the values) with the new PandasTransform component from tfx-addons, add a custom run_fn function for PyTorch, and then add a TFMA example.

Any thoughts?

Outliers detection/removal component

Dear TFX community members,

I'm a student in Data Science from University of Padova (Italy) and I've decided to write my master's thesis on MLops. I'm very interested in TFX and I'd like to analyze it in depth. In particular, for the experimental part my supervisor and I have in mind a Beam/Spark (MapReduce) implementation of an outlier detection algorithm, especially to deal with large dataset, we think that such preprocessing step maybe helpful. Then we would like to contribuite to this project creating a custom component.

Could this idea be useful in some way? Are you planning to release data mining components?

Thanks a lot for your advice.

Contact

[email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.