tensorflow / tfx-addons Goto Github PK

Developers helping developers. TFX-Addons is a collection of community projects to build new components, examples, libraries, and tools for TFX. The projects are organized under the auspices of the special interest group, SIG TFX-Addons. Join the group at http://goo.gle/tfx-addons-group

License: Apache License 2.0

Python 46.63% Jupyter Notebook 53.37%

tensorflow tfx special-interest-group machine-learning mlops neural-network python

tfx-addons's Introduction

TFX Addons

SIG TFX-Addons is a community-led open source project. As such, the project depends on public contributions, bug fixes, and documentation. This project adheres to the TensorFlow Code of Conduct. By participating, you are expected to uphold this code.

Maintainership

The maintainers of TensorFlow Addons can be found in the CODEOWNERS file of the repo. If you would like to maintain something, please feel free to submit a PR. We encourage multiple owners for all submodules.

Installation

TFX Addons is available on PyPI for all OS. To install the latest version, run the following:

pip install tfx-addons

To ensure you have a compatible version of dependencies for any given project, you can specify the project name as an extra requirement during install:

pip install tfx-addons[feast_examplegen,schema_curation]

To use TFX Addons:

from tfx import v1 as tfx
import tfx_addons as tfxa

# Then you can easily load projects tfxa.{project_name}. Ex:

tfxa.feast_examplegen.FeastExampleGen(...)

TFX Addons projects

Check out proposals for a list of existing or upcoming projects proposals for TFX Addons.

Tutorials and examples

See examples/ for end-to-end examples of various addons.

Contributing

TFX Addons is a community-led project. Please have a look at our contributing and development guides if you want to contribute to the project: CONTRIBUTING.md

Meeting cadence:

We meet bi-weekly on Wednesday. Check out our Meeting notes and join [email protected] to get invited to the meeting.

Package releases

Check out RELEASE.md to learn how TFX Addons is released.

Resources

[email protected] – Join our Google group
[email protected] – General TFX mailing list
TFX Addons Slack - join here
SIG Repository (this repo)
SIG Charter

tfx-addons's People

Stargazers

Watchers

tfx-addons's Issues

TFX MLMD Client library

Currently it's hard to query MLMD as it requires a lot of information on how TFX structures data internally in MLMD. There's several examples (including Airflow TFX example or Model Cards) that have repetitive functions just because they are not included in the core library. We can simplify MLMD querying by providing some of those functions ourselves.

cc @casassg

GitHub Release Pusher

This is a project idea that pushes trained model to the GitHub Release of a GitHub repository. I think some models are maintained in GitHub Release. Also, this component could push the model along with the Model Card in markdown format generated by ModelCardGenerator since it supports markdown template out of the box.

struct2tensor dataviews

Create struct2tensor Data Views, to enable representing dataset examples in formats other than tf.Example. (Documentation is currently lacking on this) See ranking example for usage of Data Views.

Sklearn Example

Didn't follow process originally, so backfilling.

AIC and/or BIC Component

AIC and/or BIC (Akaike and Bayesian Information Criterion) component to measure generalization error without a test set

Notebook as a component

Similar to Container based components but with notebook. Make Notebooks a first class citizen of TFX.

cc @casassg

Rename predictions_to_biquery to predictions_to_bigquery

Expected Behavior

Component directory should be tfx_addons/predictions_to_bigquery

Actual Behavior

Component directory is tfx_addons/predictions_to_biquery

Steps to Reproduce the Problem

N/A

Specifications

N/A

GPU-accelerated model evaluation component using nvidia Rapids

Could speed up model evaluation process.

cc @marcromeyn

TFX-Kale

An analogue of Kale for TFX pipelines. Hopefully not specific to K8s or KFP. https://github.com/kubeflow-kale/kale

GC process for MLMD

@1025KB @BrianSong
Currently the OSS version of TFX/MLMD has no garbage collection (GC), meaning that it's left up to individual developers or organizations to create their own. Since GC is something that many teams will want, this project aims to develop a standardized process for defining a GC policy and running GC. This could easily just be a cron script.

See:
google/ml-metadata#162 (comment)
tensorflow/tfx#5093
https://github.com/tensorflow/tfx/blob/4a6421b6b99f6529ef2c95437ce3f3f08a5c04d8/tfx/orchestration/metadata.py#L197

Upload Predictions to BigQuery

When running BulkInferrer, uploading the results to BigQuery in a downstream component. Actually this could be more general, to simply upload the rows of a dataset to BigQuery.

TFLiteToFirebaseML Component

Firebase ML is a great place to store TFLite model.

With Firebase ML, we can guarantee that mobile devices can be equipped with the latest ML model without explicitly embedding binary in project compiling stage. We can even A/B test different versions of a model with Google Analytics.

It would be nice if there is a TFX component that helps publishing TFLite model to Firebase ML.

What do you think?

A load test of the exported model

It will be helpful to have a component (dedicated of extension to the InfraValidator TFX Pipeline Component) that performs a load test of the exported model.

The expected behavior is to load the exported model, create a TensorFlow Serving endpoint, send requests to the prediction endpoint, and measure the response time. The load test may be performed using the common open-source software like Locust / Vegeta for HTTP or ghz for gRPC protocol.

The motivation behind it is that the prediction time may vary on the model type and structure. With this component, we can initially check if the model will meet the prediction time's business requirements.

A SampleGen component

It will be valuable to have a component to generate a sample dataset plus the entire dataset, both in the .TFrecord and column format like .csv. This sample set will help with experimentation and data quality checks, primarily when we work with massive datasets.

Expected behavior: create an x% sample dataset (i.e., 1%) as an addition to the dataset generated by the ExampleGen component. Make sampling in a reproducible way.

[Discussion] TFX Addons Release

Thread to bundle the discussion around TFX Addons releases.

Based on the feedback so far, here are some ideas for TFX Addons:

Create PyPI project. tfx-addons?
Clear release instructions in the repo similar to https://github.com/tensorflow/community/blob/master/sigs/addons/RELEASE.md
Supported TFX versions: 3 most recent TFX releases. We could probably support more version automatically if there aren't any breaking changes
Let's provide a compatibility matrix similar to https://github.com/tensorflow/addons#python-op-compatibility-matrix
Supported TF versions: Driven by the TFX versions
Github actions to build PyPI versions
Nightly builds (with new commits to main) to tfx-addons-nightly driven by Github Actions?
Long-term maintenance of components, examples, etc.?

Feedback from @seanpmorgan (core member for https://github.com/tensorflow/addons)

Best Practices:

Create a RELEASE.md in your SIG folder. This improves your bus factor and makes it transparent to all what exactly goes on during a release. Here's ours.
Have more than one person authorized and experienced in doing the release process
TF-Addons actually does a "nightly" release for each PR that merges. This makes it so we have a lot of trial release runs and we're aware if there's anything broken in the release process. Here you can see us gaiting the uploading of wheels in our CI: https://github.com/tensorflow/addons/blob/master/.github/workflows/release.yml#L109
Utilize a stored secret in the repo for releases so there's no need to share credentials amongst team members

How do you manage the long-term maintenance?

Wish I had a better story, but one tool we use is backport. It lets us utilize a bot to quickly create PRs to several release branches but simply tagging the PR with labels

Criteria for cutting new releases

Since we support custom ops, we need to release with every TF version. The C++ API is not stable so it's only guaranteed to work with the version its compiled against. We try to keep our python compatibility for 3 TF versions.
You can see our compatibility matrices here:
https://github.com/tensorflow/addons#python-op-compatibility-matrix
https://github.com/tensorflow/addons#c-custom-op-compatibility
How do you maintain version compatibility with the different TF versions or does TF Addons support only a subset of TF versions?
See above

Distillation Component(s)

Bring general-purpose knowledge distillation to TFX. Potentially building on Brandon’s and Terry’s (Googlers) previous work.

cc @rcrowe-google for more details

Incomplete implementation of FilterPredictionToDictFn in predictions-to-bigquery component.

Expected Behavior

In executor.py, the FilterPredictionToDictFn should process an element (PredictionLog) from an Apache Beam PCollection and converts it into a dict with input features and prediction label + score.

Actual Behavior

Parsing of the input features from the element is incomplete and was left as a TODO.

Steps to Reproduce the Problem

N/A

Specifications

Version: Python 3.9

Add feature_selection component?

The feature selection component doesn't seem to be an active module:

tfx_addons._ACTIVE_MODULES

['__version__',
 'mlmd_client',
 'schema_curation',
 'feast_examplegen',
 'xgboost_evaluator',
 'sampling',
 'message_exit_handler']

Can we add it?

ML Lineage UI for TFX

While all the data exists in MLMD, we currently have to rely on orchestrator to represent the executions and other metadata in a nice and organized UI. This would bring a frontend and backend to generate Web UIs from TFX metadata.

cc @casassg

[Discussion] How should we use CI for repo?

AS a followup conversation with @TheMichaelHu in #43 (comment) How should we set up CI?

For now here's what I added on pre-submit for anything under tfx_addons:

YAPF autoformatter (2 spaces + pep8) - This is similar to other repos and helps solve a lot of formatting issues/discussions.
isort - Helps maintain import orders nicely which helps on merge conflicts.
pylint - Used the TFX configuration here to make sure we have some nice python code.
Add-license - This is probs one that can be removed. I like it as it makes it easier to contribute code and you can kinda forget to add license as this adds it for you if you have pre-submitt activated. But we can skip it on CI?

Reasoning to not add to folders under examples:

Each example may use separate requirements and separate structures since we have not consolidated on one. We could standarize, but then we may add too much friction.
Each example can add their own CI job easily using GitHub Actions.

Also, added pytest automatic run for jobs added to tfx_addons. I didnt add examples to pytest to avoid issues on different configurations.

What are people thoughts on wether all those tests are enough/too much? Wether we should enable pre-submit-hook for examples folders?

Overall, took a fast opinionated stance to get something working for now, but would be interested on feedback or on wether we should change things here.

EvalResultToBigQuery

Component to export the output of an evaluator component to a BigQueryTable for further analysis

Stop pipeline when ExampleValidator finds anomalies

In cases where any anomalies that are found by ExampleValidator should stop the pipeline execution, the standard components don't have a way of stopping it. That's because none of the standard components depend on the results from ExampleValidator.

We can write a custom component to take the ExampleAnomalies artifact from ExampleValidator, along with the Examples artifact from ExampleGen, and parse the ExampleAnomalies artifact.

If no anomalies are found, the component passes through the Examples artifact from ExampleGen
If any anomalies are found, the component will fail

Downstream components which depend on the Examples artifact from ExampleGen can instead depend on the Examples artifact from this new custom component. When it fails, the pipeline will stop.

An enhancement of this would be to either include a user code module to check the type of anomaly and decide whether to fail or not, or some sort of configuration parameter.

Salesforce data ExampleGen

XGBoost evaluator

While TFX can train any model (as long as you can feed it the data), evaluator needs a custom extractor to run Evaluation jobs. XGBoost evaluator would basically reuse the Evaluator executor but instead provide with the extractor for XGBoost internally.

MultiModelPusher

TF Serving supports hosting multiple models in one place. I want to discuss if it is a useful to pushes multiple models from multiple Trainers?

Have model cards in `HFPusher`

HFPusher component recently introduced in this blog post and is currently in this library has model cards feature (so it's in the project repository as of now, and not in tfx-addons' HFPusher) built for Hugging Face Hub. Reason why we had to go with HF model cards was that it follows a different structure, has metadata part as yaml on top to enable easy discovery and a free text markdown section below that. So it would be good to contribute that to the HFPusher in tfx-addons. Later on, we can introduce many other things if demanded, see this issue for discussion. Also the Hugging Face Hub client library has many functions for parsing and programmatically adding and editing model cards, so it's convenient for people too!
WDYT?

Also pinging @sayakpaul and @deep-diver.

Model Card Component to generate reports about pushed models

Expected Component Behavior

The purpose of this component is to use schemaGen and StatisticsGen output and generate an html version of a model card for the produced model version after the model was pushed to its destination.

The component is acting as a wrapper for model_card_toolkit.ModelCardToolkit.
Key parameters for ModelCardToolkit like model_card.model_parameters.data.train.graphics.description, model_card.model_parameters.data.eval.graphics.description, and/or model_card.quantitative_analysis.graphics.description are passed to the component as args as they won't change between pipeline runs.

ModelCardToolkit connects with the MLMD for dynamic information like data statistics. With all information about the pipeline run, ModelCardToolkit generates an html page (based on mct.export_format() and tf.io.write_file) and then saves the html page to a provided path and registers the component output with the MLMD.

A component stub based on TFX 0.2X exists and could be used as a starting point.

Model cards [link]

Model load test TFX component

A TFX component performs a load test of the exported TensorFlow model because the prediction time may vary on the model type and structure.

This component could extend a TFX InfraValidator to validate the TensorFlow model serving performance using the load test. The load test results will tell us:

if the model will meet the prediction time requirements (essential in large scale implementations),
if the changes in the new model will make it faster/slower comparing to the baseline model.

TFX Exit Handler for Slack Notifications (or Twilio et al.)

Proposal

Since TFX 1.4, TFX supports exit handlers if the pipeline runs on Vertex (and KFP?). At @digits, we created an exit handler component to notify us about the exit state and any messages from the pipelines (failed or succeeded). If anyone finds this setup interesting, we can open source this component to TFX Addons.

This could also work as an example implementation for exit handlers and it could be used to trigger other pipelines.

Expected Component Behavior

The exit handler component would take four parameters (no input artifacts):

    final_status: tfx.dsl.components.Parameter[str],
    slack_token: tfx.dsl.components.Parameter[str],
    slack_channel_id: tfx.dsl.components.Parameter[str],
    on_failure_only: tfx.dsl.components.Parameter[int] = 0,

In addition to the final_status from the pipeline orchestrator, we are currently consuming the slack tokens and channel id. We added an additional flag to only alert on failures for frequently running pipelines.

The component could support other communication ways (e.g. Twilio SMS) too.

Example Output

Below you can find a screenshot of the current, internal implementation and its output to Slack.

Pipeline Success Message

Pipeline Failure Message

Visualization in Google Cloud Vertex Pipelines

Specifications

TFX Version: >= 1.4.0
Orchestrator Platforms: Vertex (aka KubeflowRunnerV2)

Out of Scope

Initial support for Slack only (unless someone wants to implement other communication platforms)
Decryption of credentials isn't supported through this proposal

Prefect as an orchestrator

I've been evaluating Prefect (on github) as an alternative to airflow for my team and was wondering how much effort it would take in order to automatically create prefect flows from a TFX Pipeline definition like how we can automatically create dags for airflow.

Since the prefect evaluation is going really well and I'm also interested in using TFX, I want to see if I can merge the two.

Align pylintrc with the one used by TFX

In #63 we needed to modify the original pylintrc to ignore super-with-arguments in order to get the sklearn examples to pass pre-commit checks.

Fix the linter errors in the sklearn example and remove the change that was added to pylint configs.

Schema curation custom component

Undersampling Component

I am thinking of implementing an undersampling custom component for TFX, as described in this issue for TFX: tensorflow/tfx#3831. This component may also include oversampling and checks to gauge the difference between high-frequency and low-frequency classes. A possible alternative to a custom component could be a dataset factory - which approach is better for this particular project remains to be seen.

Feature selection custom component

Visualization would be great too!

CopyExampleGen component

In cases where data does not need to be shuffled, this component will avoid using a Beam job and instead do a simple copy of the data to create a dataset artifact. This will need to be a completely custom ExampleGen and not extend BaseExampleGen in order to implement this behavior.

Does the data need to be pre-split, or will this component also do the split?
Should splitting be optional?

@rclough @1025KB

Enter Execution Parameter from UI and trigger Pipeline

As Airflow and Kubeflow supports entering parameter of component from ui we have to setup backend which will automatically take all execution parameter from ui.

Transform component using Pandas

For developers who are not using TensorFlow for their modeling, the Transform graph (one of the key advantages of TF Transform) does not apply. These users may also already be familiar with Pandas, since it's so widely used, or may want to use other libraries which accept dataframes. So the idea for this project is to develop a TF Transform-like component which uses Pandas dataframes instead of TF Transform.

ExampleGens for various OSS feature stores (Feast, etc)

[Release] 0.1.0 issue tracking

Please comment or link any issues you find with tfx-addons 0.1.0.

Thanks.

Per-user pseudonymization component

A component can be developed to do per-user pseudonymization, which pseudonymizes selected features while retaining consistency for each user.

First, the component is given two lists of feature names:

The features which identify a user, such as first and last name, user ID, etc.
The additional features which should be pseudonymized

Each user's identifier(s) are mapped to pseudonymized identifier(s), which are consistently used to replace the original identifier(s) for all examples for that user. For example, in the output the user "Barney Rubble" might always be given the name "Fred Flintstone", so that multiple examples for Barney can be analyzed as a group. This requires the creation of a map of user identifier(s) to pseudonymized user identifier(s), which can be done as data is read, without a full pass over the data. Note that different users with the first name of Barney should be given different pseudonymized first names, to avoid revealing the mapping.

Additional feature values will also be mapped and pseudonymized consistently. For example, "California" might always be given the name "Xanadu" (or actually the result of an pseudonymization algorithm, but you get the point).

Note that this is not full anonymization, and retains the information in the data while providing reasonably strong privacy protection. This is highly recommended by the GDPR.

ExampleFilter

Component to filter examples of a TF-record dataset using a user-defined predicate function

cc @sngahane

Fix lint issues in predictions_to_bigquery component.

Expected Behavior

Pre-commit checks pass in predictions_to_bigquery Python files.

Actual Behavior

Pre-commit checks are ignored for said files.
i.e. due to # pylint: skip-file comment in each file.

Steps to Reproduce the Problem

pre-commit run --files tfx_addons/predictions_to_bigquery/*

Integration with HuggingFace Hub

Hi folks (cc: @rcrowe-google , @sayakpaul)

as you know, HuggingFace provides Hub to host models and prototype applications. I think this is a nice place for some reasons:

If the model is compatible to 🤗 Transformer, the model can be easily loaded up and used with the library. In these days, lots of ML practitioners are familiar with the usage of Transformers.
HuggingFace Hub comes with Git-LFS feature out of the box, so it is a good place to store the large size of model files with version controls in Git manner.
HuggingFace Space let us write simple applications with Gradio and Streamlit, and it provides a VM to serve the applications. The VM spec is not great, but it can be improved over time. Also, before the actual production releases of the model, we can let people to play with the experimental versions of the model, and get some feedback.

To this end, I want to propose two components:

HFModelPusher : It pushes a model from an upstream TFX component such as Trainer to the HuggingFace Model Hub
HFSpacePusher : It pushes an application codes to the HuggingFace Space Hub. This is intended to work as the downstream step of the HFModelPusher. Some placeholders in the application template codes will be replaced with the outputs of the HFModelPusher such as model_repo_id, model_version

Below image shows how they are interconnected

Two Vertex AI components

With regard to tensorflow/tfx#4125, I think it'd be a good idea to have two components for Vertex AI:

One that would import a model into Vertex AI from the location provided by Pusher.
One that would deploy a given model from Vertex AI to an Endpoint.

With these two components, the users will actually be able to author end-to-end TFX pipelines that could be orchestrated on Vertex Pipelines that take the advantage of Vertex AI's services.

@deep-diver and I have worked on a PoC (Colab Notebook) that might be helpful for this purpose. We realized the following pipeline:

Cc: @rcrowe-google

_parse_example error when input contains SparseTensor in predictions-to-bigquery component.

Expected Behavior

executor._parse_example should be able to parse tf.sparse.SparseTensor values that are derived from the input serialized TFRecord data.

Actual Behavior

executor._parse_example raises an error when parsing a tf.sparse.SparseTensor value due to:

value_list = tensor.numpy().tolist()

where tensor is expected to be tf.Tensor.

Steps to Reproduce the Problem

Run executor._parse_example using a serialized tf.Example proto with tf.sparse.SparseTensor value.

ExampleGen for Kafka

Run linter into sklearn_example to make sure it can be run under CI

Expected Behavior

Currently sklearn pylint/isort/yapf is not working for sklearn_example. We should either consolidate the CI for that and the rest of the project or figure out a good strategy to maitain those.

CC @TheMichaelHu

TFX + PyTorch Example

There are a few TFX examples for how to train Scikit learn or JAX models, I haven't seen an example pipeline for PyTorch.

The pipeline could use a known dataset, e.g. MNIST, ingest the data via the CSVExampleGen, run the standard statistics and schema steps, performs a pseudo transformation (passthrough of the values) with the new PandasTransform component from tfx-addons, add a custom run_fn function for PyTorch, and then add a TFMA example.

Any thoughts?

Appender component

Applies TFT transformations which are appended to a SavedModel.

Outliers detection/removal component

Dear TFX community members,

I'm a student in Data Science from University of Padova (Italy) and I've decided to write my master's thesis on MLops. I'm very interested in TFX and I'd like to analyze it in depth. In particular, for the experimental part my supervisor and I have in mind a Beam/Spark (MapReduce) implementation of an outlier detection algorithm, especially to deal with large dataset, we think that such preprocessing step maybe helpful. Then we would like to contribuite to this project creating a custom component.

Could this idea be useful in some way? Are you planning to release data mining components?

Thanks a lot for your advice.

Contact

[email protected]

tensorflow / tfx-addons Goto Github PK

tfx-addons's Introduction

TFX Addons

Maintainership

Installation

TFX Addons projects

Tutorials and examples

Contributing

Meeting cadence:

Package releases

Resources

tfx-addons's People

Stargazers

Watchers

Forkers

tfx-addons's Issues

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Expected Component Behavior

Proposal

Expected Component Behavior

Example Output

Pipeline Success Message

Pipeline Failure Message

Visualization in Google Cloud Vertex Pipelines

Specifications

Out of Scope

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Expected Behavior

Contact

Recommend Projects

Recommend Topics

Recommend Org