Giter Club home page Giter Club logo

continuous-adaptation-for-machine-learning-system-to-data-changes's Introduction

continuous-adaptation-for-machine-learning-system-to-data-changes's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

continuous-adaptation-for-machine-learning-system-to-data-changes's Issues

Feedback from the TFX team

Definitely include the following point from the blog post:

  • Purging training examples based on distribution shifts.
  • TFMA for model evaluation.

Jiyai suggested trying out the Pusher component to replace our custom VertexUploader and Deployer:

  • Try Pusher to replace two components we have for uploading to Vertex and deploying.

Review the notebooks

02_TFX_Training_Pipeline.ipynb
03_Batch_Prediction_Pipeline.ipynb
98_Batch_Prediction_Test.ipynb

Complete 03_Batch_Prediction_Pipeline.ipynb

The main purpose of this notebook is to build KFP pipeline doing the following steps

  1. Create an batch request input file (file list format) based on the files uploaded to a GCS bucket
  2. Run Batch Prediction on the trained model obtained from 02_TFX_Training_Pipeline.ipynb
  3. Measure the batch prediction model performance in terms of accuracy
  4. If model performance < threshold
    • Copy the testing images to the original(previous) dataset
    • Trigger the TFX training pipeline with original data + newly added data

The functional test for batch prediction is shown in a separate notebook, 98_Batch_Prediction_Test.ipynb.

Support dynamic GCS location for newly collected dataset

Current behavior

  • 97_Prepare_Test_Images Nootebook collects and uploads new image data to fixed GCS location (i.e. gs://batch-prediction-collection-3)
  • Batch Prediction Pipeline's FileListGen only looks up the fixed GCS location to generate a FileList.
  • This is problematic because FileListGen always use the cached result when enable_cache=True (the same input of the GCS location let the pipeline assume to use cached result).

To be updated behavior

  • Collected image data from 97_Prepare_Test_Images Nootebook should be stored in a unique GCS location such as gs://batch-prediction-collection-3/YYYY-MM (if we want to store monthly dataset).
  • Batch Prediction Pipeline's FileListGen should take an GCS location as an input which is a type of RuntimeParameter.
  • Cloud Function from 04_Cloud_Scheduler_Trigger notebook should calculate the latest GCS location by date, then pass the GCS location to the Batch Prediction Pipeline as an RuntimeParameter.

range_config doesn't work in TFX < 1.4.0

In the original concept, range_config should work for {SPAN} in ExampleGen. However there is a bug in TFX < 1.4.0. As of writing this issue, the bug fix has been merged, but nightly build or build from scratch version should be working. The nightly build version should be above 1.4.0.dev20211010 (you can find out the dev versions here)

When range_config works properly, we can integrate that functionality to dynamically choose the range of spans to run the training pipeline with Resolver node. For example, initial pipeline run could depend on span-1, but when data drift is detected, the second pipeline run could use both span-1 and span-2 together. In that case, we don't need to process the ExampleGen for the span-1 once again but only the span-2. Resolver node will reuse the ExampleGen artifacts for the span-1 generated from the initial pipeline run and integrate it with the new ExampleGen for the span-2.

More extensive discussion about this issue can be found in the issue from the TFX official repo.

Incorporating TFMA

@rcrowe-google

I have been thinking about ways to incorporate TFMA for the evaluation part.

Currently, we run batch prediction in order to gather the results, and then we compare them against the ground truth to check if the end accuracy is above a threshold. This is implemented in this notebook. We run the batch prediction service because we think it might be common for real-world purposes too whereupon the arrival of a bulk of data we perform batch inference, collect results, and then analyze them.

We understand what we are doing with the PerformanceEvaluator component could be delegated to TFMA but given batch prediction could be an important component of the workflow, where TFMA should be incorporated?

Cc: @deep-diver

`01_Dataset_Prep.ipynb`

When there will be more data with the same distribution, we can update the currently stored dataset. In this case, you should turn on the GCS's versioning feature.

Worth providing a link that shows how to enable the feature mentioned here.

When there will be more data with the different distribution, we will create other directores of span-2/test and span-2/test to address data drift. In this way, we can keep data separetly for easier maintanence while handling versioning separtely for different SPANs.

Did you mean span-2/**train** and span-2/test?

The notebook currently has permission errors. Also,

Please note this section only works within GCP Vertex Notebook environment due to the authentication issue. If you know how to setup GCS access privilege for TFX, please let me know.

Can't we mitigate this with from google.colab import auth; auth.authenticate_user() for a Colab runtime?

Also, could you share the GCS bucket with me?

update SpanPreparator and PipelineTrigger components

SpanPreparator

Current Behaviour

  1. get the highest number of spans currently available (i.e 1)
  2. plus 1 to get the next span number (i.e 2)
  3. create tfrecord file from the list of raw data
  4. store train and test tfrecord dataset to designated locations (cifar-10/span-2/....) where the ImportExampleGen consumes

Updated Behaviour

after the fourth step, we have to add two more steps.

  1. remove the newly collected data from original GCS location. it will prevent Cloud Scheduler to trigger batch prediction pipeline. Because the triggering condition is number of newly collected image > threshold, if we don't remove them, Cloud Scheduler will trigger the batch prediction pipeline all the time even though they are already handled by the training pipeline.

  2. add OutputArtifact[Dataset] to set the new span number. It will be passed to the downstream component, PipelineTrigger, then PipelineTrigger component will use it to set RuntimeParameter.

PipelineTrigger

Current Behaviour

  1. simply trigger the training pipeline with create_run_from_job_spec function.

Updated Behaviour

  1. add InputArtifact[Dataset] to get the span number from the previous component, SpanPreparator.
  2. add parameter_values in the create_run_from_job_spec to pass RuntimeParameters to the training pipeline. An example can be found in this notebook. Or see the below
_ = pipelines_client.create_run_from_job_spec(PIPELINE_DEFINITION_FILE, 
                                              enable_caching=False,
                                              parameter_values={
                                                  'input-config': json.dumps({
                                                        'splits': [
                                                          {'name': 'train', 'pattern': 'span-[12]/train/*.tfrecord'},
                                                          {'name': 'val', 'pattern': 'span-[12]/test/*.tfrecord'}
                                                        ]
                                                  }),
                                                  'output-config': json.dumps({})
                                              })

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.