https://blog.tensorflow.org/2021/12/continuous-adaptation-for-machine.html

License: Apache License 2.0

Jupyter Notebook 99.33% Python 0.67%

tfx mlops tensorflow machine-learning pipeline

continuous-adaptation-for-machine-learning-system-to-data-changes's Introduction

continuous-adaptation-for-machine-learning-system-to-data-changes's People

Stargazers

Watchers

Forkers

ai-hub-deep-learning-fundamental kbpark102 taocao kiminh nochwysid rexche

continuous-adaptation-for-machine-learning-system-to-data-changes's Issues

Feedback from the TFX team

Definitely include the following point from the blog post:

Purging training examples based on distribution shifts.
TFMA for model evaluation.

Jiyai suggested trying out the Pusher component to replace our custom VertexUploader and Deployer:

Try Pusher to replace two components we have for uploading to Vertex and deploying.

Review the notebooks

02_TFX_Training_Pipeline.ipynb
03_Batch_Prediction_Pipeline.ipynb
98_Batch_Prediction_Test.ipynb

Please review 97_Prepare_Test_Images.ipynb notebook

Complete 03_Batch_Prediction_Pipeline.ipynb

The main purpose of this notebook is to build KFP pipeline doing the following steps

Create an batch request input file (file list format) based on the files uploaded to a GCS bucket
Run Batch Prediction on the trained model obtained from 02_TFX_Training_Pipeline.ipynb
Measure the batch prediction model performance in terms of accuracy
If model performance < threshold
- Copy the testing images to the original(previous) dataset
- Trigger the TFX training pipeline with original data + newly added data

The functional test for batch prediction is shown in a separate notebook, 98_Batch_Prediction_Test.ipynb.

Create a module for the custom components developed in this project

Test `gsutil mv` if it's working

Support dynamic GCS location for newly collected dataset

Current behavior

97_Prepare_Test_Images Nootebook collects and uploads new image data to fixed GCS location (i.e. gs://batch-prediction-collection-3)
Batch Prediction Pipeline's FileListGen only looks up the fixed GCS location to generate a FileList.
This is problematic because FileListGen always use the cached result when enable_cache=True (the same input of the GCS location let the pipeline assume to use cached result).

To be updated behavior

Collected image data from 97_Prepare_Test_Images Nootebook should be stored in a unique GCS location such as gs://batch-prediction-collection-3/YYYY-MM (if we want to store monthly dataset).
Batch Prediction Pipeline's FileListGen should take an GCS location as an input which is a type of RuntimeParameter.
Cloud Function from 04_Cloud_Scheduler_Trigger notebook should calculate the latest GCS location by date, then pass the GCS location to the Batch Prediction Pipeline as an RuntimeParameter.

range_config doesn't work in TFX < 1.4.0

In the original concept, range_config should work for {SPAN} in ExampleGen. However there is a bug in TFX < 1.4.0. As of writing this issue, the bug fix has been merged, but nightly build or build from scratch version should be working. The nightly build version should be above 1.4.0.dev20211010 (you can find out the dev versions here)

When range_config works properly, we can integrate that functionality to dynamically choose the range of spans to run the training pipeline with Resolver node. For example, initial pipeline run could depend on span-1, but when data drift is detected, the second pipeline run could use both span-1 and span-2 together. In that case, we don't need to process the ExampleGen for the span-1 once again but only the span-2. Resolver node will reuse the ExampleGen artifacts for the span-1 generated from the initial pipeline run and integrate it with the new ExampleGen for the span-2.

More extensive discussion about this issue can be found in the issue from the TFX official repo.

Add comments and brief descriptions to the custom components

Incorporating TFMA

@rcrowe-google

I have been thinking about ways to incorporate TFMA for the evaluation part.

Currently, we run batch prediction in order to gather the results, and then we compare them against the ground truth to check if the end accuracy is above a threshold. This is implemented in this notebook. We run the batch prediction service because we think it might be common for real-world purposes too whereupon the arrival of a bulk of data we perform batch inference, collect results, and then analyze them.

We understand what we are doing with the PerformanceEvaluator component could be delegated to TFMA but given batch prediction could be an important component of the workflow, where TFMA should be incorporated?

Cc: @deep-diver

`01_Dataset_Prep.ipynb`

When there will be more data with the same distribution, we can update the currently stored dataset. In this case, you should turn on the GCS's versioning feature.

Worth providing a link that shows how to enable the feature mentioned here.

When there will be more data with the different distribution, we will create other directores of span-2/test and span-2/test to address data drift. In this way, we can keep data separetly for easier maintanence while handling versioning separtely for different SPANs.

Did you mean span-2/**train** and span-2/test?

The notebook currently has permission errors. Also,

Please note this section only works within GCP Vertex Notebook environment due to the authentication issue. If you know how to setup GCS access privilege for TFX, please let me know.

Can't we mitigate this with from google.colab import auth; auth.authenticate_user() for a Colab runtime?

Also, could you share the GCS bucket with me?

update SpanPreparator and PipelineTrigger components

SpanPreparator

Current Behaviour

get the highest number of spans currently available (i.e 1)
plus 1 to get the next span number (i.e 2)
create tfrecord file from the list of raw data
store train and test tfrecord dataset to designated locations (cifar-10/span-2/....) where the ImportExampleGen consumes

Updated Behaviour

after the fourth step, we have to add two more steps.

remove the newly collected data from original GCS location. it will prevent Cloud Scheduler to trigger batch prediction pipeline. Because the triggering condition is number of newly collected image > threshold, if we don't remove them, Cloud Scheduler will trigger the batch prediction pipeline all the time even though they are already handled by the training pipeline.
add OutputArtifact[Dataset] to set the new span number. It will be passed to the downstream component, PipelineTrigger, then PipelineTrigger component will use it to set RuntimeParameter.

PipelineTrigger

Current Behaviour

simply trigger the training pipeline with create_run_from_job_spec function.

Updated Behaviour

add InputArtifact[Dataset] to get the span number from the previous component, SpanPreparator.
add parameter_values in the create_run_from_job_spec to pass RuntimeParameters to the training pipeline. An example can be found in this notebook. Or see the below

_ = pipelines_client.create_run_from_job_spec(PIPELINE_DEFINITION_FILE, 
                                              enable_caching=False,
                                              parameter_values={
                                                  'input-config': json.dumps({
                                                        'splits': [
                                                          {'name': 'train', 'pattern': 'span-[12]/train/*.tfrecord'},
                                                          {'name': 'val', 'pattern': 'span-[12]/test/*.tfrecord'}
                                                        ]
                                                  }),
                                                  'output-config': json.dumps({})
                                              })

Notebook to gather dataset for simulating data drift

Create a separate notebook to gather data of different distribution from the original CIFAR10 dataset.

If possible, we can show how those two datasets are different

deep-diver / continuous-adaptation-for-machine-learning-system-to-data-changes Goto Github PK

continuous-adaptation-for-machine-learning-system-to-data-changes's Introduction

continuous-adaptation-for-machine-learning-system-to-data-changes's People

Stargazers

Watchers

Forkers

continuous-adaptation-for-machine-learning-system-to-data-changes's Issues

SpanPreparator

Current Behaviour

Updated Behaviour

PipelineTrigger

Current Behaviour

Updated Behaviour

Recommend Projects

Recommend Topics

Recommend Org