deep-diver / continuous-adaptation-for-machine-learning-system-to-data-changes Goto Github PK
View Code? Open in Web Editor NEWhttps://blog.tensorflow.org/2021/12/continuous-adaptation-for-machine.html
License: Apache License 2.0
https://blog.tensorflow.org/2021/12/continuous-adaptation-for-machine.html
License: Apache License 2.0
Definitely include the following point from the blog post:
Jiyai suggested trying out the Pusher component to replace our custom VertexUploader and Deployer:
02_TFX_Training_Pipeline.ipynb
03_Batch_Prediction_Pipeline.ipynb
98_Batch_Prediction_Test.ipynb
The main purpose of this notebook is to build KFP pipeline doing the following steps
The functional test for batch prediction is shown in a separate notebook, 98_Batch_Prediction_Test.ipynb.
Current behavior
gs://batch-prediction-collection-3
)FileListGen
only looks up the fixed GCS location to generate a FileList
.FileListGen
always use the cached result when enable_cache=True
(the same input of the GCS location let the pipeline assume to use cached result).To be updated behavior
gs://batch-prediction-collection-3/YYYY-MM
(if we want to store monthly dataset).FileListGen
should take an GCS location as an input which is a type of RuntimeParameter
.In the original concept, range_config
should work for {SPAN}
in ExampleGen
. However there is a bug in TFX < 1.4.0
. As of writing this issue, the bug fix has been merged, but nightly build or build from scratch version should be working. The nightly build version should be above 1.4.0.dev20211010
(you can find out the dev versions here)
When range_config
works properly, we can integrate that functionality to dynamically choose the range of spans
to run the training pipeline
with Resolver
node. For example, initial pipeline run could depend on span-1
, but when data drift is detected, the second pipeline run could use both span-1
and span-2
together. In that case, we don't need to process the ExampleGen
for the span-1
once again but only the span-2
. Resolver
node will reuse the ExampleGen
artifacts for the span-1
generated from the initial pipeline run and integrate it with the new ExampleGen
for the span-2
.
More extensive discussion about this issue can be found in the issue from the TFX official repo.
I have been thinking about ways to incorporate TFMA for the evaluation part.
Currently, we run batch prediction in order to gather the results, and then we compare them against the ground truth to check if the end accuracy is above a threshold. This is implemented in this notebook. We run the batch prediction service because we think it might be common for real-world purposes too whereupon the arrival of a bulk of data we perform batch inference, collect results, and then analyze them.
We understand what we are doing with the PerformanceEvaluator
component could be delegated to TFMA but given batch prediction could be an important component of the workflow, where TFMA should be incorporated?
Cc: @deep-diver
When there will be more data with the same distribution, we can update the currently stored dataset. In this case, you should turn on the GCS's versioning feature.
Worth providing a link that shows how to enable the feature mentioned here.
When there will be more data with the different distribution, we will create other directores of span-2/test and span-2/test to address data drift. In this way, we can keep data separetly for easier maintanence while handling versioning separtely for different SPANs.
Did you mean span-2/**train**
and span-2/test?
The notebook currently has permission errors. Also,
Please note this section only works within GCP Vertex Notebook environment due to the authentication issue. If you know how to setup GCS access privilege for TFX, please let me know.
Can't we mitigate this with from google.colab import auth; auth.authenticate_user()
for a Colab runtime?
Also, could you share the GCS bucket with me?
1
)2
)tfrecord
file from the list of raw datatfrecord
dataset to designated locations (cifar-10/span-2/....
) where the ImportExampleGen
consumesafter the fourth step, we have to add two more steps.
remove the newly collected data from original GCS location. it will prevent Cloud Scheduler to trigger batch prediction pipeline
. Because the triggering condition is number of newly collected image > threshold
, if we don't remove them, Cloud Scheduler will trigger the batch prediction pipeline
all the time even though they are already handled by the training pipeline
.
add OutputArtifact[Dataset]
to set the new span
number. It will be passed to the downstream component, PipelineTrigger
, then PipelineTrigger
component will use it to set RuntimeParameter
.
training pipeline
with create_run_from_job_spec
function.InputArtifact[Dataset]
to get the span number from the previous component, SpanPreparator
.parameter_values
in the create_run_from_job_spec
to pass RuntimeParameters to the training pipeline
. An example can be found in this notebook. Or see the below_ = pipelines_client.create_run_from_job_spec(PIPELINE_DEFINITION_FILE,
enable_caching=False,
parameter_values={
'input-config': json.dumps({
'splits': [
{'name': 'train', 'pattern': 'span-[12]/train/*.tfrecord'},
{'name': 'val', 'pattern': 'span-[12]/test/*.tfrecord'}
]
}),
'output-config': json.dumps({})
})
Create a separate notebook to gather data of different distribution from the original CIFAR10 dataset.
If possible, we can show how those two datasets are different
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.