kubeflow / batch-predict Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 8.0 72 KB

Repository for batch predict

License: Apache License 2.0

Python 100.00%

batch-predict's Introduction

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.

Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

PR Dashboard

Get Involved

Please refer to the Community page.

batch-predict's People

Contributors

Stargazers

Watchers

Forkers

yixinshi yupbank gridl devops8012 gyliu513 shifu-engineer jeffwan isabella232

batch-predict's Issues

Exit criterion for 1.0

We should come up with a set of exit criterion for declaring batch predict 1.0

/area 0.4.0
/priority p1
/area inference

Add unitttests

The batch predict package was added in #2 But there are no unittests.

We should add unittests.

Need E2E tests for batch predict

We need E2E test for batch predict.

Unittests would also be nice.

/priority p1
/area inference
/area 0.4.0

A generic interface for PredictionDoFn

The module in question is kubeflow_batch_predict.dataflow.batch_prediction.py and the DoFn is PredictionDoFn.

This issue highlights a shortcoming of the current state and will serve as the primary discussion venue for the problem.

In its current state, this DoFn accepts a serialized JSON containing the following format for the examples:

{'instances': [ {'input': TFRecord}, ... ] } where each item in the list is a dictionary containing the 'input' key and a TFRecord (could be base64 encoding of the TFRecord). This input does not allow for any extraneous top level keys and only yields back a list of inputs and outputs.

The need for extraneous keys exists because there might be extra metadata along with each element which might be needed to identify the input. For instance, consider a prediction task to embed "movies" as high-dimensional vectors. If we desire to write the final results into a CSV file, we would want each row to have extra metadata like "name" etc and we might want this to be passed around in a dict in the DataFlow pipeline (as PCollections).

This is not possible currently because this would violate the input format to PredictionDoFn and instead we would have to morph these values into something acceptable. This step is expected however any downstream DoFns that derive PCollections from PredictionDoFn will not be able to access any pre-transformed data. Instead all we will have is a list of high-dimensional vectors with no way to relate back to the actual human readable information like "name".

We need to come to a design spec surrounding this so as to accommodate the most generic use cases around Batch Prediction

Identify issues needed for an initial release

We should come up with a list of issues that need to be addressed in order to have an initial release of batch predict as part of our 0.2 release.

All such issues should be P1. Anything not needed for our initial release should be lower priority.

ImportError: No module named kubeflow_batch_predict.dataflow.io.multifiles_source

I am getting "No module named" error when trying to submit a job on google cloud. any idea what I am missing here? It works fine when submitting from my local with DirectRunner.