Library to convert video files to TFRecords using Apache Beam.
Note: This is not an officially supported Google product
gcloud auth login
gcloud auth application-default login
gcloud config set project {project-id}
If you will be running the pipeline on the Dataflow Runner, the service account key must be accessible to the Dataflow workers. So, the file should be copied from its local path to Google Cloud Storage.
gsutil cp {local-path-to-json} {cloud-storage-path-to-json}
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
PROJECT_ID={project-id}
The Apache Beam pipeline will use Google Cloud Storage as the source.
Both the README and in the Bash scripts assume that you have created a GCS Bucket name gs://{project-id}. If you have not created this bucket yet, create it using the following gsutil
command:
gsutil mb gs://{project-id}
gsutil -m cp -r gs://ugc-dataset/original_videos/* \
gs://${PROJECT_ID}/videos-to-tfrecords/input/
The Bash scripts below assume that you have created a GCS directory gs://{project-id}/videos-to-tfrecords/input/ which stores your training data.
Running an Apache Beam pipeline locally can be helpful for testing and debugging. However, it's not recommended when working with a lot of data; use the Cloud Dataflow runner for Apache Beam instead.
bash bin/run.preprocess.sh {path-to-json}
bash bin/run.preprocess.sh {cloud-storage-path-to-json} cloud