Giter Club home page Giter Club logo

datacatalog-util's Introduction

Datacatalog Util Tweet

CircleCI PyPi License Issues

A Python package to manage Google Cloud Data Catalog helper commands and scripts.

Disclaimer: This is not an officially supported Google product.

Commands List

Group Command Description Documentation Link Code Repo
tags create Load Tags from CSV file. GO GO
tags delete Delete Tags from CSV file. GO GO
tags export Export Tags to CSV file. GO GO
tag-templates create Load Templates from CSV file. GO GO
tag-templates delete Delete Templates from CSV file. GO GO
tag-templates export Export Templates to CSV file. GO GO
filesets create Create GCS filesets from CSV file. GO GO
filesets enrich Enrich GCS filesets with Tags. GO GO
filesets clean-up-templates-and-tags Cleans up the Fileset Template and their Tags. GO GO
filesets delete Delete GCS filesets from CSV file. GO GO
filesets export Export Filesets to CSV file. GO GO
object-storage create-entries Create Entries for each Object Storage File. GO GO
object-storage delete-entries Delete Entries that belong to the Object Storage Files. GO GO

Execute Tutorial in Cloud Shell

Open in Cloud Shell

Table of Contents


0. Executing in Cloud Shell from PyPi

If you want to execute this script directly in cloud shell, download it from PyPi:

# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials
# This name is just a suggestion, feel free to name it following your naming conventions
export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-util-sa.json

# Install datacatalog-util
pip3 install --upgrade datacatalog-util --user

# Add to your PATH
export PATH=~/.local/bin:$PATH

# Look for available commands
datacatalog-util --help

1. Environment setup for local build

1.1. Python + virtualenv

Using virtualenv is optional, but strongly recommended unless you use Docker.

1.1.1. Install Python 3.6+

1.1.2. Get the source code

git clone https://github.com/mesmacosta/datacatalog-util
cd ./datacatalog-util

All paths starting with ./ in the next steps are relative to the datacatalog-util folder.

1.1.3. Create and activate an isolated Python environment

pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate

1.1.4. Install the package

pip install --upgrade .

1.2. Docker

Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.

1.3. Auth credentials

1.3.1. Create a service account and grant it below roles

  • Data Catalog Admin
  • Storage Admin

1.3.2. Download a JSON key and save it as

This name is just a suggestion, feel free to name it following your naming conventions

  • ./credentials/datacatalog-util-sa.json

1.3.3. Set the environment variables

This step may be skipped if you're using Docker.

export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-util-sa.json

2. Load Tags from CSV file

2.1. Create a CSV file representing the Tags to be created

Tags are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column Description Mandatory
linked_resource Full name of the asset the Entry refers to. Y
template_name Resource name of the Tag Template for the Tag. Y
column Attach Tags to a column belonging to the Entry schema. N
field_id Id of the Tag field. Y
field_value Value of the Tag field. Y

TIPS

2.1.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

2.2. Run the datacatalog-util script - Create the Tags

  • Python + virtualenv
datacatalog-util tags create --csv-file CSV_FILE_PATH
  • Docker
docker build --rm --tag datacatalog-util .
docker run --rm --tty \
  --volume CREDENTIALS_FILE_FOLDER:/credentials --volume CSV_FILE_FOLDER:/data \
  datacatalog-util create-tags --csv-file /data/CSV_FILE_NAME

2.3. Run the datacatalog-util script - Delete the Tags

  • Python + virtualenv
datacatalog-util tags delete --csv-file CSV_FILE_PATH

3. Export Tags to CSV file

3.1. A list of CSV files, each representing one Template will be created.

One file with summary with stats about each template, will also be created on the same directory.

The columns for the summary file are described as follows:

Column Description
template_name Resource name of the Tag Template for the Tag.
tags_count Number of tags found from the template.
tagged_entries_count Number of tagged entries with the template.
tagged_columns_count Number of tagged columns with the template.
tag_string_fields_count Number of used String fields on tags of the template.
tag_bool_fields_count Number of used Bool fields on tags of the template.
tag_double_fields_count Number of used Double fields on tags of the template.
tag_timestamp_fields_count Number of used Timestamp fields on tags of the template.
tag_enum_fields_count Number of used Enum fields on tags of the template.

The columns for each template file are described as follows:

Column Description
relative_resource_name Full resource name of the asset the Entry refers to.
linked_resource Full name of the asset the Entry refers to.
template_name Resource name of the Tag Template for the Tag.
tag_name Resource name of the Tag.
column Attach Tags to a column belonging to the Entry schema.
field_id Id of the Tag field.
field_type Type of the Tag field.
field_value Value of the Tag field.

3.1.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

3.2. Run tags export

  • Python + virtualenv
datacatalog-util tags export --project-ids my-project --dir-path DIR_PATH

3.3 Run tags export filtering Tag Templates

  • Python + virtualenv
datacatalog-util tags export --project-ids my-project \
--dir-path DIR_PATH \
--tag-templates-names projects/my-project/locations/us-central1/tagTemplates/my-template,\
projects/my-project/locations/us-central1/tagTemplates/my-template-2 

4. Load Templates from CSV file

4.1. Create a CSV file representing the Templates to be created

Templates are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column Description Mandatory
template_name Resource name of the Tag Template for the Tag. Y
display_name Resource name of the Tag Template for the Tag. Y
field_id Id of the Tag Template field. Y
field_display_name Display name of the Tag Template field. Y
field_type Type of the Tag Template field. Y
enum_values Values for the Enum field. N

4.1.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

4.2. Run the datacatalog-util script - Create the Tag Templates

  • Python + virtualenv
datacatalog-util tag-templates create --csv-file CSV_FILE_PATH

4.3. Run the datacatalog-util script - Delete the Tag Templates

  • Python + virtualenv
datacatalog-util tag-templates delete --csv-file CSV_FILE_PATH

TIPS

5. Export Templates to CSV file

5.1. A CSV file representing the Templates will be created

Templates are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column Description
template_name Resource name of the Tag Template for the Tag.
display_name Resource name of the Tag Template for the Tag.
field_id Id of the Tag Template field.
field_display_name Display name of the Tag Template field.
field_type Type of the Tag Template field.
enum_values Values for the Enum field.

5.1.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

5.2. Run the datacatalog-util script

  • Python + virtualenv
datacatalog-util tag-templates export --project-ids my-project --file-path CSV_FILE_PATH

6. Filesets Commands

6.1. Create a CSV file representing the Entry Groups and Entries to be created

Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column Description Mandatory
entry_group_name Entry Group Name. Y
entry_group_display_name Entry Group Display Name. N
entry_group_description Entry Group Description. N
entry_id Entry ID. Y
entry_display_name Entry Display Name. Y
entry_description Entry Description. N
entry_file_patterns Entry File Patterns. Y
schema_column_name Schema column name. N
schema_column_type Schema column type. N
schema_column_description Schema column description. N
schema_column_mode Schema column mode. N

Please note that the schema_column_type is an open string field and accept anything, if you want to use your fileset with Dataflow SQL, follow the data-types in the official docs.

6.1.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

6.2. Create the Filesets Entry Groups and Entries

  • Python + virtualenv
datacatalog-util filesets create --csv-file CSV_FILE_PATH

TIPS

6.2.1 Create the Filesets Entry Groups and Entries - with DataFlow SQL types validation

  • Python + virtualenv
datacatalog-util filesets create --csv-file CSV_FILE_PATH --validate-dataflow-sql-types

6.3. Enrich GCS Filesets with Tags

Users are able to choose the Tag fields from the list provided at Tags

datacatalog-util filesets enrich --project-id my-project 

6.3.1 Enrich all fileset entries using Tag Template from a different Project (Good way to reuse the same Template)

If you are using a different Project, make sure the Service Account has the following permissions on that Project or that Template:

  • Data Catalog TagTemplate Creator
  • Data Catalog TagTemplate User
datacatalog-util filesets \
  --project-id my_project \
  enrich --tag-template-name projects/my_different_project/locations/us-central1/tagTemplates/fileset_enricher_findings

6.3.2 Execute Fileset Enricher Tutorial in Cloud Shell

Open in Cloud Shell

6.4. clean up template and tags

Cleans up the Template and Tags from the Fileset Entries, running the main command will recreate those.

datacatalog-util filesets clean-up-templates-and-tags --project-id my-project 

6.5. Delete the Filesets Entry Groups and Entries

  • Python + virtualenv
datacatalog-util filesets delete --csv-file CSV_FILE_PATH

7. Export Filesets to CSV file

7.1. A CSV file representing the Filesets will be created

Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column Description Mandatory
entry_group_name Entry Group Name. Y
entry_group_display_name Entry Group Display Name. Y
entry_group_description Entry Group Description. Y
entry_id Entry ID. Y
entry_display_name Entry Display Name. Y
entry_description Entry Description. Y
entry_file_patterns Entry File Patterns. Y
schema_column_name Schema column name. N
schema_column_type Schema column type. N
schema_column_description Schema column description. N
schema_column_mode Schema column mode. N

7.1.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

7.2. Run the datacatalog-util script

  • Python + virtualenv
datacatalog-util filesets export --project-ids my-project --file-path CSV_FILE_PATH

8. DataCatalog Object Storage commands

8.1 Execute Tutorial in Cloud Shell

Open in Cloud Shell

8.2. Create DataCatalog entries based on object storage files

datacatalog-util \
  object-storage sync-entries --type cloud_storage \
  --project-id my_project \
  --entry-group-name projects/my_project/locations/us-central1/entryGroups/my_entry_group \
  --bucket-prefix my_bucket

8.3. Delete object storage entries on entry group

datacatalog-util \
  object-storage delete-entries --type cloud_storage \
  --project-id my_project \
  --entry-group-name projects/my_project/locations/us-central1/entryGroups/my_entry_group

9. Data Catalog Templates Examples

templates_examples.md

datacatalog-util's People

Contributors

mesmacosta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

datacatalog-util's Issues

Split the documentation into separate README files for each command group.

What would you like to be added:
Split the documentation into separate README files for each command group.

Why is this needed:
Currently the README.md documentation is too large, it would be easier to have a separate file for each command group.
One for tags, one for tag-templates, one for filesets and object-storage

[BUG] - google.api_core.exceptions.MethodNotImplemented: 501 Received http2 header with status: 404

What happened:
I ran the datacatalog-util in GCP Cloud shell and it throw exception below:
It is throwing below error while running on GCP Cloud Shell.
Could you please help me to resolve it.

Error Log=============

status = StatusCode.UNIMPLEMENTED
details = "Received http2 header with status: 404"
debug_error_string = "{"created":"@1616126950.574176093","description":"Received http2 :status header with non-200 OK status","file":"src/core/ext/filters/http/client/http_client_filter.cc","file_line":129,"grpc_message":"Received http2 header with sta
tus: 404","grpc_status":12,"value":"404"}"

What you expected to happen:
Ideally it should start updating the Dataset table in Data Catalog based in the .csv file data

How to reproduce it (as minimally and precisely as possible):
Create GCP project and service account with needed permissions.
Run as the guide document described
Anything else we need to know?:

ADD support for S3 object-storage.

What would you like to be added:
ADD support for S3 object-storage. This would be a new feature of datacatalog-object-storage-processor dependencie.

Why is this needed:
Currently the object-storage create-entries command supports only cloud_storage, ADD support for S3, so we are able to search for S3 files in Data Catalog.

Permission Denied When Looking up Entry[BUG]

What happened:
Tag template uploaded, service account has all required permission to data catalog and big query, but unable to push tags: "WARNING:root:Permission denied when looking up Entry for //bigquery.googleapis.com/projects/test****/datasets/us_state_sales/tables/us_state_salesregions. The resource will be skipped."

What you expected to happen:
Tags to be attached to specified datasets/tables.

How to reproduce it (as minimally and precisely as possible):
Just following the tutorial line by line.

Anything else we need to know?:
I can manually add the tags to dataset/table using the UI after the tag template is created using the tutorial, but still unable to attach tags using the tutorial.

ADD document with demo videos explaining each command.

What would you like to be added:
ADD document with demo videos explaining each command.

Why is this needed:
From a users perspective, we have lot of utilities commands, and it's hard to understand their use cases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.