artefactory / one-click-mlflow Goto Github PK

View Code? Open in Web Editor NEW

63.0 5.0 20.0 230 KB

A tool to deploy a mostly serverless MLflow tracking server on a GCP project with one command

License: GNU Lesser General Public License v3.0

HCL 72.39% Makefile 5.82% Shell 16.12% Dockerfile 1.94% Open Policy Agent 3.73%

terraform gcp mlflow docker serverless

one-click-mlflow's Introduction

1. one-click-mlflow

A tool to deploy a mostly serverless MLflow on a GCP project with one command

1. one-click-mlflow
- 1.1. How to use

1.1. How to use

1.1.1. Pre-requisites

A GCP project on which you are owner
Terraform, make, and jq installed
Initialized gcloud SDK with your owner account

1.1.2. Deploying

Clone the repo

Run make one-click-mlflow and let the wizard guide you.

If you want to see the innards, you can run it in debug mode: DEBUG=true make one-click-mlflow

1.1.3. What it does

Enables the necessary services
Builds and deploys the MLFlow docker image
Creates a private IP CloudSQL (MySQL) database for the tracking server
Creates an AppEngine Flex on the default service for the web UI, secured by IAP
Manages all the network magic
Creates the mlflow-log-pusher service account

1.1.4. Other available make commands

make deploy: builds and pushes the application image and (re)deploys the infrastructure
make docker: builds and pushes the application image
make apply: (re)deploys the infrastructure
make destroy: destroys the infrastructure. Will not delete the OAuth consent screen, and the app engine application.

1.1.5. Pushing your first parameters, logs, artifacts

Once the deployment successful, you can start pushing to your MLFlow instance.

cd examples
python3 -m venv venv 
source venv/bin/activate
pip install -r requirements.txt
python track_experiment.py

You can than adapt examples/track_experiment.py and examples/mlflow_config.py to suit your application's needs.

one-click-mlflow's People

Contributors

Stargazers

Watchers

Forkers

simdasch scott-wenzel rakesh283343 fpalumbo-phinx leogrosjean khaledalarja alasdair-woolworths shrinivas-io dbrtly clairetaylor352 ostfor ahlawatankit bossanova smarais reallylongaddress micseb consciousml tianyili1025 shashipal95

one-click-mlflow's Issues

Application can not be destroyed by terraform but are still removed from the TFstate, causing crashes when re-deploying after destroying.

As an Artefact Data Scientist, I want to have access to an MLFlow instance "as a service"

Is your feature request related to a problem? Please describe.
It not always possible of desirable to use one-click-mlflow on a client's infra. For example when:

There are organisation policies blocking the deployment
The DS does not have the required roles/permissions
The client is not on GCP
The mission is short in duration, a POC, ...

Describe the solution you'd like
A web app to request an instance of a GCP project with MLFlow deployed

The requester does not have direct access to the project, but is issued a service account to push logs/params/artifacts and the link to the MLFlow web app

The process is automated, so no one else needs to be involved.

Make destroy does not work propely

Describe the bug
Need to destroy multiple times to get rid of all the resources

To Reproduce
run make destroy

Expected behavior
All resources are destroyed without errors

Desktop (please complete the following information):

MacOS catalina
terraform 13.2

Error when running migrations

Describe the bug
Error when deploying mlflow server,

pymysql.err.ProgrammingError: (1146, "Table 'mlflow.experiments' doesn't exist")
ValueError: Invalid IPv6 URL

To Reproduce
Steps to reproduce the behavior:
Just run make one-click-mlflow on an empty project

Full traceback

 Error: Error waiting to create FlexibleAppVersion: Error waiting for Creating FlexibleAppVersion: Error code 9, message: Flex operation projects/sandbox-thomas-323814/regions/europe-west1/operations/8cade860-1654-4007-91d4-a70f350bed1a error [FAILED_PRECONDITION]: An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-08-23T15:51:13.172Z56066.wm.0: 2021/08/23 15:52:57 INFO mlflow.store.db.utils: Updating database tables in preparation for MLflow 1.0 schema migrations 
│ INFO  [alembic.runtime.migration] Context impl MySQLImpl.
│ INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
│ INFO  [alembic.runtime.migration] Running upgrade  -> ff01da956556, ensure_unique_constraint_names
│ Traceback (most recent call last):
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
│     cursor, statement, parameters, context
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/default.py", line 588, in do_execute
│     cursor.execute(statement, parameters)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/cursors.py", line 163, in execute
│     result = self._query(query)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/cursors.py", line 321, in _query
│     conn.query(q)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 505, in query
│     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 724, in _read_query_result
│     result.read()
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 1069, in read
│     first_packet = self.connection._read_packet()
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 676, in _read_packet
│     packet.raise_for_error()
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/protocol.py", line 223, in raise_for_error
│     err.raise_mysql_exception(self._data)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/err.py", line 107, in raise_mysql_exception
│     raise errorclass(errno, errval)
│ pymysql.err.ProgrammingError: (1146, "Table 'mlflow.experiments' doesn't exist")
│ 
│ The above exception was the direct cause of the following exception:
│ 
│ Traceback (most recent call last):
│   File "/usr/local/bin/mlflow", line 8, in <module>
│     sys.exit(cli())
│   File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1137, in __call__
│     return self.main(*args, **kwargs)
│   File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1062, in main
│     rv = self.invoke(ctx)
│   File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1668, in invoke
│     return _process_result(sub_ctx.command.invoke(sub_ctx))
│   File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1668, in invoke
│     return _process_result(sub_ctx.command.invoke(sub_ctx))
│   File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1404, in invoke
│     return ctx.invoke(self.callback, **ctx.params)
│   File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 763, in invoke
│     return __callback(*args, **kwargs)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/db.py", line 29, in upgrade
│     mlflow.store.db.utils._upgrade_db_initialized_before_mlflow_1(engine)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/store/db/utils.py", line 179, in _upgrade_db_initialized_before_mlflow_1
│     command.upgrade(config, "heads")
│   File "/usr/local/lib/python3.7/dist-packages/alembic/command.py", line 298, in upgrade
│     script.run_env()
│   File "/usr/local/lib/python3.7/dist-packages/alembic/script/base.py", line 489, in run_env
│     util.load_python_file(self.dir, "env.py")
│   File "/usr/local/lib/python3.7/dist-packages/alembic/util/pyfiles.py", line 98, in load_python_file
│     module = load_module_py(module_id, path)
│   File "/usr/local/lib/python3.7/dist-packages/alembic/util/compat.py", line 184, in load_module_py
│     spec.loader.exec_module(module)
│   File "<frozen importlib._bootstrap_external>", line 728, in exec_module
│   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/temporary_db_migrations_for_pre_1_users/env.py", line 84, in <module>
│     run_migrations_online()
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/temporary_db_migrations_for_pre_1_users/env.py", line 78, in run_migrations_online
│     context.run_migrations()
│   File "<string>", line 8, in run_migrations
│   File "/usr/local/lib/python3.7/dist-packages/alembic/runtime/environment.py", line 846, in run_migrations
│     self.get_context().run_migrations(**kw)
│   File "/usr/local/lib/python3.7/dist-packages/alembic/runtime/migration.py", line 518, in run_migrations
│     step.migration_fn(**kw)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/temporary_db_migrations_for_pre_1_users/versions/ff01da956556_ensure_unique_constraint_names.py", line 180, in upgrade
│     condition=column("lifecycle_stage").in_(["active", "deleted"]),
│   File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
│     next(self.gen)
│   File "/usr/local/lib/python3.7/dist-packages/alembic/operations/base.py", line 354, in batch_alter_table
│     impl.flush()
│   File "/usr/local/lib/python3.7/dist-packages/alembic/operations/batch.py", line 83, in flush
│     fn(*arg, **kw)
│   File "/usr/local/lib/python3.7/dist-packages/alembic/ddl/impl.py", line 244, in add_constraint
│     self._exec(schema.AddConstraint(const))
│   File "/usr/local/lib/python3.7/dist-packages/alembic/ddl/impl.py", line 140, in _exec
│     return conn.execute(construct, *multiparams, **params)
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/base.py", line 982, in execute
│     return meth(self, multiparams, params)
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/sql/ddl.py", line 72, in _execute_on_connection
│     return connection._execute_ddl(self, multiparams, params)
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/base.py", line 1044, in _execute_ddl
│     compiled,
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/base.py", line 1250, in _execute_context
│     e, statement, parameters, cursor, context
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/base.py", line 1476, in _handle_dbapi_exception
│     util.raise_from_cause(sqlalchemy_exception, exc_info)
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
│     reraise(type(exception), exception, tb=exc_tb, cause=cause)
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/util/compat.py", line 152, in reraise
│     raise value.with_traceback(tb)
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
│     cursor, statement, parameters, context
│   File "/usr/local/lib/python3.7/dist-packages/sqlalchemy/engine/default.py", line 588, in do_execute
│     cursor.execute(statement, parameters)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/cursors.py", line 163, in execute
│     result = self._query(query)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/cursors.py", line 321, in _query
│     conn.query(q)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 505, in query
│     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 724, in _read_query_result
│     result.read()
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 1069, in read
│     first_packet = self.connection._read_packet()
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/connections.py", line 676, in _read_packet
│     packet.raise_for_error()
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/protocol.py", line 223, in raise_for_error
│     err.raise_mysql_exception(self._data)
│   File "/usr/local/lib/python3.7/dist-packages/pymysql/err.py", line 107, in raise_mysql_exception
│     raise errorclass(errno, errval)
│ sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) (1146, "Table 'mlflow.experiments' doesn't exist")
│ [SQL: ALTER TABLE experiments ADD CONSTRAINT experiments_lifecycle_stage CHECK (lifecycle_stage IN ('active', 'deleted'))]
│ (Background on this error at: http://sqlalche.me/e/f405)
│ 2021/08/23 15:52:58 ERROR mlflow.cli: Error initializing backend store
│ 2021/08/23 15:52:58 ERROR mlflow.cli: Invalid IPv6 URL
│ Traceback (most recent call last):
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/cli.py", line 385, in server
│     initialize_backend_stores(backend_store_uri, default_artifact_root)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/server/handlers.py", line 146, in initialize_backend_stores
│     _get_tracking_store(backend_store_uri, default_artifact_root)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/server/handlers.py", line 131, in _get_tracking_store
│     _tracking_store = _tracking_store_registry.get_store(store_uri, artifact_root)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/tracking/_tracking_service/registry.py", line 37, in get_store
│     builder = self.get_store_builder(store_uri)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/tracking/registry.py", line 75, in get_store_builder
│     scheme = store_uri if store_uri == "databricks" else get_uri_scheme(store_uri)
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/utils/uri.py", line 157, in get_uri_scheme
│     scheme = urllib.parse.urlparse(uri_or_path).scheme
│   File "/usr/lib/python3.7/urllib/parse.py", line 368, in urlparse
│     splitresult = urlsplit(url, scheme, allow_fragments)
│   File "/usr/lib/python3.7/urllib/parse.py", line 459, in urlsplit
│     raise ValueError("Invalid IPv6 URL")
│ ValueError: Invalid IPv6 URL
│ 
│ 
│   with module.mlflow.module.server.google_app_engine_flexible_app_version.mlflow_app,
│   on modules/mlflow/server/main.tf line 112, in resource "google_app_engine_flexible_app_version" "mlflow_app":
│  112: resource "google_app_engine_flexible_app_version" "mlflow_app" {
│ 
╵
make: *** [apply-terraform] Error 1

Change cloud run to app engine

Technical Story: refactor the way configuration variables are set prior to deploying

Definition of ready
Ready

Description
We have vars, vars_base, and vars_additionnal that are treated as sh scripts. This is not clear and well designed.

Refactor this to have a json file and a parser

Definition of done

Behavior is unchanged from the user's perspective
One single json file contains all the variables
A parser to export the variables from the json file so they are accessible by Terraform through env-vars
We don't want to just create a Terraform variables file because some variables depend on each others, and we want to avoid having a monolithic sh script that does it all at once

Technical story: build docker image on Cloud Build instead of locally

Removes Docker dependency.

As a user, I want to be able to migrate experiments from another MLflow deployment to one managed by one-click-mlflow

Definition of done

A target to the makefile that automates and the migration from an existing deployment to a different one.

Existing brands or App Engine applications should automatically be imported in the TFState

Brands and App Engine applications can not be deleted from a project once created.
One-click-mlflow should be able to detect them and import them in the TFState on its own.

make one-click-mlflow not working after make destroy because of undeleted bucket

Describe the bug
Problem encountered by @ucsky. Running make one-click-mlflow is not working after make destroy because of the artifacts' bucket which still exists.
Got the following error:


Setting up your GCP project...
╷
│ Error: googleapi: Error 409: You already own this bucket. Please select another name., conflict
│ 
│   with module.bucket_backend.google_storage_bucket.this,
│   on ../modules/mlflow/artifacts/main.tf line 18, in resource "google_storage_bucket" "this":
│   18: resource "google_storage_bucket" "this" {

To Reproduce
Steps to reproduce the behavior:

run make one-click-mlflow and finish it
run make destroy
run make one-click-mlflow
See error

Expected behavior
The second command make one-click-mlflow should work

As a user, I want a CLI to walk me through the installation process instead of having to decipher the readme and fill the `vars` file

As a User I want to be guided through the install process. I want to be explained the options I have and what they imply so I can taylor my deployment to my needs and environment

I want an intuitive and easy to use CLI that will prompt me step by step. I need the tool to be transparent regarding what it will deploy.

MLflow app creation crashes because of protobuf version

Hi there,

Describe the bug
When running make one-click-mlflow, an error appears while creating the app-engine-flex:

Error: Error waiting to create FlexibleAppVersion: Error waiting for Creating FlexibleAppVersion: Error code 9, message: An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2023-03-28T10:26:33.695Z25166.wd.0: Traceback (most recent call last):
│   File "/usr/local/bin/mlflow", line 5, in <module>
│     from mlflow.cli import cli
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/__init__.py", line 32, in <module>
│     import mlflow.tracking._model_registry.fluent
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/tracking/__init__.py", line 8, in <module>
│     from mlflow.tracking.client import MlflowClient
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/tracking/client.py", line 8, in <module>
│     from mlflow.entities import ViewType
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/entities/__init__.py", line 6, in <module>
│     from mlflow.entities.experiment import Experiment
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/entities/experiment.py", line 2, in <module>
│     from mlflow.entities.experiment_tag import ExperimentTag
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/entities/experiment_tag.py", line 2, in <module>
│     from mlflow.protos.service_pb2 import ExperimentTag as ProtoExperimentTag
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/protos/service_pb2.py", line 18, in <module>
│     from .scalapb import scalapb_pb2 as scalapb_dot_scalapb__pb2
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/protos/scalapb/scalapb_pb2.py", line 35, in <module>
│     serialized_options=None, file=DESCRIPTOR)
│   File "/usr/local/lib/python3.7/dist-packages/google/protobuf/descriptor.py", line 561, in __new__
│     _message.Message._CheckCalledFromGeneratedFile()
│ TypeError: Descriptors cannot not be created directly.
│ If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
│ If you cannot immediately regenerate your protos, some other possible workarounds are:
│  1. Downgrade the protobuf package to 3.20.x or lower.
│  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
│ 
│ More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
│ Traceback (most recent call last):
│   File "/usr/local/bin/mlflow", line 5, in <module>
│     from mlflow.cli import cli
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/__init__.py", line 32, in <module>
│     import mlflow.tracking._model_registry.fluent
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/tracking/__init__.py", line 8, in <module>
│     from mlflow.tracking.client import MlflowClient
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/tracking/client.py", line 8, in <module>
│     from mlflow.entities import ViewType
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/entities/__init__.py", line 6, in <module>
│     from mlflow.entities.experiment import Experiment
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/entities/experiment.py", line 2, in <module>
│     from mlflow.entities.experiment_tag import ExperimentTag
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/entities/experiment_tag.py", line 2, in <module>
│     from mlflow.protos.service_pb2 import ExperimentTag as ProtoExperimentTag
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/protos/service_pb2.py", line 18, in <module>
│     from .scalapb import scalapb_pb2 as scalapb_dot_scalapb__pb2
│   File "/usr/local/lib/python3.7/dist-packages/mlflow/protos/scalapb/scalapb_pb2.py", line 35, in <module>
│     serialized_options=None, file=DESCRIPTOR)
│   File "/usr/local/lib/python3.7/dist-packages/google/protobuf/descriptor.py", line 561, in __new__
│     _message.Message._CheckCalledFromGeneratedFile()
│ TypeError: Descriptors cannot not be created directly.
│ If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
│ If you cannot immediately regenerate your protos, some other possible workarounds are:
│  1. Downgrade the protobuf package to 3.20.x or lower.
│  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
│ 
│ More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
│ 
│ 
│   with module.mlflow.module.server.google_app_engine_flexible_app_version.mlflow_app,
│   on modules/mlflow/server/main.tf line 112, in resource "google_app_engine_flexible_app_version" "mlflow_app":
│  112: resource "google_app_engine_flexible_app_version" "mlflow_app" {
│ 
╵

To Reproduce
Steps to reproduce the behavior:

terraform version
Terraform v1.4.2
on linux_amd64

Install Terraform v1.4.2
run make one-click-mlflow
See error

Expected behavior
The Terraform script should create the MLflow app without error

Desktop (please complete the following information):

OS: Ubuntu
Version: 22.04

Turn ocmlf into a Terraform module

Description

Right now OCMLF has a tightly coupled CLI workflow and infrastructure deployment.
Let's uncouple these by making the Terraform part a standalone module available in the Terraform registry.
This repo will just contain the CLI workflow to generate the deployment config, and invoke the TF module.

This will allow the Terraform part of OCMLF to be seamlessly integrated with existing IAC projects.

MLflow 2.2.0 Support

Why asking for this feature
MLflow 2.2.0 supports the following over MLflow 1.X:

UI has a sleeker design
Simplifying the management of end-to-end MLOps workflows with the MLflow Pipeline module

Describe the solution you'd like
MLflow Tracking Server running 2.2.0 for enhanced UI and additional features

Setup access from code to push logs to the tracking server

A new, different AppEngine service is created when doing 2+ deployments on the same project

Describe the bug
If a first deployment created a default AppEngine service for MLFlow, deploying again will create a new mlflow service instead of updating the first one

The main issue for the end-user is that the URL of the MLFlow server is inconsistent if you do one or multiple deployments: https://mlflow-dot-{PROJECT_ID}.ew.r.appspot.com or https://{PROJECT_ID}.ew.r.appspot.com.

To Reproduce
Steps to reproduce the behavior:

make one-click-mlflow -> AppEngine default is created
make one-click-mlflow a second time on the same project -> AppEngine mlflow is created

Expected behavior
If the first deployment created the default service, we expect it to be updated, not create a new one

Desktop (please complete the following information):

OS: MacOS Big Sur
Version Terraform 0.14.6

As A User, I do not want to have to enter a password for the SQL database.

Is your feature request related to a problem? Please describe.
Having the user choose a password is unnecessary and annoying.

Describe the solution you'd like
A secure random password is generated through Terraform, and the user is not prompted anymore.

Technical Story: replace AppEngine Flex by AppEngine Standard for the google_app_engine_flexible_app_version.default_app resource

Definition of ready
Ready

Description
GAE Flex does not scale to 0, which increases costs unnecessarily when a default_app is added. Deploy the default_app resource on GAE Standard instead.

Definition of done
The default application is deployed on GAE Standard, with the ability to scale to 0

Have an `Editor` role deployment option

Allow the deployment with only Editor level access on a GCP project

Adds pre-requisites:

roles/cloudsql.client, roles/secretmanager.secretAccessor, and roles/compute.networkUser to <project-id>@appspot.gserviceaccount.com
roles/storage.objectAdmin to <project-id>@gae-api-prod.google.com.iam.gserviceaccount.com

TODO:

Bash script (or other) to import these bindings to the TFstate
Readme section "Editor deployment"

Technical story: CI tests the nominal deployment scenario of one-click-mlflow. Triggered by PRs on Github.

Definition of ready
Ready to dev

Definition of done

Nominal scenario: owner user, blank project, creates a network
Run terraform validate
Create temp project
Deploy with make one-click-mlflow
Test that pushing parameters, metrics, and artifacts works
Destroy with make destroy
Shut down GCP project
Build triggered on Cloud Build on PRs to dev and master (with quotas on Cloud Build)

Create a helper for MLFLOW_TRACKING_TOKEN reception and use

Deploying on a shared VPC is not working properly

Describe the bug
Deploying on a shared VPC from another GCP project is not working properly

module.network.google_compute_global_address.private_ip_addresses: Creating...

Error: Error creating GlobalAddress: googleapi: Error 400: Invalid value for field 'resource.network': 'projects/<another project>/global/networks/<shared VPC name>'. The specified network can not come from a different project., invalid

To Reproduce
run make deploy with a shared VPC as TF_VAR_network_name

Expected behavior
MLFlow server deployed on the shared VPC

Write the terraform stuff

As a user, I do not want one-click-mflow to run if I am not Owner of the GCP project

Definition of done

In the pre-checks phase, one-click-mlflow verifies that the user has the Owner GCP role in the targeted project.

If not, stop the deployment and tell the user the reason

As a user, I want to be able to use one-click-mlflow to deploy and serve models

Needs a lot of refinement, but keeping it here as a roadmap milestone

Failed to build docker image

Describe the bug
When running apt-get update, build failed with following error:

#6 1.552 Get:1 http://deb.debian.org/debian buster InRelease [122 kB]                                                                                                                          
#6 1.552 Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]                                                                                                   
#6 1.625 Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]                                                                                                                 
#6 1.719 Get:4 https://packages.cloud.google.com/apt cloud-sdk-buster InRelease [6774 B]
#6 6.642 Get:5 https://packages.cloud.google.com/apt cloud-sdk-buster/main amd64 Packages [180 kB]
#6 7.197 Reading package lists...
#6 11.71 E: Repository 'http://security.debian.org/debian-security buster/updates InRelease' changed its 'Suite' value from 'stable' to 'oldstable'
#6 11.71 E: Repository 'http://deb.debian.org/debian buster InRelease' changed its 'Suite' value from 'stable' to 'oldstable'
#6 11.71 E: Repository 'http://deb.debian.org/debian buster-updates InRelease' changed its 'Suite' value from 'stable-updates' to 'oldstable-updates'

To Reproduce
Steps to reproduce the behavior:
Build the docker image

A docker image for the tracking server

As a user, I want to know what one-click-mlflow will deploy on my project

Definition of done

A schema of the architecture deployed is added to the README
It should contain all the resources created/used

Add conftest Dockerfile

Setup IAP for end user access to the tracking server

Technical Story: Change the tracking server backend from AppEngine to CloudRun

Definition of ready

Cloud run supports IAP
Terraform has ressources for IAP on CloudRun

Definition of done

The MLFlow tracking server is not deployed on AppEngine anymore, but on CloudRun
Behaviour from the user's perspective is the same as before (but cheaper)

Brands can not be destroyed by terraform but are still removed from the TFstate, causing crashes when re-deploying after destroying.

As A User, I want to be able to specify a region/location/part of the world that makes sense for all the deployed resources

Is your feature request related to a problem? Please describe.
As of now, all deployments are made in europe-west1. This is a problem for users in other parts of the world

Describe the solution you'd like
I want to be able to specify a region for deployment. e.g. "europe-west1", "us-central1", etc
All deployed resources should be colocated there.

Error: Failed to get existing workspaces

Hi there,

Describe the bug
When running make one-click-mlflow, I get the following error:

Error: Failed to get existing workspaces: querying Cloud Storage failed: storage: bucket doesn't exist

It seems that Terraform is unable to get my workspace.

To Reproduce

> terraform version
Terraform v1.4.2
on linux_amd64

Create a project and make sure you project is part of an organization
Run make one-click-mlflow
See the error

Expected behavior
Terraform should recognize my workspace.

(venv) ucsky@machine:~/try/one-click-mlflow$ python examples/track_experiment.py 
Enter your project ID: ofi-ai-try
Enter the name of your MLFlow experiment: test
Traceback (most recent call last):
  File "/home/ucsky/try/one-click-mlflow/examples/track_experiment.py", line 5, in <module>
    import mlflow_config
  File "/home/ucsky/try/one-click-mlflow/examples/mlflow_config.py", line 61, in <module>
    os.environ["MLFLOW_TRACKING_TOKEN"] = get_token()
  File "/home/ucsky/try/one-click-mlflow/examples/mlflow_config.py", line 17, in get_token
    token = _get_token()
  File "/home/ucsky/try/one-click-mlflow/examples/mlflow_config.py", line 35, in _get_token
    open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
  File "/home/ucsky/try/one-click-mlflow/examples/venv/lib/python3.9/site-packages/google/oauth2/id_token.py", line 252, in fetch_id_token
    credentials = service_account.IDTokenCredentials.from_service_account_info(
  File "/home/ucsky/try/one-click-mlflow/examples/venv/lib/python3.9/site-packages/google/oauth2/service_account.py", line 528, in from_service_account_info
    signer = _service_account_info.from_dict(
  File "/home/ucsky/try/one-click-mlflow/examples/venv/lib/python3.9/site-packages/google/auth/_service_account_info.py", line 46, in from_dict
    missing = keys_needed.difference(six.iterkeys(data))
  File "/home/ucsky/try/one-click-mlflow/examples/venv/lib/python3.9/site-packages/six.py", line 599, in iterkeys
    return iter(d.keys(**kw))
AttributeError: 'NoneType' object has no attribute 'keys'

To Reproduce
Installing with
make one-click-mlflow
and after

cd examples
python3 -m venv venv 
source venv/bin/activate
pip install -r requirements.txt
python track_experiment.py

Expected behavior
Experiment tracking in MLFlow.

Desktop (please complete the following information):

lsb_release -a
LSB Version:	core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID:	Pop
Description:	Pop!_OS 21.04
Release:	21.04
Codename:	hirsute

One-click-mlflow App Engine service name should not be "default", however the first service deployed on a GCP project has to be named "default". Find a workaround.

artefactory / one-click-mlflow Goto Github PK

one-click-mlflow's Introduction

1. one-click-mlflow

1.1. How to use

1.1.1. Pre-requisites

1.1.2. Deploying

1.1.3. What it does

1.1.4. Other available make commands

1.1.5. Pushing your first parameters, logs, artifacts

one-click-mlflow's People

Contributors

Stargazers

Watchers

Forkers

one-click-mlflow's Issues

Recommend Projects

Recommend Topics

Recommend Org