spotty-cloud / spotty Goto Github PK

View Code? Open in Web Editor NEW

492.0 9.0 43.0 1.29 MB

Training deep learning models on AWS and GCP instances

Home Page: https://spotty.cloud

License: MIT License

Python 97.28% Shell 2.72%

aws spot-instances deep-learning gpu sagemaker google-cloud gcp docker

spotty's Introduction

Spotty drastically simplifies training of deep learning models on AWS and GCP:

it makes training on GPU instances as simple as training on your local machine
it automatically manages all necessary cloud resources including images, volumes, snapshots and SSH keys
it makes your model trainable in the cloud by everyone with a couple of commands
it uses tmux to easily detach remote processes from their terminals
it saves you up to 70% of the costs by using AWS Spot Instances and GCP Preemtible VMs

Documentation

See the documentation page.
Read this article on Medium for a real-world example.

Installation

Requirements:

Python >=3.6
AWS CLI (see Installing the AWS Command Line Interface) if you're using AWS
Google Cloud SDK (see Installing Google Cloud SDK) if you're using GCP

Use pip to install or upgrade Spotty:

$ pip install -U spotty

Get Started

Prepare a spotty.yaml file and put it to the root directory of your project:
- See the file specification here.
- Read this article for a real-world example.
Start an instance:
```
$ spotty start
```
It will run a Spot Instance, restore snapshots if any, synchronize the project with the running instance and start the Docker container with the environment.
Train a model or run notebooks.

To connect to the running container via SSH, use the following command:
```
$ spotty sh
```
It runs a tmux session, so you can always detach this session using Ctrl + b, then d combination of keys. To be attached to that session later, just use the spotty sh command again.

Also, you can run your custom scripts inside the Docker container using the spotty run <SCRIPT_NAME> command. Read more about custom scripts in the documentation: Configuration: "scripts" section.

Contributions

Any feedback or contributions are welcome! Please check out the guidelines.

License

MIT License

spotty's People

Contributors

Stargazers

Watchers

spotty's Issues

What's the reason for nitro instance types not being supported? Specifically I'd like to use the p3dn.24xlarge instance type

Best way to set one variable string, different values, for each instance?

I am hoping to spin up a fleet of spotty instances, each identical except for having one variable (hyperparameter set name) different for each.

Something like:
HYPERPARAMETERS = ["model1", "model2", etc.]
with each instance having a different value.

I guess an environment variable makes the most sense. I see the ability to add envs to the containers. But I don't find the documentation on what happens if you have multiple containers:

Do all the containers go onto one instance? If so then I would pick the container I want when I call spotty for that instance?
Or is it one container per instance? In that case, I don't find the documentation on how to specify which container you want for a particular instance.

One workaround would be to have a separate spotty.yaml for each instance (and hyperparameter set name) but that seems janky.

Stuck at "- launching the instance..."

I have no issues creating t2 spot instances e.g. t2.medium. But somehow for p3.2xlarge spotty is stuck at "- launching the instance...".

Here is my config:

project:
  name: cvms
  syncFilters:
    - exclude:
      - .git/*
      - .idea/*
      - '*/__pycache__/*'
      - tmp/*
      - dependencies/*

containers:
  - projectDir: /workspace/computer-vision-models
    hostNetwork: true
    file: Dockerfile
    env:
      PYTHONPATH: /workspace/computer-vision-models
    volumeMounts:
      - name: workspace
        mountPath: /workspace

instances:
  - name: train
    provider: aws
    parameters:
      region: eu-west-2
      instanceType: p3.2xlarge
      availabilityZone: eu-west-2a
      spotInstance: true
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 30
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 30
            mountDir: /docker
            deletionPolicy: retain

scripts:
  train: |
    python models/semseg/train.py
  tensorboard: |
    tensorboard --bind_all --port 6006 --logdir /workspace/computer-vision-models/trained_models

Any idea what could go wrong here?

Deprecated link to specification file in README

Hi Everybody, thanks so much for such an amazing projects was starting to play around with and noticed that the link for the configuration file is deprecated:

I suppose that the link should be https://spotty.cloud/docs/user-guide/configuration-file.html , right?

Currently it points to: https://spotty.cloud/docs/configuration-file/

Thanks again!

boto3 CreateBucket fails if region is "us-east-1"

If us-east-1 is used as region, spotty start fails with:

Error:
------
An error occurred (InvalidLocationConstraint) when calling the CreateBucket operation: The specified location-constraint is not valid

The solution is to fix https://github.com/apls777/spotty/blob/master/spotty/providers/aws/deployment/project_resources/bucket.py#L31-L32 as

if self._region != 'us-east-1':
  self._s3.create_bucket(ACL='private', Bucket=bucket_name,
                                       CreateBucketConfiguration={'LocationConstraint': self._region})
else:
  self._s3.create_bucket(ACL='private', Bucket=bucket_name)

The boto3 issue is decribed in boto/boto3#125

Also, thanks for the great project ^^

"ValidationError: Stack with id spotty-instance-my-project-i1 does not exist or has been deleted"

On a new simple spotty project, I have the following error on spotty start. It's very cryptic and hard for me to understand what is wrong:

2020-12-21 19:22:47,946 P3569 [INFO] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2020-12-21 19:22:47,946 P3569 [INFO] Config mount_volumes
2020-12-21 19:22:47,947 P3569 [INFO] ============================================================
2020-12-21 19:22:47,948 P3569 [INFO] Command mount_volumes
2020-12-21 19:22:49,011 P3569 [INFO] -----------------------Command Output-----------------------
2020-12-21 19:22:49,011 P3569 [INFO]    + cfn-signal -e 0 --stack spotty-instance-renderman-dexed-imain --region eu-central-1 --resource MountingVolumesSignal
2020-12-21 19:22:49,011 P3569 [INFO]    + DEVICE_LETTERS=(f g h i j k l m n o p)
2020-12-21 19:22:49,011 P3569 [INFO]    + MOUNT_DIRS=("/workspace" "/docker")
2020-12-21 19:22:49,011 P3569 [INFO]    + for i in '${!MOUNT_DIRS[*]}'
2020-12-21 19:22:49,011 P3569 [INFO]    + MOUNT_DIR=/workspace
2020-12-21 19:22:49,011 P3569 [INFO]    + DEVICE=/dev/xvdf
2020-12-21 19:22:49,011 P3569 [INFO]    + '[' '!' -b /dev/xvdf ']'
2020-12-21 19:22:49,012 P3569 [INFO]    ++ cfn-get-metadata --stack spotty-instance-my-project-i1 --region us-east-2 --resource VolumeAttachmentF -k VolumeId
2020-12-21 19:22:49,012 P3569 [INFO]    ValidationError: Stack with id spotty-instance-my-project-i1 does not exist or has been deleted
2020-12-21 19:22:49,012 P3569 [INFO]    + VOLUME_ID=
2020-12-21 19:22:49,012 P3569 [INFO] ------------------------------------------------------------
2020-12-21 19:22:49,012 P3569 [ERROR] Exited with error code 1

And this is my spotty.yaml:

project:
  name: renderman-dexed
  syncFilters:
    - exclude:
        - '*.sw*'
        - '*.ipynb'
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - '.ipynb_checkpoints/*'
        - '__pycache__/*'
        - 'data/*'
        - 'preprocessed*.tgz*'
        - '*.log'

containers:
  - projectDir: /workspace/project
    file: docker/Dockerfile.spotty
#    ports:
#      # TensorBoard
#      - containerPort: 6006
#        hostPort: 6006
#      # Jupyter
#      - containerPort: 8888
#        hostPort: 8888
#      # Luigi
#      - containerPort: 8125
#        hostPort: 8125
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: imain
    provider: aws
    parameters:
      region: eu-central-1
      instanceType: m5ad.24xlarge
      spotInstance: True
#      ports: [6006, 8888]
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 1000
            deletionPolicy: retain
            mountDir: /workspace
        - name: docker
          parameters:
            size: 20
            mountDir: /docker

What does the error mean?

I tried to tmux into the instance, and ran spotty start -C which seemed to work, but then when I tmux in it tells me to run spotty start -C again.

Running on Windows

Can this be run on windows?
When I do
spotty start
it asks me to open a file, if I open it with notepad it shows some python code. If I open it with python it quickly disappears.
Following the instructions as from the medium article

NVIDIA driver version settings

Hi,

is it possible to change NVIDIA drivers version? I see it is fixed to nvidia-384 but I would like to use higher CUDA versions.

Regards

Jarek

Error while creating IAM role for instance

Hi,

Coming back to this project after a while. I'm following the tutorial here: https://towardsdatascience.com/how-to-train-deep-learning-models-on-aws-spot-instances-using-spotty-8d9e0543d365

I get an error while it tries to create AMI.

Tacotron-2 at master ✖ spotty create-ami
Waiting for the AMI to be created...
  - creating IAM role for the instance...
Error:
------
Stack "spotty-nvidia-docker-ami-qgogoanj" was not created.
See CloudFormation and CloudWatch logs for details.

Can't use existing IAM role in spotty

In more managed environments, you don't always have the flexibility to create your own IAM role versus use the one that is assigned to you. Spotty should have the ability to use an already existing IAM role if it has the right permissions.

Opening this as an issue so I can keep track of it, I'll be submitting a pull request next week likely.

Automatically download cfn-init logs to the local machine in case of an error

Following issues #44 and #48, it would be convenient to automatically download cfn-init logs to the local machine if an error occurred.

Probably, the most appropriate place for these logs is in the temporary directory: for example, on Unix systems, it could be the /tmp/spotty/logs/projects/<PROJECT_NAME>/instances/<INSTANCE_NAME>/cfn-init/ directory.

How to increase Image size

I have an instance thats got files in it that are quite big. In order to prepare the docker image the machine needs more space. The docker workspace has plenty of space, but in order to create the docker image to begin with, the Spotty AMI Image needs to be a bit bigger.

Is there a way to increase its size?

permission denied: spotty

Trying to get this setup using pipenv (python 3.7), running pip install -U spotty doesn't throw any errors but running spotty results in permission denied: spotty.

This also happens if I run this outside of the environment, or with sudo.

EDIT:

Tried to clone the repo, and install using python3 setup.py install. Runs the installation OK, but running create-ami results in:

  File "/Users/zohaibahmed/.virtualenvs/Tacotron-2-p0EfLM-4/bin/spotty", line 4, in <module>
    __import__('pkg_resources').run_script('spotty==1.1.1', 'spotty')
  File "/Users/zohaibahmed/.virtualenvs/Tacotron-2-p0EfLM-4/lib/python3.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/zohaibahmed/.virtualenvs/Tacotron-2-p0EfLM-4/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1448, in run_script
    exec(script_code, namespace, namespace)
  File "/Users/zohaibahmed/.virtualenvs/Tacotron-2-p0EfLM-4/lib/python3.7/site-packages/spotty-1.1.1-py3.7.egg/EGG-INFO/scripts/spotty", line 7, in <module>
  File "/Users/zohaibahmed/.virtualenvs/Tacotron-2-p0EfLM-4/lib/python3.7/site-packages/spotty-1.1.1-py3.7.egg/spotty/commands/create_ami.py", line 8, in <module>
ModuleNotFoundError: No module named 'cfn_tools'

Even though cfn-tools is installed already.

Running multiple instances

Hello @apls777

Thanks for spotty, I love this project.
I was wondering if there is way to request several instances at the same time (like with sagemaker) to run the same code but with different configuration in parallel.

Also is it possible to pass arguments to spotty run like `spotty run train --config "path to file"``

Thanks.

[discussion] Building software in cloud as a usecase

Another usecase is remote builds when a fully-fledged CI is overkill. E.g. CUDA or Emsciprten builds that are particularly slow.

This is often the case for hobby, open-source work. Being able to run the build script on a big cloud instance (via a simple spotty run command) for a given operating system and download the artefacts or log files would be very useful.

How to persist storage / resume a terminated instance

Is it possible to persist storage, so that if a spotty created instance is terminated its easy to resume using the same /workspace and instance as just before it was terminated?

I recently had some data loss as a result of not syncing data back and forth and hoping there's a way to easily prevent it, or resume the terminated spotty instance (terminated by aws).

[feature request] Launch a Spot instance with a specified duration

Coming from #44, it's sometimes needed to launch a regular instance for dataset downloads and preprocessing on a CPU instance without occasional instance termination (preprocess scripts get messy otherwise).

For extra niceness, ability to launch Reserved Spot Instances (e.g. for 6h of execution)

Error when trying the default config

Creating IAM role for the instance...
Preparing CloudFormation template...
  - volume "lsma-hw1-i1-workspace" will be created
  - volume "lsma-hw1-i1-docker" will be created
  - availability zone: auto
  - maximum Spot Instance price: on-demand
  - AMI: "Deep Learning AMI (Ubuntu 16.04) Version 26.0" (ami-025ed45832b817a35)
  - Docker data will be stored on the "docker" volume

Volumes:
+-----------+---------------+------------+-----------------+
| Name      | Container Dir | Type       | Deletion Policy |
+===========+===============+============+=================+
| workspace | /workspace    | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+
| docker    | -             | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+

Waiting for the stack to be created...
  - launching the instance...
  - waiting for the Docker container to be ready...
Error:
------
Stack "spotty-instance-lsma-hw1-i1" was not created.
Please, see CloudFormation logs for the details.

What is the recommended image for PyTorch? I can't seem to find any recommendations? I tried even with the default Tensorflow one, but I get the same error. I guess it's because of the following error, that I get when I look at the Cloudformation logs by running the following on the instance:

 sudo tail /var/log/cfn-init-cmd.log

ubuntu@ip-172-31-84-193:~$  sudo tail /var/log/cfn-init-cmd.log
2020-02-09 01:03:12,195 P1844 [INFO]    + MOUNT_DIRS=("/mnt/lsma-hw1-i1-workspace" "/docker")
2020-02-09 01:03:12,195 P1844 [INFO]    + for i in '${!MOUNT_DIRS[*]}'
2020-02-09 01:03:12,195 P1844 [INFO]    + DEVICE=/dev/xvdf
2020-02-09 01:03:12,195 P1844 [INFO]    + MOUNT_DIR=/mnt/lsma-hw1-i1-workspace
2020-02-09 01:03:12,195 P1844 [INFO]    + blkid -o value -s TYPE /dev/xvdf
2020-02-09 01:03:12,195 P1844 [INFO]    + mkfs -t ext4 /dev/xvdf
2020-02-09 01:03:12,195 P1844 [INFO]    mke2fs 1.42.13 (17-May-2015)
2020-02-09 01:03:12,195 P1844 [INFO]    The file /dev/xvdf does not exist and no size was specified.
2020-02-09 01:03:12,195 P1844 [INFO] ------------------------------------------------------------
2020-02-09 01:03:12,195 P1844 [ERROR] Exited with error code 1

Do you know why this could be happening?

My spotty config is the following:

project:
  name: test-hw1
  syncFilters:
    - exclude:
      - .git/*
      - .idea/*
      - '/_pycache_/'

container:
  projectDir: /workspace/project
  image: tensorflow/tensorflow:latest-gpu-py3-jupyter
  ports: [6006, 8888]
  volumeMounts:
    - name: workspace
      mountPath: /workspace

instances:
  - name: i1
    provider: aws
    parameters:
      region: us-east-1
      instanceType: g4dn.xlarge
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain


scripts:
  jupyter: |
    jupyter notebook --allow-root --ip 0.0.0.0 --notebook-dir=/workspace/project

EBS volum mount/unmount instead of snapshot

Hi again :)

I'm testing Spotty intensively often running spot start and spot stop. My volume data is about 50GB and it takes ages for taking a snapshot of it and than recreating the volume. I was wonder if it is possible to just mount/unmount EBS volume during start/stop instance and create a snapshot by spotty stop --snapshot command.

Regards

Jarek

Change Docker version to 19.03.2

In my Dockerfile I have:
COPY --chown=$APP_USER:$APP_GID
but $APP_USER is not recognized because of the error moby/moby#35018
It is fixed in version 19.03.2 (it is not working with 19.03.0 even the say so in the issue comment :) ).

Stack "spotty-instance-tacotron-i1" was not created.

I try to run Tacotron example following the Medium article, but got an error

Creating IAM role for the instance...
Preparing CloudFormation template...
  - volume "tacotron-i1-workspace" will be created
  - volume "tacotron-i1-docker" will be created
  - availability zone: auto
  - maximum Spot Instance price: on-demand
  - AMI: "Deep Learning Base AMI (Ubuntu) Version 19.2" (ami-07ab02efef2ffe373)
  - Docker data will be stored on the "docker" volume

Volumes:
+-----------+---------------+------------+-----------------+
| Name      | Container Dir | Type       | Deletion Policy |
+===========+===============+============+=================+
| workspace | /workspace    | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+
| docker    | -             | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+

Waiting for the stack to be created...
  - launching the instance...
Error:
------
Stack "spotty-instance-tacotron-i1" was not created.
Please, see CloudFormation logs for the details.

Where do I find CloudFormation logs?

I am using Windows 10 with Git bash (MinGW64).

And another question, I cloned the tacotron repo but needed to rename Dockerfile to Dockerfile.spotty. Why is it not named correctly in the repo? I worry if I do things correctly.

AWS Batch

Great project!

How about adding support for training using AWS Batch? Basically, I'm looking for a setup where I can develop and test locally then deploy using AWS Batch on spot instances. Do you think this functionality would be a good fit for Spotty or am I best starting something from scratch?

Thanks
Andrew

Problem with setting up "Caching Docker Image on an EBS Volume"

Hi, I set up everything according to the instructions related to "Caching Docker Image on an EBS Volume", and everything worked perfectly in the previous version of Spotty, but now in 1.3.0 I have problems starting an instance using the same settings.

Spotty throws the following error:
Validation error: The "mountDir" of one of the volumes must be a prefix for the "dockerDataRoot" path.

Can you please help me resolve this?

Problem creating AMI, fails at SetLogRetentionFunctionRetention

It's quite hard to debug this one.

spotty aws create-ami                                                                                                                                                                                         (py37)
Waiting for the AMI to be created...
  - creating IAM role for the instance...
Error:
------
Stack "spotty-ami-spottyami" was not created.
Please, see CloudFormation logs for the details.

Cloudformation logs

The Cloudwatch logs (an extract of the issue):

INFO	FAILED:
 ResourceNotFoundException: The specified log group does not exist.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:51:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'ResourceNotFoundException'

Spotty Version: 1.2.4

Enable Docker X11 Client Via SSH

Hi Oleg,

would it be possible to enable running a GUI program using SSH X11 forwarding?

Something like: https://stackoverflow.com/questions/40884589/x11-forward-to-windows-x-server-for-docker-client-in-aws

It would be nice feature especially when you work with OpenCV and video and want to quickly show the results on your host.

Regards,
Jarek

Sync several instances from one command?

I'd love to sync several instances from one command, and not run multiple times (redo'ing the S3 sync). Thank you!

Restart instance automatically

Super basic question - I intend to use this to start spot instances and if they are terminated, automatically restart them.

It seems like that's one thing this tool can help with, but it doesn't explicitly say in the docs, so that's why I asking here. Can this tool do this?

[feature request] Support several Docker configurations in one spotty.yaml (to better support auxiliary on-demand instances)

Downloading the dataset on a EBS volume can take many hours. I don't want to use a GPU machine for that.

Unfortunately, currently I cannot instruct spotty to not run any Docker image or specify a Docker image per instance (the main Docker image fails to start on the t2.micro machine).

Currently I'm using a separate spotty_preprocess.yaml to achieve this goal.

Wrong key error while using onDemandInstance parameter

Hi there,

I'm having a problem setting onDemandInstance parameter to true in spotty.yaml, getting next error while trying to build an AMI with spotty aws create-ami:

Wrong key 'onDemandInstance' in {'region': 'us-east-2', 'onDemandInstance': 'true', 'amiName': 'my-project', 'instanceType': 'p2.xlarge', 'volumes': [{'name': 'Tacotron2', 'directory': '/workspace', 'size': 50}], 'docker': {'file': 'Dockerfile', 'workingDir': '/workspace/project', 'dataRoot': '/workspace/docker'}, 'ports': [6006, 8888]}

Spotty Version Installed: 1.2.0

Can you please advice? Thanks in advance.

Cannot mount second volume

I have an old volume I am trying to attach to a new instance. I renamed the old instance so it should be fine.

Here is my config:

  - name: ietl
    provider: aws
    parameters:
      region: eu-central-1
      # These GPU instances are more reliable to create with the GPU docker so ¯\_(ツ)_/¯
      instanceType: p3.2xlarge
#      instanceType: t3.2xlarge
#      instanceType: t4g.2xlarge
      spotInstance: True
      ports: [6006, 8888]
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 1000
            deletionPolicy: retain
            mountDir: /workspace
        - name: oldworkspace
          parameters:
            size: 600
            deletionPolicy: retain
            mountDir: /oldworkspace
        - name: docker
          parameters:
            size: 20
            mountDir: /docker

While it's loading, I get this message:


Volumes:
+--------------+------------+------------+-----------------+
| Name         | Mount Path | Type       | Deletion Policy |
+==============+============+============+=================+
| workspace    | /workspace | EBS volume | Retain Volume   |
+--------------+------------+------------+-----------------+
| oldworkspace | -          | EBS volume | Retain Volume   |
+--------------+------------+------------+-----------------+
| docker       | -          | EBS volume | Retain Volume   |
+--------------+------------+------------+-----------------+

Note there is no mount path. I don't see it in the machine!

Need docker run extra params

It would be nice to pass extra commands to the docker instance at runtime, for example --shm-size 8G

Is this possible already somehow?

the errors module is not installed

There is a missing __init__.py in spotty/errors so on a fresh install I get the following error:

spotty --version
Traceback (most recent call last):
  File "/home/glefalher/.local/share/virtualenvs/autosuggest-analysis-JWy0ztrb/bin/spotty", line 7, in <module>
    from spotty.commands.aws import AwsCommand
  File "/home/glefalher/.local/share/virtualenvs/autosuggest-analysis-JWy0ztrb/lib/python3.7/site-packages/spotty/commands/aws.py", line 3, in <module>
    from spotty.providers.aws.commands.create_ami import CreateAmiCommand
  File "/home/glefalher/.local/share/virtualenvs/autosuggest-analysis-JWy0ztrb/lib/python3.7/site-packages/spotty/providers/aws/commands/create_ami.py", line 6, in <module>
    from spotty.providers.aws.instance_manager import InstanceManager
  File "/home/glefalher/.local/share/virtualenvs/autosuggest-analysis-JWy0ztrb/lib/python3.7/site-packages/spotty/providers/aws/instance_manager.py", line 3, in <module>
    from spotty.errors.instance_not_running import InstanceNotRunningError
ModuleNotFoundError: No module named 'spotty.errors'

Argument parsing doesn't accept parameter settings before SCRIPT_NAME

Despite the usage info, setting -p before SCRIPT_NAME doesn't work. It would be great if usage documentation was harmonized with actual behavior.

vadimkantorov@5CD8223LBY:~/convasr/scripts$ spotty run -p TAG=master train
usage: spotty run [-h] [-d] [-c CONFIG] [-s SESSION_NAME] [-S] [-l] [-r]
                  [-p [PARAMETER=VALUE [PARAMETER=VALUE ...]]]
                  [INSTANCE_NAME] SCRIPT_NAME
spotty run: error: the following arguments are required: SCRIPT_NAME

vadimkantorov@5CD8223LBY:~/convasr/scripts$ spotty run train -p TAG=master # works

Incomplete formatting of an error message

Got this:

Error:
------
Spot price for the "%s" availability zone not found.

The format argument was omitted here: https://github.com/apls777/spotty/blob/master/spotty/providers/aws/helpers/spot_prices.py#L27

[feature request] Self destruction command on server

If there was a way to run spotty stop on the remote machine and a way to clean up after itself that would be quite a useful feature.

Additionally if there was a way to delete s3 buckets, snapshots and any roles created that could help any lingering cloud charges too.

spotty create-ami not working on 1.1.9

Hi Oleg,

I was trying to start a new project from scratch with 1.1.9 version and was not able to create AMI.
At the beginning it was hard to spot what cause the problem because together with the stack removal I was not able to find logs from the Instance creating the AMI (are they removed with the stack?).
I was able to catch them before the stack was removed and here is the cause:

2019-03-12 21:05:58,884 P8710 [INFO] + apt-get install -y nvidia-docker2=2.0.3+docker18.09.2-1
2019-03-12 21:05:58,884 P8710 [INFO] Reading package lists...
2019-03-12 21:05:58,884 P8710 [INFO] Building dependency tree...
2019-03-12 21:05:58,885 P8710 [INFO] Reading state information...
2019-03-12 21:05:58,885 P8710 [INFO] Some packages could not be installed. This may mean that you have
2019-03-12 21:05:58,885 P8710 [INFO] requested an impossible situation or if you are using the unstable
2019-03-12 21:05:58,885 P8710 [INFO] distribution that some required packages have not yet been created
2019-03-12 21:05:58,885 P8710 [INFO] or been moved out of Incoming.
2019-03-12 21:05:58,885 P8710 [INFO] The following information may help to resolve the situation:
2019-03-12 21:05:58,885 P8710 [INFO]
2019-03-12 21:05:58,885 P8710 [INFO] The following packages have unmet dependencies:
2019-03-12 21:05:58,885 P8710 [INFO] nvidia-docker2 : Depends: nvidia-container-runtime (= 2.0.0+docker18.09.2-1) but 2.0.0+docker18.09.3-1 is to be installed
2019-03-12 21:05:58,885 P8710 [INFO] E: Unable to correct problems, you have held broken packages.
2019-03-12 21:05:58,885 P8710 [INFO] ------------------------------------------------------------
2019-03-12 21:05:58,885 P8710 [ERROR] Exited with error code 100

I can see some changes in spotty/data/create_ami.yaml file with nvidia-docker2 which perhaps cause the problem.

I went back to 1.1.8 version and the AMI was created without problem.

I think that the same problem could be with dev-1.2 as I started from it and also wasn't able to create AMI but couldn't find the cause so went back to 'stable' :) version 1.1.9 and down to 1.1.8.

Problem with config file containing multiple AWS instances for the same project

Hi, I'm trying to create a Spotty config file that will contain multiple definitions of AWS instances. Essentially, I want to run larger instances if needed using the exact same config file.

I created definitions of 3 AWS instances i1, i2, and i3, and the only difference between them is instance type p2.xlarge, p3.2xlarge and p3.8xlarge. When I run spotty start i1, the instance i1 is created successfully and all the commands are completed. When I run spotty start i2, I get this error:

Error:
Stack "spotty-instance-tf1-od-api-i2" was not created.
Please, see CloudFormation logs for the details.

When I connect to the i2 instance using spotty ssh i2, and list the working directory, I get different output than when I do the same thing on i1. On i2 it seems that the directory structure is nested, S3 bucket content is synced in /workspace/detection/research instead of /workspace/detection.

Can you please help me figure out what is wrong and how to resolve this issue?

project:
  name: tf1-od-api
  syncFilters:
    - exclude:
        ...

container:
  projectDir: /workspace/detection
  workingDir: /workspace/detection/research
  file: research/object_detection/dockerfiles/tf1/Dockerfile.spotty
  ports: [6006, 8888]
  volumeMounts:
    - name: workspace
      mountPath: /workspace
  commands: |
    protoc object_detection/protos/*.proto --python_out=.
    cp object_detection/metrics/cocoeval.py /usr/local/lib/python3.6/dist-packages/pycocotools/
    cp /workspace/detection/trains.conf /root
    export PYTHONPATH=$PYTHONPATH:`pwd`/research:`pwd`/research/slim:`pwd`/research/slim/nets:`pwd`/research/object_detection

instances:
  - name: i1
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p2.xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

  - name: i2
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p3.2xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

  - name: i3
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p3.8xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

scripts: ...

[feature request] Option for instance-initiated termination

This is useful for preprocessing scripts with On-Demand instances.

The AWS syntax is at: https://stackoverflow.com/questions/10541363/self-terminating-aws-ec2-instance

Remove `chown` line from the CloudFormation (if not done after all the recent fixes)

Otherwise, it's super slow for the huge datasets

gcp example yaml missing

I've been using spotty with aws and it works really well. I am struggling now with making a functional spotty.yaml for gcp, would it be possible to post a functioning example like there is for aws?

spotty download fails because /tmp/spotty is missing

spotty download relies on /tmp/spotty/instance/scripts/upload_files.sh being present on an instance
https://github.com/apls777/spotty/blob/master/spotty/providers/aws/deployment/cf_templates/data/instance.yaml#L175

Somehow in my case it's not there. When I ssh on an instance, I see in /tmp only my uploaded project. It seems that my project directory kind of overwrote the /tmp scripts.

So the command that should upload files from instance to S3 fails, but quietly. There is no error forwarded to the suer.

[feature request] Parameter to select EBS volume type

Currently EBS volumes are created using standard type which corresponds to "previous-generation HDD suitable for infrequently accessed storage". I can convert the volume type in the AWS console, but it would be nice to specify the volume type at its creation.

Also, the EBS volume type standard does not support volumes larger than 1 Tb.

The fix would be adding 'VolumeType' : 'gp2' (or a configurable volume type) at https://github.com/apls777/spotty/blob/master/spotty/providers/aws/deployment/cf_templates/instance_template.py#L114

Where to find the CloudFormation logs?

I am getting error Please, see CloudFormation logs for the details.

I log into the AWS console, make sure I'm in the right region, and go to CloudFormation > Logs. However, I don't see anything relevant there. Where do I find the logs specifically?

NVIDIA Docker installation fails

AMI cannot be created, because NVIDIA Docker installation fails:

nvidia-docker2 : Depends: docker-ce (= 5:18.09.2~3-0~ubuntu-xenial) but 5:18.09.3~3-0~ubuntu-xenial is to be installed or
docker-ee (= 5:18.09.2~3-0~ubuntu-xenial) but it is not installable

Mention in the docs how to access an instance from another machine

I have some instances running some tasks (managed by spotty). I'm trying to check the execution via ssh from another machine and getting Permission denied (publickey). from ssh.

The new machine also has awscli configured to the same access and secret key.

Why starts on-Demand instance?

Hi, i try this config

`project:
name: aws-project
syncFilters:
- exclude:
- .git/*
- .idea/*
- .*
- '/pycache/'

containers:

projectDir: /home/jr/aws_spot
image: tensorflow/tensorflow:nightly-gpu
env:
PYTHONPATH: /home/jr/aws_spot
volumeMounts:
- name: aws-vol
  mountPath: /home/jr/aws_spot # a directory inside the container where the "project" volume should be mounted

instances:

name: i2
provider: aws
parameters:
region: us-east-2
instanceType: p2.xlarge
volumes:
- name: aws-vol
parameters:
size: 5
deletionPolicy: retain
scripts:
python: |
python analyze_sf.py`

and get On-Demand Instance

+-------------------+---------------------+
| Instance State | running |
+-------------------+---------------------+
| Instance Type | p2.xlarge |
+-------------------+---------------------+
| Availability Zone | us-east-2c |
+-------------------+---------------------+
| Public IP Address | 3.134.103.146 |
+-------------------+---------------------+
| Purchasing Option | On-Demand Instance |
+-------------------+---------------------+
| Instance Price | $0.9000 (us-east-1) |
+-------------------+---------------------+

where mistake?

Stack was not created: DockerReadyWaitCondition

Hi!

Spotty sounds like an excellent idea and I'm eager to try it out. I'm following this article step-by-step.

At the first step of model training, I'm running into an error after running spotty start (for the extended log, see bottom):

Stack "spotty-instance-tacotron-i1" was not created.
Please, see CloudFormation logs for the details.

CloudFormation is showing:

CREATE_FAILED | The following resource(s) failed to create: [DockerReadyWaitCondition].
Received FAILURE signal with UniqueId i-0a62be66c83b8b65d

I can run spotty run ssh, but this shows a tmux session with the message Pane is dead. I can close the window using ctrl-b x and execute commands from there. However, running spotty run <script> or spotty run -r <scripts> gives me the same dead tmux session and the script is not being executed. If possible, I'd like to run Spotty commands from my local machine as advertised. What's going on here and how can I solve it?

Thanks!

$ spotty start
Instance is already running. Are you sure you want to restart it?
Type "y" to confirm: y
Terminating the instance...
fskkSyncing the project with S3 bucket...
Preparing CloudFormation template...
  - volume "tacotron-i1-workspace" (vol-xxxx) will be attached
  - volume "tacotron-i1-docker" (vol-yyyy) will be attached
  - availability zone: eu-west-1c
  - maximum Spot Instance price: on-demand
  - AMI: "Deep Learning Base AMI (Ubuntu) Version 19.1" (ami-06c961cd3240cfd7c)
  - Docker data will be stored on the "docker" volume

Volumes:
+-----------+---------------+------------+-----------------+
| Name      | Container Dir | Type       | Deletion Policy |
+===========+===============+============+=================+
| workspace | /workspace    | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+
| docker    | -             | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+

Waiting for the stack to be deleted...
Waiting for the stack to be created...
  - launching the instance...
  - waiting for the Docker container to be ready...
Error:
------
Stack "spotty-instance-tacotron-i1" was not created.
Please, see CloudFormation logs for the details.

Docker with private dockerhub or gitlab image?

If I have a private dockerhub or gitlab docker image, how do I authenticate with these services in Spotty to pull my docker instance?

Throw error in spotty download when the file does not exist in the instance

When downloading a file from the instance to local by using spotty download -f file ./, spotty won't throw any warning or error if the file does not exist in the instance. It would be nice to see this information.

Timeout for DockerReadyWaitCondition to short

Hi @apls777

I think that 10 minutes for building a docker image is sometimes to short. Installing miniconda with some additional packages for pytorch task is a little bit longer and spotty start failed. In the CloudFormation stack i got the following error 'Failed to receive 1 resource signal(s) within the specified duration' but after a few minutes more the docker image is up and running.

Regards

Jarek

spotty-cloud / spotty Goto Github PK

spotty's Introduction

Documentation

Installation

Get Started

Contributions

License

spotty's People

Contributors

Stargazers

Watchers

Forkers

spotty's Issues

Recommend Projects

Recommend Topics

Recommend Org