Giter Club home page Giter Club logo

aws's Introduction

CI Coverage Status Latest Docs Code style: black

Manage AWS EC2 instances for experiments

Contents:

Local Installation of Repository

See the instructions here. Then clone and enter the repo.

git clone https://github.com/wesselb/aws
cd aws

Finally, make a virtual environment and install the requirements.

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt -e .

Sample Experiment

In the following, values that you need to set are bash variables (like $REPO) or Python constants (like KEY).

Setup AWS

  • Install and configure the Amazon CLI.

  • Create a new EC2 instance with the Deep Learning Base AMI (Amazon Linux 2). Make sure that you allocate enough disk space.

  • Create and name an appropriate security group.

  • Launch the instance.

Create an Image

  • Log into the instance.

  • Create a key for GitHub.

ssh-keygen -f ~/.ssh/github -t ed25519 -C "[email protected]" \
    && (echo "Host github.com"                 > ~/.ssh/config) \
    && (echo "    IdentityFile ~/.ssh/github" >> ~/.ssh/config) \
    && chmod 644 ~/.ssh/config \
    && echo "Public key:" \
    && cat ~/.ssh/github.pub
  • Add the public key to your GitHub account.

  • Configure the instance:

sudo amazon-linux-extras install python3.8 \
    && sudo yum install -y tmux htop python38-devel \
    && sudo pip3.8 install --upgrade pip setuptools Cython numpy virtualenv
  • Setup the AWS repository. Note: If the path to the repository is ~/aws, then ~/aws/venv must be a virtual environment which has the repository installed in editable mode.
cd ~
git clone [email protected]:wesselb/aws.git \
    && cd aws \
    && virtualenv venv -p python3.8 \
    && source venv/bin/activate \
    && pip install -r requirements.txt -e . \
    && deactivate \
    && cd ..
  • Setup the project repository:
cd ~
git clone [email protected]:$USER/$REPO.git \
    && cd $REPO \
    && virtualenv venv -p python3.8 \
    && source venv/bin/activate \
    && pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html \
    && pip install -r requirements.txt -e . \
    && deactivate \
    && cd ..
  • If necessary, transfer data to the instance:
rsync -e "ssh -i ~/.ssh/$KEY.pem" -Pav $DATA_DIR ec2-user@$IP:/home/ec2-user/$REPO
  • Stop the instance and create an image.

  • Once the image is ready, terminate the instance.

Test the Cluster

  • Create a file cluster.py:
import aws

aws.config["ssh_user"] = "ec2-user"
aws.config["ssh_key"] = f"~/.ssh/{KEY}.pem"
aws.config["setup_commands"] = [
    f"cd /home/ec2-user/{REPO}",
    "ssh-keygen -F github.com || ssh-keyscan github.com >> ~/.ssh/known_hosts",
    "git pull"
]

commands = [
    ["mkdir -p results", "touch results/one.txt"],
    ["mkdir -p results", "touch results/two.txt"],
    ["mkdir -p results", "touch results/three.txt"],
]

aws.manage_cluster(
    commands,
    instance_type="t2.small",
    key_name=KEY,
    security_group_id=SECURITY_GROUP,
    image_id=IMAGE_ID,
    sync_sources=[f"/home/ec2-user/{REPO}/results"],
    sync_target=aws.LocalPath("sync"),
    monitor_call=aws.shutdown_timed_call(duration=60),
    monitor_delay=60,
    monitor_aws_repo="/home/ec2-user/aws",
)
  • Here's what it can do:
usage: cluster.py [-h] [--spawn SPAWN] [--start] [--terminate] [--kill]
                  [--stop] [--sync-stopped] [--sync-sleep SYNC_SLEEP]

optional arguments:
  -h, --help            show this help message and exit
  --spawn SPAWN         Spawn instances.
  --start               Start experiments.
  --terminate           Terminate all instances. This is a kill switch.
  --kill                Kill all running experiments, but keep the instances
                        running.
  --stop                Stop all running instances
  --sync-stopped        Synchronise all stopped instances.
  --sync-sleep SYNC_SLEEP
                        Number of seconds to sleep before syncing again.
  • Make an empty directory to synchronise to:
mkdir sync
  • Test that no instances are running:
$ python cluster.py
Instances still running: 0
Sleeping for two minutes...
  • Kill the script. Now spawn two instances and start the experiment:
$ python cluster.py --spawn 2 --start
...
  • Wait for the instances to have booted and the experiments to have started. The local folder sync/results should eventually contain the files one.txt, two.txt, and three.txt.

  • Wait a bit longer to ensure that the instances eventually shutdown themselves.

  • Kill the script. Now that all instances are stopped, remove everything in sync and attempt to sync the stopped instances:

$ python cluster.py --sync-stopped
  • If the contents of sync is restored, then we're golden! Terminate all instances of the cluster.
$ python cluster.py --terminate
Terminating all instances:
Instances still running: 0
Sleeping for two minutes...

Kill the script. You're now good to run your big experiment!

Features

Synchronise to a Remote Host

aws.manage_cluster(
    commands,
    ...
    sync_target=aws.RemotePath(
        aws.Remote(user="user", host="host", key=f"~/.ssh/{KEY}"),
        "/path/to/sync"
    ),
    ...
)

Shutdown the Instance When All GPUs Are Idle

aws.manage_cluster(
    commands,
    ...
    monitor_call=aws.shutdown_when_all_gpus_idle_call(duration=60),
    ...
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.