runpod / runpodctl Goto Github PK

View Code? Open in Web Editor NEW

246.0 246.0 35.0 5.81 MB

🧰 | RunPod CLI for pod management

Home Page: https://www.runpod.io/

License: GNU General Public License v3.0

Go 94.73% Makefile 0.53% Python 2.14% Shell 2.61%

command-line docker file-transfer runpod

runpodctl's People

Contributors

Stargazers

Watchers

runpodctl's Issues

Run an existing pod on Google Colab

How do we run an existing pod without creating a new pod in google colab?

runpodctl from pod - certificate signed by unkown authority

I have a use case where a pod executes work and should automatically exit. The current behaviour is that pods restart after the ENTRYPOINT command has ended. This forces me to use the runpodctl from within a pod to put it in exited state. To test this behaviour. I have the following set-up:

Dockerfile

FROM ubuntu:22.04

COPY . .
RUN chmod +x run.sh

ENTRYPOINT ["./run.sh"]

run.sh

#!/bin/sh

# wait for 10 seconds
for i in $(seq 20 -1 0); do
    echo $i
    sleep 1
done

# stop the pod
echo "Stopping the pod"
runpodctl stop pod $RUNPOD_POD_ID

this results in the following output whenever the command is ran:
2024-06-11T14:33:40.393686854Z Error: Post "https://api.runpod.io/graphql?api_key=XXX": x509: certificate signed by unknown authority

Any thoughts?

ps: is there any other way to exit the pod after executing, instead of restarting it automatically?

It won't finish sending and/or receiving

Using runpodctl v1.8.0.

I have been trying to send a 172MB file in the last hour without any success. I keep retrying to no avail.

Sometimes when I send it will just stop in the middle of the job, and it stays like that, like frozen.

$ runpodctl send samples.zip                         
Sending 'samples.zip' (172.4 MB) 
Code is: 1100-yahoo-boat-friend-0
On the other computer run

runpodctl receive 1100-yahoo-boat-friend-0

Sending (->XX.XX.XXX:40806)
samples.zip  90% |██████████████████  | (156/172 MB, 478.021 kB/s) [3m42s:34s]

... so it never finishing downloading at my end.

But then there's another problem, when the sending actually completes with 100%, my receiving end (say my PC) will stop receiving in the middle of it. Again, like frozen. It's like a communication breakdown and thing doesn't know what to do next so it stays in a frozen state.

Stard command gives error response

The pod starts running but shows the error below.
runpodctl start pod 4v0nxxxxx
Error: Something went wrong. Please try again later or contact support.

Is there any way to pass in an argument to the pod with this

When I'm doing runpodctl start pod {podId} is there any way to pass in a command argument to the pod? Like send the docker command, or something that would be appended to the docker command, or set a bash environment variable, any other way I can pass an argument string to the pod from my remote command line where I'm invoking runpodctl? My goal here is to be able to start a pod remotely and point it at a target URL that it should process. I know I can set a startup docker command from within the web interface, but I'm hoping to be able to do something like that from the command line.

Add homebrew formular

Hi!

I think it's much better to add homebrew installation option.

How do you think?

Thanks

can't start spot pod using the cli

I'm trying to start a pod using the cli by doing:

 runpodctl start pod <id>

but I'm getting the error: Error: Cannot resume a spot pod as an on demand pod.

I also tried putting a bid that matches the spot pricing:

runpodctl start pod --bid 0.340 <id>

and that gets me a different error Error: PodBidResume: statuscode 400

can you get the api to return JSON?

Seems like some core functionality is missing, if its not its poorly documented.

Is it possible to get runpodctl to return json?

Support modification of serverless templates

In order to be able to orchestrate Serverless Runpod.io deployments as part of a continuous deployment workflow it would be desirable to be able to update the Serverless template using runpodctl. Specifically to change the Container image setting on the template to point to a new version of the image.

Pointing the template to the :latest label runs the risk of docker pull caches being out of sync and running an old version of the image. And it makes rollback difficult too.

Ideally I'd like it to be possible to execute a runpodctl command and point it to an existing Serverless template to a new image URL

Add filtering for public IP on community cloud

Right now, there is no way to make sure a newly created instance will have a public IP, when selecting from the community cloud. Please add a feature like on the UI.

Please add the following instructions to make beginner Linux users be able to easily install this on their PCs

wget --quiet --show-progress https://github.com/Run-Pod/runpodctl/releases/download/v1.6.1/runpodctl-linux-amd -O runpodctl
chmod +x runpodctl
cp runpodctl /usr/bin/runpodctl

Get the balance information via runpodctl

Hi is it possible to get balance information via runpodctl or via different call graphql or via SDK ?
I need this information for automatic monitoring purpose, and send alert when balance is low.

Get ssh parameters using runpodctl

is there a way to obtain hostname and port of Pod's ssh using runpodctl? I would like to automate benchmarking my models, but I need to automate ssh connection.

does this thing even work?

every single call to both the api and using runpodctl ends with errors like:
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Fix Windows install URL in Readme

The Windows install URL in the Readme is outdated and no longer works

wget https://github.com/runpod/runpodctl/releases/download/v1.9.0/runpodctl-windows-amd64.exe -O runpodctl.exe
Needs to be updated to
wget https://github.com/runpod/runpodctl/releases/download/v1.14.2/runpodctl-windows-amd64.exe -O runpodctl.exe

runpodctl exec python <script> always executes python3.11

In line 21 of cmd/exec/functions.go instead of just executing the default python version in the container by using python3, python3.11 is used. This is a problem when using most of the runpod pytorch templates.

Feature request? runpodctl receive 5261-goat-module-brasil-8 custom-name.zip

Is it possible to receive a file and change its name upon receiving it ?

For instance, say I'm sending samples.zip but on my receiving end I'd like it to unzip it in a folder named samples-2.

runpodctl receive 5261-goat-module-brasil-8 samples-2

More specifically say I need to review from my pc a remote folder that has changing data in it such as logs or images which are being created every n minutes and I'd like to keep track of changes in different folders.

Please add binaries for linux and android on arm

This should be as simple as setting goarch=arm and arm64 and goos=android and linux, but my fork failed to build for some reason.I'm not familiar with this release please, that might be part of it?

Error: Worker concurrency cannot go beyond the maximum limit

Failed to deploy project: Your worker concurrency cannot go beyond the maximum limit of (20). Please contact support if you wish to scale past this number.

Perhaps, this could be checked on before having to wait a few minutes when deploying a new endpoint.
I could see how that experience would frustrate a user.

Show pod IP for get pod cmd

Looks like graphql spec can return information about pod ip.

I wanna to create pod with ctl and then connect with ssh to created pod.

Publish binaries for linux/arm64

How to get connect information from runpodctl ?

After creating a pod with runpodctl, how can I get the same connection information that I get on the console to access the pod. I am talking about the ssh connection info (e.g: ssh [email protected] -i ~/.ssh/id_ed1111111)

Right now I have to login to the console to get this information. What is the preffered way to get this from the CLI ?

Install instructions in README install an old version on the default Runpod.io Docker image

steps

spin up the default CPU docker template on runpod.io
run runpodctl version -> "runpodctl v1.8.0"
run the command from README: wget -qO- cli.runpod.net | sudo bash
run runpodctl version -> still "runpodctl v1.8.0"

what i expect

version 1.14.3 should have been installed

what happened

v1.8.0 is installed

root@runpod-pod:~ # which runpodctl
/usr/bin/runpodctl
root@runpod-pod:~ # wget -qO- cli.runpod.net | sudo bash
Installing runpodctl...
jq is not installed.
Installing jq...
Hit:1 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:5 http://security.ubuntu.com/ubuntu focal-security InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  libjq1 libonig5
The following NEW packages will be installed:
  jq libjq1 libonig5
0 upgraded, 3 newly installed, 0 to remove and 77 not upgraded.
Need to get 313 kB of archives.
After this operation, 1062 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 libonig5 amd64 6.9.4-1 [142 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libjq1 amd64 1.6-1ubuntu0.20.04.1 [121 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 jq amd64 1.6-1ubuntu0.20.04.1 [50.2 kB]
Fetched 313 kB in 0s (855 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libonig5:amd64.
(Reading database ... 29857 files and directories currently installed.)
Preparing to unpack .../libonig5_6.9.4-1_amd64.deb ...
Unpacking libonig5:amd64 (6.9.4-1) ...
Selecting previously unselected package libjq1:amd64.
Preparing to unpack .../libjq1_1.6-1ubuntu0.20.04.1_amd64.deb ...
Unpacking libjq1:amd64 (1.6-1ubuntu0.20.04.1) ...
Selecting previously unselected package jq.
Preparing to unpack .../jq_1.6-1ubuntu0.20.04.1_amd64.deb ...
Unpacking jq (1.6-1ubuntu0.20.04.1) ...
Setting up libonig5:amd64 (6.9.4-1) ...
Setting up libjq1:amd64 (1.6-1ubuntu0.20.04.1) ...
Setting up jq (1.6-1ubuntu0.20.04.1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.14) ...
Latest version of runpodctl: v1.14.3
runpodctl                      100%[==================================================>]   3.48M  17.8MB/s    in 0.2s
runpodctl installed successfully.
root@runpod-pod:~ # which runpodctl
/usr/local/bin/runpodctl
root@runpod-pod:~ # runpodctl  version
runpodctl v1.8.0

Add to homebrew and/or create a custom tap

It would be awesome if macOS homebrew users could install this from brew:

Transfer randomly pauses

I was sending a stable diffusion model which is 2 gigabytes, but around 90%, the transfer just stopped. This happened the other day too, but at 80%.

Sending folders does not work on Windows

Instead of recreating the folder structure in the pod, runprodctl creates a bund of individual files with names like "

<filename>" (notice that \ is not a folder seperator on linux).

The same functionality works fine when sending from linux. Does not work in Windows.

Update installer to expose runpod alias

Feature request: Create pod in a specific DC

I'm looking for a way to create a pod in a specific datacenter from the command line. There's a way to do it with the web interface -- it looks like we'd just need to pass dataCenterId in the graphql request (unless I'm misreading something). I guess it'd need to be added somewhere like this:

runpodctl/api/pod.go

Line 138 in 46dbb96

type CreatePodInput struct {

runpodctl receive error

panic: runtime error: index out of range [4] with length 4

goroutine 1 [running]:
cli/cmd/croc.glob..func1(0xc4c9e0, {0xc000121610, 0x1, 0x1})
/home/runner/work/runpodctl/runpodctl/cmd/croc/receive.go:47 +0x3d3
github.com/spf13/cobra.(*Command).execute(0xc4c9e0, {0xc0001215f0, 0x1, 0x1})
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc4bae0)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
cli/cmd.Execute({0x9571a4, 0xc0000001a0})
/home/runner/work/runpodctl/runpodctl/cmd/root.go:26 +0x4a
main.main()
/home/runner/work/runpodctl/runpodctl/main.go:8 +0x27

panic: runtime error: index out of range [4] with length 4

Can't receive data from runpod (docker image with no scp support)

$ runpodctl receive 1208-goat-boat-screen

panic: runtime error: index out of range [4] with length 4

goroutine 1 [running]:
cli/cmd/croc.glob..func1(0xc4c9e0, {0xc0000f1620, 0x1, 0x1})
	/home/runner/work/runpodctl/runpodctl/cmd/croc/receive.go:47 +0x3d3
github.com/spf13/cobra.(*Command).execute(0xc4c9e0, {0xc0000f1600, 0x1, 0x1})
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc4bae0)
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
	/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
cli/cmd.Execute({0x9571a4, 0xc0000001a0})
	/home/runner/work/runpodctl/runpodctl/cmd/root.go:26 +0x4a
main.main()
	/home/runner/work/runpodctl/runpodctl/main.go:8 +0x27
root@bed533d5304a:/workspace/stable-diffusion-webui# ls -l
total 680

"runpodctl get pod" returning null

I'm not getting the ID from an active pod.

# runpodctl get pod
Error: data is nil: {"data":{"myself":null}}

Fix help command strings

The RunPod CLI tool to manage resources on runpod.io and develop serverless applications.

Usage:
  runpodctl [command]

Aliases:
  runpodctl, runpod

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  config      Manage CLI configuration
  create      create a resource
  exec        Execute commands in a pod
  get         get resource
  help        Help about any command
  project     Manage RunPod projects
  receive     receive file(s), or folder
  remove      remove a resource
  send        send file(s), or folder
  ssh         SSH keys and commands
  start       start a resource
  stop        stop a resource
  update      update runpodctl

Flags:
  -h, --help      help for runpodctl
  -v, --version   Print the version of runpodctl

Some start with caps others do not. Should be consistent.
Also dont use (s):

Don't put optional plurals in parentheses. Instead, use either plural or singular constructions and keep things consistent throughout your documentation. Choose what is most appropriate for your documentation and your audience. If it's important in a specific context to indicate both, use one or more.

https://developers.google.com/style/plurals-parentheses

Adding a terminate command ( and not just stop the pod)

ValueError: Attempting to unscale FP16 gradients.

I ran this command.

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path openlm-research/open_llama_7b \
    --do_train \
    --dataset train \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

[INFO|training_args.py:1345] 2023-12-07 06:09:02,164 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-07 06:09:02,164 >> PyTorch: setting up devices
[INFO|trainer.py:1760] 2023-12-07 06:09:03,760 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-12-07 06:09:03,761 >>   Num examples = 78,303
[INFO|trainer.py:1762] 2023-12-07 06:09:03,761 >>   Num Epochs = 3
[INFO|trainer.py:1763] 2023-12-07 06:09:03,761 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1766] 2023-12-07 06:09:03,761 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1767] 2023-12-07 06:09:03,761 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1768] 2023-12-07 06:09:03,761 >>   Total optimization steps = 14,682
[INFO|trainer.py:1769] 2023-12-07 06:09:03,762 >>   Number of trainable parameters = 4,194,304
  0%|                                                                                                                                                                                               | 0/14682 [00:00<?, ?it/s][WARNING|logging.py:290] 2023-12-07 06:09:03,766 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 68, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1950, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

我昨天使用時是正常的，但當我今天改變了資料集大小後出現了這個問題，請問是發生了甚麼事呢?

runpodctl send exits without any info

This was working for me just fine and then randomly out of the blue, running runpodctl send <file> just exits without saying anything.

This happens both locally and on the pod itself. Is there any way to get some verbose output / logging info so I can help you troubleshoot?

I'm running it on a macbook m2, just installed it today v1.9.0. Same behavior on the pod itself so I don't know if it matters.

Unable to use podRentInterruptable

Using the web interface I'm able to Deploy a spot Instance instead of an On Demand instance. It would be nice to be able to do this using the command line tool too.

I tried naively replacing podFindAndDeployOnDemand with podRentInterruptable, but this failed. I have no idea if this was a permission problem, a server problem, or a client problem. (If I could get it to work, I'd provide a pull request.) I can see the current spot price using runpodctl get cloud. Once I create a pod through the web, I am able to see it and stop it using the command line interface.
I found this documentation.

Create CPU Pod using runpodctl

Hello,

I would like to create non-gpu Pod for quick experimenting, before running GPU Pod. I cannot create CPU Pod, because runpodctl requires gpuType.

Support sending more than 1 files

That would be helpful

runpodctl send "t112_38080.safetensors","t112_38080.yaml"

update docker image for existing pod using runpodctl

Hi, is there any way to update container image for my running pod, just like edit pod option?

It seems that it's only possible to create a new pod with a new gpu using create command, but not with the gpu I already owned.

I hope there's a way to update container image only, not changing pod id & gpu.

API key is required for `version` cmd

When you install runpodctl, you can't confirm a successful installation by checking the version. You get an error telling you to run runpodctl config. After adding an API key, runpodctl version works as expected.

The file transfer never ends... always stuck at 90%...

I tried to use runpodctl to upload dataset around 100G to runpod. to receive the files, I had to start the pod...however it has take the whole day, which means I pay the gpus for the whole day but get no chance to use it because runpodctl always fails..

Please make Archlinux AUR package

Could you please make AUR package:

https://aur.archlinux.org/packages/
https://wiki.archlinux.org/title/Arch_User_Repository
https://wiki.archlinux.org/title/PKGBUILD => PKGBUILDs are simple. In theory I or others could do it, but probably best if maintainers take care of it! Thank you!

rp % runpodctl version
runpodctl v1.14.1
rp % runpodctl project start
No 'runpod.toml' found in the current directory.
Please navigate to your project directory and try again.
rp % runpodctl project h
Develop and deploy projects entirely on RunPod's infrastructure.

Usage:
  runpodctl project [command]

Available Commands:
  build       builds Dockerfile for current project
  create      Creates a new project
  deploy      deploys your project as an endpoint
  dev         Start a development session for the current project

Flags:
  -h, --help   help for project

Use "runpodctl project [command] --help" for more information about a command.
rp % runpodctl project cre
ate
Welcome to the RunPod Project Creator!
--------------------------------------

Provide a name for your project:
   >