runpod / runpodctl Goto Github PK
View Code? Open in Web Editor NEW๐งฐ | RunPod CLI for pod management
Home Page: https://www.runpod.io/
License: GNU General Public License v3.0
๐งฐ | RunPod CLI for pod management
Home Page: https://www.runpod.io/
License: GNU General Public License v3.0
How do we run an existing pod without creating a new pod in google colab?
I have a use case where a pod executes work and should automatically exit. The current behaviour is that pods restart after the ENTRYPOINT command has ended. This forces me to use the runpodctl from within a pod to put it in exited
state. To test this behaviour. I have the following set-up:
Dockerfile
FROM ubuntu:22.04
COPY . .
RUN chmod +x run.sh
ENTRYPOINT ["./run.sh"]
run.sh
#!/bin/sh
# wait for 10 seconds
for i in $(seq 20 -1 0); do
echo $i
sleep 1
done
# stop the pod
echo "Stopping the pod"
runpodctl stop pod $RUNPOD_POD_ID
this results in the following output whenever the command is ran:
2024-06-11T14:33:40.393686854Z Error: Post "https://api.runpod.io/graphql?api_key=XXX": x509: certificate signed by unknown authority
Any thoughts?
ps: is there any other way to exit the pod after executing, instead of restarting it automatically?
Using runpodctl v1.8.0.
I have been trying to send a 172MB file in the last hour without any success. I keep retrying to no avail.
Sometimes when I send it will just stop in the middle of the job, and it stays like that, like frozen.
$ runpodctl send samples.zip
Sending 'samples.zip' (172.4 MB)
Code is: 1100-yahoo-boat-friend-0
On the other computer run
runpodctl receive 1100-yahoo-boat-friend-0
Sending (->XX.XX.XXX:40806)
samples.zip 90% |โโโโโโโโโโโโโโโโโโ | (156/172 MB, 478.021 kB/s) [3m42s:34s]
... so it never finishing downloading at my end.
But then there's another problem, when the sending actually completes with 100%, my receiving end (say my PC) will stop receiving in the middle of it. Again, like frozen. It's like a communication breakdown and thing doesn't know what to do next so it stays in a frozen state.
When I'm doing runpodctl start pod {podId}
is there any way to pass in a command argument to the pod? Like send the docker command, or something that would be appended to the docker command, or set a bash environment variable, any other way I can pass an argument string to the pod from my remote command line where I'm invoking runpodctl? My goal here is to be able to start a pod remotely and point it at a target URL that it should process. I know I can set a startup docker command from within the web interface, but I'm hoping to be able to do something like that from the command line.
Hi!
I think it's much better to add homebrew installation option.
How do you think?
Thanks
I'm trying to start a pod using the cli by doing:
runpodctl start pod <id>
but I'm getting the error: Error: Cannot resume a spot pod as an on demand pod.
I also tried putting a bid that matches the spot pricing:
runpodctl start pod --bid 0.340 <id>
and that gets me a different error Error: PodBidResume: statuscode 400
Seems like some core functionality is missing, if its not its poorly documented.
Is it possible to get runpodctl to return json?
In order to be able to orchestrate Serverless Runpod.io deployments as part of a continuous deployment workflow it would be desirable to be able to update the Serverless template using runpodctl. Specifically to change the Container image
setting on the template to point to a new version of the image.
Pointing the template to the :latest
label runs the risk of docker pull caches being out of sync and running an old version of the image. And it makes rollback difficult too.
Ideally I'd like it to be possible to execute a runpodctl command and point it to an existing Serverless template to a new image URL
wget --quiet --show-progress https://github.com/Run-Pod/runpodctl/releases/download/v1.6.1/runpodctl-linux-amd -O runpodctl
chmod +x runpodctl
cp runpodctl /usr/bin/runpodctl
Hi is it possible to get balance information via runpodctl or via different call graphql or via SDK ?
I need this information for automatic monitoring purpose, and send alert when balance is low.
is there a way to obtain hostname and port of Pod's ssh using runpodctl? I would like to automate benchmarking my models, but I need to automate ssh connection.
every single call to both the api and using runpodctl ends with errors like:
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The Windows install URL in the Readme is outdated and no longer works
wget https://github.com/runpod/runpodctl/releases/download/v1.9.0/runpodctl-windows-amd64.exe -O runpodctl.exe
Needs to be updated to
wget https://github.com/runpod/runpodctl/releases/download/v1.14.2/runpodctl-windows-amd64.exe -O runpodctl.exe
In line 21 of cmd/exec/functions.go
instead of just executing the default python version in the container by using python3
, python3.11
is used. This is a problem when using most of the runpod pytorch templates.
Is it possible to receive a file and change its name upon receiving it ?
For instance, say I'm sending samples.zip
but on my receiving end I'd like it to unzip it in a folder named samples-2.
runpodctl receive 5261-goat-module-brasil-8 samples-2
More specifically say I need to review from my pc a remote folder that has changing data in it such as logs or images which are being created every n minutes and I'd like to keep track of changes in different folders.
This should be as simple as setting goarch=arm and arm64 and goos=android and linux, but my fork failed to build for some reason.I'm not familiar with this release please, that might be part of it?
Failed to deploy project: Your worker concurrency cannot go beyond the maximum limit of (20). Please contact support if you wish to scale past this number.
Perhaps, this could be checked on before having to wait a few minutes when deploying a new endpoint.
I could see how that experience would frustrate a user.
Looks like graphql spec can return information about pod ip.
I wanna to create pod with ctl and then connect with ssh to created pod.
After creating a pod with runpodctl, how can I get the same connection information that I get on the console to access the pod. I am talking about the ssh connection info (e.g: ssh [email protected] -i ~/.ssh/id_ed1111111)
Right now I have to login to the console to get this information. What is the preffered way to get this from the CLI ?
runpodctl version
-> "runpodctl v1.8.0"wget -qO- cli.runpod.net | sudo bash
runpodctl version
-> still "runpodctl v1.8.0"version 1.14.3 should have been installed
v1.8.0 is installed
root@runpod-pod:~ # which runpodctl
/usr/bin/runpodctl
root@runpod-pod:~ # wget -qO- cli.runpod.net | sudo bash
Installing runpodctl...
jq is not installed.
Installing jq...
Hit:1 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:5 http://security.ubuntu.com/ubuntu focal-security InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libjq1 libonig5
The following NEW packages will be installed:
jq libjq1 libonig5
0 upgraded, 3 newly installed, 0 to remove and 77 not upgraded.
Need to get 313 kB of archives.
After this operation, 1062 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 libonig5 amd64 6.9.4-1 [142 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libjq1 amd64 1.6-1ubuntu0.20.04.1 [121 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 jq amd64 1.6-1ubuntu0.20.04.1 [50.2 kB]
Fetched 313 kB in 0s (855 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libonig5:amd64.
(Reading database ... 29857 files and directories currently installed.)
Preparing to unpack .../libonig5_6.9.4-1_amd64.deb ...
Unpacking libonig5:amd64 (6.9.4-1) ...
Selecting previously unselected package libjq1:amd64.
Preparing to unpack .../libjq1_1.6-1ubuntu0.20.04.1_amd64.deb ...
Unpacking libjq1:amd64 (1.6-1ubuntu0.20.04.1) ...
Selecting previously unselected package jq.
Preparing to unpack .../jq_1.6-1ubuntu0.20.04.1_amd64.deb ...
Unpacking jq (1.6-1ubuntu0.20.04.1) ...
Setting up libonig5:amd64 (6.9.4-1) ...
Setting up libjq1:amd64 (1.6-1ubuntu0.20.04.1) ...
Setting up jq (1.6-1ubuntu0.20.04.1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.14) ...
Latest version of runpodctl: v1.14.3
runpodctl 100%[==================================================>] 3.48M 17.8MB/s in 0.2s
runpodctl installed successfully.
root@runpod-pod:~ # which runpodctl
/usr/local/bin/runpodctl
root@runpod-pod:~ # runpodctl version
runpodctl v1.8.0
It would be awesome if macOS homebrew users could install this from brew
:
I was sending a stable diffusion model which is 2 gigabytes, but around 90%, the transfer just stopped. This happened the other day too, but at 80%.
I'm looking for a way to create a pod in a specific datacenter from the command line. There's a way to do it with the web interface -- it looks like we'd just need to pass dataCenterId
in the graphql request (unless I'm misreading something). I guess it'd need to be added somewhere like this:
Line 138 in 46dbb96
panic: runtime error: index out of range [4] with length 4
goroutine 1 [running]:
cli/cmd/croc.glob..func1(0xc4c9e0, {0xc000121610, 0x1, 0x1})
/home/runner/work/runpodctl/runpodctl/cmd/croc/receive.go:47 +0x3d3
github.com/spf13/cobra.(*Command).execute(0xc4c9e0, {0xc0001215f0, 0x1, 0x1})
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc4bae0)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
cli/cmd.Execute({0x9571a4, 0xc0000001a0})
/home/runner/work/runpodctl/runpodctl/cmd/root.go:26 +0x4a
main.main()
/home/runner/work/runpodctl/runpodctl/main.go:8 +0x27
Can't receive data from runpod (docker image with no scp support)
$ runpodctl receive 1208-goat-boat-screen
panic: runtime error: index out of range [4] with length 4
goroutine 1 [running]:
cli/cmd/croc.glob..func1(0xc4c9e0, {0xc0000f1620, 0x1, 0x1})
/home/runner/work/runpodctl/runpodctl/cmd/croc/receive.go:47 +0x3d3
github.com/spf13/cobra.(*Command).execute(0xc4c9e0, {0xc0000f1600, 0x1, 0x1})
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc4bae0)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
/home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
cli/cmd.Execute({0x9571a4, 0xc0000001a0})
/home/runner/work/runpodctl/runpodctl/cmd/root.go:26 +0x4a
main.main()
/home/runner/work/runpodctl/runpodctl/main.go:8 +0x27
root@bed533d5304a:/workspace/stable-diffusion-webui# ls -l
total 680
I'm not getting the ID from an active pod.
# runpodctl get pod
Error: data is nil: {"data":{"myself":null}}
The RunPod CLI tool to manage resources on runpod.io and develop serverless applications.
Usage:
runpodctl [command]
Aliases:
runpodctl, runpod
Available Commands:
completion Generate the autocompletion script for the specified shell
config Manage CLI configuration
create create a resource
exec Execute commands in a pod
get get resource
help Help about any command
project Manage RunPod projects
receive receive file(s), or folder
remove remove a resource
send send file(s), or folder
ssh SSH keys and commands
start start a resource
stop stop a resource
update update runpodctl
Flags:
-h, --help help for runpodctl
-v, --version Print the version of runpodctl
Some start with caps others do not. Should be consistent.
Also dont use (s):
Don't put optional plurals in parentheses. Instead, use either plural or singular constructions and keep things consistent throughout your documentation. Choose what is most appropriate for your documentation and your audience. If it's important in a specific context to indicate both, use one or more.
I ran this command.
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--stage sft \
--model_name_or_path openlm-research/open_llama_7b \
--do_train \
--dataset train \
--template default \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir checkpoint \
--overwrite_cache \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 2000 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--plot_loss \
--fp16
[INFO|training_args.py:1345] 2023-12-07 06:09:02,164 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-07 06:09:02,164 >> PyTorch: setting up devices
[INFO|trainer.py:1760] 2023-12-07 06:09:03,760 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-12-07 06:09:03,761 >> Num examples = 78,303
[INFO|trainer.py:1762] 2023-12-07 06:09:03,761 >> Num Epochs = 3
[INFO|trainer.py:1763] 2023-12-07 06:09:03,761 >> Instantaneous batch size per device = 4
[INFO|trainer.py:1766] 2023-12-07 06:09:03,761 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1767] 2023-12-07 06:09:03,761 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1768] 2023-12-07 06:09:03,761 >> Total optimization steps = 14,682
[INFO|trainer.py:1769] 2023-12-07 06:09:03,762 >> Number of trainable parameters = 4,194,304
0%| | 0/14682 [00:00<?, ?it/s][WARNING|logging.py:290] 2023-12-07 06:09:03,766 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 68, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1950, in _inner_training_loop
self.accelerator.clip_grad_norm_(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
ๆๆจๅคฉไฝฟ็จๆๆฏๆญฃๅธธ็๏ผไฝ็ถๆไปๅคฉๆน่ฎไบ่ณๆ้ๅคงๅฐๅพๅบ็พไบ้ๅๅ้ก๏ผ่ซๅๆฏ็ผ็ไบ็้บผไบๅข?
This was working for me just fine and then randomly out of the blue, running runpodctl send <file>
just exits without saying anything.
This happens both locally and on the pod itself. Is there any way to get some verbose output / logging info so I can help you troubleshoot?
I'm running it on a macbook m2, just installed it today v1.9.0. Same behavior on the pod itself so I don't know if it matters.
Using the web interface I'm able to Deploy a spot Instance instead of an On Demand instance. It would be nice to be able to do this using the command line tool too.
I tried naively replacing podFindAndDeployOnDemand with podRentInterruptable, but this failed. I have no idea if this was a permission problem, a server problem, or a client problem. (If I could get it to work, I'd provide a pull request.) I can see the current spot price using runpodctl get cloud
. Once I create a pod through the web, I am able to see it and stop it using the command line interface.
I found this documentation.
Hello,
I would like to create non-gpu Pod for quick experimenting, before running GPU Pod. I cannot create CPU Pod, because runpodctl requires gpuType.
That would be helpful
Like
runpodctl send "t112_38080.safetensors","t112_38080.yaml"
Hi, is there any way to update container image for my running pod, just like edit pod
option?
It seems that it's only possible to create a new pod with a new gpu using create
command, but not with the gpu I already owned.
I hope there's a way to update container image only, not changing pod id & gpu.
When you install runpodctl, you can't confirm a successful installation by checking the version. You get an error telling you to run runpodctl config
. After adding an API key, runpodctl version
works as expected.
I tried to use runpodctl to upload dataset around 100G to runpod. to receive the files, I had to start the pod...however it has take the whole day, which means I pay the gpus for the whole day but get no chance to use it because runpodctl always fails..
Could you please make AUR package:
Is there a way to watch container logs when starting a pod using this command line tool?
How to create pod with an existing network volume attached using runpodctl
?
Can't seem to find it in the documentation.
Many thanks in advance.
Not really a ๐ bug but not the expected behavior:
When you run runpodctl project start
you get a No 'runpod.toml' found in the current directory.
error - when it should be like unknown command
or alias for create.
rp % runpodctl version
runpodctl v1.14.1
rp % runpodctl project start
No 'runpod.toml' found in the current directory.
Please navigate to your project directory and try again.
rp % runpodctl project h
Develop and deploy projects entirely on RunPod's infrastructure.
Usage:
runpodctl project [command]
Available Commands:
build builds Dockerfile for current project
create Creates a new project
deploy deploys your project as an endpoint
dev Start a development session for the current project
Flags:
-h, --help help for project
Use "runpodctl project [command] --help" for more information about a command.
rp % runpodctl project cre
ate
Welcome to the RunPod Project Creator!
--------------------------------------
Provide a name for your project:
>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.