patwie / cluster-smi Goto Github PK
View Code? Open in Web Editor NEWnvidia-smi but for an entire GPU cluster
License: GNU General Public License v3.0
nvidia-smi but for an entire GPU cluster
License: GNU General Public License v3.0
There should be a timeout message like "N/A" if the server didn't receive any data from a node/client.
This can probably be a fairly high interval (minutes).
Hello,
I'm trying to use this tool in my gpu cluster, but i got the following error during the compilation:
# make all
cd proc; go install
go build cluster-smi.go config.go
cluster-smi.go:6:2: cannot find package "github.com/patwie/cluster-smi/cluster" in any of:
/nfs/utils/go/src/github.com/patwie/cluster-smi/cluster (from $GOROOT)
/root/go/src/github.com/patwie/cluster-smi/cluster (from $GOPATH)
cluster-smi.go:7:2: cannot find package "github.com/pebbe/zmq4" in any of:
/nfs/utils/go/src/github.com/pebbe/zmq4 (from $GOROOT)
/root/go/src/github.com/pebbe/zmq4 (from $GOPATH)
cluster-smi.go:8:2: cannot find package "github.com/vmihailenco/msgpack" in any of:
/nfs/utils/go/src/github.com/vmihailenco/msgpack (from $GOROOT)
/root/go/src/github.com/vmihailenco/msgpack (from $GOPATH)
config.go:4:2: cannot find package "gopkg.in/yaml.v2" in any of:
/nfs/utils/go/src/gopkg.in/yaml.v2 (from $GOROOT)
/root/go/src/gopkg.in/yaml.v2 (from $GOPATH)
make: *** [all] Error 1
it's not clear for me how to fix it....do you have any idea?
Golang 1.9 supports sync maps with baked in mutex locks. Currently, the cluster smi server only updates the node information when receiving a new message. This is ok. But these updates are broadcasted immediately to all subscribers, which causes a lot of messages flooding around (having more nodes) as this loop does not pause or depends on ticks.
I wonder if the receiving part can be a go routine and the broadcasting part another go routines depending on ticks to limit the number of messages sent to clients.
What are the steps for cluster setup on ubuntu ?
Is there any Master and Slave structure in it ?
Currently using cluster-smi 24/7, but cluster-smi seems stuck (./cluster-smi stops working, no output at all) once or twice a day.
Not exactly sure where the problem comes from, so I'm just rebooting the router+client computer when it stops working.
Any ideas?
Nvidia-smi provides information which processes are running on the GPUs. However this would cause a long list when showing these information for all nodes at the same time. Not sure how to display these information.
Information are stored in the nvmlProcessInfo_t
structure which can be obtained via nvmlReturn_t nvmlDeviceGetComputeRunningProcesses ( nvmlDevice_t device, unsigned int* infoCount, nvmlProcessInfo_t* infos )
. Further this only returns the pid and used memory. Querying the command requires to parse /proc/pid/stat
like in cluster-top which is OS dependent.
Hi @PatWie ,
I quickly created a Dockerfile in order to build a Docker Image out of cluster-smi.
Have a look here:
https://github.com/ieggel/cluster-smi-docker
It's quite nice in the sense that it automates the cumbersome cluster-smi build process in a reproducible way. Once the Docker Image build it should be able to run anywhere.
If you think it is useful you can link to the repo. I'm also open for suggestions. I created the readme quite quickly, so there might still be some typos/other mistakes. I will review it again next week.
Cheers,
-Ivan
Current the app requires libzmq.o
and the nvidia*.so
stubs. It seems to be possible to create a fully self-contained app (pebbe/zmq4#60).
This further requires to made some constants environment variables and some other changes
Hi again,
Do you know why it happens?
# go build cluster-smi-router.go config.go
# command-line-arguments
./cluster-smi-router.go:81:48: cannot use router_socket (type *"github.com/pebbe/zmq4".Socket) as type *"github.com/patwie/cluster-smi/vendor/github.com/pebbe/zmq4".Socket in argument to messaging.ReceiveMultipartMessage
./cluster-smi-router.go:97:33: cannot use router_socket (type *"github.com/pebbe/zmq4".Socket) as type *"github.com/patwie/cluster-smi/vendor/github.com/pebbe/zmq4".Socket in argument to messaging.SendMultipartMessage
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.