Giter Club home page Giter Club logo

cluster-smi's People

Contributors

ieggel avatar patwie avatar thangvubk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-smi's Issues

Add Timeout message

There should be a timeout message like "N/A" if the server didn't receive any data from a node/client.
This can probably be a fairly high interval (minutes).

[error] make all

Hello,

I'm trying to use this tool in my gpu cluster, but i got the following error during the compilation:

# make all
cd proc; go install
go build cluster-smi.go config.go
cluster-smi.go:6:2: cannot find package "github.com/patwie/cluster-smi/cluster" in any of:
        /nfs/utils/go/src/github.com/patwie/cluster-smi/cluster (from $GOROOT)
        /root/go/src/github.com/patwie/cluster-smi/cluster (from $GOPATH)
cluster-smi.go:7:2: cannot find package "github.com/pebbe/zmq4" in any of:
        /nfs/utils/go/src/github.com/pebbe/zmq4 (from $GOROOT)
        /root/go/src/github.com/pebbe/zmq4 (from $GOPATH)
cluster-smi.go:8:2: cannot find package "github.com/vmihailenco/msgpack" in any of:
        /nfs/utils/go/src/github.com/vmihailenco/msgpack (from $GOROOT)
        /root/go/src/github.com/vmihailenco/msgpack (from $GOPATH)
config.go:4:2: cannot find package "gopkg.in/yaml.v2" in any of:
        /nfs/utils/go/src/gopkg.in/yaml.v2 (from $GOROOT)
        /root/go/src/gopkg.in/yaml.v2 (from $GOPATH)
make: *** [all] Error 1

it's not clear for me how to fix it....do you have any idea?

Use syncMap

Golang 1.9 supports sync maps with baked in mutex locks. Currently, the cluster smi server only updates the node information when receiving a new message. This is ok. But these updates are broadcasted immediately to all subscribers, which causes a lot of messages flooding around (having more nodes) as this loop does not pause or depends on ticks.

I wonder if the receiving part can be a go routine and the broadcasting part another go routines depending on ticks to limit the number of messages sent to clients.

Possible deadlock?

Currently using cluster-smi 24/7, but cluster-smi seems stuck (./cluster-smi stops working, no output at all) once or twice a day.
Not exactly sure where the problem comes from, so I'm just rebooting the router+client computer when it stops working.

Any ideas?

Show processes running on the GPUs

Nvidia-smi provides information which processes are running on the GPUs. However this would cause a long list when showing these information for all nodes at the same time. Not sure how to display these information.

Information are stored in the nvmlProcessInfo_t structure which can be obtained via nvmlReturn_t nvmlDeviceGetComputeRunningProcesses ( nvmlDevice_t device, unsigned int* infoCount, nvmlProcessInfo_t* infos ). Further this only returns the pid and used memory. Querying the command requires to parse /proc/pid/stat like in cluster-top which is OS dependent.

Docker Image for cluster-smi

Hi @PatWie ,

I quickly created a Dockerfile in order to build a Docker Image out of cluster-smi.
Have a look here:
https://github.com/ieggel/cluster-smi-docker

It's quite nice in the sense that it automates the cumbersome cluster-smi build process in a reproducible way. Once the Docker Image build it should be able to run anywhere.

If you think it is useful you can link to the repo. I'm also open for suggestions. I created the readme quite quickly, so there might still be some typos/other mistakes. I will review it again next week.

Cheers,
-Ivan

Cannot compile

Thank you for your excellent work. However when i compile with go-lang, it have the error. I totally new to Go-lang, can you show me what is the problem?
image

[error] build cluster-smi-router.go config.go

Hi again,
Do you know why it happens?

# go build cluster-smi-router.go config.go
# command-line-arguments
./cluster-smi-router.go:81:48: cannot use router_socket (type *"github.com/pebbe/zmq4".Socket) as type *"github.com/patwie/cluster-smi/vendor/github.com/pebbe/zmq4".Socket in argument to messaging.ReceiveMultipartMessage
./cluster-smi-router.go:97:33: cannot use router_socket (type *"github.com/pebbe/zmq4".Socket) as type *"github.com/patwie/cluster-smi/vendor/github.com/pebbe/zmq4".Socket in argument to messaging.SendMultipartMessage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.