Giter Club home page Giter Club logo

matex's Introduction

MaTEx: Machine Learning Toolkit for Extreme Scale

MaTEx is a collection of high performance parallel machine learning and data mining (MLDM) algorithms, targeted for desktops, supercomputers and cloud computing systems.

Getting Started

Please look at the MaTEx wiki: https://github.com/matex-org/matex/wiki for detailed instructions and support.

matex's People

Contributors

abhinavvishnu avatar charlesmsiegel avatar ipdrm16 avatar jeffdaily avatar rlamothe avatar vamatya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matex's Issues

MPI AllReduce error

I got the following errors

2018-07-16 15:27:27.536541: W tensorflow/core/framework/op_kernel.cc:1192] Unknown: Exception: Message truncated, error stack:
MPI_Allreduce(855)..................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x2049aaa00, count=256, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(712)............:
MPIR_Allreduce_intra(357)...........:
MPIC_Sendrecv(186)..................:
MPIDI_CH3U_Request_unpack_uebuf(599): Message truncated; 1536 bytes received but buffer size is 1024

Any comments!

how do I get the ordered list of reduction ops to do the layer-wise all-to-all reducetion

I ‘confused with the MPI Allreduce Operator
the paper said that MaTEx-TensorFlow use the allreduce ops to synchronize each layer across ranks.
I think one AllReduce op is to reduce all the gradients generated one the same layer(also returns at the same time )
Only got the graph defined by user, I do not know which layer one node belongs to
so how do I know which gradients are on the same layer?
I mean how do I get the ordered list of reduction ops to do the layer-wise all-to-all reducetion.
I have not find some code do this work in this resouce

"Example of a MaTEx TensorFlow executing on four MPI ranks. Each rank will run a model replica and communicate
at each of the reduction points (i.e. the orange bars). Each model is initialized identically due to the broadcast operator at the
beginning (i.e the blue bar)."
https://www.researchgate.net/profile/Abhinav_Vishnu/publication/316184213/figure/fig2/AS:484334135189506@1492485664993/Fig-2-Example-of-a-MaTEx-TensorFlow-executing-on-four-MPI-ranks-Each-rank-will-run-a.png

AttributeError:Datasets' object has no attribute 'training_data

I tested the code from /matex/src/deeplearning/tensorflow/examples/glibc_after_2.19/MNIST/tf_lenet3.py with command python tf_lenet3.py and I got an error:

Traceback (most recent call last):
  File "tf_lenet3.py", line 17, in <module>
    mnist = tf.DataSet("MNIST", normalize=255.0)
AttributeError: module 'tensorflow' has no attribute 'DataSet'

Then, I modified the code. I added from tensorflow.examples.tutorials.mnist import input_data and mnist = input_data.read_data_sets('MNIST_data', one_hot=True), removed mnist = tf.DataSet("MNIST", normalize=255.0) and I got the following error:

Traceback (most recent call last):
  File "tf_lenet3.py", line 70, in <module>
    for train_batch in range(int(len(mnist.training_data)/args.train_batch)):
AttributeError: 'Datasets' object has no attribute 'training_data'

What should I do to solve the problem?
@jeffdaily @vamatya @abhinavvishnu @charlesmsiegel @cabe1980 Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.