Giter Club home page Giter Club logo

lbann's Introduction

LBANN: Livermore Big Artificial Neural Network Toolkit

The Livermore Big Artificial Neural Network toolkit (LBANN) is an open-source, HPC-centric, deep learning training framework that is optimized to compose multiple levels of parallelism.

LBANN provides model-parallel acceleration through domain decomposition to optimize for strong scaling of network training. It also allows for composition of model-parallelism with both data parallelism and ensemble training methods for training large neural networks with massive amounts of data. LBANN is able to advantage of tightly-coupled accelerators, low-latency high-bandwidth networking, and high-bandwidth parallel file systems.

LBANN supports state-of-the-art training algorithms such as unsupervised, self-supervised, and adversarial (GAN) training methods in addition to traditional supervised learning. It also supports recurrent neural networks via back propagation through time (BPTT) training, transfer learning, and multi-model and ensemble training methods.

Building LBANN

The preferred method for LBANN users to install LBANN is to use Spack. After some system configuration, this should be as straightforward as

spack install lbann

More detailed instructions for building and installing LBANN are available at the main LBANN documentation.

Running LBANN

The basic template for running LBANN is

<mpi-launcher> <mpi-options> \
    lbann <lbann-options> \
    --model=model.prototext \
    --optimizer=opt.prototext \
    --reader=data_reader.prototext

When using GPGPU accelerators, users should be aware that LBANN is optimized for the case in which one assigns one GPU per MPI rank. This should be borne in mind when choosing the parameters for the MPI launcher.

More details about running LBANN are documented here.

Publications

A list of publications, presentations and posters are shown here.

Reporting issues

Issues, questions, and bugs can be raised on the Github issue tracker.

lbann's People

Contributors

adammoody avatar aj-prime avatar andy-yoo avatar bburnett66 avatar benson31 avatar bvanessen avatar davidhysom avatar dylanmckinney avatar fiedorowicz1 avatar forsyth2 avatar graham63 avatar jaeseungyeom avatar jonesholger avatar jslenderman avatar kiwabuchi avatar lukebroskop avatar lukejaffe avatar mcneish1 avatar mrwyattii avatar naoyam avatar ndryden avatar nmnobre avatar oyamay avatar samadejacobs avatar soumyadipghosh avatar spears9 avatar szaman19 avatar tbennun avatar timmoon10 avatar tnat410 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lbann's Issues

Clean up dead code

There's a lot of dead code lying around, or code that's broken but needs to work eventually. We should probably just delete it all and add issues for what needs to be implemented. The old code will still be around in version control if needed.

Off the top of my head, the biggest offender is probably the checkpointing code, which is almost certainly broken.

Handling of partially complete mini-batches

We do not properly handle the behavior for when a rank does not have a complete mini-batch of data. The solution is to check how many samples are actually valid in each mini-batch and then compensate for that during backprop and update.

Build problem on Centos

Hi - I'm trying to do an LBANN build on a CentOS 6.8 system. I have a custom build of the gcc 4.9.3 toolchain. I had trouble with the internal build of Elemental that manifested as

gmake[8]: O2: Command not found

but I resolved that by pulling Elemental and building it outside the LBANN tree. However I am now running into the C++ compile errors below.

Thanks for any help,
Bob Olson (ANL)

[ 28%] Building CXX object CMakeFiles/src.dir/src/layers/lbann_layer_softmax.cpp.o
In file included from /tmp/elemental/include/El/optimization/solvers.hpp:12:0,
                 from /tmp/elemental/include/El/optimization.hpp:14,
                 from /tmp/elemental/include/El.hpp:18,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/optimization/solvers/LP.hpp:274:68: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::direct::Ctrl<Real>& ctrl=lp::direct::Ctrl<Real>(false) );
                                                                    ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:284:68: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::direct::Ctrl<Real>& ctrl=lp::direct::Ctrl<Real>(false) );
                                                                    ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:294:67: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::direct::Ctrl<Real>& ctrl=lp::direct::Ctrl<Real>(true) );
                                                                   ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:304:67: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::direct::Ctrl<Real>& ctrl=lp::direct::Ctrl<Real>(true) );
                                                                   ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:540:63: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::affine::Ctrl<Real>& ctrl=lp::affine::Ctrl<Real>() );
                                                               ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:553:63: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::affine::Ctrl<Real>& ctrl=lp::affine::Ctrl<Real>() );
                                                               ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:566:63: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::affine::Ctrl<Real>& ctrl=lp::affine::Ctrl<Real>() );
                                                               ^
/tmp/elemental/include/El/optimization/solvers/LP.hpp:579:63: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
   const lp::affine::Ctrl<Real>& ctrl=lp::affine::Ctrl<Real>() );
                                                               ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp: In member function ‘virtual void lbann::SoftmaxLayer::fp_linearity(ElMat&, ElMat&, ElMat&, ElMat&)’:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: error: no matching function for call to ‘IndexDependentMap(ElMat&, std::function<float(int, int, float)>)’
     }));
       ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note: candidates are:
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:15:6: note: template<class T> void El::IndexDependentMap(El::Matrix<Ring>&, std::function<T(int, int, const T&)>)
 void IndexDependentMap( Matrix<T>& A, function<T(Int,Int,const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:15:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘El::Matrix<Ring>’
     }));
       ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:26:6: note: template<class T> void El::IndexDependentMap(El::AbstractDistMatrix<Ring>&, std::function<T(int, int, const T&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:26:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   mismatched types ‘const T&’ and ‘float’
     }));
       ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   ‘std::function<float(int, int, float)>’ is not derived from ‘std::function<T(int, int, const T&)>’
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:45:6: note: template<class S, class T> void El::IndexDependentMap(const El::Matrix<Ring>&, El::Matrix<T>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:45:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::Matrix<Ring>’
     }));
       ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:58:6: note: template<class S, class T, El::DistNS::Dist U, El::DistNS::Dist V, El::DistWrapNS::DistWrap wrap> void El::IndexDependentMap(const El::DistMatrix<S, U, V, wrap>&, El::DistMatrix<T, U, V, wrapType>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:58:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::DistMatrix<S, U, V, wrap>’
     }));
       ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:82:6: note: template<class S, class T, El::DistNS::Dist U, El::DistNS::Dist V> void El::IndexDependentMap(const El::AbstractDistMatrix<Ring>&, El::DistMatrix<T, U, V>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:82:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   ‘std::function<float(int, int, float)>’ is not derived from ‘El::DistMatrix<T, U, V>’
     }));
       ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:110:6: note: template<class S, class T, El::DistNS::Dist U, El::DistNS::Dist V> void El::IndexDependentMap(const El::AbstractDistMatrix<Ring>&, El::DistMatrix<T, U, V, (El::DistWrapNS::DistWrap)1u>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:110:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:158:7: note:   ‘std::function<float(int, int, float)>’ is not derived from ‘El::DistMatrix<T, U, V, (El::DistWrapNS::DistWrap)1u>’
     }));
       ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: error: no matching function for call to ‘IndexDependentMap(ElMat&, std::function<float(int, int, float)>)’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note: candidates are:
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:15:6: note: template<class T> void El::IndexDependentMap(El::Matrix<Ring>&, std::function<T(int, int, const T&)>)
 void IndexDependentMap( Matrix<T>& A, function<T(Int,Int,const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:15:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘El::Matrix<Ring>’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:26:6: note: template<class T> void El::IndexDependentMap(El::AbstractDistMatrix<Ring>&, std::function<T(int, int, const T&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:26:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   mismatched types ‘const T&’ and ‘float’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   ‘std::function<float(int, int, float)>’ is not derived from ‘std::function<T(int, int, const T&)>’
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:45:6: note: template<class S, class T> void El::IndexDependentMap(const El::Matrix<Ring>&, El::Matrix<T>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:45:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::Matrix<Ring>’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:58:6: note: template<class S, class T, El::DistNS::Dist U, El::DistNS::Dist V, El::DistWrapNS::DistWrap wrap> void El::IndexDependentMap(const El::DistMatrix<S, U, V, wrap>&, El::DistMatrix<T, U, V, wrapType>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:58:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::DistMatrix<S, U, V, wrap>’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:82:6: note: template<class S, class T, El::DistNS::Dist U, El::DistNS::Dist V> void El::IndexDependentMap(const El::AbstractDistMatrix<Ring>&, El::DistMatrix<T, U, V>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:82:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   ‘std::function<float(int, int, float)>’ is not derived from ‘El::DistMatrix<T, U, V>’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:39:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:110:6: note: template<class S, class T, El::DistNS::Dist U, El::DistNS::Dist V> void El::IndexDependentMap(const El::AbstractDistMatrix<Ring>&, El::DistMatrix<T, U, V, (El::DistWrapNS::DistWrap)1u>&, std::function<T(int, int, const S&)>)
 void IndexDependentMap
      ^
/tmp/elemental/include/El/blas_like/level1/IndexDependentMap.hpp:110:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:170:168: note:   ‘std::function<float(int, int, float)>’ is not derived from ‘El::DistMatrix<T, U, V, (El::DistWrapNS::DistWrap)1u>’
                                                                 DataType{Int rL = this->ZsNormExpSumStar.LocalRow(c); return z/this->ZsNormExpSumStar.GetLocal(rL,0);}));
                                                                                                                                                                        ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp: In member function ‘DataType lbann::SoftmaxLayer::computeCost(const DistMat&)’:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: error: no matching function for call to ‘EntrywiseMap(ElMat&, std::function<float(float)>)’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note: candidates are:
In file included from /tmp/elemental/include/El/core.hpp:274:0,
                 from /tmp/elemental/include/El.hpp:14,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/core/Graph/decl.hpp:142:17: note: template<class U, class V> void El::EntrywiseMap(const El::SparseMatrix<U>&, El::SparseMatrix<V>&, std::function<V(U)>)
     friend void EntrywiseMap
                 ^
/tmp/elemental/include/El/core/Graph/decl.hpp:142:17: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::SparseMatrix<U>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:15:6: note: template<class T> void El::EntrywiseMap(El::Matrix<Ring>&, std::function<T(const T&)>)
 void EntrywiseMap( Matrix<T>& A, function<T(const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:15:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘El::Matrix<Ring>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:28:6: note: template<class T> void El::EntrywiseMap(El::SparseMatrix<U>&, std::function<T(const T&)>)
 void EntrywiseMap( SparseMatrix<T>& A, function<T(const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:28:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘El::SparseMatrix<U>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:38:6: note: template<class T> void El::EntrywiseMap(El::AbstractDistMatrix<Ring>&, std::function<T(const T&)>)
 void EntrywiseMap( AbstractDistMatrix<T>& A, function<T(const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:38:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   mismatched types ‘const T&’ and ‘float’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘std::function<float(float)>’ is not derived from ‘std::function<T(const T&)>’
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:42:6: note: template<class T> void El::EntrywiseMap(El::DistSparseMatrix<U>&, std::function<T(const T&)>)
 void EntrywiseMap( DistSparseMatrix<T>& A, function<T(const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:42:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘El::DistSparseMatrix<U>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:52:6: note: template<class T> void El::EntrywiseMap(El::DistMultiVec<Ring1>&, std::function<T(const T&)>)
 void EntrywiseMap( DistMultiVec<T>& A, function<T(const T&)> func )
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:52:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘El::DistMultiVec<Ring1>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:56:6: note: template<class S, class T> void El::EntrywiseMap(const El::Matrix<Ring>&, El::Matrix<T>&, std::function<T(const S&)>)
 void EntrywiseMap
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:56:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::Matrix<Ring>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:74:6: note: template<class S, class T> void El::EntrywiseMap(const El::SparseMatrix<U>&, El::SparseMatrix<V>&, std::function<T(const S&)>)
 void EntrywiseMap
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:74:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::SparseMatrix<U>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:90:6: note: template<class S, class T> void El::EntrywiseMap(const El::AbstractDistMatrix<Ring>&, El::AbstractDistMatrix<T>&, std::function<T(const S&)>)
 void EntrywiseMap
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:90:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘std::function<float(float)>’ is not derived from ‘El::AbstractDistMatrix<T>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:121:6: note: template<class S, class T> void El::EntrywiseMap(const El::DistSparseMatrix<U>&, El::DistSparseMatrix<T>&, std::function<T(const S&)>)
 void EntrywiseMap
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:121:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::DistSparseMatrix<U>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
In file included from /tmp/elemental/include/El/blas_like/level1/impl.hpp:28:0,
                 from /tmp/elemental/include/El/blas_like/level1.hpp:13,
                 from /tmp/elemental/include/El/blas_like.hpp:12,
                 from /tmp/elemental/include/El.hpp:15,
                 from /homes/olson/Pilot1/lbann/include/lbann/lbann_base.hpp:33,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer.hpp:32,
                 from /homes/olson/Pilot1/lbann/include/lbann/layers/lbann_layer_softmax.hpp:32,
                 from /homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:27:
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:138:6: note: template<class S, class T> void El::EntrywiseMap(const El::DistMultiVec<Ring1>&, El::DistMultiVec<T>&, std::function<T(const S&)>)
 void EntrywiseMap
      ^
/tmp/elemental/include/El/blas_like/level1/EntrywiseMap.hpp:138:6: note:   template argument deduction/substitution failed:
/homes/olson/Pilot1/lbann/src/layers/lbann_layer_softmax.cpp:227:102: note:   ‘ElMat {aka El::ElementalMatrix<float>}’ is not derived from ‘const El::DistMultiVec<Ring1>’
     EntrywiseMap(*Acts, (std::function<DataType(DataType)>)([](DataType z)->DataType{return log(z);}));
                                                                                                      ^
make[2]: *** [CMakeFiles/src.dir/src/layers/lbann_layer_softmax.cpp.o] Error 1
make[1]: *** [CMakeFiles/src.dir/all] Error 2
make: *** [all] Error 2

Add CUB dependency

When I run LBANN under the Nvidia profiler, I find that much of the GPU time (50%-66%) is occupied with allocating and freeing memory on the GPU. This cost can be eliminated by using a memory pool. We could try implementing our own memory pool, but I think a more robust option is to use CUB, an open-source (New BSD license) package being developed by Nvidia. CUB has a variety of GPU primitives and utilities, which may be useful in the future if we want to do anything fancy on the GPU.

Quantization optimization

I just merged a bunch of updates to lbann_quantizer, mostly optimization. Lots of stuff runs faster, but this is to track some notes and further work on it:

  • While these updates algorithms seem correct after an initial pass, I need to do more checking/testing to ensure correctness.
  • One-bit quantization is quite slow. Part of this is due to computing averages over the whole data-- sampling to approximate the average would probably work better (this is already done in proportion_threshold_average). The AdaGrad step is also pretty slow.
  • One-bit quantization currently also has an outstanding bug leading to memory corruption that I haven't fixed.
  • Compression could do with some further optimization, it roughly doubles runtime.
  • We have lots of opportunities for multi-threading things to improve performance further.

Some notes:

  • Threshold quantization suffers from a problem where the error feedback means it starts rapidly sending large amounts of data because entries exceed the threshold. (Adaptive quantization does not suffer this problem.)
  • Threshold quantization before error feedback messes things up, and adaptive quantization consistently, perform the same or faster than the vanilla MPI allreduce for large matrices.
  • Compression requires delta coding with large matrices, or we try and encode position numbers like 33 million, which is terrible with Golomb-Rice encoding.

Early stopping breaks multiple models

When training multiple models, if one model terminates due to early stopping, LBANN either:

  • Crashes, because a collective gets messed up (I haven't figured out exactly where this occurs).
  • Hangs, because it tries to do an MPI collective (e.g. in the summarizer) and a portion of the ranks are not participating.

FC and other layers only update at training

(Extracting this from a todo so we don't forget.)

Right now Dnn::EvaluateBatch hackily calls only the update methods of the first and last layers. Instead it should call the update methods of every layer, and each layer should only update when it's supposed to (e.g., fully-connected/softmax at training time).

CMake script doesn't always identify --debug

The build_lbann_lc.sh script doesn't always identify its parameters correctly:

  • build_lbann_lc.sh --compiler gnu --debug builds a Release build
  • build_lbann_lc.sh --debug --compiler gnu does the right thing and makes a Debug build

Scripts override model defaults

Our run scripts (e.g. run_lbann_dnn_multi_mnist.sh) set their own defaults for parameters (e.g. mini-batch size). These override the defaults set in the model zoo files. They probably shouldn't do that.

Accuracy suffers as number of MPI processes increases

When I run lbann_dnn_mnist, I find that the accuracy is dependent on the number of MPI ranks. The mini-batch size is 100 and the other parameters are default. With one process, we get 90.52% test accuracy after one epoch. With 16 processes, the test accuracy drops to 80.75%. Looking at the past commits, the error first appears in commit 6be4b03 or commit a6adcf3.

Dropout does not give good accuracy

Training with dropout does not seem to give the good accuracy the dropout paper (and dropout's wide use) would suggest. The dropout paper gives a few experiments on MNIST that I've tried to reproduce:

  1. Dropout NN with three hidden layers of 1024 units, logistic activation, 0.8 keep probability on the input layer, 0.5 keep probability on the hidden layers, gives 1.35% error (98.65% accuracy)
  2. Ditto, with ReLU activation gives 1.25% error (98.75% accuracy)
  3. They also do an evaluation (see figure 4, their JMLR paper) in which they train a NN (e.g. three hidden layers, 1024 units in each) without dropout, then use the exact same hyperparameters but with dropout on the hidden layers. Without dropout, error is ~1.7-1.8%; with dropout error is ~1.0-1.3%.

Trying these same experiments in LBANN, our ReLU accuracy without dropout is comparable (~98.3%). Our sigmoid accuracy without dropout is ~96.8% (this is low due to using the same learning rate as with ReLU-- we can do better). But with dropout, our accuracy is actually worse:

  • ReLU with 0.5 dropout on the hidden layers gets ~97.0% accuracy.
  • Sigmoid with 0.5 dropout on the hidden layers gets ~94.0% accuracy.
  • Adding dropout to the input layer just makes things worse.
    (These all used a learning rate 0.005.)

Based on this, I suspect there's a bug somewhere in dropout/

Greedy layer-wise auto-encoders

Train multiple network layers (e.g. Greedy layer-wise)
Freeze lower network layers, add new target reconstruction layers, and iteratively train deeper layers of the network.

Standardized metrics class to compute data set balance

For a labeled data set, can we build a metrics package that calculates how many labels exist, and of which type, and then the system can compute raw accuracy and corrected accuracy.

A secondary question is if we can create a cross validation library that split a single data set into a good cross validation set. Right now we can perform this partitioning based on a random distribution. A more advanced approach would be to make sure that the held-out portion of the data set is a representative sample of the data set.

Summarizer causes MPI error in dnn_mnist

When I add an lbann_callback_summary instance to our DNN model in dnn_mnist, I get the following error after the end of the final epoch:

Fatal error in PMPI_Gather: Invalid communicator, error stack:
PMPI_Gather(918): MPI_Gather(sbuf=0x7fffffffc5b0, scount=1, MPI_FLOAT, rbuf=0x644920, rcount=1, MPI_FLOAT, root=0, comm=0x0) failed
PMPI_Gather(795): Invalid communicator
srun: error: catalyst176: task 0: Exited with exit code 1

I suspect the comm instance is being deleted, then the summarizer's destructor is being called, which may try to use the now-deleted comm object.

MNIST experiment script broken

MNIST experiment script does not work without -u option (i.e., loading image data to local SSD). I think this has to do with copy and paste from imagenet script.

Problems building on OS X

[ 18%] Building Fortran object CMakeFiles/scalapack.dir/TOOLS/pielset.f.o
/usr/local/bin/gfortran-4.9 -DAdd_ -Dscalapack_EXPORTS -I/usr/local/Cellar/mpich/3.2_2/include -fPIC -c /Users/vanessen1/Research/DeepLearning/lbann.git/build/vandamme.llnl.gov/download/elemental/build/download/scalapack/source/TOOLS/pielset.f -o CMakeFiles/scalapack.dir/TOOLS/pielset.f.o
/var/folders/t3/c63l2dsn13b_1tg80g00f2v00015k6/T//cchoUJaF.s:80:suffix or operands invalid for `movq'

Layer variable names

We should fix the variable naming scheme in the Layer class and its children. WB and Ds and whatnot are somewhere between mostly and completely unintelligible.

Optimize inter-kernel "white space" for convolution / pooling layers

Explore the possibilities of leaving data on the GPU in between successive convolutional / pooling layers. Note that this would be an option optimization that would have to accommodate mixing convolutional / pooling layers with fully connected layers as well as convolutional / pooling layers with different fields of view for local receptive fields.

Check for gcc versions

We should change the cmake environment to fail if a new enough version of gcc is not found.

New cmake environment does not seem to properly pickup changes to header files

The new cmake build system will install the header files into the build directory and then use those for compiling the library. However, when a change to a header file is made, it seems that the header files in the build directory are not updated and thus the stale header file is used rather than the new header file.

Branch layer

Create a network layer that can branch / duplicate the activations of the layer and feed them to two dependent layers.

Signed/unsigned comparisons

We have a lot of signed/unsigned comparison warnings that show up. We should fix them all.

A decent number seem to be from using unsigned loop iteration variables, and comparing them to signed upper bounds, e.g. the width of an Elemental matrix. Since Elemental (and MPI) use signed ints, we should probably use them in such cases.

Non-deterministic performance when varying number of ranks in model

There is a set of issues that we discovered that lead to non-determinism when you vary the ranks within the model. The easy problem is that if you do not explicitly initialize the random seed for LBANN there is a variation in train and test accuracy as you change the number of MPI ranks. The simple solution is to use init_random(const e.g. 42); in the model_zoo file. A more important issue is that the initialization of the matrices when doing He or Xavier initialization uses a Elemental function that generates parallel random numbers. Given that initialization is done rarely, we need to add a default mode that will generate the random sequence from a single root node. As a performance optimization we can keep the parallel initialization as a compile time option. A secondary issue has to do with dropout. In the current implementation each rank generates a parallel set of random numbers to determine which nodes to drop out. We want to keep this behavior for performance reasons, but add in a "serial" mode that uses a global, consistent, random sequence.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.