Giter Club home page Giter Club logo

galois's People

Contributors

abrooks98 avatar amberhassaan avatar bigwater avatar bozhiyou avatar breakinbad avatar chenxuhao avatar danghvu avatar darthscsi avatar ddn0 avatar gurbinder533 avatar insertinterestingnamehere avatar jmftrindade avatar kessido avatar kjopek avatar l-hoang avatar mrzmann avatar ndryden avatar nicelhc13 avatar podsiadlo avatar roshandathathri avatar srogatch avatar swarnendubiswas avatar uditagar avatar vishweshjatala avatar yishanlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

galois's Issues

Removal in MorphGraph is buggy

galois::graphs::MorphGraph is buggy at edge/node removal when the graph is instantiated directed and tracking incoming/outgoing edges at the same time. Need to trace down the root cause and fix it.

galois::graphs::First_SepInOut_Graph in experimental features fixed the issue by separating incoming edges and outgoing edges to different vectors. We can move this to production code (probably with renaming the class).

Error setting CUDA personality with multiple GPUs

In the file dist_apps/src/DistBenchStart.cpp, the line

164: if (personality_set.length() == (net.Num / num_nodes)) {

fails to correctly set the personality to GPU_CUDA if an application is launched with a configuration such as -num_nodes=1 -pset="gg", ie. one host with two GPUs.

Pangolin

  1. Clean up code.
  2. Prepare a small input for testing.
  3. Webpage.

mpi not executed in bfs_pull aplication(distributed vertion)

when I use the command: mpirun -n 4 -host=IIPLab-WS2,IIPLab-WS1 ./bfs_pull /home/mpi_ share/KW/bgg.gr -numnodes=2
the runtime error is:

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 4774 RUNNING AT IIPLab-WS1
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

[proxy:0:1@IIPLab-WS2] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:1@IIPLab-WS2] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@IIPLab-WS2] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[mpiexec@IIPLab-WS2] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
[mpiexec@IIPLab-WS2] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
[mpiexec@IIPLab-WS2] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
[mpiexec@IIPLab-WS2] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Constant warnings when compiling

With gcc 8.1, warnings as follows always occur when compiling the code.

In file included from /opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/mpl/aux_/na_assert.hpp:2
3,
from /opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/mpl/arg.hpp:25,
from /opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/mpl/placeholders.hpp:24,
from /opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/iterator/iterator_catego
ries.hpp:16,
from /opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/iterator/iterator_facade
.hpp:13,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/FixedSizeRing.h:27,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/worklists/PerThreadChunk.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/worklists/WorkList.h:25,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Traits.h:24,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/runtime/Executor_OnEach.h:24,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Bag.h:24,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/runtime/Executor_Deterministic.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Loops.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Galois.h:23,
from /h1/byou/Workspace/GaloisCpp/lonestar/connectedcomponents/ConnectedComponents.cpp:20:
/opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/mpl/assert.hpp:188:21: warning: unnecessary pa
rentheses in declaration of ‘assert_arg’ [-Wparentheses]
failed ************ (Pred::************
^
/opt/apps/ossw/libraries/boost/boost-1.67.0/c7/gcc-8.1/include/boost/mpl/assert.hpp:193:21: warning: unnecessary parentheses in declaration of ‘assert_not_arg’ [-Wparentheses]
failed ************ (boost::mpl::not_::************

In file included from /h1/byou/Workspace/GaloisCpp/libllvm/include/llvm/Support/CommandLine.h:25,
from /h1/byou/Workspace/GaloisCpp/lonestar/connectedcomponents/ConnectedComponents.cpp:30:
/h1/byou/Workspace/GaloisCpp/libllvm/include/llvm/ADT/SmallVector.h: In instantiation of ‘static void llvm::SmallVectorTemplateBase<T, true>::uninitialized_copy(T1*, T1*, T2*) [with T1 = const std::pair<const char*, std::pair<int, const char*> >; T2 = std::pair<const char*, std::pair<int, const char*> >; T = std::pair<const char*, std::pair<int, const char*> >]’:
/h1/byou/Workspace/GaloisCpp/libllvm/include/llvm/ADT/SmallVector.h:636:3: required from ‘const llvm::SmallVectorImpl& llvm::SmallVectorImpl::operator=(const llvm::SmallVectorImpl&) [with T = std::pair<const char*, std::pair<int, const char*> >]’
/h1/byou/Workspace/GaloisCpp/libllvm/include/llvm/ADT/SmallVector.h:693:36: required from ‘llvm::SmallVector<T, N>::SmallVector(const llvm::SmallVector<T, N>&) [with T = std::pair<const char*, std::pair<int, const char*> >; unsigned int N = 4]’
/h1/byou/Workspace/GaloisCpp/libllvm/include/llvm/Support/CommandLine.h:444:7: required from ‘llvm::cl::ValuesClass llvm::cl::values(const char*, DataType, const char*, ...) [with DataType = int]’
/h1/byou/Workspace/GaloisCpp/lonestar/connectedcomponents/ConnectedComponents.cpp:82:21: required from here
/h1/byou/Workspace/GaloisCpp/libllvm/include/llvm/ADT/SmallVector.h:244:11: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct std::pair<const char*, std::pair<int, const char*> >’ with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
memcpy(Dest, I, (E - I) * sizeof(T));
~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /opt/apps/ossw/applications/gcc/gcc-8.1/c7/include/c++/8.1.0/utility:70,
from /opt/apps/ossw/applications/gcc/gcc-8.1/c7/include/c++/8.1.0/tuple:38,
from /opt/apps/ossw/applications/gcc/gcc-8.1/c7/include/c++/8.1.0/mutex:38,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/substrate/SimpleLock.h:27,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/substrate/PaddedLock.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/PriorityQueue.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/gstl.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Bag.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/runtime/Executor_Deterministic.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Loops.h:23,
from /h1/byou/Workspace/GaloisCpp/libgalois/include/galois/Galois.h:23,
from /h1/byou/Workspace/GaloisCpp/lonestar/connectedcomponents/ConnectedComponents.cpp:20:
/opt/apps/ossw/applications/gcc/gcc-8.1/c7/include/c++/8.1.0/bits/stl_pair.h:198:12: note: ‘struct std::pair<const char*, std::pair<int, const char*> >’ declared here
struct pair
^~~~

Release graphgrammar based mesh refinement code

Another group recently put together https://github.com/Podsiadlo/TerrainMeshGenerator/tree/graphgrammar/lonestar/graphgrammar2 using Galois. The algorithm they use there is an excellent demo of Galois and they've said we can add their app as a demo.

[ ] Remove large binary files from their git history before merging
[ ] Fix the warnings present in their code
[ ] Make one of their needed inputs available in our small inputs download
[ ] Write a test

Fix CMake (3.17) warnings

CMake Warning (dev) at /org/centers/cdgc/sw/miniconda/envs/cmake-3.17/share/cmake-3.17/Modules/GNUInstallDirs.cmake:225 (message):
Unable to determine default CMAKE_INSTALL_LIBDIR directory because no
target architecture is known. Please enable at least one language before
including GNUInstallDirs.
Call Stack (most recent call first):
CMakeLists.txt:3 (include)
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Warning (dev) at /org/centers/cdgc/sw/miniconda/envs/cmake-3.17/share/cmake-3.17/Modules/FindCUDA.cmake:593 (option):
Policy CMP0077 is not set: option() honors normal variables. Run "cmake
--help-policy CMP0077" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.

For compatibility with older versions of CMake, option is clearing the
normal variable 'CUDA_SEPARABLE_COMPILATION'.
Call Stack (most recent call first):
CMakeLists.txt:208 (find_package)
This warning is for project developers. Use -Wno-dev to suppress it.

CMake Warning (dev) at /org/centers/cdgc/sw/miniconda/envs/cmake-3.17/share/cmake-3.17/Modules/CheckIncludeFile.cmake:80 (message):
Policy CMP0075 is not set: Include file check macros honor
CMAKE_REQUIRED_LIBRARIES. Run "cmake --help-policy CMP0075" for policy
details. Use the cmake_policy command to set the policy and suppress this
warning.

CMAKE_REQUIRED_LIBRARIES is set to:

m

For compatibility with CMake 3.11 and below this check is ignoring it.
Call Stack (most recent call first):
cmake/Modules/llvm-extras.cmake:39 (check_include_file)
libllvm/CMakeLists.txt:16 (include)
This warning is for project developers. Use -Wno-dev to suppress it.

Clear out special cases for old C++17 code

Some of my older code already was using C++17. There are some special cases in the CMake configurations for sweeps, ILU, and triangular solve that need to be removed now that the whole repository is running on C++17.

libllvm too old to support option categorization

The latest (or some earlier versions) llvm commandline parsing library supports option categorization, which makes the help information (output of -help option) more systematic and pretty. For example, we can categorize the common options like -runs, -t, etc. into a group, and other application- or even algorithm-specific options like -edgeTileSize into other groups. Current libllvm is too old to support this.

Why running SGD with multiple hosts return segfault?

I have an bipartite graph with 11 nodes and 10 edges. The edgelist of this graph is:
1 6 1
2 6 1
2 7 1
3 7 1
3 8 1
4 8 1
4 9 1
5 9 1
5 10 1
0 10 1
I convert it to .gr with -edgelist2gr.
When I used it to run SGD in one host ,the program run successfully,but when I used it to run SGD in two hosts, the program return segfaults.
[0] InitializeGraph::go called
[1] InitializeGraph::go called
yhrun: error: cn7420: task 1: Segmentation fault

When program run
_graph.sync<writeSource, readAny, Reduce_set_latent_vector>("InitializeGraph");
program return segfaults.
How to solve it?

NodeIterator and EdgeIterator for TriangleCounting yield different results

I am trying to run TriangleCounting on the cit-Patents dataset and I get 2 different numbers for the NodeIterator and EdgeIterator algorithms!
I was wondering why that is?

For reference, for the NodeIterator I get 261897 triangles. For the EdgeIterator I get 188649 triangles.

Any explanation would be helpful!

Separate repo for Galois core from that for Lonestar apps

Right now Lonestar apps and Galois core share a common repo. It would be good to separate them into different repos so that external users only need to work with Galois core when writing their own apps. My use cases are OpenSTA (the timer in OpenROAD project led by UCSD) and Cyclone (asynchronous timer collaborated with Yale).

triangle counting updates

Update TC code with some optimizations found in past few weeks (notably moving function out of lambda).

Maybe drop algorithms that don't perform well? Ordered count generally is always fastest if you can sort the graph.

Survey Propagation times out

Survey propagation with multiple threads can get into a state where it never terminates.

186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
186: NonTrivial 4 MaxBias 2.22507e-308 Average Bias -nan
186: DECIMATED
333/333 Test #186: run-small2-surveypropagation-18 ................***Timeout 1500.09 sec

https://circleci.com/gh/IntelligentSoftwareSystems/Galois/5415

Upgrade D-IrGL Apps

  1. Support newer architectures
  2. Verify the correctness for CUDA > 10.0 and GCC > 7.0.

Poor Scaling of Exceptions Fix

We currently need exceptions to implement operator aborts, but these scale poorly in practice since stack unwinding involves some locking. This issue started out focused on discussing long term ways to actually get this problem resolved upstream with the short term fixes discussed in #72. We later decided to continue both discussions here to avoid confusion.

Config header

Currently if other projects want to use Galois as a library and write their own Galois app, they have to let Galois pass its CXX flags through CMake options.

It would be better if these options can be generated as a Galois config header, which is automatically included by Galois headers. This way other projects can be decoupled from this tangling of CMake options.

Remove dependency on LLVM in Dist Galois

Dist Galois relies on our vendored copy of libllvm. If the actual llvm headers from a later version are installed, it currently fails. Updating Galois to no longer vendor libllvm is a good fix in general and will fix the build error. On the other hand, things that can be installed as libraries should not assume they have control over the end-user's command line argument handling. If the default options are needed in every distributed app, they should be moved to something like the current liblonestar which takes care of common setup and configuration needed for all the lonestar apps.

Galois does not compile on Intel compiler 19 (icpc)

There appears to be an issue with template argument expansion:

Galois/libgalois/include/galois/gtuple.h(77): error: pack expansion does not make use of any argument packs
using append = integer_seq<T, Is..., I>;
^
detected during instantiation of class "galois::integer_seq<T, Is...> [with T=int, Is=<>]" at line 104

MPI not executed in bfs_pull application(single nodes muliple process version)

when I use the command: mpirun -n 4 -host=IIPLab-WS2 ./bfs_pull /home/mpi_share/KW/bgg.gr ,
The runtime error I met is:
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 32831 RUNNING AT XXXXXX
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions.

Make option for compiling tools separately

  1. Allow tools to be compiled separately
  2. When building tools, libllvm should be compiled automatically
  3. When building libgalois alone, should not complain about libllvm not found

lonestar-cpu-inputs.tar.xz is re-extracted every build

Hi there,

Every time I run make, the target 'input' somehow does not recognize that the file has been unpacked, and I get this output:

[  0%] Unpacking lonestar-cpu-inputs.tar.xz

which takes a while. I'm using the release-4.0 branch (which seems to be in sync with master).

Fix Config File Machinery

There's currently some old broken machinery for generating a config file with defines that are set at CMake configure time. That header is no longer included anywhere. It should be fixed and used to provide info about how Galois is configured. Specifically, it should be used to provide:

  • version information
  • the type of aborts used inside the lockable type in the Galois shared object.

Both of those things are currently set as compiler flags in our current CMakeLists, but we shouldn't expect downstream users of an installed library to be aware of the flags used to compile it. That information needs to be included in an installed config header so they can check things with #ifdef statements. This is especially important for our operator aborting code because the headers currently will assume exception based aborts by default even though the library builds with setjump/longjump based aborts by default.

SPRoute

  1. Clean up code. Remove warnings.
  2. Prepare a small input for testing. Large input files.
  3. Webpage.
  4. Move to lonestar/EDA.

MPI_Finalize() not executed in bfs_push application (distributed version)

Hi there,

When I run the distributed bfs_push application(MPI: openMPI 3.0.1a1), I get some runtime error which suggests that the process may exit without calling "MPI_Finalize". And I find out that the "MPI_Finalize" operation is located in the destructor function of NetworkInterfaceBuffered. It seems that the destructor of "net", an object of "NetworkInterfaceBuffered" class in main(), is never executed, and as a result, the process will exit without calling "MPI_Finalize".

The runtime error I met is:

--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 0 on
node cn662 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.
--------------------------------------------------------------------------

Use size_t in test

Anything that ever represents a "collection size" should always be size_t, but I suspect this patch is intended to make the test match what the data structure does to remove a warning, so we can switch everything over to size_t later.

Originally posted by @insertinterestingnamehere in #88

Blitz

  1. Finish timing propagation with false paths
  2. Refactor Blitz code base
  3. Prepare small test cases (w/ ground truth for verification)
  4. Webpage
  5. Not for release yet; make public when the papers are accepted
    • v1: Basic version (with support of false paths)
    • v2: General tagging mechanism

Drop Non-working Build Configs

Cygwin and IBM XL haven't been working for a long time. We should remove the special cases from them in the existing build system code.

DistTC

  1. Clean up the code
  2. Merge to Dist-Dev branch and finally to the master branch.

CMake Errors if ENABLE_HETERO_GALOIS

CMake Error in libgpu/CMakeLists.txt:
Target "galois_gpu" requires the language dialect "CUDA17" , but CMake does
not know the compile flags to use to enable it.

CMake Error in lonestardist/bc/CMakeLists.txt:
Target "bc_level_cuda" requires the language dialect "CUDA17" , but CMake
does not know the compile flags to use to enable it.

...

Sanitizer CMake Configurations

We currently have a CMake configuration to enable using the address sanitizer. It'd be nice to have configurations that enable the other sanitizers as well.

graph-conver tool segfaults when converting the graph using -gr2sortedparentdegreegr

The graph-converter tool segfaults when converting the graph using the sort by parent degree option,
I tried with multiple graphs (twitter_rv, cit-patents).
I got the backtrace from gdb and the segfault seems to be happening at line 1399 in Galois/tools/graph-convert/graph-convert.cpp inside the call to std:sort function. The segfault only happens when the number of nodes processed (the count variable) is close to the total size (the sz variable).
In the case of cit-patents, the last print statement that was output to console was "6008832 of 6009555
" after it had found an inverse.
I have been running this on Ubuntu 16.04 with 32GB RAM, i7 3.2GHz processor.

KDG Related Apps No Longer Compile

I am getting following errors while trying to compile with -DUSE_EXP=ON flag:

error: no matching function for call to ‘galois::UserContext<std::pair<unsigned int, int> >::getLocalState()’

/home/mahmad/work/new/Galois/lonestar/experimental/bfs/bfs.cpp:806:11: error: ‘StatManager’ is not a member of ‘galois’
galois::StatManager statManager;

/home/mahmad/work/new/Galois/libgalois/include/galois/Loops.h:73:33: error: no match for call to ‘(const boost::iterators::counting_iterator) (std::tuple<max_dist>&)’
runtime::do_all_gen(rangeMaker(tpl), fn, tpl);

/home/mahmad/work/new/Galois/libgalois/include/galois/Loops.h:54:35: error: no match for call to ‘(const std::_Deque_iterator<unsigned int, unsigned int&, unsigned int*>) (std::tuple<AsyncAlgo, galois::s_wl<galois::worklists::OrderedByIntegerMetric<GNodeIndexer, galois::worklists::internal::ChunkMaster<int, galois::worklists::ConExtLinkedQueue, true, false, 64, true>, 0, true, int, int, false, false, false, true> > >&)’
runtime::for_each_gen(rangeMaker(tpl), fn, tpl);

/home/mahmad/work/new/Galois/libgalois/include/galois/Loops.h:54:35: error: no match for call to ‘(const std::_Deque_iterator<std::pair<unsigned int, int>, std::pair<unsigned int, int>&, std::pair<unsigned int, int>*>) (std::tuple<BarrierAlgo<galois::worklists::BulkSynchronous<galois::worklists::internal::ChunkMaster<int, galois::worklists::ConExtLinkedStack, true, true, 256, true>, int, true>, false>, galois::s_wl<galois::worklists::BulkSynchronous<galois::worklists::internal::ChunkMaster<int, galois::worklists::ConExtLinkedStack, true, true, 256, true>, int, true> > >&)’

Refactor experimental code

We should probably rethink USE_EXP. Once we support importing (library) targets, most of the experimental apps can be moved to a separate repo.

For the experimental library code, I propose we audit it with an aim to either promote it to non-experimental or to delete it.

Example: #88

dist-apps output on multi-threads/GPU, bfs_push, road-europe

Both CPU/GPU sanity output differs among runs in the same executable call for bfs_push on road-europe on a single machine.

Example output for road-europe below:

Single thread:
Number of nodes visited from source 0 is 3735649
Max distance from source 0 is 2706

Multiple thread:
[0] BFS::go run 0 called
Number of nodes visited from source 0 is 3774664
Max distance from source 0 is 2643
[0] BFS::go run 1 called
Number of nodes visited from source 0 is 3835497
Max distance from source 0 is 2666
[0] BFS::go run 2 called
Number of nodes visited from source 0 is 3785452
Max distance from source 0 is 2647

Reported by Tal Ben-Nun.

Poor Scaling of Exceptions Workaround

This is a companion issue to #71 focused on the question "what do we do now?" We need a short term solution since getting exceptions to scale in parallel contexts will likely require some long term work on upstream projects.

There are a few things that would be nice to achieve:

  • Not relying on a custom toolchain
  • Not killing performance of apps with moderate contention
  • Resolve the crashes in the gmetis and torus apps
  • Not breaking user code in conter-intuitive ways
  • Not poluting the user interface with additional options
  • Not forcing users to worry about another configuration setting

Here are a few plausible options and where they stand:

  1. Sticking with longjump exclusively provides good performance, but relies on the currently undocumented assumption that operators that use conflict detection do not construct any objects that require nontrivial destruction before reaching their failsafe point. If we stick with this in the long run we should probably document this limitation. We already warn people not to modify data till after the failsafe point, so this isn't much more of a constraint than that. Switching off setjmp/longjmp based aborts also fixed our gmetis and torus apps. That said, it's possible that the runtime locks for exceptions are just hiding some other bug in those cases, so more investigation is needed to see what's going on there. This is what we'd been doing up until we realized the crashes in torus and gmetis were connected to this. If we commit to this option we should at least document the additional limitations on operators.
  2. Switching to exceptions entirely has severe performance implications for some apps. It does avoid adding additional constraints on the style of code the user writes though. It also fixes the crashes in gmetis and torus. It also avoids any user interface issues.
  3. Modifying the Galois API to no longer rely on exceptions for locking/control flow. It's not currently clear what this would look like, but maybe we could modify our API to make operator aborts something that doesn't have to involve unwinding the stack at all.
  4. Use exceptions in the CI tests and nowhere else. This requires no user options and very little additional work from us right now, but it could result in unexpected memory leaks or crashes in user code that uses the current implementation of conflict detection. It also only hides the issue in gmetis and torus and introduces a disparity in our testing environment that could hide other bugs as well.
  5. Allow switching between setjmp/longjmp based aborts and exception based aborts via a preprocessor flag. This avoids directly polluting the user interface, but still requires users to make a decision about which exception style is better suited to their application. If we get the exception scaling issues fully fixed upstream then the build option can be silently ignored later without any API changes. We can avoid major performance penalties by allowing demo applications that need setjmp/longjmp to opt into that abort mechanism in their build configs.
  6. Allow switching between abort implementations via a template argument flag. This directly pollutes the user API with something that is arguable an implementation choice, but it makes sense if we actually intend setjmp/longjmp to be permanent distinct options for aborts.

Thus far I'm partial to option 5 (see #68), but my opinion isn't the only one that matters. We also need to measure to see exactly what kinds of performance tradeoffs are in play here. If exceptions kill performance in every case where we use termination detection then options 1 and 3 are probably better approaches.

OpenMPI + CentOS segfault

321: [0c214156352c:13251:0:13288] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f0359deb768)
321: ==== backtrace ====
321:     0  /lib64/libucs.so.0(+0x18bb0) [0x7f035977ebb0]
321:     1  /lib64/libucs.so.0(+0x18d8a) [0x7f035977ed8a]
321:     2  /lib64/libuct.so.0(+0x1655b) [0x7f036d13c55b]
321:     3  /lib64/ld-linux-x86-64.so.2(+0xfd2a) [0x7f042c9ead2a]
321:     4  /lib64/ld-linux-x86-64.so.2(+0xfe2a) [0x7f042c9eae2a]
321:     5  /lib64/ld-linux-x86-64.so.2(+0x13e3f) [0x7f042c9eee3f]
321:     6  /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7f042b270ff7]
321:     7  /lib64/ld-linux-x86-64.so.2(+0x136ae) [0x7f042c9ee6ae]
321:     8  /lib64/libdl.so.2(+0x11ba) [0x7f042a9ca1ba]
321:     9  /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7f042b270ff7]
321:    10  /lib64/libc.so.6(_dl_catch_error+0x33) [0x7f042b271093]
321:    11  /lib64/libdl.so.2(+0x1939) [0x7f042a9ca939]
321:    12  /lib64/libdl.so.2(dlopen+0x4a) [0x7f042a9ca25a]
321:    13  /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x7f042ac3af05]
321:    14  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x206) [0x7f042ac18b16]
321:    15  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x7f042ac17a5a]
321:    16  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7f042ac233ce]
321:    17  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x7f042ac238b2]
321:    18  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x7f042ac23915]
321:    19  /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x7f042be7b494]
321:    20  /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Init_thread+0x89) [0x7f042beab839]
321:    21  ./pagerank_pull() [0x51d844]
321:    22  ./pagerank_pull() [0x51e8b9]
321:    23  /lib64/libstdc++.so.6(+0xc2b23) [0x7f042bb57b23]
321:    24  /lib64/libpthread.so.0(+0x82de) [0x7f042c7c32de]
321:    25  /lib64/libc.so.6(clone+0x43) [0x7f042b2344b3]
321: ===================

https://circleci.com/gh/IntelligentSoftwareSystems/Galois/5362

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.