open-mpi / ompi-collectives-tuning Goto Github PK
View Code? Open in Web Editor NEWScripts to collect data for collectives selection tuning
License: Other
Scripts to collect data for collectives selection tuning
License: Other
I found that there is a discrepancy between the message size that the OSU benchmarks report and the size that is used by coll/tuned
to make tuning decisions: the OSU benchmarks report the size of the message each rank sends while coll/tuned
bases it's decision in allgatherv
on the total amount of data to be received. This leads to nonsensical rules and likely suboptimal decisions. This should be fixed in the python scripts (when generating the decision file and ideally also when writing the best.out
file).
Need to add easier way to run collectives on different OMPI versions (They have different numbers of algorithms for example)
This will need a config file change specifying OMPI version, or at least a parameter in run_and_analyze, and separate collective_jobs directories for each ompi version we care about. I'll make one for 3.x.x, 4.x.x, 5.x.x, and master.
I toyed around with the idea of creating separate branches for each, but the amount of backporting would be stupid, so not going to do that.
The code I wrote to analyze multiple directories and average the data + grpahing would be useful. Needs to be cleaned up, organized better.
In order to get more targeted data, need to add some sort of mapping option. IE. Bind one process per node so we can tune based on internode communication only.
Message size is taken directly from osu and impi message sizes, the following issue explains how they are collective specific. Need to refactor decision file creation code to reflect actual code:
open-mpi/ompi#7672
There is probably a better way to report that data is invalid due to errors when running mpirun rather than resorting to analyzing to detect it. I would like an error output file that lists every single file that contains an error. Needs more thought.
Rather than executing data analysis right away, it's probably better to split the analyze code into a separate user entered section and instructions.
Dynamic code does not currently have any safety features or fallbacks if an algorithm that cannot support non commutative ops is used during a non commutative op. Need to adjust decision file creation based on that.
In newer versions, the usage of "python" has been deprecated. Now, python3 needs to be explicitly called. In addition, "python" may be linked to versions of python 2. Change calling "python" to calling "python3".
Reduce Scatter Block hits lots of issues in regards to running out of memory and such at ranks 128+. Need to figure out a way to handle this...Possible to just ignore the failures and just revamp the parsing code a little. Needs more thought.
Need to hook up Travis to help get basic things out of the way. Maybe also tox and shellcheck, given the python and bash.
Two proc algorithms can hit issues with the generated decision file (ie. if you have 2 proc and 4 proc tuning, 3 procs will have issues by using the 2 proc tuning).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.