ross-org / ross Goto Github PK
View Code? Open in Web Editor NEWRensselaer's Optimistic Simulation System
Home Page: http://ross-org.github.io
License: BSD 3-Clause "New" or "Revised" License
Rensselaer's Optimistic Simulation System
Home Page: http://ross-org.github.io
License: BSD 3-Clause "New" or "Revised" License
It might be helpful in ROSS to add a statistical counter that is incremented every time a random number is generated and decremented every time a random number is reversed. That way a user could compare and make sure that the same net number of random numbers are consistent across different synchronization modes.
Currently, g_tw_rng_max
is set in the call chain of tw_init
based on the value of g_tw_nRNG_per_lp
. Later on, it's used in tw_rand_init_streams
in the offset calculation and also as a (unnecessary?) check on the number of RNGs asked for. This makes it awkward for individual models or libraries (e.g., CODES) to specify the use of multiple RNGs (ideally, we'd want to insulate main
from having to set it manually for everything to work).
I think the "right" solution is to remove g_tw_rng_max
entirely - at this point, it seems only to be used as an artificial upper bound during initialization that's in all likelihood the value 1. tw_define_lps
is expected to be called exactly once AFAICT so g_tw_nRNG_per_lp
is guaranteed to be the same value for all LPs on the PE.
If there is a case I'm missing, a simple fix that would get around these issues is to set g_tw_rng_max
within the call to tw_define_lps
, which allows some space for LP-specific initialization code to be run. For example, programs written using CODES have an LP "registration" phase before tw_define_lps is called to set up PE->LP mappings.
Currently we use either a hash table or an AVL tree for remote event tracking during optimistic simulation. However, the AVL tree is hacked in underneath the hash table implementation. We need to do some clean up of the API, so that we can try out different data structures here.
((a + b) - a) may not necessarily be equal to b. There are a number of such issues within ross/models when looking at reverse computation (for example the waiting_time variable in the airport model). Floating-points should typically be saved within the messages before being changed so they can be rolled back to a bit-wise identical value.
Copied from #22:
I would love for there to be an option to kill the simulation after N consecutive failed "allocation" attempts. E.g., last evening before I left I set up a sweep of ~700 simulations, the 10th of which entered an infinite "buffers not available" loop I then killed this morning. A "g_tw_max_consec_nobuf_ct" or something (defaulting to ULONG_MAX or similar) would do the trick, along with a bit of extra logic to track and reset the failure count accordingly. Basically, an "opt-in" workaround for issue #6.
Forgot to make this an issue...
For simulations running over a long period of time with small time deltas (e.g., a small g_tw_lookahead
or small random noise to prevent event ties), we can run into floating-point precision limitations - the magnitude of time deltas can be (mostly) lost and in the case of conservative mode, the lookahead check can fail. To fix that (or at least kick the can down the road by a few orders of magnitude), it would be helpful to have some mechanism in ROSS to compile with "long double" support (really, just make the tw_stime
datatype a cmake-time option). This involves:
CMake should check for the GetGitRevison module before it errors out (and just spring a warning instead).
From @JohnPJenkins in #59:
While we're on the subject, specifying user parameters through tw_optdef spits out the warning "missing braces around initializer" (via -Wmissing-braces, enabled by -Wall), caused by the TWOPT_END macro not being within braces. I think we had this fix in before, but it may have been lost in the SR transition?
I thought this issue was taken care of when we added C++ support (see this commit).
For which models does this warning occur? I don't see it when building phold or the codes-base tests.
We need to clarify the documentation and remove any deprecated code (global variable g_tw_rng_seed and the 3rd argument it tw_define_lps).
@JohnPJenkins remembered this quote from @carothersc while discussing 45665b4:
I think we stopped using that parameter since we did not want users picking
their own seeds as some seeds are better than others and some are just bad.So, a little primer here on RNGs. RNGs in fact have only one (hopefully) very, very
long stream of random numbers. CLCG4 uses 4, 32bit values or 128 bits.
This generator has a period of 2^121 implying that ~2^7 values are missing
from the generator.This is fine.So, you can image a circle with 2^121 tiny point going around the circumference
where each point would represent a random value. Eventually, if you called
the generator 2^121 times, you would have made it around the circle.A very nice property of CLCG4 is that you can jump (or virtually teleport) ahead
in that single stream of random numbers. We call it "seed jumping". ROSS
starts from the default seed and jumps 2^70 calls ahead in the stream for each
LP. That is, each LP's starting point in the random stream circle is 2^72 calls
apart. So, in theory, we could support simulation with unto 2^49 LPs and each
LP have it's on part of the random stream, uncorrelated from other parts.Now, to change seed sets, what you really want to do is change where each LP
starts on the random stream circle.You can do this by increasing g_tw_nRNGs_per_lp. So, suppose you use 4.
That will give each LP 4 distinct RNG seed sets that start again 2^72 calls
apart.In your model code, you just change which seed set you wish to us -- 0, 1, 2, 3
e.g., lp->rng[0] or lp->rng[1], lp->rng[2] or lp->rng[3].By default in ROSS, when you use lp->rng your really saying &(lp->rng[0]).
The downside of this approach is you're potentially wasting some memory by
have these 4 seed sets around and the cost to compute them but it's a very
low cost all around unless you have a model with billions of LPs.If that is the case, you would have to directly jump or shift seeds an LP uses
by directly invoking the seed jumping routine w/i the RNG with the LPs
init event.e.g., call, tw_rand_initial_seed(&lp->rng[i], lp->gid + offset);
where offset would be the number of 2^70 jumps you want to make beyond the
LP's global id.
It doesn't look like there are options to include/link in by hand the MPI libraries, meaning we need to use the MPI compiler wrappers to build. Can you include in the build instructions the need to set CMAKE_C_COMPILER (and CMAKE_CXX_COMPILER?) to mpicc/mpicxx?
From "A random number generator based on the combination of four LCGs" by L'Ecuyer and Andres, all seed values must satisfy "1 <= s[0] <= 2147483646, 1 <= s[1] <= 2147483542, 1 <= s[2] <= 2147483422, 1 <= s[3] <= 2147483322."
We need to add checks that maintain this invariant.
A command-line argument, --no-ross-output
or similar, that disables the pre-run (timestamp, ROSS revision) and post-run (statistics) ROSS output would be great to have. That way, it's much easier to handle sims that e.g., dump their own data to stdout. Typically (for me at least), the ROSS output is best for debugging/tuning the simulation under various scenarios - it doesn't need to hang around for the subsequent "production" runs.
a statistic that calculates, on average, how many events are created per event processed. Can be used during model development to understand if a model has an unstable event population.
ROSS should define an API for model statistics.
For some reason, this isn't working right now.
Right now it's set as a constant (in core/CMakeLists.txt). To change it, the builder needs to edit the CMakeLists.txt file by hand.
I run into this fairly often when scaling up models.
Hi all,
When out of event memory in optimistic mode, the message "WARNING: No free event buffers. Try increasing memory via the --extramem option" is printed to stdout and what looks like memory recollection via forcing a GVT update is attempted (tw-sched.c, lines 177-185). However, if no memory is able to be recollected, then the program seems to enter an infinite loop of checking for free mem and running the GVT. Any way to detect this behavior and terminate the program?
I would love to have a header file or some sort of column title for the ross.csv file that is generated. It gets challenging to use this file after a few runs. Or even a brief description of what the columns are would be very helpful.
It would be a good idea to include the use of CMAKE_INSTALL_PREFIX
and make install
into the installation instructions, and the compiler/linker flags needed to compile against ROSS: -I/path/to/ROSS/install/include -L/path/to/ROSS/install/lib -lROSS
. Additionally, the "Constructing the Model" and top level README pages should be updated accordingly. Not everyone will want to symlink their code into ROSS to build their models :).
Many times, when developing optimistic models, we are able to determine < LP state, event > pairs which represent infeasible model behavior. These types of simulation states typically arise when time warp causes us to receive and potentially process messages in an order we don't expect.
For example, consider a client/server protocol in which a server sends an ACK to a client upon completion of some event. In optimistic mode, the client can see what amounts to duplicate ACKs from the server due to the server LP rolling back and re-sending an ACK.
While some models can gracefully cope with such issues, more complex models can have troubles (the client in the example could for instance destroy the request metadata after receiving an ACK).
A solution, as noted in the "Dark Side of Risk" paper, is to introduce LP "self-suspend" functionality. If an LP is able to detect a < state, message > pair which is incorrect / unexpected in a well-behaved simulation, the LP should be able to put itself into suspend mode, refusing to process messages until rolled back to a pre < state, message > state. There are two benefits: 1) it greatly reduces the difficulty in tracking down and distinguishing proper model bugs from bugs arising from time-warp related issues such as out-of-order event receipt and 2) it improves simulation performance by pruning the number of processed events that we know are invalid and will be rolled back anyways.
I suggest the function signature tw_suspend(tw_lp *lp, int do_suspend_event_rc, const char * format, ...)
, with the following semantics:
do_orig_event_rc
is nonzero; otherwise, the reverse event handler shall additionally be a no-op. Typically, do_orig_event_rc == 0
is desired, as good coding practices for moderate-or-greater complexity simulations dictate state/event validation prior to modifying LP state (partial rollbacks are very undesirable), but there may be messy logic in the user code for which a partial rollback is warranted (operations that free memory as a side effect of operations, for example).format
string is acceptable for performance purposes, e.g. when doing "production" simulation runs.I believe this "feature" should be deprecated and removed. PE types are used to do logging and statistics reductions... and only in the propagation, srw, and tlm models.
In splay.c, the function tw_pq_minimum is implemented as "double tw_pq_minimum", but the header has "tw_stime tw_pq_minimum". If any changes are made to the tw_stime variable, this prevents compiling.
Minor issue - encountered when using long double tw_stime to double check my timing in a model I'm working on.
The CODES repos rely on ross-config to provide CFLAGS (really CPPFLAGS) necessary to compile models that are compatible with the ross installation. These cflags seem to always include -g and -Wall (among others) regardless of how ross itself was built.
Ideally these debugging and optimization flags should be left out of ross-config entirely and left up to the project that depends on it, rather than adding possibly conflicting options.
Workaround is to hand edit ross-config after installation to remove unwanted options.
When choosing command line options for ROSS, there are several options that have a logical space in them, such as "gvt interval", "clock rate", "lz4 knob", "avl size" and others. However, the flags are not consistent. Some flags have an underscore, while others have a dash. It would be great if they all used one or the other.
It would be awesome if, given a model, we were able to find optimal batch and gvt interval.
(for general information about pkg-config see http://www.freedesktop.org/wiki/Software/pkg-config/)
ROSS currently installs a ross-config script that allows other projects to pick up the appropriate library and cflag options. pkg-config is the more modern mechanism for doing, this though.
A basic ross.pc file would look something like this, and would be installed in ${prefix}/lib/pkgconfig/ross.pc:
Name: ross
Description: Rensselaer's Optimistic Simulation System
Version: 1.0
URL: https://github.com/carothersc/ROSS
Requires:
Libs: -L/some/dir/lib -lROSS -lm
Cflags: -I/some/dir/include
I'm not exactly sure how to generate a file like this with cmake, but there are probably some examples available in other projects?
In order to grow the ROSS user base, we need a better website/landing page for people interested in our software. It should contain links to publications, both for ROSS core and various ROSS models. We should also keep track of the records that ROSS holds and a history of performance on various machines.
This would simply be a flag without an argument. Currently, ROSS requires arguments for all of it's command line argument types. The "value" for this type could simply be an int/bool that is set to 1 if the option is specified and 0 otherwise.
Running optimistic debug mode with a fairly large extramem value does NOT trigger rollbacks.
We need to define/standardize a benchmark suite for ROSS. This will ensure that only changes that improve performance are accepted.
Another thought to document...
By having a 64-bit integer tw_stime rather than a floating-point, you can basically eliminate any precision issues while still having a very long time-frame available. For instance, if you work in picoseconds (1e-12), the max time you can have is approx 213 days. Bumping that up to nanoseconds would get you 213,000 days.
Not sure what the implications are of changing this type - I know for example the RNG is floating point. Though I would think that most uses of tw_stime in ROSS are comparisons, as the users drive event times.
Currently, the installation instructions are two degrees separated from the code repository (readme links to wiki home, which links to installation). Lazy people such as myself don't get that far :). Maybe a better place for it could be:
I prefer the latter so all instructions are distributed with the code, but either way is fine with me.
Currently in ROSS, dynamic memory management for LP states in optimistic models is strongly discouraged, since it's difficult from the modeler's perspective to know if a piece of memory is truly no longer needed (i.e., rollbacks). A simple function, tw_opt_free(tw_lp *lp, void * data)
could be provided that:
data
to the tw_event that the LP is currently executing.data
when a GVT sweeps by that reclaims the event (e.g., the GVT is greater than the event's timestamp).To make the function more applicable to non-standard allocators, the signature could be extended to support function pointers to functions that free, e.g., tw_opt_free(tw_lp *lp, void (free_fn*) (void *), void *data)
.
This would be a huge win for model development - currently, I mostly use static memory at the cost of model complexity and just accept leaks in the cases where I malloc within events.
tw_memory
is currently disabled by default and it's unclear what level of support there is for it, especially since the SR refactor/merge.
In CODES, our modelnet and local storage models currently piggyback messages on top of modelnet/lsm-specific messages to provide an abstraction over IO/communication. That means the per-event memory we allocate is approx 2x larger than it needs to be! Enabling the passing of event data through tw_memory functionality would greatly improve the memory characteristics and likely performance of our models, not to mention make our current piggybacking mechanism less hairy.
The ROSS tw_lp_settype() could take a user pointer argument and then that could be given as an input argument to the init callback function.
Example: "tw-sched.c:378: [1] Found KP last time 0.000005 > current event time 0.000002 for LP 5[1]". This could be triggered by a bad lookahead value or issuing initial events to non-local LPs. The error message could be more helpful.
include information about the (new) RNG counter
When installed, I get empty strings for every variable (including CC and friends). This is breaking a lot of things - see the recent codes-ross-users emails.
A full example is "node: 0: error: tw-sched.c:306: insufficient event memory". This indicates that you should check if the number of events in your model is growing too quickly or else increase the
g_tw_events_per_pe variable.
The error message should describe this or at least print the current limit that has been reached.
ROSS automatically rounds up the size of all messages to either g_tw_event_msg_sz in the normal case or 500 bytes if ROSS_MEMORY is set. We could improve performance for models using multiple LP types by only transmitting the size of each event and handling short receives on the receiving side.
We should document which packages are required for ROSS. The ones that immediately come to mind are:
The g_tw_ts_end variable sets a cap on the global timestamp of the simulation, at which point the simulation will stop running.
There is no clear indication in the output that this has happened however (or I am overlooking it if it is there). This causes confusion if the user expects the simulation to normally end by completing all pending events but it accidentally hits g_tw_ts_end instead.
The default KP-per-PE count is 16. Small runs may not use enough LPs to fill them up. Would be good to generate a warning for this.
With the DUMPI traces and the network models, when the simulation runs out of memory, it throws weird errors that seem to be coming from the model (for e.g. assertion failed etc.). However, its actually the simulation event memory running out: if you use the --extramem option, all these errors go away. The simulation should halt with an explicit 'insufficient memory' error instead.
tw-sched.c, line 180
I noticed in ROSS that we can finally run in serial mode without using the --sync
flag - a mistake I have shamefully made many times. Hooray! Currently, it prints to stdout (should be stderr) a warning if you run in serial mode. Perhaps it would make sense to handle sync flags in the following way?
--sync
-> no warning, run in serial--sync=2,3
-> warning + run in serial. Indicating a parallel mode without the goods to run in parallel seems erroneous to me. This already happens for --sync=3
, but not for --sync=2
(e.g., lookahead checking still occurs)--sync=1,4
-> error (already happens)The following use case has come for me a few times in the past in different contexts, so I thought I'd document the example here and discuss potential API solutions. In general, the problem I'm looking at is the issuing and recalling of speculative events that you may or may not want to execute depending on factors you don't know ahead of time.
Let's say I'm doing some sort of hearbeat protocol in which, if I don't hear from some process in x
units of time, I consider that process dead. So, in the heartbeat process I issue an event representing a timeout for x
units in the future for each process in the protocol. Now, the DES logic to handle the case where I do hear back from the server is a bit awkward - I need to process the event, hold onto the time the process responded to me and reissue timeout events for the new time, and in the timeout event check against that time and reissue if necessary. If x
is not a tight bound w.r.t. the rate of contact, then the event queue gets filled up with a bunch of effectively dead events. A cleaner solution to this would involve being able to do some simple manipulations to the event queue. Namely, one (or more) of the following:
tw_event_cancel(tw_event *e); // removes previously sent event e from the queue
tw_event_reschedule(tw_event *e, tw_stime new_offset); // reissues event e for time new_offset in the future
With this change, the logic for resetting the heartbeat can reside with the logic that causes the resetting, without upping the event population significantly and having to process defunct events.
Hi all,
Using this as an excuse to learn to file an issue :).
I reached for tw_rand_integer for the first time for some testing code and it wasn't clear whether the arguments are inclusive or exclusive. For example, tw_rand_integer(rng, 0, 3) could either be in the range [0,3] or [0,3). Looking at the source, I believe it is the former given the use of safe_high in the implementation, though that would assume that tw_rand_unif returns values in the range [0.0, 1.0). Similar boundary condition-related questions arise for other functions in the tw_rand family.
Could you add some documentation to tw_random.h describing boundary conditions, special cases (don't use [U]LONG_MAX for high), and other gotchas? Thanks!
:)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.