ross-org / ross Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 46.0 75.38 MB

Rensselaer's Optimistic Simulation System

Home Page: http://ross-org.github.io

License: BSD 3-Clause "New" or "Revised" License

CMake 11.45% C 88.22% Shell 0.33%

c discrete-event-simulation parallel-distributed-computing pdes simulation simulation-framework

ross's People

Contributors

Stargazers

Watchers

ross's Issues

report the number of random number generations (minus rollbacks)

It might be helpful in ROSS to add a statistical counter that is incremented every time a random number is generated and decremented every time a random number is reversed. That way a user could compare and make sure that the same net number of random numbers are consistent across different synchronization modes.

defer (or remove?) setting of g_tw_rng_max

Currently, g_tw_rng_max is set in the call chain of tw_init based on the value of g_tw_nRNG_per_lp. Later on, it's used in tw_rand_init_streams in the offset calculation and also as a (unnecessary?) check on the number of RNGs asked for. This makes it awkward for individual models or libraries (e.g., CODES) to specify the use of multiple RNGs (ideally, we'd want to insulate main from having to set it manually for everything to work).

I think the "right" solution is to remove g_tw_rng_max entirely - at this point, it seems only to be used as an artificial upper bound during initialization that's in all likelihood the value 1. tw_define_lps is expected to be called exactly once AFAICT so g_tw_nRNG_per_lp is guaranteed to be the same value for all LPs on the PE.

If there is a case I'm missing, a simple fix that would get around these issues is to set g_tw_rng_max within the call to tw_define_lps, which allows some space for LP-specific initialization code to be run. For example, programs written using CODES have an LP "registration" phase before tw_define_lps is called to set up PE->LP mappings.

Optimistic remote event data structure

Currently we use either a hash table or an AVL tree for remote event tracking during optimistic simulation. However, the AVL tree is hacked in underneath the hash table implementation. We need to do some clean up of the API, so that we can try out different data structures here.

Lossy floating-point ops in example models

((a + b) - a) may not necessarily be equal to b. There are a number of such issues within ross/models when looking at reverse computation (for example the waiting_time variable in the airport model). Floating-points should typically be saved within the messages before being changed so they can be rolled back to a bit-wise identical value.

Terminate simulation after failed allocations

Copied from #22:

I would love for there to be an option to kill the simulation after N consecutive failed "allocation" attempts. E.g., last evening before I left I set up a sweep of ~700 simulations, the 10th of which entered an infinite "buffers not available" loop I then killed this morning. A "g_tw_max_consec_nobuf_ct" or something (defaulting to ULONG_MAX or similar) would do the trick, along with a bit of extra logic to track and reset the failure count accordingly. Basically, an "opt-in" workaround for issue #6.

support for "long double" tw_stime

Forgot to make this an issue...

For simulations running over a long period of time with small time deltas (e.g., a small g_tw_lookahead or small random noise to prevent event ties), we can run into floating-point precision limitations - the magnitude of time deltas can be (mostly) lost and in the case of conservative mode, the lookahead check can fail. To fix that (or at least kick the can down the road by a few orders of magnitude), it would be helpful to have some mechanism in ROSS to compile with "long double" support (really, just make the tw_stime datatype a cmake-time option). This involves:

fixing printfs
clean up type violations (some places in ROSS directly use a double in place of tw_stime)
make sure alignment assumptions don't break (unlikely?)
updating cmakelists to specify which type to use for tw_stime (placing the type as a macro in config.h?)

Check for GetGitRevision module

CMake should check for the GetGitRevison module before it errors out (and just spring a warning instead).

Warning: Missing braces around initializer

From @JohnPJenkins in #59:

While we're on the subject, specifying user parameters through tw_optdef spits out the warning "missing braces around initializer" (via -Wmissing-braces, enabled by -Wall), caused by the TWOPT_END macro not being within braces. I think we had this fix in before, but it may have been lost in the SR transition?

I thought this issue was taken care of when we added C++ support (see this commit).

For which models does this warning occur? I don't see it when building phold or the codes-base tests.

RNG Seeding

We need to clarify the documentation and remove any deprecated code (global variable g_tw_rng_seed and the 3rd argument it tw_define_lps).

@JohnPJenkins remembered this quote from @carothersc while discussing 45665b4:

I think we stopped using that parameter since we did not want users picking
their own seeds as some seeds are better than others and some are just bad.

So, a little primer here on RNGs. RNGs in fact have only one (hopefully) very, very
long stream of random numbers. CLCG4 uses 4, 32bit values or 128 bits.
This generator has a period of 2^121 implying that ~2^7 values are missing
from the generator.This is fine.

So, you can image a circle with 2^121 tiny point going around the circumference
where each point would represent a random value. Eventually, if you called
the generator 2^121 times, you would have made it around the circle.

A very nice property of CLCG4 is that you can jump (or virtually teleport) ahead
in that single stream of random numbers. We call it "seed jumping". ROSS
starts from the default seed and jumps 2^70 calls ahead in the stream for each
LP. That is, each LP's starting point in the random stream circle is 2^72 calls
apart. So, in theory, we could support simulation with unto 2^49 LPs and each
LP have it's on part of the random stream, uncorrelated from other parts.

Now, to change seed sets, what you really want to do is change where each LP
starts on the random stream circle.

You can do this by increasing g_tw_nRNGs_per_lp. So, suppose you use 4.
That will give each LP 4 distinct RNG seed sets that start again 2^72 calls
apart.

In your model code, you just change which seed set you wish to us -- 0, 1, 2, 3
e.g., lp->rng[0] or lp->rng[1], lp->rng[2] or lp->rng[3].

By default in ROSS, when you use lp->rng your really saying &(lp->rng[0]).

The downside of this approach is you're potentially wasting some memory by
have these 4 seed sets around and the cost to compute them but it's a very
low cost all around unless you have a model with billions of LPs.

If that is the case, you would have to directly jump or shift seeds an LP uses
by directly invoking the seed jumping routine w/i the RNG with the LPs
init event.

e.g., call, tw_rand_initial_seed(&lp->rng[i], lp->gid + offset);
where offset would be the number of 2^70 jumps you want to make beyond the
LP's global id.

Add note about using the MPI compiler

It doesn't look like there are options to include/link in by hand the MPI libraries, meaning we need to use the MPI compiler wrappers to build. Can you include in the build instructions the need to set CMAKE_C_COMPILER (and CMAKE_CXX_COMPILER?) to mpicc/mpicxx?

Add checks for RNG seed values

From "A random number generator based on the combination of four LCGs" by L'Ecuyer and Andres, all seed values must satisfy "1 <= s[0] <= 2147483646, 1 <= s[1] <= 2147483542, 1 <= s[2] <= 2147483422, 1 <= s[3] <= 2147483322."

We need to add checks that maintain this invariant.

Disable ROSS output

A command-line argument, --no-ross-output or similar, that disables the pre-run (timestamp, ROSS revision) and post-run (statistics) ROSS output would be great to have. That way, it's much easier to handle sims that e.g., dump their own data to stdout. Typically (for me at least), the ROSS output is best for debugging/tuning the simulation under various scenarios - it doesn't need to hang around for the subsequent "production" runs.

Model Event-Efficiency Statistic

a statistic that calculates, on average, how many events are created per event processed. Can be used during model development to understand if a model has an unstable event population.

statistics API

ROSS should define an API for model statistics.

Add lookahead parameter for ROSS command-line

ROSS C code should compile cleanly in a C++ context

For some reason, this isn't working right now.

Make AVL_TREE_SIZE a user-visible cmake option

Right now it's set as a constant (in core/CMakeLists.txt). To change it, the builder needs to edit the CMakeLists.txt file by hand.

I run into this fairly often when scaling up models.

ROSS hangs when out of event memory

Hi all,

When out of event memory in optimistic mode, the message "WARNING: No free event buffers. Try increasing memory via the --extramem option" is printed to stdout and what looks like memory recollection via forcing a GVT update is attempted (tw-sched.c, lines 177-185). However, if no memory is able to be recollected, then the program seems to enter an infinite loop of checking for free mem and running the GVT. Any way to detect this behavior and terminate the program?

ross.csv documentation

I would love to have a header file or some sort of column title for the ross.csv file that is generated. It gets challenging to use this file after a few runs. Or even a brief description of what the columns are would be very helpful.

Installation notes tweak

It would be a good idea to include the use of CMAKE_INSTALL_PREFIX and make install into the installation instructions, and the compiler/linker flags needed to compile against ROSS: -I/path/to/ROSS/install/include -L/path/to/ROSS/install/lib -lROSS. Additionally, the "Constructing the Model" and top level README pages should be updated accordingly. Not everyone will want to symlink their code into ROSS to build their models :).

Enhancement request: LP self-suspend functionality

Many times, when developing optimistic models, we are able to determine < LP state, event > pairs which represent infeasible model behavior. These types of simulation states typically arise when time warp causes us to receive and potentially process messages in an order we don't expect.

For example, consider a client/server protocol in which a server sends an ACK to a client upon completion of some event. In optimistic mode, the client can see what amounts to duplicate ACKs from the server due to the server LP rolling back and re-sending an ACK.

While some models can gracefully cope with such issues, more complex models can have troubles (the client in the example could for instance destroy the request metadata after receiving an ACK).

A solution, as noted in the "Dark Side of Risk" paper, is to introduce LP "self-suspend" functionality. If an LP is able to detect a < state, message > pair which is incorrect / unexpected in a well-behaved simulation, the LP should be able to put itself into suspend mode, refusing to process messages until rolled back to a pre < state, message > state. There are two benefits: 1) it greatly reduces the difficulty in tracking down and distinguishing proper model bugs from bugs arising from time-warp related issues such as out-of-order event receipt and 2) it improves simulation performance by pruning the number of processed events that we know are invalid and will be rolled back anyways.

I suggest the function signature tw_suspend(tw_lp *lp, int do_suspend_event_rc, const char * format, ...), with the following semantics:

After a call to tw_suspend, all subsequent events (both forward and reverse) that arrive at the suspended LP shall be processed as if they were no-ops. The reverse event handler of the event that caused the suspend will be run if do_orig_event_rc is nonzero; otherwise, the reverse event handler shall additionally be a no-op. Typically, do_orig_event_rc == 0 is desired, as good coding practices for moderate-or-greater complexity simulations dictate state/event validation prior to modifying LP state (partial rollbacks are very undesirable), but there may be messy logic in the user code for which a partial rollback is warranted (operations that free memory as a side effect of operations, for example).
An LP exits suspend state upon rolling back the event that caused the suspend (whether or not that event is processed as a no-op).
Upon GVT, if an LP is in self-suspend mode and the event that caused the suspend has a timestamp less than that of GVT, then the simulator shall report the format string of suspended LP(s) and exit.
A NULL format string is acceptable for performance purposes, e.g. when doing "production" simulation runs.

pe types

I believe this "feature" should be deprecated and removed. PE types are used to do logging and statistics reductions... and only in the propagation, srw, and tlm models.

tw_pw_minimum function implemented with double, but declared as tw_stime

In splay.c, the function tw_pq_minimum is implemented as "double tw_pq_minimum", but the header has "tw_stime tw_pq_minimum". If any changes are made to the tw_stime variable, this prevents compiling.

Minor issue - encountered when using long double tw_stime to double check my timing in a model I'm working on.

ross-config always sets -g and and -Wall in CFLAGS

The CODES repos rely on ross-config to provide CFLAGS (really CPPFLAGS) necessary to compile models that are compatible with the ross installation. These cflags seem to always include -g and -Wall (among others) regardless of how ross itself was built.

Ideally these debugging and optimization flags should be left out of ross-config entirely and left up to the project that depends on it, rather than adding possibly conflicting options.

Workaround is to hand edit ross-config after installation to remove unwanted options.

Standardize command-line parameters

When choosing command line options for ROSS, there are several options that have a logical space in them, such as "gvt interval", "clock rate", "lz4 knob", "avl size" and others. However, the flags are not consistent. Some flags have an underscore, while others have a dash. It would be great if they all used one or the other.

Add a parameter sweep mode

It would be awesome if, given a model, we were able to find optimal batch and gvt interval.

add pkg-config support (install a ross.pc file)

(for general information about pkg-config see http://www.freedesktop.org/wiki/Software/pkg-config/)

ROSS currently installs a ross-config script that allows other projects to pick up the appropriate library and cflag options. pkg-config is the more modern mechanism for doing, this though.

A basic ross.pc file would look something like this, and would be installed in ${prefix}/lib/pkgconfig/ross.pc:

Name: ross
Description: Rensselaer's Optimistic Simulation System
Version: 1.0
URL: https://github.com/carothersc/ROSS
Requires:
Libs: -L/some/dir/lib -lROSS -lm
Cflags: -I/some/dir/include

I'm not exactly sure how to generate a file like this with cmake, but there are probably some examples available in other projects?

Website with Details

In order to grow the ROSS user base, we need a better website/landing page for people interested in our software. It should contain links to publications, both for ROSS core and various ROSS models. We should also keep track of the records that ROSS holds and a history of performance on various machines.

Enhancement request: TWOPT_FLAG addition to tw_optdef types

This would simply be a flag without an argument. Currently, ROSS requires arguments for all of it's command line argument types. The "value" for this type could simply be an int/bool that is set to 1 if the option is specified and 0 otherwise.

synch=4 and extramem are not friends

Running optimistic debug mode with a fairly large extramem value does NOT trigger rollbacks.

Benchmarking

We need to define/standardize a benchmark suite for ROSS. This will ensure that only changes that improve performance are accepted.

integer tw_stime

Another thought to document...

By having a 64-bit integer tw_stime rather than a floating-point, you can basically eliminate any precision issues while still having a very long time-frame available. For instance, if you work in picoseconds (1e-12), the max time you can have is approx 213 days. Bumping that up to nanoseconds would get you 213,000 days.

Not sure what the implications are of changing this type - I know for example the RNG is floating point. Though I would think that most uses of tw_stime in ROSS are comparisons, as the users drive event times.

Make installation instructions more prominent

Currently, the installation instructions are two degrees separated from the code repository (readme links to wiki home, which links to installation). Lazy people such as myself don't get that far :). Maybe a better place for it could be:

a direct link from the readme file?
in a INSTALL file at the top-level code directory?

I prefer the latter so all instructions are distributed with the code, but either way is fine with me.

Enhancement request: optimistic-aware free lists

Currently in ROSS, dynamic memory management for LP states in optimistic models is strongly discouraged, since it's difficult from the modeler's perspective to know if a piece of memory is truly no longer needed (i.e., rollbacks). A simple function, tw_opt_free(tw_lp *lp, void * data) could be provided that:

attaches data to the tw_event that the LP is currently executing.
frees data when a GVT sweeps by that reclaims the event (e.g., the GVT is greater than the event's timestamp).

To make the function more applicable to non-standard allocators, the signature could be extended to support function pointers to functions that free, e.g., tw_opt_free(tw_lp *lp, void (free_fn*) (void *), void *data).

This would be a huge win for model development - currently, I mostly use static memory at the cost of model complexity and just accept leaks in the cases where I malloc within events.

Resurrect tw_memory?

tw_memory is currently disabled by default and it's unclear what level of support there is for it, especially since the SR refactor/merge.

In CODES, our modelnet and local storage models currently piggyback messages on top of modelnet/lsm-specific messages to provide an abstraction over IO/communication. That means the per-event memory we allocate is approx 2x larger than it needs to be! Enabling the passing of event data through tw_memory functionality would greatly improve the memory characteristics and likely performance of our models, not to mention make our current piggybacking mechanism less hairy.

Enhancement request: ROSS initialization callbacks should allow a user argument

The ROSS tw_lp_settype() could take a user pointer argument and then that could be given as an input argument to the init callback function.

"Found KP last time > current event time" error is unclear

Example: "tw-sched.c:378: [1] Found KP last time 0.000005 > current event time 0.000002 for LP 5[1]". This could be triggered by a bad lookahead value or issuing initial events to non-local LPs. The error message could be more helpful.

update Random Number wiki page

include information about the (new) RNG counter

ross-config is broken

When installed, I get empty strings for every variable (including CC and friends). This is breaking a lot of things - see the recent codes-ross-users emails.

"insufficient event memory" error is unclear

A full example is "node: 0: error: tw-sched.c:306: insufficient event memory". This indicates that you should check if the number of events in your model is growing too quickly or else increase the
g_tw_events_per_pe variable.

The error message should describe this or at least print the current limit that has been reached.

avoid rounding up message sizes in ROSS

ROSS automatically rounds up the size of all messages to either g_tw_event_msg_sz in the normal case or 500 bytes if ROSS_MEMORY is set. We could improve performance for models using multiple LP types by only transmitting the size of each event and handling short receives on the receiving side.

Requirements Documentation

We should document which packages are required for ROSS. The ones that immediately come to mind are:

CMake
MPI (mpich recommended)
anything else?

ROSS doesn't clearly indicate when g_tw_ts_end has been hit

The g_tw_ts_end variable sets a cap on the global timestamp of the simulation, at which point the simulation will stop running.

There is no clear indication in the output that this has happened however (or I am overlooking it if it is there). This causes confusion if the user expects the simulation to normally end by completing all pending events but it accidentally hits g_tw_ts_end instead.

KP count warning

The default KP-per-PE count is 16. Small runs may not use enough LPs to fill them up. Would be good to generate a warning for this.

ROSS not indicating insufficient memory when the simulation runs out of memory

With the DUMPI traces and the network models, when the simulation runs out of memory, it throws weird errors that seem to be coming from the model (for e.g. assertion failed etc.). However, its actually the simulation event memory running out: if you use the --extramem option, all these errors go away. The simulation should halt with an explicit 'insufficient memory' error instead.

Trivial: print "No free event buffers" warning to stderr instead of stdout

tw-sched.c, line 180

sync protocol warnings and PE counts

I noticed in ROSS that we can finally run in serial mode without using the --sync flag - a mistake I have shamefully made many times. Hooray! Currently, it prints to stdout (should be stderr) a warning if you run in serial mode. Perhaps it would make sense to handle sync flags in the following way?

one PE, no --sync -> no warning, run in serial
one PE, --sync=2,3 -> warning + run in serial. Indicating a parallel mode without the goods to run in parallel seems erroneous to me. This already happens for --sync=3, but not for --sync=2 (e.g., lookahead checking still occurs)
multiple PE, --sync=1,4 -> error (already happens)

Enhancement request: event cancellation/deferral/delay

The following use case has come for me a few times in the past in different contexts, so I thought I'd document the example here and discuss potential API solutions. In general, the problem I'm looking at is the issuing and recalling of speculative events that you may or may not want to execute depending on factors you don't know ahead of time.

Let's say I'm doing some sort of hearbeat protocol in which, if I don't hear from some process in x units of time, I consider that process dead. So, in the heartbeat process I issue an event representing a timeout for x units in the future for each process in the protocol. Now, the DES logic to handle the case where I do hear back from the server is a bit awkward - I need to process the event, hold onto the time the process responded to me and reissue timeout events for the new time, and in the timeout event check against that time and reissue if necessary. If x is not a tight bound w.r.t. the rate of contact, then the event queue gets filled up with a bunch of effectively dead events. A cleaner solution to this would involve being able to do some simple manipulations to the event queue. Namely, one (or more) of the following:

tw_event_cancel(tw_event *e); // removes previously sent event e from the queue
tw_event_reschedule(tw_event *e, tw_stime new_offset); // reissues event e for time new_offset in the future

With this change, the logic for resetting the heartbeat can reside with the logic that causes the resetting, without upping the event population significantly and having to process defunct events.

tw_rand_* documentation

Hi all,

Using this as an excuse to learn to file an issue :).

I reached for tw_rand_integer for the first time for some testing code and it wasn't clear whether the arguments are inclusive or exclusive. For example, tw_rand_integer(rng, 0, 3) could either be in the range [0,3] or [0,3). Looking at the source, I believe it is the former given the use of safe_high in the implementation, though that would assume that tw_rand_unif returns values in the range [0.0, 1.0). Similar boundary condition-related questions arise for other functions in the tw_rand family.

Could you add some documentation to tw_random.h describing boundary conditions, special cases (don't use [U]LONG_MAX for high), and other gotchas? Thanks!

ross-org / ross Goto Github PK

ross's People

Contributors

Stargazers

Watchers

Forkers

ross's Issues

Recommend Projects

Recommend Topics

Recommend Org