Giter Club home page Giter Club logo

Comments (10)

nikomatsakis avatar nikomatsakis commented on August 28, 2024 1

So I think what we need to do here...

The current strategy when executing a query is that we lock the map and we swap in a placeholder during execution. That is done here, and the value old_memo contains the value we swapped out from the map:

salsa/src/derived.rs

Lines 248 to 264 in 4a6a626

let mut old_memo = match self.probe(
self.map.upgradable_read(),
runtime,
revision_now,
descriptor,
key,
) {
ProbeState::UpToDate(v) => return v,
ProbeState::StaleOrAbsent(map) => {
let mut map = RwLockUpgradableReadGuard::upgrade(map);
match map.insert(key.clone(), QueryState::in_progress(runtime.id())) {
Some(QueryState::Memoized(old_memo)) => Some(old_memo),
Some(QueryState::InProgress { .. }) => unreachable!(),
None => None,
}
}
};

Under ordinary circumstances, we eventually invoke overwrite_placeholder to remove the placeholder and replace it with a fresh value:

salsa/src/derived.rs

Lines 456 to 466 in 4a6a626

/// Overwrites the `InProgress` placeholder for `key` that we
/// inserted; if others were blocked, waiting for us to finish,
/// the notify them.
fn overwrite_placeholder(
&self,
runtime: &Runtime<DB>,
descriptor: &DB::QueryDescriptor,
key: &Q::Key,
memo: Memo<DB, Q>,
new_value: &StampedValue<Q::Value>,
) {

You can see that this function also has the job of waking up other threads that may be blocked on us.

The danger is that a panic could occur and we would wind up with the placeholder being left in there. That is bad for many reasons: future calls from this thread will error out as if a cycle occurred, and other threads that are blocked waiting for a message will hang indefinitely.

I think that what we want to do is to install an "RAII guard" right here, basically after the InProgress placeholder has been installed, but before we've done anything else:

This guard would hold on to a reference to the key and looks something like this:

struct PanicGuard<'me, DB, Q> {
    map: &'me RwLock<FxHashMap<Q::Key, QueryState<DB, Q>>>,
    key: &'me Q::Key,
}

impl<'me, DB, Q> Drop for PanicGuard<'me, DB, Q> {..}

the Drop for PanicGuard is intended to execute only if an unexpected panic occurs. That means we have to explicitly forget the guard on the "success" condition. I would probably do this by adding a parameter to overwrite_placeholder and invoking forget the first thing:

    fn overwrite_placeholder(
        &self,
        runtime: &Runtime<DB>,
        descriptor: &DB::QueryDescriptor,
        key: &Q::Key,
        memo: Memo<DB, Q>,
        new_value: &StampedValue<Q::Value>,
        panic_guard: PanicGuard<'_, DB, Q>,
    ) {
        // No panic occurred, do not run the panic-guard destructor:
        std::mem::forget(panic_guard);

        // ... as before ...
    }

If a panic does occur, then I think we probably just want to acquire the write lock on the map and remove the key altogether. I think that should always be ok -- other code treats a removed key as evidence that the value is out of date.

Then we need some tests. =)

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

Hmm. Ordinarily, I'm not a big fan of trying to be panic safe, and would rather that we just "tear it down". But I suppose that in the case of salsa -- since we are controlling all the mutable state -- so long as we propagate the panic cleanly, we can be sure that the system is in a consistent state afterwards. (And, if not, then there is some hidden state that would probably mess up the increment system anyway.)

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

There is a Zulip thread dedicated to this issue.

from salsa.

kleimkuhler avatar kleimkuhler commented on August 28, 2024

So location of this test case may belong in a new test file within parallel since it's not required the threads be true_parallel or race. Both threads need to sum(a).

Ideally thread1 would overwrite the value of a with an InProgress mark, but panic before it finishes the sum. The current implementation of fn sum may need to change to allow a panic here

sum += db.input(ch);

Once it panics, it should sum_signal_on_exit with a stage value that lets thread2 begin it's sum(a). If thread1 properly panicked and the cleanup of the InProgress mark on a happened, then thread2 should be okay to complete. We would then assert thread1 panicked (not sure right now how to assert that) and thread2 is the value of a.

What I could use a pointer on is how to ensure the fn sum of thread1 panics so that the cleanup happens. Right now fn sum waits and signals, but my thought is the panic would need to happen inside the actual db.input(ch).

We could add a panic Cell in KnobsStruct that is somehow checked at that point? It may required a new derived query that has the sole purpose of panicking when getting a certain key?

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

@kleimkuhler Hmm. I imagine the simplest test-case would not require threads.

For example, imagine we have two queries, foo and bar, where bar is an input query;

fn foo() -> usize {
}

fn bar() -> usize {
    storage input;
}

and you have

fn foo(db) -> usize {
    assert_eq!(db.bar(), 1);
}

and then you invoke db.foo() without having set db.bar to anything (so that it defaults to 0).

This will panic and — presently — leave an InProgress marker for foo.

Now if you catch that panic (via catch_unwind or by executing db.foo() in another thread), set db.bar to 1 and re-invoke db.foo(), it will give you a cycle error.

But that is incorrect.

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

But you are absolutely right that we should test the parallel behavior. In particular, I think my write-up neglected to mention an important detail. I wrote:

If a panic does occur, then I think we probably just want to acquire the write lock on the map and remove the key altogether

That is correct, but we also have to inspect the removed key to see if anyone is waiting for us and — if so — propagate a panic to them. I was imagining we would modify our channel so that instead of sending the final result StampedValue<Q::Value>, we send a Result<StampedValue<Q::Value>, PanicOccurred>, where PanicOccurred is some private newtype we use. If we get an Err value, we can then re-propagate the panic.

I think for testing you might like to create this scenario where:

  • sum begins executing on thread1 but pauses
  • thread2 then tries to execute sum and hence blocks
  • thread1 resumes and panics, hence propagating the panic to thread2

Unfortunately we can't quite force that scenario to occur yet (as you can see from the tests). Maybe it's worth modifying the library here -- the problem is that we have no hook that thread2 can execute once it has begun blocking. But we can create a scenario where thread2 signals only once it is about to block, and hence the race is likely to occur. That may be good enough, we can run in a loop a few times or something. 😛

(The only thing I can imagine to do better would be somehow modifying the salsa runtime to send signals when a blocking event occurs or other things via a channel to some other thread which would then signal thread1 to wake up...?)

from salsa.

matklad avatar matklad commented on August 28, 2024

the Drop for PanicGuard is intended to execute only if an unexpected panic occurs.

One thing that feels slightly off (but maybe is OK!) is that this will catch unexpected panics in salsa itself. An alternate is to create a wrapper around QueryFunction::execute (that is, all the code supplied by the user) which turns panics into results using catch_unwind.

fn execute_safely<Q, DB>(db: &DB, query: Q, key: Q::Key) -> std::thread::Result<Q::Value>
where
    Q: QueryFunction,
    DB: Database
{
    std::catch_panic(|| query.execute(db, key))
}

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

(Note that with #63 you can figure out when one thread is blocking for another and thus induce panics more precisely.)

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

One thing that feels slightly off (but maybe is OK!) is that this will catch unexpected panics in salsa itself. An alternate is to create a wrapper around QueryFunction::execute (that is, all the code supplied by the user) which turns panics into results using catch_unwind.

To be honest, I can't remember if this was by design or not, but it seems like recovering gracefully from internal salsa panics is a feature, not a bug, no?

Another option would be to use catch_unwind for user code but install a "abort-the-process" guard for salsa's code. This is what rayon does. But I don't think salsa has to manage nearly as many complex invariants as rayon does, so it doesn't seem obviously necessary.

from salsa.

nikomatsakis avatar nikomatsakis commented on August 28, 2024

@matklad points out here that it may be very simple to implement the parallel case after all -- in fact, maybe it even works now! Basically, just the act of dropping the channel without sending anything should trigger the "dependent cases" to panic.

from salsa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.