Cloud Haskell Supervision Trees

License: BSD 3-Clause "New" or "Revised" License

Haskell 99.88% Shell 0.12%

cloud-haskell distributed-systems haskell supervisor

distributed-process-supervisor's People

Contributors

Stargazers

Watchers

Forkers

qnikst cthulhuden tmcgilchrist hyperthunk lukemshannonhill jhbertra

distributed-process-supervisor's Issues

Test suite failure

Test suite SupervisorTests: RUNNING...
NOTICE: Branch Tests (Relying on Non-Guaranteed Message Order) Can Fail Intermittently
Supervisor Processes:
  Starting And Adding Children:
    Normal (Managed Process) Supervisor Start Stop: [OK]
    Specified By Closure:
      Add Child Without Starting: [OK]
      Start Previously Added Child: [OK]
      Start Unknown Child: [OK]
      Add Duplicate Child: [OK]
      Start Duplicate Child: [OK]
      Started Temporary Child Exits With Ignore: [OK]
      Configured Temporary Child Exits With Ignore: [OK]
      Start Bad Closure: [OK]
      Configured Bad Closure: [OK]
      Started Non-Temporary Child Exits With Ignore: [OK]
      Configured Non-Temporary Child Exits With Ignore: [OK]
    Specified By Delegate/Restarter:
      Add Child Without Starting (Chan): [OK]
      Start Previously Added Child: [OK]
      Start Unknown Child: [OK]
      Add Duplicate Child (Chan): [OK]
      Start Duplicate Child (Chan): [OK]
      Started Temporary Child Exits With Ignore (Chan): [OK]
      Started Non-Temporary Child Exits With Ignore (Chan): [OK]
  Stopping And Deleting Children:
    Delete Existing Child Fails: [OK]
    Delete Stopped Temporary Child (Doesn't Exist): [OK]
    Delete Stopped Child Succeeds: [OK]
    Restart Minus Dropped (Temp) Child: [OK]
    Sequential Shutdown Ordering: [OK]
  Stopping and Restarting Children:
    Permanent Children Always Restart (Closure): [OK]
    Permanent Children Always Restart (Chan): [OK]
    Temporary Children Never Restart (Closure): [OK]
    Temporary Children Never Restart (Chan): [OK]
    Transient Children Do Not Restart When Exiting Normally (Closure): [OK]
    Transient Children Do Not Restart When Exiting Normally (Chan): [OK]
    Transient Children Do Restart When Exiting Abnormally (Closure): [OK]
    Transient Children Do Restart When Exiting Abnormally (Chan): [OK]
    ExitShutdown Is Considered Normal (Closure): [OK]
    ExitShutdown Is Considered Normal (Chan): [OK]
    Intrinsic Children Do Restart When Exiting Abnormally (Closure): [OK]
    Intrinsic Children Do Restart When Exiting Abnormally (Chan): [OK]
    Intrinsic Children Cause Supervisor Exits When Exiting Normally (Closure): [OK]
    Intrinsic Children Cause Supervisor Exits When Exiting Normally (Chan): [OK]
    Explicit Restart Of Running Child Fails (Closure): [OK]
    Explicit Restart Of Running Child Fails (Chan): [OK]
    Explicit Restart Of Unknown Child Fails: [OK]
    Explicit Restart Whilst Child Restarting Fails (Closure): [OK]
    Explicit Restart Whilst Child Restarting Fails (Chan): [OK]
    Explicit Restart Stopped Child (Closure): [OK]
    Explicit Restart Stopped Child (Chan): [OK]
    Immediate Child Termination (Brutal Kill) (Closure): [OK]
    Immediate Child Termination (Brutal Kill) (Chan): [OK]
    Child Termination Exceeds Timeout/Delay (Becomes Brutal Kill): [OK]
    Child Termination Within Timeout/Delay: [OK]
  Branch Restarts:
    Restart All:
      Terminate Child Ignores Siblings: [OK]
      Restart All, Left To Right (Sequential) Restarts: [OK]
      Restart All, Right To Left (Sequential) Restarts: [OK]
      Restart All, Left To Right Stop, Left To Right Start: [Failed]
unexpected signal from pid://127.0.0.1:10501:0:930
      Restart All, Right To Left Stop, Right To Left Start: [OK]
      Restart All, Left To Right Stop, Reverse Start: [Failed]
unexpected signal from pid://127.0.0.1:10501:0:1093
      Restart All, Right To Left Stop, Reverse Start: [OK]
    Restart Left:
      Restart Left, Left To Right (Sequential) Restarts: [OK]
      Restart Left, Leftmost Child Dies: [OK]
      Restart Left, Left To Right Stop, Left To Right Start: [OK]
      Restart Left, Right To Left Stop, Right To Left Start: [OK]
      Restart Left, Left To Right Stop, Reverse Start: [OK]
      Restart Left, Right To Left Stop, Reverse Start: [OK]
    Restart Right:
      Restart Right, Left To Right (Sequential) Restarts: [OK]
      Restart Right, Rightmost Child Dies: [OK]
      Restart Right, Left To Right Stop, Left To Right Start: [Failed]

Expected: equalTo pid://127.0.0.1:10501:0:2854
     but: was pid://127.0.0.1:10501:0:2855
      Restart Right, Right To Left Stop, Right To Left Start: [OK]
      Restart Right, Left To Right Stop, Reverse Start: [Failed]

Expected: equalTo pid://127.0.0.1:10501:0:2972
     but: was pid://127.0.0.1:10501:0:2974
      Restart Right, Right To Left Stop, Reverse Start: [OK]
  Restart Intensity:
    Three Attempts Before Successful Restart: [OK]
    Permanent Child Exceeds Restart Limits: [OK]
  ToChildStart Link Setup:
    Both Local Process Instances Link Appropriately: [OK]

         Test Cases   Total       
 Passed  67           67          
 Failed  4            4           
 Total   71           71          
Test suite SupervisorTests: FAIL
Test suite logged to: /home/ubuntu/haskell/stackage/logs/stackage-nightly-2014-12-23/distributed-process-supervisor-0.1.1/test-run.out

Child start handling for `Process ()` is broken.

First reported as part of haskell-distributed/distributed-process-platform#77, there is a serious fault in the ToChildStart instance for Process ().

Not only do we potentially spin up and subsequently leak starter processes, the whole premise of this approach is wrong, since it can lead to a supervisor waiting indefinitely for a child to start. This breaks the contract between parent and child processes and goes against the design principles of supervision as laid out in the original OTP implementation.

I propose we remove this instance and leave it to implementors to define, but also that we remove StarterPid from the data type, since we have no clean solutions for using these without hitting the issues mentioned above.

Specifically, spawn should be asynchronous, and indication that a child has died should come from monitor signals, such that once the child has spawned, the monitor signal should be established before the code has a chance to proceed (and potentially crash prior to monitoring being properly established).

Some things I think we can/should rule out...

The original fix (from @tavisrudd before the repos were split)

Here's a playback of the commentary there... When the StarterProcess ChildStart variant is used with dynamic supervision, each cycle of startNewChild + terminateChild/deleteChild
leaks a process. The proposed fix was to kill the started process, for which we have an ID.

This was broken for two reasons. The first is that we do not evaluate toChildStart in the supervisor process, thus killing the starter process means we can no longer restart the child. The second reason is that we break location transparency, as per this part of the thread:

... So it is safe for us to kill this ProcessId from our point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? The ProcessId given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) with StartFailureDied DiedUnknownId which is confusing. (UPDATE: note that I'd missed the fact this supervisor won't be able to start children either...)

At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the ChildSpec becomes unreachable (and is gc'ed). However, the System.Mem.Weak documentation isn't clear on whether or not this is guaranteed for an ADT such as ProcessId. In particular, this comment:

WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.

The alternative to this would be to document this behaviour of ChildSpec, explaining that once deleted from a supervisor, the ChildSpec becomes invalid. But I really really dislike this idea. The problem is that the ChildSpec is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize the ChildSpec and send it to another node, if the same ChildSpec (or even another ChildSpec that shares the same ToChildStart thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because the ProcessId is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.

One approach that might work here would be to convert the StarterProcess constructor to take a Weak ProcessId and in the ToChildStart instance create a weak reference to the process id and create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., the ChildStart datum becomes garbage. Problem solved? Well no.....

Firstly, we'd need to test that this finalization business works properly for a Weak ProcessId. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give the System.Mem.Weak infrastructure time to do the cleanup - maybe even forcing a GC at some point.

Secondly, and this is a serious problem: finalisation is useless if/when the ChildStart datum escapes the local node. Which means, in practice, that you can't serialize and send it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using the ChildStart at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing a ProcessId works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(

You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in the original ticket that introduced StarterProcess. This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.

I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:

remove StarterProcess and make people use the Closure Process constructor instead
re-work StarterProcess to reference count its clients/supervisors
remove the ToChildStart instance for Process () and export StarterProcess then make people implement this themselves

The idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO.

UPDATE: I actually think (1) is what we should do now. I went on to say the following, which I'll recant shortly...

The idea behind (2) is that the re-starter process will keep some internal state to track each time a new supervisor tries to use it. Each time a supervisor deletes a ChildSpec that uses this re-starter pid, instead of kill restarterPid ... they should exit restarterPid Shutdown and the re-starter process should catchExit expecting Shutdown and decrement its supervisor/reference count each time. Once that ref-count reaches zero, it terminates. This is still vulnerable to the problem I outlined above though, so I think it's not really viable. We cannot provide an API to the general population that is open to abuse like this.

The idea of (3) is that we either find a way to expose that re-starter pid or just make people implement the ToChildStart themselves. The point is, if you've written your code so that you know which re-starter is associated with which ChildSpec and you know (in your own code) that you're no longer using the ChildSpec or the StarterProcess then you - the one who knows what is going on are best placed to terminate that process yourself when you're sure it is no longer needed. This puts the responsibility where it really IMHO belongs - in the hands of the application developers.

The problem with (2) and (3) is that we introduce a huge degree of complexity for very little benefit. Despite the original report in https://cloud-haskell.atlassian.net/browse/DPP-81, it is not difficult to turn a Process () expression into a closure without template haskell, and therefore we're jumping through hoops for hardly any reason at all. But there's more: The way that StarterProcess works is fundamentally broken, because the supervisor has to interact with this other known process to spawn its children, and get back a process id and monitor ref. That is broken in a fundamental way: if we do not own the code for that starter process then we've no way of ensuring we won't block indefinitely waiting for a reply. Even if we do, if the starter resides on a foreign node, network congestion could cause almost indefinite blocking (even in the presence of heartbeats), so this is a bit no-no. We cannot have the supervisor process blocked waiting like that - spawning a child has to be asynchronous, and there needs to be a guarantee that monitoring has been set up correctly before the child proceeds to run, which requires a wrapper that we need to have written. We cannot leave this up to 3rd parties.

Because of this (above), we have to have a thunk that we can execute, which we can wrap in the relevant spawn/monitor code for safely. Since we cannot send Process a over the wire, only Closure (Process a) is possible - allowing for the CreateHandle instance obviously, which takes Closure (SupervisorPid -> Process (ChildPid, Message)) - and we cannot operate in terms of Process a. A concession would be to break up ChildSpec into those used to define the supervisor and those used for dynamic supervision, allowing Process () in a child spec used to boot the supervisor but not in subsequent calls to addChild etc. I really do not like this idea however, since not only to it create an imbalanced API, it could also prevent us sending supervisor setup data over the wire (since it could potentially contain Process () instead of Closure (Process ()) thunks), and that is a pretty huge disadvantage.

The final concession we might make, would be to separate out the supervisor runtime implementation into two parts, one that handles local children and another for remote. I still don't like this idea though, because we end up with a fractured API, but I will consider it if people shout asking for Process () to be supported as a ToChildStart instance again. The approach I would take to achieve this would involve (a) breaking up the server so that the child start handling could be modified, (b) segregating the client APIs and having two supervisor client handles, one for local only and another for remote only supervisors. The client handle for a local only supervisor would carry an STM.TChan used for sending unserialisable thunks to the server. This would require us to implement haskell-distributed/distributed-process-client-server#9 first.

Publish mx events and sync on these during testing

Currently we trace logs, which is no more or less overhead than publishing mxuser events. Publishing these instead of tracing log messages, we get two benefits. Firstly, you can write a custom formatter for SystemLog and your job is done in terms of logging. Secondly, you can use the management API to track supervisor status. This is especially useful when you're testing, since instead of waiting some arbitrary delay to hope you've ensured the supervisor under test is in a state that is ready for verification, you can wait to see the management events appear. But this also provides a generalised approach to listening for supervisor events, where you might do things like track restart frequency, count the supervisors in your system, etc etc.

Loosen upper bounds on distributed-process-X packages

I'm trying to build this project with the latest version of distributed-process (0.7.3) and currently need to increase the upper-bounds of several packages that this package depends on, and the bounds of some packages they depend on.

The dependency bounds I'm interested in changing are:

- distributed-process <0.7 to  <=0.7.3
- ansi-terminal <0.7 to <=0.8.0.4
- time <1.7 to <=1.8.0.2
- network-transport ==0.4 to ==0.5.2
- network-transport-tcp <0.6 to <=0.6

The only two of these that seem to affect extra-deps (distributed-process-X packages) directly are time and distributed-process, which obviously also have some dependency constraints that would need reconciling.

Would you be open to me submitting sweeping PRs that made these bounds changes across all the relevant distribute-process-X packages:

- distributed-process-extras-0.3.2 
- distributed-process-async-0.2.4  
- distributed-process-systest-0.1.1

as well as this one, distributed-process-supervisor?

I'm really interested in using this package, and will for the mean time use allow-newer: true with stack so that I can build this library in a codebase that uses distributed-process-0.7.3, but would love to get these upper bounds fixed upstream and depend on these repos directly, without the use of allow-newer. Thanks!

A bad `CreateHandle` can crash the supervisor process

The code that handles RunClosure spawns a process to deal with initialising the starting the child. The code in wrapHandle, on the other hand, unpacks and executes the Closure (SupervisorPid -> Process(ProcessId, Message)) in the supervisor's own process, which could crash and bring down the managed process itself.

Since this is a common branch - executing every time we spawn a child using that ToChildStart constructor - we should be more careful.

Add support for `addLocalChild`

As per https://github.com/haskell-distributed/distributed-process-registry/blob/master/src/Control/Distributed/Process/Registry.hs, we can pass unserialisable thunks to a supervisor on the same local node, by creating a binary shim to hold the data, and creating a binary instance that simply errors on read and write.

Since the Unsafe module deliberately avoids forcing evaluation of messages, this enables us to safely pass a Process () to a local supervisor, for the purposes of adding it after initialisation.

There are some issues to be observed about this approach...

Firstly, if we wish to add children in this fashion, we might not want to support putting them in the initialisation config for the supervisor. Supporting that notion would make the init config non-serialisable.

We also need to consider whether refactoring the whole code base for this feature is worth it. Perhaps the simplest option that occurs to me, would be to store unclosured child start procs in a map, against the child keys, and when adding a local child, to ignore the closure field and just add the passed proc. So long as code which usually accesses the closure (i.e. The start/restart routines) always checks the map first, this should work fine, with minimal impact.

It could be useful to add an alternate startSup API, that takes LocalChildSpec records instead of the serialisable ones, which should allow for library users to clearly distinguish between supervisors that can be used locally and those which may be deployed to remote nodes. This would get around the issue of having non-serialisable init config.

An interesting issue here is that of nested supervision trees. If a local child spec is for a supervisor, that local child init could indeed contain non-serialisable local child specs, or regular specs.

This change would be incompatible with remote supervision groups, like those that rabbitmq uses to handle load balancing process groups. It would not prevent ordinary supervisor configurationa from being used in those kind of scenarios though...

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

haskell-distributed / distributed-process-supervisor Goto Github PK