chimeratk / applicationcore Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 3.36 MB

Core library for creating applications based on the ControlSystemAdapter and DeviceAccess.

License: GNU Lesser General Public License v3.0

CMake 7.18% Shell 0.95% Makefile 0.03% C++ 91.12% Python 0.62% C 0.10%

applicationcore's People

Contributors

Watchers

Forkers

phako smarsching

applicationcore's Issues

Implement persistency option for application output variables

DeviceToControlSystem variables can conceptually not be presisted by the control system. The application has to take care of this in case the values cannot be re-calculated when re-starting the server.

Task: The Output should get an option to persist this variable (off by default). When enabled, the variable is always written to file when a write is called. When the application is started/initialised, the variable content is read from the file.

Details:

The implementation should happen with a decorator. Like this any output (to Device, Control System or to other ApplicationModules) can be saved and restored.
Make sure to flush the file after each write. Leave a comment in the code that this has to be tested manually by killing the application with -9 and check that the last value is restored correctly at server start. This kind of test is not well suited for an automated test.

Exception handling and ConfigReader/Constants

This ticket is a child of #10 and depends on #46.

Variables which are Constants or outputs of the ConfigReader and are connected to a DeviceModule should be written in an initialisation handler (see #46). Currently they are written in ConfigReader::pepare() etc., which might block the application initialisation if an exception occurs in the process of writing these variables.

Hints for implementation:

Start with the implementation for ConfigReader. Constants are nowadays a corner case and not so important (not even sure if they properly work when connected to devices right now). If the ticket takes too long, you might split the Constants off into their own ticket.
The ConfigReader stores ScalarOutput/ArrayOutput in the Var/Array structs. These accessor types can be converted into VariableNetworkNode
From the VariableNetworkNode you can get to the VariableNetwork (-> VariableNetworkNode::getOwner())
Scan all consumers of the VariableNetwork (-> VariableNetwork::getConsumingNodes()). You can test each node if it is a Device node (getType()). If yes, you can get the device alias name (getDeviceAlias()). You can look up the corresponding DeviceModule in Application::deviceModuleList.
Unfortunately this (Application::deviceModuleList) is protected and a plain list (so you have to search). Maybe it would be better to extend the VariableNetworkNode to contain a pointer to the corresponding DeviceModule for all Device nodes.
Once you got the corresponding DeviceModule you can at first add per variable one initialisation handler. Maybe later this can be combined by first collecting all variables which are connected with the same DeviceModule.
For Constants I think the right place to put this gymnastics in is VariableNetworkNode::makeConstant().

ConfigReader: Allow easy access from any ApplicationModule

If only one single instance of the ConfigReader exists, this instance should be available through a member function of ApplicationModule inside the constructor of each module.

Logging module: don't use pointers to process variables

Something like this is not necessary to write:
std::unique_ptr<ctk::ScalarOutput<std::string> > message;

Instead of initialising the pointer in the constructor, use a default constructed ScalarOutput and use message.setMetadata() to set its name etc.

Exception handling: propagate data fault through ApplicationModule

This ticket is a child of #10 and depends on ChimeraTK/ControlSystemAdapter#18.

The data fault flag introduced with ChimeraTK/ControlSystemAdapter#18 needs to be propagated through ApplicationModule. This can be done similar to the propagation of the VersionNumber:

In a read operation, when a set data fault flag is received, a data flag for the module will be set (this flag for the module needs to be introduced first). This flag will be cleared once all input variables of the module have their data fault flag no longer set (a logic how this is achieved needs to be planned first - please keep in mind a module might have many variables, so scanning all data fault flags at each operation is not an option!).

In a write operation, the module's data fault flag status will be attached to the variable to write.

All computations shall be executed normally, whether the data fault flag is set nor not, unless the application author specifically tests for the flag. There might be use cases for this, so please allow testing the module's flag by the application user code.

Implementation
Tests
Documentation: the behaviour should be mentioned on the ExceptionHandling Doxygen page.
Check that your implementation matches the general ideas described in the second comment of #10.

Generic modules for status monitoring - Part 1

Write a StatusMonitor module

Actually a bunch of modules which are very similar (let's call them flavours for now)
Module monitors an input variable and reports one of the four states (see below) depending on the value
The different flavours of the StatusMonitor module perform different type of checks. Implement the following at first (configuration PVs in parenthesis):
- Maximal value (two thresholds for warning and error)
- Minimal value (two thresholds for warning and error)
- Allowed range (each two upper and lower thresholds for warning and error)
- Require exact value (the required value), will be error if different
- On/Off state checker (the desired on/off state), will be error if state is different, otherwise 0 resp. 1 depending on the state

The reported state is an integer with the following values (this is actually an enum, but enums are not supported by ApplicationCore. Please define them as a weakly typed enum and make the PV a plain int):

value 0 = off (intentionally)
value 1 = ok
value 2 = warning
value 3 = error

Not all of these states need to be used for each flavour. E.g. 0 is only used by the on/off state checker described above.

Hints for design/interface/usage:

Allow specifying tags for the output status PV through the constructor
Allow in addition to specify the name and tag of the input variable, and separately for the parameters
StatusMonitor instances hence need tree different tags. As a convenience, make the specification of all three optional and allow specifying one single tag list used for all of them. This often will be the "control system" tag
Usually StatusMonitor instances will have the eliminateHierarchy flag enabled, maybe this could be always the case?
This allows to easily let the StatusMonitor connect to a variable on the same hierarchy level, if it is also connected e.g. to the control system

Definition of Complete:

Implement a first, simple exception handling - part 2

This is a child task of #10 and depends on task #11. The idea is to provide a first rather simple exception handling to investigate how to develop the idea further.

Find all places where a ChimeraTK::runtime_error could be thrown in read and write operations (ignore open for now). This is (list may not be complete);

ScalarAccessor and ArrayAccessor: all read and write functions (they are not yet necessarily all overloaded, this must be done then to catch the exception)
ThreadedFanOut and TriggerFanOut
I think not the FeedingFanOut, as it is always used inside a ScalarAccessor or ArrayAccessor

Catch the exceptions there and feed the error state into the DeviceModule through the new function DeviceModule::reportException() (see #11). Retry the failed operation after reportException() returns.

Please put the ticket for review by @mhier when done.

Add missing test

Add test for generating triggers in an ApplicationModule and distributing it through a FanOut. This is currently not covered, thus the following bug was not detected:

2ea7b17

Update Docu: Device exception handling

Update the exception handling doxygen page. The description should be detailed enough to server as specification for the implementation. Do not describe implementation details like the conditons variables or the exception reporting queue, but the behaviour as seen from the Device and the application programmer.

Exception handling

Document the exact behaviour of handling and reporting of runtime_error thrown by Device and its accessors.
** A central thread is waiting for exceptions to be reported by all accessors of the device.
** If an exception is received it is reported and the device status goes to 1
** The thread tries in a loop to call Device::open(), which is supposed to recover the device, until the device reports isFunctional() again.
** After this the initialisaton seqeuence is run. If it fails this itself reports an exception and the recovery restarts.
** The accessor tries to execute the failed action (read/write) again. If it succeeds the device status reported to the CS is set to OK and the error message from the exception is cleared.
Describe the seen behaviour in case of Devices which
** do not (or cannot) report error states (isFunctional() is always true) and one has to retry the read/write to see if the device is working again
** implement a recovery procedure and can report isFunctional() as false while the device is known to be not operational.
Describe when the initialisation will be executed
Describe the difference of the blocking read() in contrast to readNonBlocking() and readLatest()

Generic modules for status monitoring - Part 2

This ticket depends on #13 - read that ticket first for background information.

Implement a StatusAggregator module

Collect the results of a number of StatusMonitor instances and aggregates them into a single status with the same value scheme
Also other StatusAggregator instances can be connected to this instance so its status gets aggregated into the single status. This allows to build arbitrary hierarchies of aggregators
The StatusMonitor and StatusAggregator instances to be connected into the StatusAggregator should be found automatically as follows:
- A StatusAggregator aggregates all instances in the hierarchy of ApplicationCore Modules which are at the same hierarchy level or below
- Two StatusAggregator instances cannot be on the same hierarchy level - this should throw a logic_error
- If a StatusAggregator is found at some hierarchy level below, the search is stopped for that branch of the hierarchy tree, i.e. all variables which are already aggregated by that aggregator are not aggregated againn
The StatusAggregator is configurable (by a constructor parameter, not a PV) in the way how the statuses are prioritised. The status with the highest priority within the aggregated instances defines the aggregated status. The following options exist (highest priority first):
- error - warning - off - ok
- error - warning - ok - off
- error - warning - ok or off - mixed state of ok and off results in warning
- off - error - warning - ok

Hints for implementation:

Collection of instances to be aggregated has to be done in the constructor, later, e.g. in prepare() is too late, because the variable household is then fixed and inputs from the aggregated monitors can not be added anymore.
Search for instances must be done on the virtual hierarchy, use getOwner()->findTag(".*"), then use getSubmoduleList(), getAccessorList on the VirtualModule (see also comment below)
a dynamic_cast finds out, whether a submodule is a StatusMonitor, StatusAggregator, a ModuleGroup or something else
If it's a ModuleGroup, recurse into it via EntityOwner::getSubmoduleList(), unless a StatusAggregator exists on the same level
"Something else" can only be an other ApplicationModule, which we will ignore.

Hints for design/interface/usage:

Allow specifying tags for the output status PV through the constructor

Definition of Complete:

Implementation
Tests, also with some real life scenarios
Doxygen reference documentation
Doxygen page documenting how to use it with an example
Review by @mhier

Missing tests

Many tests are missing. Here is an incomplete list of missing tests (so less obvious scenarios do not get forgotten):

Exception handling

Currently, exceptions (in particular ChimeraTK::runtime_error) cannot be handled. If a device throws a ChimeraTK::runtime_error, it is not possible to catch this exception by the user code, since often it is thrown inside a FanOut. Ideally, exceptions should be handled by ApplicationCore in a way that the application developer does not have to care much about it. This means:

User code doesn't see the exceptions.
Device goes into error state if an exception is found. This is visible on the control system.
All modules need to communicate with the device go to a "paused" stated, i.e. they do no longer process data.
This paused stated should propagate through process variables and connected modules, so on the control system it is visible which part of the application is currently out-of-order.

Exceptions caught in write() shall not set the data invalid flag

Currently the ExceptionHandlingDecorator is treating all cases equal: They set the data invalid flag and block until the error condition is resolved.

Correct behaviour of write()

it shall block (at least to the current point of discussion)
it shall not set the data invalid flag because the data being send is not necessarily invalid

Related with #56 and (discussion about blocking read).

Testable mode in eviceModule::reportException is broken

To prevent a race condition with deadlocks, reportException the releases TestableMode mutex after increasing the testable_mode_counter, but before pushing to the error queue, which is protected by another mutex. This causes the race condition that the test sees an inconsistent state (counter increased but empty queue) and stalls the execution.

Both mutexes must be held when modifying the queue and the counter to keep them consistent. Use "try_lock" for the second mutex, and release the first mutex with a spinning re-lock to avoid the deadlock.

Implement "HierarchyModifier"

Currently it is already possible to modify the apparent hierarchy structure by setting the flag "eliminateHierarchy" for modules. This concept should be made more flexible and readable by changing the boolean flag into a strongly-typed enum called HierarchyModifier. Default will be HierarchyModifier::none, which is does not change the hierarchy (same as eliminateHierarchy=false). HierarchyModifier::hide will be same as eliminateHierarchy=true (I think this is a much better name then "eliminate").

A new feature will be enable by HierarchyModifier::moveToRoot, which will let the module appear at the root level of the hierarchy tree. All children will be moved along with the group (while keeping the tree structure below the moved group of course).

MicroDaq module should use uint64_t trigger

Currently it is not possible to connect the PeriodicTrigger output tick to the MicroDAQ module. This is because MicroDAQ is expecting an int and PeriodicTrigger delivers uint_64

Connecting ApplicationModule to DeviceModule will lose pushType information from device module

Assuming the following code:

DeviceModulde dev;
ApplicationModule mod;
…
mod.connectTo(dev);

And also assuming that mod has a push-type input and dev a corresponding output, the current code will generate a poll-type output on the dev.

VirtualModule::connectTo() calls operator(std::string) on the device module (see https://github.com/ChimeraTK/ApplicationCore/blob/master/src/VirtualModule.cc#L70) which generates poll-type by default (see https://github.com/ChimeraTK/ApplicationCore/blob/master/src/DeviceModule.cc#L37). The following operator>> call will then connect the poll-type dev variable to the push-type input of mod, even though the device would support that variable as push-type (e.g. being a DOOCS ZMQ variable)

Bug in ExceptionHandling

Seen in testExceptionHandling: Currently the trigger has a device variable as consumer. If a separate variable is created, just to trigger, the exception from the device which is being triggered is not handled.

Missing: provide example code to reproduce the issue

Add descriptions for module instances

Module instances (in particular VariableGroup instances) should get a description, which will become part of the process variable description published to the control system.

Produce proper error message when destroying a module which already has connected variables

Typical example: Module depending on configuration parameters is initialised only in the constructor body of the application. By mistake a connection to that module is made before the module is replaced with the final (properly configured) instance. In the replacement (assignment operation) the destructor of the previous instance is called, which should throw if connections to variables inside the module have already been made. Only during the shutdown phase of the application modules may be destroyed which have connections.

Also default constructed modules should not allow making any connection in the first place.

getCurrentVersionNumber() of VirtualModule crashes due to missing owner

I guess that should not happen when using ApplicationCore in the correct way, but currently my application crashes at this point since owner is a nullptr in my case:

ApplicationCore/include/Module.h

Line 146 in 43b4e47

 VersionNumber getCurrentVersionNumber() const override { return _owner->getCurrentVersionNumber(); } 

Better throw an exception and if possible give a hint how to solve the problem.

Examples don't have a CMakeLists.txt

The examples should be stand-alone so one can compile them out of the box against a system-installation of ApplicationCore. For this, they each need a CMakeLists.txt.

ConfigReader: Wrong error message when using wrong type

When defining a ConfigReader variable with a different type than reading it in the application, the ConfigReader reports that the variable cannot be found. Instead it should report that the type is incorrect and which type is expected by the application.

Exception handling: device initialisation after recovery from exception

This ticket is a child of #10.

Background: If a device is recovered after an exception, it might need to be reinitialised (e.g. because it was power cycled). Currently, device initialisation is completely up to the application. This needs to be changed so devices can be automatically reinitialised after recovery.

Task:

Extend the interface of DeviceModule so the application can define a callback function to be executed when the device needs to be initialised.
The function shall have one argument of the type ChimeraTK::Device (the device to be initialised) and a no return value. The function is executed inside the DeviceModule thread.
The function can be either specified in the DeviceModule constructor or through a later call to a DeviceModule member function named DeviceModule::addInitialisationHandler(). Specifying the function is optional (a device might not require any initialisation).
By calling DeviceModule::addInitialisationHandler() multiple times it is possible to add multiple initialisation handlers. The handlers are called in the order they were added (the one added in the DeviceModule constructor comes first).
The handlers shall called in two situations (by the DeviceModule):
- The device has just been successfully opened for the first time during the initialisation phase
- The device has been successfully opened again after recovery from an exception state
The handler may throw ChimeraTK::runtime_error. This exception should be treated in the same way as it were reported by DeviceModule::reportException(), i.e. the device goes again into the exception state (or does not even leave it) and the exception what() string is shown to the control system. This allows the function to also report errors on an application level by throwing the exception explicitly, e.g. if the hardware state is not like expected. Please mention this in the documentation.
If a handler has thrown an exception, the remaining handlers are not called (until the next attempt to recover the device).

Note: multiple handlers are required for the implementation of #47 (as it will add initialisation handlers to devices which might already have a user-defined handler).

DoD:

Implementation
Tests:
- Handler gets called in initialisation phase of the application.
- Handler gets called after exception.
- Correct behaviour if function throws ChimeraTK::runtime_error.
- Correct behaviour with multiple handlers.
Documentation

Missing boost date_time in CMakeLists?

If building ApplicationCore at home I get error:
libChimeraTK-ApplicationCore.so.01.04.00: Nicht definierter Verweis auf `boost::gregorian::greg_month::as_short_string() const'

adding that in CMakeLists.txt fixed that for me:
FIND_PACKAGE(Boost COMPONENTS date_time REQUIRED)

what's your situation?

Fix device error reporting

The device error reporting scheme is broken. As requested in #11 / #12 , reportError() blocks until the device error is resolved. This is conceptually wrong for the following reasons:

Writing must always succeed. Due to inversion of control a module cannot know whether it is connected to a device and would block (as mentioned as "paused" in #10) or if there is a fan-out in between and the module does not even know that the device blocked.
If a module is blocked it cannot propagate the data invalidity flag, and not even shut down.
Currently, if the device stays in an error mode, it blocks the control system module if directly connected to the device, and prevents shutdown of the application. The only reason that this has not hit us so far is that the DeviceModule thread does not check if the device is back online (not in the interface yet), just sleeps 0.5 seconds and then always resets the error state.

Task:

Extend the DeviceAccess Device and Backend interfaces with bool getDeviceError(), so a backend can signal a device error. By default, this reports falser, so backends do not have to implement this (some maybe even can't).
Change the ExceptionDevice so it overloads getDeviceError and reports true if it would throw an exception.
Change the DeviceModule so it keeps polling the DeviceError flag until it goes to true. Make sure there is an interuption point so the thread can terminate if the device does not become available again. The thread must not get stuck waiting for this flag, but must keep reporting errors that are reported (FIXME: does this make sense? The current implementation does not even report errors, so we might as well wait).
Fix the implementation until all the broken and stuck tests work again. First in the master (part of this ticket), then in the issue46 branch (ticket #46 postponed until this one is done).
Update the documentation with the changed behaviour.
Needs review by Martin K.

Consider the following for changes in the ExceptionHandlingDecorator:

Writing must never block
Read may block, but it might be better to return with invalid data
readLatest/ readNonBlocking must not block but return with invalid data

Restructure class Application

The class Application combines too much functionality in a single class. Some ideas to improve the structure:

Split out all code that creates the connections into its own class
Move all specialised handling of Device and ControlSystem variables into the DeviceModule resp. ControlSystemModule
This may also affect other classes like VariableNetworkNode and VariableNetwork.

MicroDAQ causes segfault when disk is full

I guess closing the file in the exception handling does not work if the disk is full.

ApplicationCore/Modules/src/MicroDAQ.cc

Line 360 in a8bafe0

outFile.close();

See backtrace:

#0  0x00007f0134f0c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f0134f0e02a in __GI_abort () at abort.c:89
#2  0x00007f013554684d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f01355446b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0135544701 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f0135544969 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f013750a6d2 in ChimeraTK::detail::H5storage::writeData() () from /usr/lib/libChimeraTK-ApplicationCore.so.01.02xenial2
#7  0x00007f013750ad34 in ChimeraTK::detail::H5storage::processTrigger() () from /usr/lib/libChimeraTK-ApplicationCore.so.01.02xenial2
#8  0x00007f013750d00b in ChimeraTK::MicroDAQ::mainLoop() () from /usr/lib/libChimeraTK-ApplicationCore.so.01.02xenial2
#9  0x00007f01374e2b20 in ChimeraTK::ApplicationModule::mainLoopWrapper() () from /usr/lib/libChimeraTK-ApplicationCore.so.01.02xenial2
#10 0x00007f0136c2f5d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
#11 0x00007f01360086ba in start_thread (arg=0x7f0021ffb700) at pthread_create.c:333
#12 0x00007f0134fde41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Connecting device module to control system module

Connecting a device module to a control system module e.g.,
dev(RegisterName, ctk::UpdateMode::push) >> csModule[configLocation](RegisterName);
fails with an exeception

terminate called after throwing an instance of 'ChimeraTK::logic_error'
what(): The feeding node has zero (or undefined) length!

Reason is that the VariableNetworkNode operator() in DeviceModule sets the nElements = 0 by default.

ConfigReader: allow building hierarchies of configuration variables

Currently, the ConfigReader module supports only a flat list of variables. This can be inconvenient for more complex applications. A well-structured hierarchy of configuration variables could also be mapped directly onto the structure of an application.

Add the option to the ConfigReader to create submodules with configuration variables (and more submodules) inside. The config file could then look like this:

<configuration>
  <variable name="variableName" type="int32" value="42"/>
  <variable name="anotherVariable" type="string" value="Hello world!"/>
  <module name="foo">
    <variable name="variableInModule" type="int32" value="120"/>
    <variable name="anotherVariable" type="string" value="Hello world!"/>
    <module name="bar">
      <variable name="someName" type="int32" value="666"/>
    </module>
  </module>
</configuration>

The values can be obtained through get() like this: config.get<int>("foo/bar/someName"), or like this: config["foo"]["bar"].get<int>("someName"). When connecting the ConfigReader module with application modules, the square bracket operator behaves in the same way as for any other module with submodule, and connectTo() can also be used to connect the hierarchy on to a hierarchy in another module. For unit testing, it should be sufficient to use: config.connectTo(cs), where "cs" is the ControlSystemModule. The hierarchy of variables should then appear in the control system (with its constant values).

Definition of done:

Implementation
Unit tests in the described way
Create a new doxygen page with a simple example - this page shall later be part of a tutorial-like documentation of ApplicationCore (for now, create stubs for everything which is needed to generate the documentation structure and not directly related with the implementation in this ticket)

Exception handling - part 4

This is a child task of #10 and follows up on task #12. It can be done in parallel with #25, but it depends on ChimeraTK/DeviceAccess#50

Current situation:
runtime_error exceptions coming from devices are now caught and made visible to the control system. While the device is in the exception state, the DeviceModule will try opening the device until it succeeds. On the other hand, the device is already opened during launch of the application. If that fails (because the device is not reachable during server start), the application will currently not start.

Task:
Initial opening of the device should take place in the DeviceModule itself, inside the exception handling loop. That way the device can go to the error state right at the beginning and the server can start despite not all its devices are available.

Since the device is currently opened in some cases already while the connections are made (in Application::makeConnections()), to obtain the device catalogue if connectTo() is used to connect an entire device with directly the control system, we need to make use of the new Device constructor introduced in ChimeraTK/DeviceAccess#50 there instead.

DoD:

Implementation
Test: Application should start even if using an ExceptionBackend which throws exceptions in open even from the beginning.

Application::optimiseConnections() needs to eliminate duplicate connections

In Application::optimiseConnections(), networks which have the same Device variable as feeder are merged. This merging fails, if both networks contain the same consumer (typically a ControlSystem variable). Those duplicates need to be detected first and only one copy is kept.

This happens e.g. if a (sub-)module with HierarchyModifier::moveToRoot is connected in several instances to both the device and the control system. This is often done when using tags.

Write first a test with a VariableGroup HierarchyModifier::moveToRoot with tags "CS" and "DEV", instantiated two times. The tag "CS" is connected with the control system, "DEV" with the device. This should initially fail. Then try to fix this problem.

DoD:

Write the test first
Fix the issue

Server version should be visible in variable description XML file

This requires to specify the server version somehow e.g. in the constructor of Application.

Write test for DeviceModule move constructor and assignment

a4c2e03 has fixed a bug in the move assignment (which is used by the constructor). A test is missing, which would have shown this bug earlier. The test should verify that an application which uses multiple DeviceModules (e.g. moved to a std::vector) still operates normally. Also the DeviceModules need some testing (e.g. subscript operator etc.) after they all have been created.

Introduce additional HierarchyModifier::move1LevelUp etc.

Introduce additional HierarchyModifiers:

move1LevelUp
move2LevelUp
... up to 10 is probably sufficient.

Exception handling: set/clear fault flag in in case of exceptions

This ticket is a child of #10 and depends on #38.

If a read or write operation results in a ChimeraTK:runtime_error, the module's data fault flag should be set and propagated to all outputs of the module. When the operation completes after clearing the exception state, the flag should be cleared as well.

Implementation
Tests
Documentation: the behaviour should be mentioned on the ExceptionHandling Doxygen page.
Check that your implementation matches the general ideas described in the second comment of #10.

Exception handling - improve tests for part 2

Improve tests of the implementation of #12 . Currently only exceptions that occur on write() are properly tested. This should be extended so any possible runtime_error which can be thrown by a backend is tested. This will require to change the ExceptionBackend used for the tests into a backend not based on the NumericAddressedBackend, so the test can control where exactly the exception is thrown. Then it is possible to also test exceptions in preRead()/postRead()/preWrite()/postWrite().

It is probably also useful to have fine control in the ExceptionBackend in which function an exception is thrown (rather everywhere or nowhere right now). The test can then check one place after the other.

Also the test is currently lacking to verify that the read or write operation is retried, i.e. the data is not lost.

readNonBlocking is blocking

In case of runtime_errors thrown by Device accessors, the error reporting blocks all actions until they finally succeed. This is wrong for readNonBlocking and readLatest.

Implement a "non-blocking" version of reportException() in DeviceModule.
It still uses a condition variable to synchronise with DeviceModule thread. However, it does not
block until the device reports isFunctional, but only until the error has been reported and
clearErrorState has been set to false.
Use it in doReadTransferNonBlocking() and doReadTransferLatest() of the ExceptionHandlingDecorator.
These functions shall not block but return with invalid data after a failed transfer.
When implementing after #54 make sure reportException waits until the error state has been reported and then commumicate with the DeviceModule thread such that the device error can be set back (there is no retry for this actions).

Review the shutdown test in testExceptionHandling

The shutdown test is supposed to cover all scenarios where an accessor could block, and check that the application destructor can join all threads and shutdown without blocking.
However, DeviceAccessors can be placed directly into all kinds of objects by the connection code: The control system module, user modules, ConcuminFanOuts, FeedingFanOuts, ThreadedFanOuts, TriggerFanOuts. Please review the test if there are other classes which are not covered, and if all of the above are handled by the test. (The TriggerFanOut is not, but the mechanism is not in place in Device for tests, and probably this will not throw but just not send in case of error).

Sanity check for internal variable connections

When using the "trick" to make ApplicationModule to ApplicationModule connections by connecting them both to the same variables in the ControlSystem, mistakes can be easily made e.g. when variables are renamed in refactoring at only one end of the internal connection. Add an optional sanity check which helps detecting those problems:

Write a function, which can be called on Modules (usually the result of findTag()) or on individual accessors (hence technically each one member function), which allows to set certain flags for process variables.
One of these flags (the only one in the beginning) will be called something like EnforceInternal.
Right after defineConnections(), all process variables with this flag will be checked. The corresponding variable networks should have its feeder and at least one consumer of the type application.

Definition of done:

Clarify names, discuss with a second person
Implementation
Tests
Reference documentation (Doxygen inline)

Fix fluctuating error state

The device status is always reported as OK after 500 ms.

The recovery mechanism for failed devices is to call open() again and wait until the Device reports isFunctional() again. Some devices (network based without permanent connection) don't take any recovery action during open() and always report isFunctional(). Currently all devices do this. This leads to the unwanted behaviour that the device state is always reported as OK after 500 ms (if the device does not have an initialisation procedure which itself throws again).

Correct behaviour:

The device reports error until a successful read or write could be performed.

This behaviour works well with connection-less backends with unstable network, where sometimes network interruptions occur and just go away, as well as with backends which actively have to resolve the error. The latter happens in the DeviceModule thread, which wakes up the failing accessors to retry.

Implementation:

Details are not fully worked out. The DeviceModule thread needs to get notified that the error condition has been resolved. So it after being woken up there can be two actions: send error or resolve error. Do we have two different sleeping points? But what to do with additional errors that come in? Or just one sleeping point? Do we need an extra queue or extend the meaning of the predicate for the condition variable?

When #56 is already implemented, make sure that readNonBlocking and readLatest have the correct behaviour:

They return immediately with data invalid. The DeviceModule thread is allowed to reset the device error flag. However, it will only do this after the next read/write (of another or the same) accessor is successful.

VariableGroup::readAll() fails to update all push-type variables

In the code below, somehow the ch7 (possibly others, but the data is usually constant) was not updated properly after the readAll().

The values are connected to the module it lives in using the trigger from the PeriodicTrigger

    struct : ctk::VariableGroup {
      using ctk::VariableGroup::VariableGroup;

      ctk::ScalarPushInput<int16_t> ch1{this, "CAR_ADC_CH1", "", ""};
      ctk::ScalarPushInput<int16_t> ch2{this, "CAR_ADC_CH2", "", ""};
      ctk::ScalarPushInput<int16_t> ch3{this, "CAR_ADC_CH3", "", ""};
      ctk::ScalarPushInput<int16_t> ch4{this, "CAR_ADC_CH4", "", ""};
      ctk::ScalarPushInput<int16_t> ch5{this, "CAR_ADC_CH5", "", ""};
      ctk::ScalarPushInput<int16_t> ch6{this, "CAR_ADC_CH6", "", ""};
      ctk::ScalarPushInput<int16_t> ch7{this, "CAR_ADC_CH7", "", ""};
      ctk::ScalarPushInput<int16_t> ch8{this, "CAR_ADC_CH8", "", ""};
      ctk::ScalarPushInput<int16_t> ch9{this, "CAR_ADC_CH9", "", ""};
      // 10, 11, 12 handled elsewhere
      ctk::ScalarPushInput<int16_t> ch12{this, "CAR_ADC_CH12", "", ""};
      ctk::ScalarPushInput<int16_t> ch13{this, "CAR_ADC_CH13", "", ""};
      ctk::ScalarPushInput<int16_t> ch14{this, "CAR_ADC_CH14", "", ""};
      ctk::ScalarPushInput<int16_t> ch15{this, "CAR_ADC_CH15", "", ""};
    } input{this, "DeviceIn", "ADC input channels", ctk::HierarchyModifier::hideThis, {"ULOG_IN"}};
  ...
  void mainLoop() {
    input.readAll();
    ...
  }

Implement a first, simple exception handling - part 1

This is a child task of #10. The idea is to provide a first rather simple exception handling to investigate how to develop the idea further. The implementation should look as follows:

Add two error state variables:

"state" (boolean flag if error occurred)
"message" (string with error message)

to DeviceModule. Place them together in a VariableGroup called "DeviceError", but do not list this group in the subModules map, so it doesn't get automatically connected with connectTo(). Instead one needs to connect this explicitly to the control system.

Add a thread safe function to the DeviceModule which sets the error state called "reportException". This requires an internal thread of the DeviceModule which fills these variables. To communicate with this thread, we can directly use cppext::future_queue, as it is multi producer (in contrast to the ProcessArray).

reportException() shall block until the error state has been resolved. This means the internal thread periodically tries to reopen the device. Once this is successfully done, all threads currently blocking reportException() (keep in mind this can be many!) should be unblocked. This can be realised through a std::condition_variable.

After reportException() returns, the callee must retry the last read or write operation (since it has failed). This should be noted in the documentation.

Add test cases for this implementation:

write a small dummy device which can throw an artificial ChimeraTK::runtime_error in open() when asked to do so through a flag, and counts how often open() has been called
close the device in the test, then call reportException(). this should result in the device being opened again.
do the same again but set the flag so open() fails. This time reportException() shall block and the counter shall increase until the flag is cleared (this requires an extra thread in the test).
think of more test cases, e.g. check the error message and state flag variables

Also add documentation in form of Doxygen descriptions for the new function and the VariableGroup, as well as a Doxygen page about exception handling explaining this functionality for users (i.e. Application developers).

Please put the ticket for review by @mhier when done.

Direct connetion Device->ControlSystem always has two hierrarchy levels

When directly connecting a device with just a plain list of registers (test.map for instance with /REG1, /REG2 etc.) they show up as /REG1/REG1, /REG2/REG2 etc. I was excepting just /REG1 etc.

When there is a hierarchy in the map file (test2.map with /MyModule/actuator) no additional hierarchy is created.

Allow unallowed hierarchy building on purpose

Could this be converted to something like a warning:

ApplicationCore/src/ControlSystemModule.cc

Line 28 in 67b8eb5

assert(variableName.find_first_of("/") == std::string::npos);

Currently the https://github.com/ChimeraTK/ServerMockup is not working anymore because of this assertion. I just tested what happens when removing this assertion. The ServerMockup runs fine again and all the ApplicationCore test succeed.

It would be very inconvenient (if possible at all) to reproduce the variable tree using ModuleGroups and VariableGroups.

Test missing: propagation of version number

There is no test for the propagation of the VersionNumber through ApplicationModules and FanOuts etc.

New Project: Write standard application directly connecting one device to the control system

This should be an independent project (i.e. create new repository on github). This very simple application just has one DeviceModule opening a single device. This DeviceModule is then connected to a ControlSystemModule with connectTo(). Since many devices have only poll-type variables, there should also be a configurable trigger. Use a PeriodicTrigger module for this purpose and specify its tick variable in the connectTo() call as second argument. Both the tick and the period of the trigger should be visible in the control system inside a directory "/trigger".

The idea if this application is of course to use it in combination with the LogicalNameMapping backend, so the registers which should be visible in the control system can be selected and properly named. Also register plugins can be used e.g. to convert raw values to physical units.

The application can also double as a starting point for writing new, more complex applications. It is therefore useful to document the code well and use a structure of the project which is suitable for extension, i.e. put a header file into the include directory and the source file into the src directory and follow best practises in CMakeLists.txt. The project should link against a user-selected control system adapter, refer to the LLRF server for an example how to do that (maybe: put that code into a cmake module in project-template and use that module).

DoD:

New source code repository
Implementation
Inline code documentation so beginners understand what's going on
Test based on ApplicationCore testable mode and a dummy device
Example configuration files with a dummy device, a logical name mapping layer and DOOCS config files (if the DOOCS adapter is found)
User documentation doxygen page how to use it as a simple server
Setup Jenkins job

Code Cleanup: Remove shadow warnings

With improved warning flags there are some "shadow" warnings. These should be removed to get rid of the compiler warnings.

Exception handling - part 3

This is a child task of #10 and follows up on task #12. Can be done in parallel with #26.

Current situation:
runtime_error exceptions coming from devices are now caught and made visible to the control system. ApplicationModules which read those variables will be blocked until the exception state is resolved. TriggerFanOuts read multiple poll-type device variables when a trigger arrives. Those variables might come from different devices. If one of the devices fails, currently the entire TriggerFanOuts will block until the device is recovered.

Task:
Change the connection making code (in class Application) so it creates a separate TriggerFanOut for each device. This makes not only sure that faulty devices will not block readout of other devices, but also will parallelise the readout of multiple devices, which is a performance improvement (completely disconnected from exception handling).

DoD:

Implementation
Add test with two ExceptionDummy instances (see test for exception handling) to make sure that readout of on device continues while the other is in an exception state.

DeviceModule uses ConditionVariable without predicate

Due to spurious wakeups, condition_variable::wait must always be used with a boolean predicate. In the DeviceModule the mechanism just uses the wait() without checking a conditon.

Introduce a flag which indicated the error status and use it together with the condition variable.

chimeratk / applicationcore Goto Github PK

applicationcore's People

Contributors

Watchers

Forkers

applicationcore's Issues

Exception handling

Recommend Projects

Recommend Topics

Recommend Org