snews2 / snews_coincidence_system Goto Github PK

View Code? Open in Web Editor NEW

1.0 8.0 2.0 2.67 MB

Coincidence System backend for snews alert trigger

License: BSD 2-Clause "Simplified" License

Python 100.00%

snews_coincidence_system's Introduction

SNEWS_Coincidence_System

Coincidence system backend for SNEWS alert trigger.

How to Install

The package can be installed via setuptools or poetry. See this page for more details.

Clone this repo and change into the directory:

git clone https://github.com/SNEWS2/SNEWS_Coincidence_System.git
cd SNEWS_Coincidence_System

Usage

snews_cs is the main software running on the servers to initiate coincidence seraches, and trigger alerts. Basic usage is starting the coincidence search with the following command:

snews_pt run-coincidence --no-firedrill

The command line tool provides information about the available commands and options:

snews_pt --help

The heartbeat feedbacks can be tracked via the following command:

snews_pt run-feedback

For more details and advanced usage, see the documentation.

Note for the developers

Please visit the Developer Notes page for more details. Anytime a new package needs to be introduced, please do so over poetry and update the requirements.txt file for the setuptools using poetry as described in the notes.

SNEWS Detector List

“Baksan”
“Borexino”
“DS-20K”
“DUNE”
“HALO”
“HALO-1kT”
“Hyper-K”
“IceCube”
“JUNO”
“KM3NeT”
“KamLAND”
“LVD”
“LZ”
“MicroBooNe”
“NOvA”
“PandaX-4T”
“SBND”
“SNO+”
“Super-K”
“XENONnT”

snews_coincidence_system's People

Contributors

Stargazers

Watchers

Forkers

whiskey9cjo justinvasel

snews_coincidence_system's Issues

Tracking the snews_pt version

As the snews_pt and snews_cs are developed in parallel, they need to be compatible.

if in the future, we make changes to snews_pt we will need to validate the messages generated by that version properly.

For this, we have introduced a schema_version field in the snews_pt with the idea that snews_cs checks the versioning and processes the message accordingly when consuming messages from the stream. However, we never implemented this check.

Rename auxiliary folder

The folder with configuration files (auxiliary) should be renamed to etc as this is more standard for packages.

Bypass Mongodb if it is a test

We are considering running two CS instances in parallel; dev and production.

For the dev instance it might make sense to bypass the MongoDB insertion, or creating a new mongo instance.

Coordination between multiple CS instances

So, it would be nice to have SNEWS servers running redundantly in different places, to dodge network or power problems.

The way it is right now, this would work out of the box for the decision-making algorithms: each server can subscribe to the same hopskotch stream and would make the same decisions independently.

But, we don't want multiple copies of the alerts going out and confusing people, be they emails or slack alerts or whatever. So, at any point in time, only one server should push alerts.

How best do do this? I could imagine something where the first server to reach a decision puts out its alert. Other servers, if they see an alert already that matches the one they are about to push, would not make a duplicate.

While adding the logic to do this, we would need to be careful about making the operation an "atomic" one, to avoid race conditions. That will take some thought.

Network connections and file opening in cs_email.py should use try/except

I noticed during the firedrill that when smtp connections fail, they fail tragically.

TODO: wrap smtp connections and file opens in a try/except block so that there's less drama.

Merge retraction_fix branch to main

Alert webpages

We need a dynamic, self-generating webpage for each alert we produce. Where the current state of each alert (after all updates, retractions, etc) can be displayed, with links to followup plans, skymaps, etc.

Kind of like LIGO has: https://gracedb.ligo.org/superevents/public/O3/

Since we produce kafka (hopskotch), and kafka is how dynamic webpages are built, this should be tractable.

(Maybe we can ask to ride on LIGO's infrastructure?)

alert message content

SNEWS_Coincidence_System/snews_cs/snews_coinc_v2.py

Line 224 in aee4c94

 alert_data = snews_utils.data_cs_alert(p_vals=p_vals, nu_times=nu_times, ids=None) 

at this line, ids is expected to be a list of detector ids. Is there a reason we pass just a None or is it just not yet implemented?

Also, do we only submit the involved detectors' ids and not their names?
I think id-name pairing already exists in snews_pt so I don't see harm passing the detector name in the alert messages
@Storreslara @joesmolsky

Fire a warning when the number of "alive" detectors =< 2

After we start receiving regular heartbeats, we can also track and take action in case of a global coincident detector operations. This, however, would only trigger after they stop sending heartbeats (or sending OFF beats). If we want to take actions before it happens, we need to ask for "scheduled" operations

clean coincidence probability calculation

When alerts are sent to snews_cs by snews_pt, they are followed by a heartbeat message. Since the alert is formed from the coincidences first, the coincidence probability can be off by one (or more) detectors because the heartbeat has not been registered in time.

Update the code so that the coincidence probability is correctly calculated. Could be done as follows:

Either snews_pt does not send a HB and snews_cs counts every CoincidenceTier message as a HB.
In the logic that handles the CoincidenceTier message, we log a heartbeat automatically and just ignore the subsequent HB from snews_pt.

Heartbeat Tracker

I suggest we make a separate script in snews_cs to handle HB messages.

I think we need the following functionalities, please add more or correct as you see fit;

Track the frequency of hb messages from the experiment.
Track received time, and machine time, time differences to naively infer latency
Track how many detectors are running currently
Raise a warning if the number of online detectors is 1 or zero.
Keep daily (?) tracks of expected hb / received hb statistics?

Online Detectors Check missing logic

Thanks to @Sheshuk we have noticed a bug in the coincidence system.

Normally, in here

SNEWS_Coincidence_System/snews_cs/cs_stats.py

Lines 4 to 31 in 8803365

 def ncr(n, r): 

 f = math.factorial 

 return int(f(n) / f(r) / f(n - r)) 

 def cache_false_alarm_rate(cache_sub_list, hb_cache): 

 """ Assume false alarm rate of 1 per week 

  returns the combined false alarm rate in years 

  meaning; if there are 8 active detectors, each with false alarm rate of 1/week 

  We would get a false alarm with 2-fold coincidence every X years 

  The formula; 

  n = number of detectors 

  r = number of coincidences 

  C(n,r) = \frac{n!}{r!(n-r)!} 

  R_{combined} = C(n,r)+1 \times F_{im, d1} * F_{im, d2} ... * F_{im, dn} \times δt^{r-1} 

  """ 

 seconds_year = 31_556_926 

 seconds_week = 604800 

 single_imitation_freq = 1 / seconds_week # 1/week in seconds 

 online_detectors = len(hb_cache.Detector.unique()) # n 

 coincident_detectors = len(cache_sub_list['detector_name']) # r 

 time_window = 10 # seconds 

 combinations = ncr(online_detectors, coincident_detectors) 

 combined_imitation_freq = (combinations + 1) * np.power(single_imitation_freq, coincident_detectors) * np.power( 

 time_window, coincident_detectors - 1) 

 comb_Fim_year = combined_imitation_freq / seconds_year 

 return 1/comb_Fim_year

We compute the number of detectors available at the time of the alert to compute false-alarm probability of such a trigger. There are 2 problems at the moment, one was already discussed;

We check the number of available detectors at the time when we receive the alerts not at the time where the neutrino events were detected. (See picture)
We depend on the heartbeat messages to come frequently, however, if there are no heartbeats registered at the time of Observation message, the heartbeat cache will be empty and the FAR calculation thinks that there is a coincidence between 2 detectors out of 0 available detectors (since no HB is registered).

For the second issue, we talked about registering a heartbeat for each observation message but not yet implemented. Currently, when the heartbeat cache is empty, FAR tries to compute a factorial with 0 detectors and crashes before sending the message. We can do a quick-patch to ignore FAR when cache is empty. However, we need to fix the real issue.

No default smtp server in cs_email.py

Not having a default smtp server configured in cs_email.py causes problems when you do not set the smtp_server environment variable.
This should be set to 127.0.0.1 by default, and overridden by the environment variable.

replicaset=rs0

this line here caused failures when I try;

SNEWS_Coincidence_System/snews_cs/snews_db.py

Line 28 in a00bc8a

 self.client = pymongo.MongoClient('mongodb://localhost:27017/', replicaset='rs0') 

If I remove the replicaset=rs0 everything works as I expected. What is the reason for the argument?

Unit Tests CI/CD integrations

To keep track of this, I created an issue.

We need unit tests and automated github actions.

Afet implementing the data models, the tests should be easy to implement. As for the github actions, one of them might/should be creating requirement.txt from the poetry requirements at each PR to make sure they are synchronized

Clean up project scripts, images, notebooks

The top level of the project is littered with Python scripts and random image files and notebooks. The top level of the project should include only files necessary for installation and tests.

Create a snews_cs/scripts folder and copy all driver code there.
Copy all images into docs
Copy all stray notebooks into docs.

Unexpected indent in snews_cs/hearbeat_feedbacks.py

I encountered this error recently while working with the snews_cs server.

/project/snews/snewscoinc-venv/bin/python /project/snews/SNEWS_Coincidence_System/server_run.py
Traceback (most recent call last):
File "/project/snews/SNEWS_Coincidence_System/server_run.py", line 1, in
from snews_cs.snews_coinc import CoincidenceDistributor
File "/project/snews/SNEWS_Coincidence_System/snews_cs/snews_coinc.py", line 11, in
from .cs_remote_commands import CommandHandler
File "/project/snews/SNEWS_Coincidence_System/snews_cs/cs_remote_commands.py", line 6, in
from .heartbeat_feedbacks import check_frequencies_and_send_mail, delete_old_figures
File "/project/snews/SNEWS_Coincidence_System/snews_cs/heartbeat_feedbacks.py", line 163
latency = pd.to_timedelta(df['Latency'].values).total_seconds()
IndentationError: unexpected indent

Signal Handling

We need to handle at least SIGHUP and SIGINT.

Currently, ctl-C doesn't do what you'd expect, it needs ctl-Z and killed. You'd like it to shut down gracefully.

When sent a SIGHUP, it should close the current logfile and re-open a new one. This allows log rotation.

Mark points out the docs for doing so.

Related request - stdout should be an option for the logging of stuff. Hopefully this just changes the file open to open stdout instead of a file.

Heartbeat feedback messages missing the attachment

When requesting heartbeats we get the feedback as an email.

However the feedback image is not attached. As an improvement, we also thought of attaching a csv file with the hearbeats from the last 24h for that experiment. My experience from my experiment-side is that this would be useful to improve our system.

Update pillow and other dependencies in poetry.lock

The poetry lock file includes a version of pillow that has been flagged with a critical security flaw (see this alert). The dependencies in the lock file need to be updated.

Distributed server discovery/locking (SNEWS_CS)

This issue tracks distributed locking and discovery efforts specifically for hackathon2024

Create Grafana display for SNEWS heartbeats.

GCN Kafka output module

We need to be able to send our SNEWS2.0 alerts to GCN. Luckily, GCN is also moving to Kafka. Although not Hopskotch, it shouldn't be too hard to plug into it:

https://gcn.nasa.gov/docs/contributing

But: we need to write that module!

investigate import error of `six.moves` module in unit tests

In ubuntu+Python 3.12, Mac OS X, and Windows the unit tests fail with this exception:

ModuleNotFoundError: No module named 'six.moves'

Improve email output format

The alert email should be carefully formatted to be human-friendly and polite to the experiments.

The code is in cs_email.py, the snews_email function, building the pretty_alert string that gets sent as an email.

Will attempt to use markdown here to make an example. Let's settle on what it looks like before trying to code it.

Subject: SNEWS2-test SNEWS COINCIDENCE 2023-06-02T08:23:09.833793

Need a better subject line, but one which preserves a timestamp so we can sort t them out while developing. For humans, should we parse out that ANSI time?

Note that this couples to the as yet to be written up issue where we want to make a web page for each alert to provide updates, links to skymaps, etc.

The Supernova Early Warning System reports that the following experiments observed a Supernova neutrino-like signal within 10 seconds of each other:
XENONnT at 2012-06-09T15:31:08.109876 with a probability of None
DS-20K at 2012-06-09T15:31:08.109876 with a probability of None
KamLAND at 2012-06-09T15:31:08.891011 with a probability of None

reported by SNEWS server at avogadro.physics.purdue.edu, which estimates a False Alarm Rate of 0.00%

More information on this alert (including directional information) will be kept at https://snews2.org/alerts/2023-06-02T08:23:09.831463

Internal SNEWS information follows:
_id : SNEWS_Coincidence_ALERT-UPDATE 2023-06-02T08:23:09.831463
alert_type : NEW_MESSAGE
p_values average : nan
sub list number : 2

Currently, here's what it says:

_id : SNEWS_Coincidence_ALERT-UPDATE 2023-06-02T08:23:09.831463
alert_type : NEW_MESSAGE
server_tag : avogadro.physics.purdue.edu
False Alarm Prob : 0.00%
detector_names : ['XENONnT', 'DS-20K', 'KamLAND']
sent_time : 2023-06-02T08:23:09.831463
p_values : [None, None, None]
neutrino_times : ['2012-06-09T15:31:08.109876',
'2012-06-09T15:31:08.109876', '2012-06-09T15:31:08.891011']
p_values average : nan
sub list number : 2

snews-cs docs not building

@justinvasel created a branch and PR (#90) to clean up broken docs generation. I merged it but I find offline that several of the pinned dependencies in doc/requirements.txt are not resolving properly, causing sphinx documentation to fail.

Single detector update triggers alert

When the same detector updates its own message it triggers an "update-alert" even if the first message is not in coincidence with any other detectors.

snews_pt run-scenarios

then select the 5th option "same detector submits within 1sec" and watch the alerts.

Probably need a check within update alert function before triggering the alert

Implement a Kafka logging handler

Chris and I discussed this topic briefly today.

We should implement logging over Kafka: SNEWPlog. This is foundational for centralized monitoring/operations.

make repo or organization secret for hop authentication

We need a repo-level or organization-level secret to allow buildbots to test the hop authentication in snews_cs and snews_pt.

Open Heartbeat Issues

There are still a few things that need to be addressed on the heartbeat side;

The warning frequency and possible duplicate warnings;
We are now sending warnings after mu+3sigma time has elapsed and there were no new HB from a given experiment. However, by definition this sends a warning ~once every 100 heartbeats. On top of that, on XENONnT tests I noticed that each time there is a warning, I get a second warning with ms-sec differences which is not supposed to happen. (mentioned here #99)
FAR computation at the time of the alert, not at the time of the coincidence
The FAR should be computed taking into account the number of detectors that were online at the time of the "coincidence" however the script looks at the number of detectors that have a HB in the hb cache (last 24hr) which means, they could have sent a heartbeat 20hr ago and have a registered beat but are offline since then. Thus, at the time of the coincidence they should not be counted. (mentioned here #100)
heartbeat feedback request. This works as expected except for the attachment. For experiments to understand their latencies and cross-check their missed heartbeats, we wanted to send some more information. This includes a figure showing the registered beats, status and latencies. The figure is created but it fails to sent via email. We can also consider sending a csv file with the beats for that experiment within 24 hr for user to play around with.

Coincidence System runs only for Obs Mode

Currently, when we run the coincidence logic, it runs for the default observation topic which is;
"kafka://kafka.scimma.org/snews.experiments-test"
We should allow for fire drills topic to run coincidence logic.

SNEWS_Coincidence_System/snews_cs/snews_coinc.py

Line 450 in c62bc11

with stream.open(self.observation_topic, "r") as s:

Add the locking package for distributed servers to the SNEWS2 workspace

Change time format to ISO

Timestamps and datetime objects should be in ISO format.

Most of the work seems to me changing how variables are assigned.

Current method example

# Current time
t = datetime.utcnow().strftime("%y/%m/%d %H:%M:%S:%f")

# Converting strings
 nu_t = self.times.str_to_datetime(message['neutrino_time'], fmt='%y/%m/%d %H:%M:%S:%f')

New method example

# Current time
t = datetime.utcnow().isoformat()

# Converting strings
 nu_t = datetime.fromisoformat(message['neutrino_time'])

This makes TimeStuff() unneeded.

I've started the iso-time branch for this issue.

missing dependencies

SNEWS_Coincidence_System/doc/requirements.txt

Line 1 in bf4861a

click==8.0.3

I noticed that pandas or numpy are not listed in the requirements. If one is working on a new environment, and trying to install only snews_pt and snews_cs it will complain about it.

I'm opening the issue so that we don't forget.

Append the observation messages to feedback mail

If the user has published an observation message within the last 24 hours, this information can also be added to the feedback email.

Following the discussion in #11 we figured that it is easier and more secure this way.

We need to write a mongo query which checks if there was an "Observation Message" within the last 24hours, when the user requests a heartbeat feedback. We then add that information on top the feedback and send the email.

This way, the user can get a "delivered confirmation" without any danger.

Heartbeat Warning emails

Ever since XENONnT is connected and sending heartbeats, I have noticed that the skipped heartbeat messages are always doubled.
For each skipped heartbeat we are receiving 2 emails and the reported times are 5-6 seconds different from each other.

There seems to be a problem with the hold-after-warning logic, or there is a bug in the code which sends warning at two different levels

Heartbeats -> SQLite transitioning

We agreed that it is better to keep things in databases such as SQLite rather than keeping csv files. This helps improve both the access and storage. I opened the issue to keep track of this.

Following this upgrade, we might also think about the heartbeat feedback messages and might consider sending the data as a file when requested.

snews_pt should be listed as a requirement of snews_cs

It looks like snews_pt is a strong dependency of this project; it should be included in the package list in setup.py or in a requirements.txt file.

Replacing the snews format checker

With SNEWS2/SNEWS_Publishing_Tools#81 we are getting rid of the snews format checker, and instead validating the messages upon generation.

Previously we were using a script called snews format checker where the neutrino times, p values etc are checked for each message in a standalone script, and using the same script within the snews coincidence system to make sure that the message comes from snews pt and not a stray message injected into the stream -which might potentially crash the server if not ignored.

Now that we have gotten rid of that, we should either write something that checks the similar properties of each received message, or we can again use the snews_pt as a dependency and try to rebuild a new message and check if it self validates again.

I'm open to alternative suggestions.

Test Connection Ideas

In 80bb6f3 I added a part where it checks if "test-connection" is passed in the '_id' field of an observation message.
I (stupidly) was printing it on the server, thinking that this would be a confirmation message. However, of course, this is printed on the server terminal. Thus, the user still won't know if their messages are received by the server.

We need to feed this back to the user, without having private channels, the only option I see is publishing a message in the observation(?) topic saying that;
2022-04-14 15:45 connection test message is received from 'TEST', the user would still be required to launch a second terminal and subscribe to something...

Alternative, and maybe unnecessarily sophisticated solution would be;

user submits a message via a function called test_connection()
test_connection() generates a hash and publishes {'_id' : 'test-connection', ''hash':'xsASgwe35s'}
test_connection() subscribes to a special test channel and listens
on server run_coincidence receives {'_id' : 'test-connection', 'hash':'xsASgwe35s'} and submits a message to this channel with this specific hash
test_connection() receives messages (possibly more than one, if there are also others testing) but it compares this hash
if there is a hash_sent == hash_read match, it confirms that the message is received by the server.

is this an overkill? :D

SQL DB does not have any data

Out of order coincidence

In #69 the "out of order" scenario does not trigger an alert for an unknown reason.

I tested with the current branch and the scenario "3 messages, out of order" works as expected.

In this scenario we first receive a message from

DUNE at 12:34:55
DS-20k at 12:34:47
XENONnT at 12:34:45

and it should still trigger an alert, first after the DS-20k saying DUNE+DS-20k formed a coincident, and then an update-alert saying all three formed a coincidence.

While this works with the main branch, it stopped working with the #69 although there is no obvious relation. We need to check what changed.

coincidence stream logic

Currently, we are initiating a stream with the first incoming message, and then looking for coincidences in a given time window, and if we receive something outside of this window we close the first stream and start a new one with the latter.

I suggest we initiate a new stream for every message outside of the existing stream(s)' time window. i.e.

message at t=0s

start a stream df1

message at t=12s

'outside of window, but don't kill the old one'
start a new stream df2

message at t=24s

'outside of the window, but don't kill the old one'
start a new stream df3

Later, say an experiment submits a message after half an hour saying that they now processed their data and they have a signal at t=9 which forms a coincidence with df1

check with existing streams' time windows if it forms a coincidence with any of them accept and trigger.

Abandon the existing streams after ~an hour.

This way we allow experiments to publish their messages later than detected.

Goals for firedrill

Week of Nov 27 or Dec 4
In addition to testing all the new code:

light curves (requires coordination with timing group)
Email output tests
Have some monitoring with minimal grafana pages
alert webpage
Distributed servers (stretch goal)

Update message triggers another alert even if sublist has single detector

In #36
I noticed another thing that would probably take a bit more work; after retracting a message if there is a single detector remains we send an alert with that detector and say that there was a retraction. This is the intended behavior so that users know it use to be A & B, but now B retracted, and only A remained.
However, if A later updates their times or anything, there goes another alert with only A in it. How critical is this? I think not much, but we should at least be aware of it

Stamped times, are they TZ aware?

Thanks to @sybenzvi while working on #98 we noticed that tz aware and tz naive times would cause troubles when tracking the heartbeats from the last 24hr.

I'm opening this issue so we can confirm that every timestamp we use considers UTC times (e.g. tz naive)

Make Alert Webpages

Alerts should each get a webpage. This allows updates to write to that webpage, healpix maps to be posted, etc. Ideally an alert should send out one email/GCN/etc, that all refer to this webpage. Prevents confusing spam as more info comes in.

There has to be a kafka module that writes html based on what comes into a channel, that's how web interfaces to, say, twitter work. So, this issue is probably separate from CS other than that it cosumes the alert channel output, and the URL to build needs to be supplied by CS.

Reset cache

Reset cache seems to be failing

Dependencies are not automatically installed

I created a new environment, and installed snew_cs using pip install -e ./ however, when I run the snews_cs run-coincidence I get ModuleNotFoundError: No module named 'pymongo' error. I had to manually install it, even though it is listed under the install requirements.

What would happen in a World without Mongo?

Mongo started out life as the data structure in which we stored alerts to form a coincidence. Now we do that in with python data structures.

We log heartbeats to csv.

We log alerts to mongo, but we reset Mongo each day due to our data retention idea of "won't keep it fot more than a day".

What if we also logged alerts to a flat file? We already have the machinery to purge old data from the heartbeats. Could re-use that for alerts too.

Not having Mongo eliminates a lot of complexity: don't have to install that, don't have to include it and its ports in the containers.

	def ncr(n, r):
	f = math.factorial
	return int(f(n) / f(r) / f(n - r))

	def cache_false_alarm_rate(cache_sub_list, hb_cache):
	""" Assume false alarm rate of 1 per week
	returns the combined false alarm rate in years
	meaning; if there are 8 active detectors, each with false alarm rate of 1/week
	We would get a false alarm with 2-fold coincidence every X years
	The formula;
	n = number of detectors
	r = number of coincidences
	C(n,r) = \frac{n!}{r!(n-r)!}

	R_{combined} = C(n,r)+1 \times F_{im, d1} * F_{im, d2} ... * F_{im, dn} \times δt^{r-1}

	"""
	seconds_year = 31_556_926
	seconds_week = 604800
	single_imitation_freq = 1 / seconds_week # 1/week in seconds
	online_detectors = len(hb_cache.Detector.unique()) # n
	coincident_detectors = len(cache_sub_list['detector_name']) # r
	time_window = 10 # seconds
	combinations = ncr(online_detectors, coincident_detectors)
	combined_imitation_freq = (combinations + 1) * np.power(single_imitation_freq, coincident_detectors) * np.power(
	time_window, coincident_detectors - 1)
	comb_Fim_year = combined_imitation_freq / seconds_year
	return 1/comb_Fim_year