mit-lcp / downcast Goto Github PK

Tools for unpacking and converting data from the DWC system

License: GNU General Public License v3.0

Python 94.18% Perl 5.82%

downcast's Introduction

Downcast
--------

This repository contains tools for processing and converting data from
the DWC system into WFDB and other open formats.


Requirements
------------

Python 3.4 or later is required.  A Unix-like platform is required -
Debian and CentOS have been tested; Mac OS might work as well.  This
package will not work on Windows.

For processing data in BCP format, the ply package is required.

For processing data directly from SQL Server, the pymssql package is
required.  (This package is now mostly abandoned and should probably
be replaced with a different backend.)


Quick start
-----------

If you have access to the demo DWC database, download and unpack these
files (about 30 GB uncompressed.)  You will then need to create a
"server.conf" file, which should look like this:

[demo]
type = bcp
bcp-path = /home/user/dwc-demo

(where /home/user/dwc-demo is the directory containing "Alert.dat",
"Alert.fmt", etc.)  See server.conf.example for other examples.

The demo database spans the time period from 1:00 AM EDT on October
31, 2004, to midnight EST on November 1.  To parse and convert a slice
of the data (say, from 10:00 to 10:05 AM), first we initialize an
output directory and set the starting time:

  $ ./downcast.py --init --server demo \
                  --output-dir /home/user/dwc-test-output \
                  --start "2004-10-31 10:00:00.000 -05:00"

Then run a batch conversion while specifying the end time:

  $ ./downcast.py --batch --server demo \
                  --output-dir /home/user/dwc-test-output \
                  --end "2004-10-31 10:05:00.000 -05:00"

If we wanted to keep going, we could run the same --batch command
again, increasing the end timestamp each time.  We don't need to
specify the starting timestamp for --batch, since the "current"
timestamp is saved automatically.

To "finalize" the output (and forcibly truncate all patient records at
the specified end time), we use the --terminate option.  This wouldn't
be done for a real database conversion, but it's useful for a simple
test:

  $ ./downcast.py --batch --server demo \
                  --output-dir /home/user/dwc-test-output \
                  --end "2004-10-31 10:05:00.000 -05:00" \
                  --terminate

This should result in a bunch of patient records in WFDB format,
stored in /home/user/dwc-test-output.

downcast's People

Contributors

Stargazers

Watchers

Forkers

afcarl vishwas1234567

downcast's Issues

Correct signal file sample range

In some cases, the stated sample range for a signal (scale_lower/scale_upper) is flat-out wrong. ECG signals in particular are often wrong.

WaveSampleHandler should calculate and report an accurate sample range (adcres/adczero) for each segment. This is required in order to correctly convert the record to other formats (e.g., using wfdb2mat.)

Finalization crashes if --output-dir is a relative path

If --output-dir is specified as a relative path, e.g.:

downcast.py --init --server demo --output-dir example-output \
            --start '2004-10-31 10:00:00.000 -05:00'
downcast.py --batch --server demo --output-dir example-output \
            --end '2004-10-31 10:05:00.000 -05:00' --terminate

then something in the finalization process crashes:

  File "/home/benjamin/downcast/downcast/subprocess.py", line 283, in _main1
    self.handler.flush()
  File "/home/benjamin/downcast/downcast/dispatcher.py", line 151, in flush
    self._handler_flush(h)
  File "/home/benjamin/downcast/downcast/dispatcher.py", line 313, in _handler_flush
    handler.flush()
  File "/home/benjamin/downcast/downcast/output/waveforms.py", line 156, in flush
    self.archive.flush()
  File "/home/benjamin/downcast/downcast/output/archive.py", line 165, in flush
    rec.flush(self.deterministic_output)
  File "/home/benjamin/downcast/downcast/output/archive.py", line 271, in flush
    deterministic = deterministic)
  File "/home/benjamin/downcast/downcast/output/archive.py", line 297, in _write_state_file
    os.rename(tmpfname, fname)
FileNotFoundError: [Errno 2] No such file or directory: 'example-output/3d/demo_3d97e525-d794-4aa8-82e8-8821b8da12b4_20041031-1500/__phi_properties.tmp' -> 'example-output/3d/demo_3d97e525-d794-4aa8-82e8-8821b8da12b4_20041031-1500/_phi_properties'

It doesn't do this if the output directory is an absolute path.

This is bizarre. Nothing in the entire package calls chdir, so why should it matter if the path is absolute or relative?

Generate unique signal names

In some strange cases we might see two simultaneous waveforms with the same label. WFDB requires that each signal in a multi-segment record has a unique name.

Correct signal file checksums

Currently, WaveSampleHandler will set all signal checksums to zero. The checksums should instead be set to the sum of all samples in the segment.

Negative clock adjustment coincides with DST transition

At fall transition time, when the clock switches from wrongly-labelled winter time to correctly-labelled winter time, there is often a small negative clock adjustment (after fixing the broken timestamps).

For example, the raw data might look like this:

TimeStamp                        SequenceNumber
2020-11-01 01:59:59.123 -05:00   657489599123
2020-11-01 01:00:04.218 -05:00   657489604243

Clearly there's no discontinuity here and the first message is mislabelled as -05:00 when it should be -04:00. But the delta in TimeStamp is only 5095 milliseconds vs. a delta in SequenceNumber of 5120.

There even seems to be a clock adjustment in those rare cases that DWC labels the summer timestamps correctly.

Normally a negative clock adjustment creates ambiguity, but in this case it seems it might be possible to disambiguate based on the timezone. Need to investigate further.

(There is usually no adjustment when the clock switches from correctly-labelled summer time to wrongly-labelled winter time. So the one-hour correction is right.)

Finalizing record at end of patient stay

When a patient is discharged, we need to mark the record as finalized.

Currently, we finalize a record automatically when there is a gap - i.e., some period of time when no new messages are seen, then a new message appears - but when the patient is discharged, there are no new messages, so this never happens.

(For testing, we can force all records to be finalized by using the --terminate argument, but that's no good for "real" conversion.)

The tricky thing is that since we are processing messages in parallel, it's hard to say which worker process is responsible for finalizing the record.

One way to deal with this would be to periodically check "what is the earliest unprocessed message in any queue"? Call that timestamp T_next. Then, if there are any unfinalized records for which the last processed message is earlier than (T_next - split_interval), those records should be finalized.

I don't think there's a good way to do this without stopping and then restarting all of the worker processes, but we don't need to do so frequently - doing it once for every 3 hours of data should be quite adequate.

Handling of "delayed" numerics

Some numeric values (in particular, NBP) have multiple time values and we need to understand what they mean and how to use them.

TimeStamp seems to have one-second resolution.
SequenceNumber seems to have 5120-ms resolution.
Often the two values are wildly different (TimeStamp could be hours earlier.)
Often the same measurement appears multiple times with the same TimeStamp and differing SequenceNumber.

I am guessing, actually, that the TimeStamp is pretty meaningless - that it refers to the time when the measurement was first "requested" rather than when it was actually performed. I'm guessing that the SequenceNumber tells us when the measurement was reported, which might be a few seconds after it was measured.

It might be helpful to hear from somebody who is familiar with using these machines:

is NBP measured automatically (on a schedule) or does the nurse press a button to initiate the measurement? Or both?
how long does it usually take (from inflating the pressure cuff, to deflating it, to when the NBP measurements appear on screen)?
how long do the values stay on screen afterwards?

Generate multisegment record

When a record is finalized, we need to generate a multi-segment header file for it.

Up until now I've been doing this by hand using a hacked version of 'wfdbjoin', but this should ideally be done by the WaveSampleHandler itself.

This requires re-reading the segment header files, and creating a layout header with the composite signal information, and a master header containing the names and lengths of the segments (and gaps, if any.)

mit-lcp / downcast Goto Github PK

downcast's Introduction

downcast's People

Contributors

Stargazers

Watchers

Forkers

downcast's Issues

Correct signal file sample range

Finalization crashes if --output-dir is a relative path

Generate unique signal names

Correct signal file checksums

Negative clock adjustment coincides with DST transition

Finalizing record at end of patient stay

Handling of "delayed" numerics

Generate multisegment record

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent