bovee / entab Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 5.0 6.86 MB

* -> TSV

License: MIT License

Rust 88.50% MAXScript 2.20% HTML 3.38% R 0.29% JavaScript 5.06% Python 0.57%

entab's People

Contributors

Stargazers

Watchers

Forkers

chasemc jnj16180340 ethanbass eitsupi murphy-osmo

entab's Issues

waters raw documentation in rainbow package for python

Just saw this newish python package with documentation for some formats like Agilent FID (https://rainbow-api.readthedocs.io/en/latest/agilent/ch_fid.html) and Waters .raw (https://rainbow-api.readthedocs.io/en/latest/waters.html) that aren't yet supported in Entab. Thought it might be of use.

Entab parity

File parsers:

Agilent Chemstation REG telemetry (Existing #11 and #40)
Agilent Masshunter MSMS Scan (Existing #12 )
Agilent Masshunter telemetry (Existing #14)
Bruker MSMS
MZML (lower priority)
NetCDF (lower priority)

Other:

Any missing file types for inference
Iterators that wrap the existing entab file readers, but return "scan" objects for each time point.

Methods:

It's not really clear to me how many of the non-file reading functions people use(d) in Aston so some (most?) of the method code could be copied straight over into the Python bindings and doesn't need to be rewritten into Rust.

Allow `read_into` parsers

It should be possible to reuse the e.g. FastaRecord from iteration to iteration to avoid an allocation each time, but I'm having issues with lifetimes trying to write out a function that does this.

I tried the following in buffer.rs, but I kept having lifetime errors with both the state and ReadBuffer being still mutably borrowed in the next iteration of the loop.

pub fn next_into<'n, T>(
    &'n mut self,
    record: &mut T,
    mut state: <T as FromSlice<'n>>::State,
) -> Result<bool, EtError>
where
    T: FromSlice<'n> + 'n,
{
...
}

bug in agilent .uv parser

I was investigating the UV parser more and I think there are still some problems. For example, I was trying to import a UV file from my lab and it looks pretty good for about the first 15 minutes, but then the baseline starts going all over the place. Any idea what might be going on? I'm attaching a picture of the entab imported file in black and the CSV I exported from chemstation in blue.

The example file that ships with entab doesn't look too good either:

Below is the code to reproduce what I did in R. You can find the file I tried to convert and the CSV version here https://cornell.box.com/v/example-DAD-files .
Thanks!
Ethan

library(entab)
files[1]
path <- "~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/ETHAN_01_19_21 2021-01-20 00-27-52/679.D/dad1.uv"
r <- as.data.frame(Reader(path))
ch.entab <- data.frame(tidyr::pivot_wider(r, id_cols = "time",
                        names_from = "wavelength", values_from = "intensity"))

ch.csv <- read.csv("~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/export3D/EXPORT3D_ETHAN_01_19_21 2021-01-20 00-27-52/679.CSV",
                   row.names = 1, header=TRUE,
                   fileEncoding="utf-16",check.names = FALSE)
par(mfrow=c(1,1))
matplot(ch.entab$time, ch.entab[,"X280"], type="l",ylim=c(-100,800))
matplot(ch.entab$time, ch.csv[,"280.00000"],type="l",add=T,lty=2,col="blue")
abline(v="15.00",col="red",lty=3)

example_file <- as.data.frame(Reader("~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv"))
library("tidyverse")
df %>% filter(wavelength=="200")
df <- data.frame(tidyr::pivot_wider(example_file, id_cols = "time", names_from = "wavelength", values_from = "intensity"))
matplot(df$time, df$X280, type="l")

entab doesn't process Thermo RAW from Xcalibur 3.1

Hello

I've used entab 0.3.1 on a RAW file generated by a Thermo LCMS using Xcalibur 3.1 but the data isn't getting processed. The file is opened and the columns are created but the values are not generated. I've included a link to the original files and attached the output CSV (github doesn't support uploading TSVs)

The command used to produce the file was:

entab -i angio_test.raw -p thermo_raw -o angio_test.tsv

Thanks!

Thanks for making this wonderful software! It's changing our lab process entirely.

Thanks so much for making this wonderful library! It is changing our lab process entirely.

angio_test.csv

angio_test.raw

bug in thermo raw converter (or maybe missing UV functionality)?

Hi Roderick,
I received a Thermo Raw file from a new twitter friend that we'd like to be able to convert with entab, but I'm running into an error. I believe the file should contain both MS and UV data. I'm not sure if the UV data is the source of the problem or if there's something else also. Here is the link to the file: https://t.co/TeoGpYdxdx

I tried to call entab from the command line:

entab -i ~/Downloads/20211227_SAJ9571-F07.raw

and received the following error message:

thread 'main' panicked at 'range end index 40 out of range for slice of length 0', entab/src/parsers/thermo/thermo_raw.rs:384:38

I'm not sure what the note about the BACKTRACE means. Ideally we'd like to be able to extract the UV data from the file.

Thanks!
Ethan

Flexible Param input

A lot of parser states are parameterized. For example, both #12 and #13 require a file name so we can open another ReadBuffer to store in the state, TSV/CSV parsers can take delimiters and number of header lines, and kmer-parsers need a k.

Right now, we pass initial params into the state as a "P" object scoped to each state which we also require to have a Default so that we can use None across parsers. An alternate design could take an Into<P> object where P: Default + From<Params> for convenience (where Params is some kind of { filename: String, other_params: BTreeMap<&str, Value> }?)

Multithreaded support

I wrote some code that could take the unconsumed part of a ReadBuffer and allow iterating over it (and then repeating those two steps over and over again) that allowed very basic multithreaded support, but the API was pretty gross and I'm not sure the way the multithreading worked was actually that efficient (it was about 4x slower than the normal Readers). It would be nice to support this in a more principled way.

    #[cfg(feature = "std")]
    #[test]
    fn test_multithreaded_read() -> Result<(), EtError> {
        let f = File::open("./tests/data/test.fastq")?;
        let (mut rb, mut state) = init_state::<FastqState, _, _>(f, None)?;
        let seq_len = Arc::new(AtomicUsize::new(0));
        while let Some((slice, mut chunk)) = rb.next_chunk()? {
            let chunk = rayon::scope(|s| {
                while let Some(FastqRecord { sequence, .. }) =
                    chunk.next(slice, &mut state).map_err(|e| e.to_string())?
                {
                    let sl = seq_len.clone();
                    s.spawn(move |_| {
                        let _ = sl.fetch_add(sequence.len(), Ordering::Relaxed);
                    });
                }
                Ok::<_, String>(chunk)
            })?;
            rb.update_from_chunk(chunk);
        }
        assert_eq!(seq_len.load(Ordering::Relaxed), 250000);

        Ok(())
    }

Add parser for Chemstation "31" format

This is an Agilent format for storing DAD data that Aston supports (although I think it may be a bit buggy) and might be pretty easy to support here too.

Support Agilent Masshunter auxillary file formats

There are a couple files that store pressure, temperature, etc info about a run in a CG file that we should support. There's also an XML "acquisition" file that has some of the run metadata, but not really clear how closely this fits entab's intended use case.

Existing implementation here:
https://github.com/bovee/Aston/blob/master/aston/tracefile/agilent_extra_mh.py

Error installing entab-cli

This seems like a really useful project! Unfortunately, I have not yet been able to install the CLI. I get a number of errors when I run cargo install entab-cli as suggested in the read me (reproduced below). I also get a similar set of errors when I try to install the R bindings. I am running Mac OS 12.2.1 (on a m1 mac) with rustc 1.59.0. I have not really used rust before, so I'm not sure if there's some obvious problem I might be missing. Please let me know if there's any further information I can provide. Thanks!

   Compiling entab-cli v0.2.2
error[E0432]: unresolved imports `clap::crate_authors`, `clap::crate_version`
 --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:7:12
  |
7 | use clap::{crate_authors, crate_version, App, Arg};
  |            ^^^^^^^^^^^^^  ^^^^^^^^^^^^^ no `crate_version` in the root
  |            |
  |            no `crate_authors` in the root

error: cannot determine resolution for the macro `crate_authors`
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:21:17
   |
21 |         .author(crate_authors!())
   |                 ^^^^^^^^^^^^^
   |
   = note: import resolution is stuck, try simplifying macro imports

error: cannot determine resolution for the macro `crate_version`
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:22:18
   |
22 |         .version(crate_version!())
   |                  ^^^^^^^^^^^^^
   |
   = note: import resolution is stuck, try simplifying macro imports

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:26:18
   |
26 |                 .about("Path to read; if not provided stdin will be used")
   |                  ^^^^^ method not found in `Arg<'_>`

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:32:18
   |
32 |                 .about("Path to write to; if not provided stdout will be used")
   |                  ^^^^^ method not found in `Arg<'_>`

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:38:18
   |
38 |                 .about("Parser to use [if not specified, file type will be auto-detected]")
   |                  ^^^^^ method not found in `Arg<'_>`

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:45:18
   |
45 |                 .about("Reports metadata about the file instead of the data itself"),
   |                  ^^^^^ method not found in `Arg<'_>`

Some errors have detailed explanations: E0432, E0599.
For more information about an error, try `rustc --explain E0432`.

Track `record_pos` in state

And have it reportable via StateMetadata? This would allow e.g. MS parsers to return the current "scan" record position in errors instead of how many individual m/z's they've spit out and may also make sense for e.g. kmer-based parsing.

Support Agilent Masshunter DAD format

This requires handling two files ("SD" header file and "SP" data file) for input (same as #12).

Existing implementation here:
https://github.com/bovee/Aston/blob/master/aston/tracefile/agilent_uv.py#L221

New error parsing Chemstation 31 files

Hi Roderick,

I am getting a new error parsing a test file (https://github.com/ethanbass/chromConverterExtraTests/blob/main/inst/chemstation_31.uv) after updating Entab to the latest version. Entab was able to handle this file fine in a previous version (not sure exactly which one though...). This is the error I'm getting:

thread '<unnamed>' panicked at /Users/ethanbass/.cargo/registry/src/index.crates.io-6f17d22bba15001f/extendr-api-0.6.0/src/robj/into_robj.rs:64:13:
called `Result::unwrap()` on an `Err` value: Other("Chemstation 31 header needs to be at least 652 bytes long\n0000000000000000430761140008813E02E280500\n                                 C   v                 . (   P  \n                              ^^ 512\n")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at src/lib.rs:47:1:
explicit panic
Warning in read_chroms(path, progress_bar = FALSE, parser = "entab") :
  Error in .local(.Object, ...) : user function panicked: new

Thoughts?

Better support for multi-threading

I tried writing the following documentation on how to pass records off to e.g. a thread poll, but it doesn't work.

//! If you want to pass records off to a thread pool:
//! ```
//! # #[cfg(feature = "std")] {
//! use std::fs::File;
//! use rayon;
//! use entab::parsers::extract_opt;
//! use entab::readers::init_state;
//! use entab::readers::fastq::FastqRecord;
//!
//! let f = File::open("./tests/data/test.fastq")?;
//! let (mut rb, mut state) = init_state(f, None)?;
//! while let Some(slice) = rb.refill()? {
//!     let consumed = &mut 0;
//!     let eof = rb.eof;
//!     rayon::scope(|s| {
//!         while let Some(FastqRecord { id, ..}) = extract_opt(slice, eof, consumed, &mut state).unwrap() {
//!             s.spawn(move |_| {
//!                 println!("{}", id);
//!             });
//!         }
//!         Ok::<(), &str>(())
//!     });
//!     rb.consumed += *consumed;
//! }
//! # }
//! # use entab::EtError;
//! # Ok::<(), EtError>(())
//! ```

It threw the following errors:

error[E0499]: cannot borrow `state` as mutable more than once at a time
  --> src/lib.rs:51:83
   |
16 |       rayon::scope(|s| {
   |                     - has type `&Scope<'1>`
17 |           while let Some(FastqRecord { id, ..}) = extract_opt(slice, eof, consumed, &mut state).unwrap() {
   |                                                                                     ^^^^^^^^^^ `state` was mutably borrowed here in the previous iteration of the loop
18 | /             s.spawn(move |_| {
19 | |                 println!("{}", id);
20 | |             });
   | |______________- argument requires that `state` is borrowed for `'1`

error[E0503]: cannot use `rb.eof` because it was mutably borrowed
  --> src/lib.rs:49:15
   |
13 | while let Some(slice) = rb.refill()? {
   |                         ----------- borrow of `rb` occurs here
14 |     let consumed = &mut 0;
15 |     let eof = rb.eof;
   |               ^^^^^^ use of borrowed `rb`
16 |     rayon::scope(|s| {
17 |         while let Some(FastqRecord { id, ..}) = extract_opt(slice, eof, consumed, &mut state).unwrap() {
   |                                                             ----- borrow later captured here by closure

The ergonomics here are still pretty bad (e.g. it would be nice to write rb.next_no_refill?) but that might be because I haven't written any crossbeam/rayon/etc code lately.

Build binaries on release?

This could be useful to package this up into e.g. Homebrew, Debian, etc or just as an alternative to having to cargo install entab-cli.

Maybe using something like this: https://github.com/marketplace/actions/rust-release-binary

Redo "generic Readers"

I'm not crazy about the current impl_reader! design for generating a "reader" for each file type and it would be nice if that could be replaced by a generic solution like Reader<T> (although we'd still need to impl a trait like RecordReader on this to let it be boxed up for e.g. get_reader, I think?) A generic design like this could potentially be used to replace the experimental chunk interface for multi-threaded reading (new below is inspired by the init_state in that).

I wrote out the following before I started running into some lifetime issues; I think the type bounds for S and T probably need to be tweaked? Refactoring to next(&'r mut self) -> Result<Option<Vec<Value<'r>>>, EtError> breaks the caller because they're no longer able to iterate over multiple records?

use core::fmt::Debug;
use core::marker::PhantomData;
use crate::parsers::FromSlice;
use crate::record::StateMetadata;

/// A reader that abstracts over parsing files
#[derive(Debug)]
pub struct Reader<'r, S, T> where 
    S: Debug + 'r,
    T: Debug + FromSlice<'r, State=&'r mut S>,
{
    rb: ReadBuffer<'r>,
    state: S,
    returns: PhantomData<T>,
}

impl<'r, S, T> Reader<'r, S, T> where
    S: Debug + 'r,
    T: Debug + FromSlice<'r, State=&'r mut S> + Into<Vec<Value<'r>>>,
{
    /// Create a new Reader
    pub fn new<B, P>(data: B, params: Option<P>) -> Result<Self, EtError>
    where
        B: TryInto<ReadBuffer<'r>>,
        EtError: From<<B as TryInto<ReadBuffer<'r>>>::Error>,
        S: for<'a> FromSlice<'a, State = P>,
        P: Default,
    {
        let mut rb = data.try_into()?;
        if let Some(state) = rb.next::<S>(params.unwrap_or_default())? {
            Ok(Reader {
                rb,
                state,
                returns: PhantomData,
            })
        } else {
            Err(format!(
                "Could not initialize state {}",
                ::core::any::type_name::<S>()
            )
            .into())
        }
    }

    /// Get the next record from the `Reader`
    #[allow(clippy::should_implement_trait)]
    pub fn next(&mut self) -> Result<Option<T>, EtError> {
        // FIXME: fails on the next line because of lifetime issues with borrowing both `self.rb` and `self.state`
        self.rb.next::<T>(&mut self.state)
    }
}

impl<'r, S, T> RecordReader for Reader<'r, S, T> where
    S: Debug + StateMetadata + 'r,
    T: Debug + FromSlice<'r, State=&'r mut S> + Into<Vec<Value<'r>>>,
{
    fn next_record(&mut self) -> Result<Option<Vec<Value>>, EtError> {
        Ok(self.next()?.map(|v| v.into()))
    }

    fn headers(&self) -> Vec<String> {
        self.state.header().iter().map(|s| s.to_string()).collect()
    }

    fn metadata(&self) -> BTreeMap<String, Value> {
        self.state.metadata()
    }
}

Support Agilent REG format

This is what the instrument telemetry data is stored in, e.g. pump pressures, etc.

See https://github.com/bovee/Aston/blob/master/aston/tracefile/agilent_extra_cs.py for an existing implementation in Python.

zenodo archive with DOI (and/or maybe even a JOSS paper?)

I am using the Entab parsers in my research and it would be nice to have a DOI to use in citations. It is pretty quick and easy to set this up through Zenodo, since it is automatically integrated with GitHub. If you want, you could also consider publishing a note in the Journal of Open Source Software.

Unknown doesn't have a parser for Chemstation .ch FID data

When I try to load an FID file from chemstation Rev. C.01.07 via entab-cli and it throws an error:

> cat FID1A.ch | entab
##### AN ERROR OCCURRED ####
Unknown doesn't have a parser
#####

I've attached the file below

FID1A.ch.txt

issues installing entab-r on windows

I can't figure out how to install entab-r on Windows 10. I spent way too long today messing around with makevars.win (https://github.com/ethanbass/entab/blob/windows_dev/entab-r/src/Makevars.win). The version I have now seems to compile, but then I am getting an error at the very end: "no DLL was created" causing the installation to fail. Maybe you have some insight as someone who actually understands this stuff? I am basically just doing it by trial and error because I only have a very dim understanding of how these makevar files work.

Here's the output from the installer:

rm -Rf entab.dll ../target/x86_64-pc-windows-gnu/release/libentab.a 
mkdir -p ../target/libgcc_mock
cd ../target/libgcc_mock && \
	touch gcc_mock.c && \
	gcc -c gcc_mock.c -o gcc_mock.o && \
	ar -r libgcc_eh.a gcc_mock.o && \
	cp libgcc_eh.a libgcc_s.a
C:\rtools42\x86_64-w64-mingw32.static.posix\bin\ar.exe: creating libgcc_eh.a
# CARGO_LINKER is provided in Makevars.ucrt for R >= 4.2
export PATH="/x86_64-w64-mingw32.static.posix/bin:/usr/bin:/c/Users/eb565/AppData/Local/Programs/R/R-42~1.1/bin/x64:/x86_64-w64-mingw32.static.posix/bin:/usr/bin:/usr/bin:/usr/bin:/x86_64-w64-mingw32.static.posix/bin:/usr/bin:/c/Users/eb565/AppData/Local/Programs/R/R-4.2.1/bin/x64:/c/Windows/System32:/c/Windows:/c/Windows/System32/wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/Windows/System32/OpenSSH:/c/Program Files (x86)/AOMEI/AOMEI Backupper 6.4.0:/c/Users/eb565/.cargo/bin:/c/Users/eb565/AppData/Local/Programs/Python/Python37/Scripts:/c/Users/eb565/AppData/Local/Programs/Python/Python37:/c/Users/eb565/AppData/Local/Microsoft/WindowsApps:/c/Users/eb565/AppData/Local/GitHubDesktop/bin:/c/Users/eb565/AppData/Local/Programs/Git/cmd:/c/Users/eb565/AppData/Local/Programs/Microsoft VS Code/bin/:/c/Users/eb565/Documents/.cargo/bin" && \
export CARGO_TARGET_X86_64_PC_WINDOWS_GNU_LINKER="x86_64-w64-mingw32.static.posix-gcc.exe" && \
	export LIBRARY_PATH="${LIBRARY_PATH};/c/Users/eb565/AppData/Local/Temp/RtmpQprVeS/R.INSTALL24982fdd1daa/entab/src/../target/libgcc_mock" && \
	cargo build --target=x86_64-pc-windows-gnu --lib --release --manifest-path=../Cargo.toml --target-dir ../target
    Updating git repository `https://github.com/bovee/entab`
    Updating git repository `https://github.com/extendr/extendr/`
    Updating crates.io index
   Compiling jobserver v0.1.24
   Compiling proc-macro2 v1.0.42
   Compiling winapi-x86_64-pc-windows-gnu v0.4.0
   Compiling libc v0.2.126
   Compiling winapi v0.3.9
   Compiling quote v1.0.20
   Compiling unicode-ident v1.0.2
   Compiling pkg-config v0.3.25
   Compiling autocfg v1.1.0
   Compiling syn v1.0.98
   Compiling either v1.7.0
   Compiling glob v0.3.0
   Compiling encoding_index_tests v0.1.4
   Compiling serde_derive v1.0.140
   Compiling zstd-safe v2.0.6+zstd.1.4.7
   Compiling crc32fast v1.3.2
   Compiling serde v1.0.140
   Compiling extendr-engine v0.2.0 (https://github.com/extendr/extendr/?rev=1d2e87ed49a3e0e5c1a1a2df58140b3f7824fb87#1d2e87ed)
   Compiling memchr v2.5.0
   Compiling adler v1.0.2
   Compiling cfg-if v1.0.0
   Compiling extendr-api v0.2.0 (https://github.com/extendr/extendr/?rev=1d2e87ed49a3e0e5c1a1a2df58140b3f7824fb87#1d2e87ed)
   Compiling bytecount v0.6.3
   Compiling paste v1.0.7
   Compiling lazy_static v1.4.0
   Compiling cc v1.0.73
   Compiling itertools v0.9.0
   Compiling encoding-index-singlebyte v1.20141219.5
   Compiling encoding-index-korean v1.20141219.5
   Compiling encoding-index-japanese v1.20141219.5
   Compiling encoding-index-tradchinese v1.20141219.5
   Compiling encoding-index-simpchinese v1.20141219.5
   Compiling num-traits v0.2.15
   Compiling num-integer v0.1.45
   Compiling miniz_oxide v0.5.3
   Compiling encoding v0.2.33
   Compiling flate2 v1.0.24
   Compiling zstd-sys v1.4.18+zstd.1.4.7
   Compiling lzma-sys v0.1.19
   Compiling bzip2-sys v0.1.11+1.0.8
   Compiling libR-sys v0.2.2
   Compiling bzip2 v0.3.3
   Compiling extendr-macros v0.2.0 (https://github.com/extendr/extendr/?rev=1d2e87ed49a3e0e5c1a1a2df58140b3f7824fb87#1d2e87ed)
   Compiling xz2 v0.1.7
   Compiling zstd v0.5.4+zstd.1.4.7
   Compiling chrono v0.4.19
   Compiling entab v0.3.1 (https://github.com/bovee/entab#b4ea4cef)
   Compiling entab-r v0.3.1 (C:\Users\eb565\AppData\Local\Temp\RtmpQprVeS\R.INSTALL24982fdd1daa\entab)
    Finished release [optimized] target(s) in 38.93s
no DLL was created
ERROR: compilation failed for package 'entab'
* removing 'C:/Users/eb565/AppData/Local/Programs/R/R-4.2.1/library/entab'

Fix error location

With the big v0.3 rewrite I changed how the parser advances so the EtErrorContext::byte in most error messages is no longer pointing to the correct position in the record, but rather to the first byte.

I think the best fix for this is to manually set byte (maybe in EtError::new?) when the error is created and to make sure that the calling next function doesn't overwrite this when it's handling the error, but this involves updating every EtError constructor.

Closure-based reader?

When playing around with read_into, I wrote a reader that operates more functionally and maybe could be modified into something that could do e.g. multithreaded map-reduce?

/// Apply `fxn` to each element in the data
#[doc(hidden)]
pub fn reduce<'r: 's, 's, E, T, D, F, P, TS, S>(data: D, init: S, params: Option<P>, fxn: F) -> Result<S, E>
where
    D: TryInto<ReadBuffer<'r>>,
    E: From<EtError>,
    F: Fn(S, &T) -> Result<S, E>,
    EtError: From<<D as TryInto<ReadBuffer<'r>>>::Error>,
    T: Default + FromSlice<'s, 's, State = TS>,
    TS: for<'a> FromSlice<'a, 'a, State = P> + 's,
    P: Default,
{
    let mut rb = data.try_into().map_err(|e| e.into())?;
    let mut user_state = init;
    let mut parser_state = match rb.next(&mut params.unwrap_or_default())? {
        Some(state) => state,
        None => {
            return Err(E::from(EtError::new("Could not initialize state {}")));
        },
    };
    let mut record = T::default();
    while unsafe { rb.next_into(&mut parser_state, &mut record)? } {
        user_state = fxn(user_state, &record)?;
    }
    Ok(user_state)
}

#[cfg(test)]
mod tests {
    use super::*;
    use parsers::fastq::FastqRecord;

    #[test]
    fn test_reduce() -> Result<(), EtError> {
        let data: &[u8] = include_bytes!("../tests/data/test.fastq");
        let count: usize = reduce(data, 0, None, |count, &FastqRecord { sequence, .. }| {
            Ok::<_, EtError>(count + sequence.len())
        })?;
        assert_eq!(count, 250000);
        Ok(())
    }
}

Support VCF/BCF format

Spec:
https://samtools.github.io/hts-specs/VCFv4.3.pdf

Note that blocks may or may not be compressed:
samtools/htsjdk#946

agilent UV converter reports wavelengths incorrectly

The Agilent UV converter seems to just output the first value repeatedly in the "wavelength" column (though it is in fact reporting different values). For example, here is the first few lines of the output from the example file: entab -i ~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv

(I think these are actually the absorbance values for different wavelengths, but they all say 200 under the wavelength column).

time	wavelength	intensity
0.0013333333333333333	200	-15.6675
0.0013333333333333333	200	-31.805
0.0013333333333333333	200	-35.3695
0.0013333333333333333	200	-28.8505
0.0013333333333333333	200	-21.1925
0.0013333333333333333	200	-14.8475
0.0013333333333333333	200	-10.1245
0.0013333333333333333	200	-6.801
0.0013333333333333333	200	-4.542
0.0013333333333333333	200	-3.031
0.0013333333333333333	200	-2.035
0.0013333333333333333	200	-1.3765
0.0013333333333333333	200	-0.933
0.0013333333333333333	200	-0.6475

Python bindings - Datetime formatting panics

Sample file attached, but this probably applies to everything since NaiveDateTime has no time zone?

Environment

MacOS 12.3.1 arm64 built from master
Docker/linux/x86_64 in above w/ pip install
Linux/PopOS 20.04/x86_64 w/ pip install

Sample file

DAD1.UV.remove_the_txt.txt

Reproduce it

from entab import Reader
r = Reader(filename='DAD1.UV')
r.metadata

Backtrace

Click to expand!

thread '<unnamed>' panicked at 'a Display implementation returned an error unexpectedly: Error', /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/string.rs:2406:14
stack backtrace:
 0:        0x105988a40 - std::backtrace_rs::backtrace::libunwind::trace::h449592924b3bd63f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
 1:        0x105988a40 - std::backtrace_rs::backtrace::trace_unsynchronized::ha2aaeafed0c31c90
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
 2:        0x105988a40 - std::sys_common::backtrace::_print_fmt::h58db85a17304976f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:66:5
 3:        0x105988a40 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h10cf06316d33e2a9
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:45:22
 4:        0x1059a3870 - core::fmt::write::h1faf18c959c3a8df
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/fmt/mod.rs:1190:17
 5:        0x105986308 - std::io::Write::write_fmt::h86ab231360bc97d2
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/io/mod.rs:1657:15
 6:        0x10598a5bc - std::sys_common::backtrace::_print::h771b4aab9b128422
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:48:5
 7:        0x10598a5bc - std::sys_common::backtrace::print::h637de99a9f76e8a7
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:35:9
 8:        0x10598a5bc - std::panicking::default_hook::{{closure}}::h36e628ffaf3cd44f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:295:22
 9:        0x10598a234 - std::panicking::default_hook::h3ee1564a7544e58f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:314:9
10:        0x10598ac1c - std::panicking::rust_panic_with_hook::h191339fbd2fe2360
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:698:17
11:        0x10598a9a4 - std::panicking::begin_panic_handler::{{closure}}::h91c230befd9929e3
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:588:13
12:        0x105988f28 - std::sys_common::backtrace::__rust_end_short_backtrace::haaaeebb1d37476b3
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:138:18
13:        0x10598a6e0 - rust_begin_unwind
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
14:        0x1059ac640 - core::panicking::panic_fmt::h4fe1013b011ef602
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
15:        0x1059ac6e4 - core::result::unwrap_failed::hf608a47e6e04ea5d
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/result.rs:1749:5
16:        0x105937dc0 - core::result::Result<T,E>::expect::he5913cd1ad288b54
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/result.rs:1022:23
17:        0x10582a510 - <T as alloc::string::ToString>::to_string::h502c2311386b7f12
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/string.rs:2405:9
18:        0x105811714 - entab::py_from_value::hb1eb6451a367f81b
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:36:13
19:        0x105813408 - entab::Reader::get_metadata::hfe43c19c7467fb48
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:158:32
20:        0x1058295d4 - entab::<impl pyo3::class::impl_::PyMethods<entab::Reader> for pyo3::class::impl_::PyClassImplCollector<entab::Reader>>::py_methods::METHODS::__wrap::{{closure}}::hc2c8b9c79d0d57ac
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:94:1
21:        0x105830e24 - pyo3::callback::handle_panic::{{closure}}::h9c0e60d40a74b860
                             at /Users/nate/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.15.2/src/callback.rs:247:9
22:        0x105820070 - std::panicking::try::do_call::h47893b6f1394ee2e
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:492:40
23:        0x105820b4c - ___rust_try
24:        0x10581f478 - std::panicking::try::h08f3ea0876586e3a
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:456:19
25:        0x1058183b0 - std::panic::catch_unwind::h72128b255fb6372d
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panic.rs:137:14
26:        0x105830acc - pyo3::callback::handle_panic::he5a8dfdaab2835a2
                             at /Users/nate/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.15.2/src/callback.rs:245:24
27:        0x1058136c8 - entab::<impl pyo3::class::impl_::PyMethods<entab::Reader> for pyo3::class::impl_::PyClassImplCollector<entab::Reader>>::py_methods::METHODS::__wrap::h93cc9623eaf06290
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:94:1
28:        0x102da0124 - __PyObject_GenericGetAttrWithDict
29:        0x102d9f910 - _PyObject_GetAttr
30:        0x102e2a78c - __PyEval_EvalFrameDefault
31:        0x102e26778 - __PyEval_Vector
32:        0x102e266d0 - _PyEval_EvalCode
33:        0x102e23398 - _builtin_exec
34:        0x102d9af84 - _cfunction_vectorcall_FASTCALL
35:        0x102e2f420 - _call_function
36:        0x102e2cad0 - __PyEval_EvalFrameDefault
37:        0x102d6b1f8 - _gen_send_ex2
38:        0x102e28da4 - __PyEval_EvalFrameDefault
39:        0x102d6b1f8 - _gen_send_ex2
40:        0x102e28da4 - __PyEval_EvalFrameDefault
41:        0x102d6b1f8 - _gen_send_ex2
42:        0x102d6b400 - _gen_send
43:        0x102d5f9d0 - _method_vectorcall_O
44:        0x102e2f420 - _call_function
45:        0x102e2ca20 - __PyEval_EvalFrameDefault
46:        0x102e26778 - __PyEval_Vector
47:        0x102e2f420 - _call_function
48:        0x102e2cad0 - __PyEval_EvalFrameDefault
49:        0x102e26778 - __PyEval_Vector
50:        0x102e2f420 - _call_function
51:        0x102e2ca20 - __PyEval_EvalFrameDefault
52:        0x102e26778 - __PyEval_Vector
53:        0x102d58b34 - _method_vectorcall
54:        0x102e2f420 - _call_function
55:        0x102e2cb48 - __PyEval_EvalFrameDefault
56:        0x102e26778 - __PyEval_Vector
57:        0x102e2f420 - _call_function
58:        0x102e2ca20 - __PyEval_EvalFrameDefault
59:        0x102e26778 - __PyEval_Vector
60:        0x102e2f420 - _call_function
61:        0x102e2ca20 - __PyEval_EvalFrameDefault
62:        0x102e26778 - __PyEval_Vector
63:        0x102e2f420 - _call_function
64:        0x102e2ca20 - __PyEval_EvalFrameDefault
65:        0x102e26778 - __PyEval_Vector
66:        0x102d58b34 - _method_vectorcall
67:        0x102d569d8 - _PyVectorcall_Call
68:        0x102e2cd70 - __PyEval_EvalFrameDefault
69:        0x102e26778 - __PyEval_Vector
70:        0x102e2f420 - _call_function
71:        0x102e2cad0 - __PyEval_EvalFrameDefault
72:        0x102e26778 - __PyEval_Vector
73:        0x102e266d0 - _PyEval_EvalCode
74:        0x102e71720 - __PyRun_SimpleFileObject
75:        0x102e71274 - __PyRun_AnyFileObject
76:        0x102e8f3ac - _Py_RunMain
77:        0x102e8f82c - _pymain_main
78:        0x102e8f8a8 - _Py_BytesMain

Cause?

In this line:

entab/entab-py/src/lib.rs

Line 36 in 4f35ee5

d.format("%+").to_string().to_object(py)

d.format("%+") panics, probably because there is no timezone info?

Changing the line to d.format("%Y-%m-%dT%H:%M:%S%.f").to_string().to_object(py), i.e. removing the timezone part of ISO8601 seems to fix it, and then the output matches entab-cli e.g. entab -m -i DAD1.UV

@bovee if this is acceptable, I'm happy to PR it

Rewrite to remove `refill`s from parsers

Right now there's a pattern we use like the following (simplified from the FASTA parser):

    let (start, end) = loop {
        if let Some(p) = memchr(...) {
            break ...;
        } else if rb.eof() {
            return Err(...);
        }
        rb.refill()?;
    };

The advantage of this is that we don't have to reparse the record out if we need to refill (because we don't exit out and then reenter once the buffer is refilled), but it also means that the lifetime of a record returned from the parser is fixed to right after refill is called (every time we parse, we could potentially wipe the underlying buffer and invalidate all previous records). Theoretically, this means we save time whenever we refill, but at the cost of code clarity and some lifetime constraints. The amount of time saved may not be that great for larger buffers or for mmaped files though.

If we moved the refill logic out of the parser into the caller and returned a sentinel value instead, we could simplify to:

    if let Some(p) = memchr(...) {
        (start, end) = ...
    } else if rb.eof() {
        return Err(...);
    } else {
        return Ok(Incomplete)
    }

This is much closer to how parser libraries like nom handle these situations so it probably won't have a noticeable performance impact? It would require a large rewrite of the parser code and all of the parsers, but I don't think many people are using entab right now that would object to the breakage.

Memory corruption in R

There's an intermittent issue when using the R bindings where fields will be out-of-order and parse speed will be abnormally slow. From #28:

This also seems to have improved the issue I mentioned with retention times appearing in the wavelengths column (about 9/10 times). The weird part though (!?) is that this behavior is still happening about one tenth of the time, as in, if I repeatedly run the Reader on the same file. 🧐 (This seems to be independent of the file used).

I thought this might be an issue with R linking to an old version of the dynamic library, but it's also possible it's an issue with all the unsafe libR code or something else entirely.

Support getting metadata from other PNG chunks

Maybe not full EXIF support, but there are a few other common ancillary chunks it might be worth parsing into metadata.

https://en.wikipedia.org/wiki/Portable_Network_Graphics#Ancillary_chunks
http://www.libpng.org/pub/png/spec/1.2/PNG-Chunks.html

Add bindings for Julia

I don't think there's a good binding library like pyo3 or wasm-bindgen, but this should be doable via straight FFI:
https://github.com/felipenoris/JuliaPackageWithRustDep.jl

I'm not exactly sure what the best record format on the Julia side would be? Maybe there's something in their DataFrames library:
https://juliadata.github.io/DataFrames.jl/stable/

Support XML

This is mostly necessary to support XML-based file formats like some of the Agilent MassHunter formats, mzML, etc.

There are a couple existing streaming XML Rust parsers that we could possibly wrap, but it may be "easy" enough to just write one on top of the existing ReadBuffer interface:
https://github.com/netvl/xml-rs
https://github.com/tafia/quick-xml

Passing a raw XML file into entab should probably result in a stream with fields like:

Materialized path to current node (Value::List([key, subkey, ...]))
Attributes for current node (Value::Record<string, string? value?>)
Text for current node (String?)

We may not want to actually do that though because it will probably require saving up all the data and emitting the nodes post-traversal (which isn't the most natural format to view and requires more memory).

"Unknown doesn't have a parser" is not a very helpful error message

We should maybe add more common magic sequences to filetype.rs?

Also for really unknown files, we should maybe capture the first 8 (?) bytes (change [Unknown](https://github.com/bovee/entab/blob/f4e0f3cb7ca4383145cd3575a0e80962d07dcdee/entab/src/filetype.rs#L79) to Unknown([u8; 8])?) and then rewrite the error message appropriately.

Chemstation .reg Format

I believe the Core of the Chemstation .reg format is a CArchive https://learn.microsoft.com/en-us/cpp/mfc/reference/carchive-class?view=msvc-170 here is a reader/writer implemented in
cpp: https://github.com/pixelspark/corespark/blob/e2aa78fe13e273fcc9bb2665ab4c700e89895741/Libraries/atlmfc/src/mfc/arcobj.cpp

The "data type numbers" are random / depend on the order in which the different data objects are written to the archive. It get´s a number once it´s written to the file the first time.
See this code: https://github.com/pixelspark/corespark/blob/e2aa78fe13e273fcc9bb2665ab4c700e89895741/Libraries/atlmfc/src/mfc/arcobj.cpp#L259

Support interlaced PNGs

I think this is the biggest thing missing from the PNG parser and I just never got around to implementing this. Requires understanding exactly how Adam-7 splits up the pixels across the IDAT chunk.

Support LAS format

This is a text-based format for well-log data:

There are a few extant readers already out there:

Support Thermo/Finnegan RAW format

This is a fairly common format for mass spec data so it would be useful to support it.

I think all the other open source implementations are kind-of-dated and interlinked?
https://github.com/prvst/unfinnigan
https://github.com/draperjames/unfinnigan

Deploy an example of the JS code

It would be cool to have an actual website up and running with the parser in it for demos, etc. We could do this with a gh-pages publishing action? e.g.

    - uses: peaceiris/actions-gh-pages@v3
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: ./public

Add bindings for R

Not sure what the best binding solution is or what an appropriate record format on the R side is.

Maybe take a look at:
https://github.com/extendr/extendr
https://github.com/r-rust/hellorust

Support Agilent Masshunter MS format

I think there were a couple subformats here depending on how much of the scan was saved?

This requires parsing another XML manifest file separately so it's blocked on both #8 and on figuring out a good way to support that in a streaming fashion (may require doing something like cat test.xsd test.bin | entab?)

Existing parser here:
https://github.com/bovee/Aston/blob/master/aston/tracefile/agilent_ms.py#L212

-o option doesn't seem to work on CLI

It isn't too big a deal, since you can just use a pipe to save the output from the entab function, but I can't get the -o option to work. For example, if I run:
entab -i ~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv -o test.txt

I get back:

No such file or directory (os error 2)

#####

I'm also sometimes getting:
Bad file descriptor (os error 9)

(I also get the same results if I write the a full path after the -o).

However, it works fine if I pipe it:
entab -i ~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv > test.txt

`def test_aston_import():
from aston.tracefile.agilent_uv import AgilentMWD2

instance = AgilentMWD2()
assert isinstance(instance, AgilentMWD2)`

but one cannot simply replace the "aston" with "entab". Any insight would be welcome. Thank you!