Giter Club home page Giter Club logo

entab's People

Contributors

bovee avatar ethanbass avatar jnj16180340 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

entab's Issues

Entab parity

File parsers:

Other:

Methods:

It's not really clear to me how many of the non-file reading functions people use(d) in Aston so some (most?) of the method code could be copied straight over into the Python bindings and doesn't need to be rewritten into Rust.

Allow `read_into` parsers

It should be possible to reuse the e.g. FastaRecord from iteration to iteration to avoid an allocation each time, but I'm having issues with lifetimes trying to write out a function that does this.

I tried the following in buffer.rs, but I kept having lifetime errors with both the state and ReadBuffer being still mutably borrowed in the next iteration of the loop.

pub fn next_into<'n, T>(
    &'n mut self,
    record: &mut T,
    mut state: <T as FromSlice<'n>>::State,
) -> Result<bool, EtError>
where
    T: FromSlice<'n> + 'n,
{
...
}

bug in agilent .uv parser

I was investigating the UV parser more and I think there are still some problems. For example, I was trying to import a UV file from my lab and it looks pretty good for about the first 15 minutes, but then the baseline starts going all over the place. Any idea what might be going on? I'm attaching a picture of the entab imported file in black and the CSV I exported from chemstation in blue.
image

The example file that ships with entab doesn't look too good either:
image

Below is the code to reproduce what I did in R. You can find the file I tried to convert and the CSV version here https://cornell.box.com/v/example-DAD-files .
Thanks!
Ethan

library(entab)
files[1]
path <- "~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/ETHAN_01_19_21 2021-01-20 00-27-52/679.D/dad1.uv"
r <- as.data.frame(Reader(path))
ch.entab <- data.frame(tidyr::pivot_wider(r, id_cols = "time",
                        names_from = "wavelength", values_from = "intensity"))

ch.csv <- read.csv("~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/export3D/EXPORT3D_ETHAN_01_19_21 2021-01-20 00-27-52/679.CSV",
                   row.names = 1, header=TRUE,
                   fileEncoding="utf-16",check.names = FALSE)
par(mfrow=c(1,1))
matplot(ch.entab$time, ch.entab[,"X280"], type="l",ylim=c(-100,800))
matplot(ch.entab$time, ch.csv[,"280.00000"],type="l",add=T,lty=2,col="blue")
abline(v="15.00",col="red",lty=3)

example_file <- as.data.frame(Reader("~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv"))
library("tidyverse")
df %>% filter(wavelength=="200")
df <- data.frame(tidyr::pivot_wider(example_file, id_cols = "time", names_from = "wavelength", values_from = "intensity"))
matplot(df$time, df$X280, type="l")

entab doesn't process Thermo RAW from Xcalibur 3.1

Hello

I've used entab 0.3.1 on a RAW file generated by a Thermo LCMS using Xcalibur 3.1 but the data isn't getting processed. The file is opened and the columns are created but the values are not generated. I've included a link to the original files and attached the output CSV (github doesn't support uploading TSVs)

The command used to produce the file was:

entab -i angio_test.raw -p thermo_raw -o angio_test.tsv

Thanks!

PS

Thanks for making this wonderful software! It's changing our lab process entirely.

Thanks so much for making this wonderful library! It is changing our lab process entirely.

angio_test.csv

angio_test.raw

bug in thermo raw converter (or maybe missing UV functionality)?

Hi Roderick,
I received a Thermo Raw file from a new twitter friend that we'd like to be able to convert with entab, but I'm running into an error. I believe the file should contain both MS and UV data. I'm not sure if the UV data is the source of the problem or if there's something else also. Here is the link to the file: https://t.co/TeoGpYdxdx

I tried to call entab from the command line:

entab -i ~/Downloads/20211227_SAJ9571-F07.raw

and received the following error message:

thread 'main' panicked at 'range end index 40 out of range for slice of length 0', entab/src/parsers/thermo/thermo_raw.rs:384:38

I'm not sure what the note about the BACKTRACE means. Ideally we'd like to be able to extract the UV data from the file.

Thanks!
Ethan

Flexible Param input

A lot of parser states are parameterized. For example, both #12 and #13 require a file name so we can open another ReadBuffer to store in the state, TSV/CSV parsers can take delimiters and number of header lines, and kmer-parsers need a k.

Right now, we pass initial params into the state as a "P" object scoped to each state which we also require to have a Default so that we can use None across parsers. An alternate design could take an Into<P> object where P: Default + From<Params> for convenience (where Params is some kind of { filename: String, other_params: BTreeMap<&str, Value> }?)

Multithreaded support

I wrote some code that could take the unconsumed part of a ReadBuffer and allow iterating over it (and then repeating those two steps over and over again) that allowed very basic multithreaded support, but the API was pretty gross and I'm not sure the way the multithreading worked was actually that efficient (it was about 4x slower than the normal Readers). It would be nice to support this in a more principled way.

    #[cfg(feature = "std")]
    #[test]
    fn test_multithreaded_read() -> Result<(), EtError> {
        let f = File::open("./tests/data/test.fastq")?;
        let (mut rb, mut state) = init_state::<FastqState, _, _>(f, None)?;
        let seq_len = Arc::new(AtomicUsize::new(0));
        while let Some((slice, mut chunk)) = rb.next_chunk()? {
            let chunk = rayon::scope(|s| {
                while let Some(FastqRecord { sequence, .. }) =
                    chunk.next(slice, &mut state).map_err(|e| e.to_string())?
                {
                    let sl = seq_len.clone();
                    s.spawn(move |_| {
                        let _ = sl.fetch_add(sequence.len(), Ordering::Relaxed);
                    });
                }
                Ok::<_, String>(chunk)
            })?;
            rb.update_from_chunk(chunk);
        }
        assert_eq!(seq_len.load(Ordering::Relaxed), 250000);

        Ok(())
    }

Error installing entab-cli

This seems like a really useful project! Unfortunately, I have not yet been able to install the CLI. I get a number of errors when I run cargo install entab-cli as suggested in the read me (reproduced below). I also get a similar set of errors when I try to install the R bindings. I am running Mac OS 12.2.1 (on a m1 mac) with rustc 1.59.0. I have not really used rust before, so I'm not sure if there's some obvious problem I might be missing. Please let me know if there's any further information I can provide. Thanks!

   Compiling entab-cli v0.2.2
error[E0432]: unresolved imports `clap::crate_authors`, `clap::crate_version`
 --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:7:12
  |
7 | use clap::{crate_authors, crate_version, App, Arg};
  |            ^^^^^^^^^^^^^  ^^^^^^^^^^^^^ no `crate_version` in the root
  |            |
  |            no `crate_authors` in the root

error: cannot determine resolution for the macro `crate_authors`
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:21:17
   |
21 |         .author(crate_authors!())
   |                 ^^^^^^^^^^^^^
   |
   = note: import resolution is stuck, try simplifying macro imports

error: cannot determine resolution for the macro `crate_version`
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:22:18
   |
22 |         .version(crate_version!())
   |                  ^^^^^^^^^^^^^
   |
   = note: import resolution is stuck, try simplifying macro imports

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:26:18
   |
26 |                 .about("Path to read; if not provided stdin will be used")
   |                  ^^^^^ method not found in `Arg<'_>`

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:32:18
   |
32 |                 .about("Path to write to; if not provided stdout will be used")
   |                  ^^^^^ method not found in `Arg<'_>`

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:38:18
   |
38 |                 .about("Parser to use [if not specified, file type will be auto-detected]")
   |                  ^^^^^ method not found in `Arg<'_>`

error[E0599]: no method named `about` found for struct `Arg` in the current scope
  --> /Users/ethanbass/.cargo/registry/src/github.com-1ecc6299db9ec823/entab-cli-0.2.2/src/main.rs:45:18
   |
45 |                 .about("Reports metadata about the file instead of the data itself"),
   |                  ^^^^^ method not found in `Arg<'_>`

Some errors have detailed explanations: E0432, E0599.
For more information about an error, try `rustc --explain E0432`.

Track `record_pos` in state

And have it reportable via StateMetadata? This would allow e.g. MS parsers to return the current "scan" record position in errors instead of how many individual m/z's they've spit out and may also make sense for e.g. kmer-based parsing.

New error parsing Chemstation 31 files

Hi Roderick,

I am getting a new error parsing a test file (https://github.com/ethanbass/chromConverterExtraTests/blob/main/inst/chemstation_31.uv) after updating Entab to the latest version. Entab was able to handle this file fine in a previous version (not sure exactly which one though...). This is the error I'm getting:

thread '<unnamed>' panicked at /Users/ethanbass/.cargo/registry/src/index.crates.io-6f17d22bba15001f/extendr-api-0.6.0/src/robj/into_robj.rs:64:13:
called `Result::unwrap()` on an `Err` value: Other("Chemstation 31 header needs to be at least 652 bytes long\n0000000000000000430761140008813E02E280500\n                                 C   v                 . (   P  \n                              ^^ 512\n")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at src/lib.rs:47:1:
explicit panic
Warning in read_chroms(path, progress_bar = FALSE, parser = "entab") :
  Error in .local(.Object, ...) : user function panicked: new

Thoughts?

Better support for multi-threading

I tried writing the following documentation on how to pass records off to e.g. a thread poll, but it doesn't work.

//! If you want to pass records off to a thread pool:
//! ```
//! # #[cfg(feature = "std")] {
//! use std::fs::File;
//! use rayon;
//! use entab::parsers::extract_opt;
//! use entab::readers::init_state;
//! use entab::readers::fastq::FastqRecord;
//!
//! let f = File::open("./tests/data/test.fastq")?;
//! let (mut rb, mut state) = init_state(f, None)?;
//! while let Some(slice) = rb.refill()? {
//!     let consumed = &mut 0;
//!     let eof = rb.eof;
//!     rayon::scope(|s| {
//!         while let Some(FastqRecord { id, ..}) = extract_opt(slice, eof, consumed, &mut state).unwrap() {
//!             s.spawn(move |_| {
//!                 println!("{}", id);
//!             });
//!         }
//!         Ok::<(), &str>(())
//!     });
//!     rb.consumed += *consumed;
//! }
//! # }
//! # use entab::EtError;
//! # Ok::<(), EtError>(())
//! ```

It threw the following errors:

error[E0499]: cannot borrow `state` as mutable more than once at a time
  --> src/lib.rs:51:83
   |
16 |       rayon::scope(|s| {
   |                     - has type `&Scope<'1>`
17 |           while let Some(FastqRecord { id, ..}) = extract_opt(slice, eof, consumed, &mut state).unwrap() {
   |                                                                                     ^^^^^^^^^^ `state` was mutably borrowed here in the previous iteration of the loop
18 | /             s.spawn(move |_| {
19 | |                 println!("{}", id);
20 | |             });
   | |______________- argument requires that `state` is borrowed for `'1`

error[E0503]: cannot use `rb.eof` because it was mutably borrowed
  --> src/lib.rs:49:15
   |
13 | while let Some(slice) = rb.refill()? {
   |                         ----------- borrow of `rb` occurs here
14 |     let consumed = &mut 0;
15 |     let eof = rb.eof;
   |               ^^^^^^ use of borrowed `rb`
16 |     rayon::scope(|s| {
17 |         while let Some(FastqRecord { id, ..}) = extract_opt(slice, eof, consumed, &mut state).unwrap() {
   |                                                             ----- borrow later captured here by closure

The ergonomics here are still pretty bad (e.g. it would be nice to write rb.next_no_refill?) but that might be because I haven't written any crossbeam/rayon/etc code lately.

Redo "generic Readers"

I'm not crazy about the current impl_reader! design for generating a "reader" for each file type and it would be nice if that could be replaced by a generic solution like Reader<T> (although we'd still need to impl a trait like RecordReader on this to let it be boxed up for e.g. get_reader, I think?) A generic design like this could potentially be used to replace the experimental chunk interface for multi-threaded reading (new below is inspired by the init_state in that).

I wrote out the following before I started running into some lifetime issues; I think the type bounds for S and T probably need to be tweaked? Refactoring to next(&'r mut self) -> Result<Option<Vec<Value<'r>>>, EtError> breaks the caller because they're no longer able to iterate over multiple records?

use core::fmt::Debug;
use core::marker::PhantomData;
use crate::parsers::FromSlice;
use crate::record::StateMetadata;

/// A reader that abstracts over parsing files
#[derive(Debug)]
pub struct Reader<'r, S, T> where 
    S: Debug + 'r,
    T: Debug + FromSlice<'r, State=&'r mut S>,
{
    rb: ReadBuffer<'r>,
    state: S,
    returns: PhantomData<T>,
}

impl<'r, S, T> Reader<'r, S, T> where
    S: Debug + 'r,
    T: Debug + FromSlice<'r, State=&'r mut S> + Into<Vec<Value<'r>>>,
{
    /// Create a new Reader
    pub fn new<B, P>(data: B, params: Option<P>) -> Result<Self, EtError>
    where
        B: TryInto<ReadBuffer<'r>>,
        EtError: From<<B as TryInto<ReadBuffer<'r>>>::Error>,
        S: for<'a> FromSlice<'a, State = P>,
        P: Default,
    {
        let mut rb = data.try_into()?;
        if let Some(state) = rb.next::<S>(params.unwrap_or_default())? {
            Ok(Reader {
                rb,
                state,
                returns: PhantomData,
            })
        } else {
            Err(format!(
                "Could not initialize state {}",
                ::core::any::type_name::<S>()
            )
            .into())
        }
    }

    /// Get the next record from the `Reader`
    #[allow(clippy::should_implement_trait)]
    pub fn next(&mut self) -> Result<Option<T>, EtError> {
        // FIXME: fails on the next line because of lifetime issues with borrowing both `self.rb` and `self.state`
        self.rb.next::<T>(&mut self.state)
    }
}

impl<'r, S, T> RecordReader for Reader<'r, S, T> where
    S: Debug + StateMetadata + 'r,
    T: Debug + FromSlice<'r, State=&'r mut S> + Into<Vec<Value<'r>>>,
{
    fn next_record(&mut self) -> Result<Option<Vec<Value>>, EtError> {
        Ok(self.next()?.map(|v| v.into()))
    }

    fn headers(&self) -> Vec<String> {
        self.state.header().iter().map(|s| s.to_string()).collect()
    }

    fn metadata(&self) -> BTreeMap<String, Value> {
        self.state.metadata()
    }
}

issues installing entab-r on windows

I can't figure out how to install entab-r on Windows 10. I spent way too long today messing around with makevars.win (https://github.com/ethanbass/entab/blob/windows_dev/entab-r/src/Makevars.win). The version I have now seems to compile, but then I am getting an error at the very end: "no DLL was created" causing the installation to fail. Maybe you have some insight as someone who actually understands this stuff? I am basically just doing it by trial and error because I only have a very dim understanding of how these makevar files work.

Here's the output from the installer:

rm -Rf entab.dll ../target/x86_64-pc-windows-gnu/release/libentab.a 
mkdir -p ../target/libgcc_mock
cd ../target/libgcc_mock && \
	touch gcc_mock.c && \
	gcc -c gcc_mock.c -o gcc_mock.o && \
	ar -r libgcc_eh.a gcc_mock.o && \
	cp libgcc_eh.a libgcc_s.a
C:\rtools42\x86_64-w64-mingw32.static.posix\bin\ar.exe: creating libgcc_eh.a
# CARGO_LINKER is provided in Makevars.ucrt for R >= 4.2
export PATH="/x86_64-w64-mingw32.static.posix/bin:/usr/bin:/c/Users/eb565/AppData/Local/Programs/R/R-42~1.1/bin/x64:/x86_64-w64-mingw32.static.posix/bin:/usr/bin:/usr/bin:/usr/bin:/x86_64-w64-mingw32.static.posix/bin:/usr/bin:/c/Users/eb565/AppData/Local/Programs/R/R-4.2.1/bin/x64:/c/Windows/System32:/c/Windows:/c/Windows/System32/wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/Windows/System32/OpenSSH:/c/Program Files (x86)/AOMEI/AOMEI Backupper 6.4.0:/c/Users/eb565/.cargo/bin:/c/Users/eb565/AppData/Local/Programs/Python/Python37/Scripts:/c/Users/eb565/AppData/Local/Programs/Python/Python37:/c/Users/eb565/AppData/Local/Microsoft/WindowsApps:/c/Users/eb565/AppData/Local/GitHubDesktop/bin:/c/Users/eb565/AppData/Local/Programs/Git/cmd:/c/Users/eb565/AppData/Local/Programs/Microsoft VS Code/bin/:/c/Users/eb565/Documents/.cargo/bin" && \
export CARGO_TARGET_X86_64_PC_WINDOWS_GNU_LINKER="x86_64-w64-mingw32.static.posix-gcc.exe" && \
	export LIBRARY_PATH="${LIBRARY_PATH};/c/Users/eb565/AppData/Local/Temp/RtmpQprVeS/R.INSTALL24982fdd1daa/entab/src/../target/libgcc_mock" && \
	cargo build --target=x86_64-pc-windows-gnu --lib --release --manifest-path=../Cargo.toml --target-dir ../target
    Updating git repository `https://github.com/bovee/entab`
    Updating git repository `https://github.com/extendr/extendr/`
    Updating crates.io index
   Compiling jobserver v0.1.24
   Compiling proc-macro2 v1.0.42
   Compiling winapi-x86_64-pc-windows-gnu v0.4.0
   Compiling libc v0.2.126
   Compiling winapi v0.3.9
   Compiling quote v1.0.20
   Compiling unicode-ident v1.0.2
   Compiling pkg-config v0.3.25
   Compiling autocfg v1.1.0
   Compiling syn v1.0.98
   Compiling either v1.7.0
   Compiling glob v0.3.0
   Compiling encoding_index_tests v0.1.4
   Compiling serde_derive v1.0.140
   Compiling zstd-safe v2.0.6+zstd.1.4.7
   Compiling crc32fast v1.3.2
   Compiling serde v1.0.140
   Compiling extendr-engine v0.2.0 (https://github.com/extendr/extendr/?rev=1d2e87ed49a3e0e5c1a1a2df58140b3f7824fb87#1d2e87ed)
   Compiling memchr v2.5.0
   Compiling adler v1.0.2
   Compiling cfg-if v1.0.0
   Compiling extendr-api v0.2.0 (https://github.com/extendr/extendr/?rev=1d2e87ed49a3e0e5c1a1a2df58140b3f7824fb87#1d2e87ed)
   Compiling bytecount v0.6.3
   Compiling paste v1.0.7
   Compiling lazy_static v1.4.0
   Compiling cc v1.0.73
   Compiling itertools v0.9.0
   Compiling encoding-index-singlebyte v1.20141219.5
   Compiling encoding-index-korean v1.20141219.5
   Compiling encoding-index-japanese v1.20141219.5
   Compiling encoding-index-tradchinese v1.20141219.5
   Compiling encoding-index-simpchinese v1.20141219.5
   Compiling num-traits v0.2.15
   Compiling num-integer v0.1.45
   Compiling miniz_oxide v0.5.3
   Compiling encoding v0.2.33
   Compiling flate2 v1.0.24
   Compiling zstd-sys v1.4.18+zstd.1.4.7
   Compiling lzma-sys v0.1.19
   Compiling bzip2-sys v0.1.11+1.0.8
   Compiling libR-sys v0.2.2
   Compiling bzip2 v0.3.3
   Compiling extendr-macros v0.2.0 (https://github.com/extendr/extendr/?rev=1d2e87ed49a3e0e5c1a1a2df58140b3f7824fb87#1d2e87ed)
   Compiling xz2 v0.1.7
   Compiling zstd v0.5.4+zstd.1.4.7
   Compiling chrono v0.4.19
   Compiling entab v0.3.1 (https://github.com/bovee/entab#b4ea4cef)
   Compiling entab-r v0.3.1 (C:\Users\eb565\AppData\Local\Temp\RtmpQprVeS\R.INSTALL24982fdd1daa\entab)
    Finished release [optimized] target(s) in 38.93s
no DLL was created
ERROR: compilation failed for package 'entab'
* removing 'C:/Users/eb565/AppData/Local/Programs/R/R-4.2.1/library/entab'

Fix error location

With the big v0.3 rewrite I changed how the parser advances so the EtErrorContext::byte in most error messages is no longer pointing to the correct position in the record, but rather to the first byte.

I think the best fix for this is to manually set byte (maybe in EtError::new?) when the error is created and to make sure that the calling next function doesn't overwrite this when it's handling the error, but this involves updating every EtError constructor.

Closure-based reader?

When playing around with read_into, I wrote a reader that operates more functionally and maybe could be modified into something that could do e.g. multithreaded map-reduce?

/// Apply `fxn` to each element in the data
#[doc(hidden)]
pub fn reduce<'r: 's, 's, E, T, D, F, P, TS, S>(data: D, init: S, params: Option<P>, fxn: F) -> Result<S, E>
where
    D: TryInto<ReadBuffer<'r>>,
    E: From<EtError>,
    F: Fn(S, &T) -> Result<S, E>,
    EtError: From<<D as TryInto<ReadBuffer<'r>>>::Error>,
    T: Default + FromSlice<'s, 's, State = TS>,
    TS: for<'a> FromSlice<'a, 'a, State = P> + 's,
    P: Default,
{
    let mut rb = data.try_into().map_err(|e| e.into())?;
    let mut user_state = init;
    let mut parser_state = match rb.next(&mut params.unwrap_or_default())? {
        Some(state) => state,
        None => {
            return Err(E::from(EtError::new("Could not initialize state {}")));
        },
    };
    let mut record = T::default();
    while unsafe { rb.next_into(&mut parser_state, &mut record)? } {
        user_state = fxn(user_state, &record)?;
    }
    Ok(user_state)
}

#[cfg(test)]
mod tests {
    use super::*;
    use parsers::fastq::FastqRecord;

    #[test]
    fn test_reduce() -> Result<(), EtError> {
        let data: &[u8] = include_bytes!("../tests/data/test.fastq");
        let count: usize = reduce(data, 0, None, |count, &FastqRecord { sequence, .. }| {
            Ok::<_, EtError>(count + sequence.len())
        })?;
        assert_eq!(count, 250000);
        Ok(())
    }
}

agilent UV converter reports wavelengths incorrectly

The Agilent UV converter seems to just output the first value repeatedly in the "wavelength" column (though it is in fact reporting different values). For example, here is the first few lines of the output from the example file: entab -i ~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv

(I think these are actually the absorbance values for different wavelengths, but they all say 200 under the wavelength column).

time	wavelength	intensity
0.0013333333333333333	200	-15.6675
0.0013333333333333333	200	-31.805
0.0013333333333333333	200	-35.3695
0.0013333333333333333	200	-28.8505
0.0013333333333333333	200	-21.1925
0.0013333333333333333	200	-14.8475
0.0013333333333333333	200	-10.1245
0.0013333333333333333	200	-6.801
0.0013333333333333333	200	-4.542
0.0013333333333333333	200	-3.031
0.0013333333333333333	200	-2.035
0.0013333333333333333	200	-1.3765
0.0013333333333333333	200	-0.933
0.0013333333333333333	200	-0.6475

Python bindings - Datetime formatting panics

Sample file attached, but this probably applies to everything since NaiveDateTime has no time zone?

Environment

  • MacOS 12.3.1 arm64 built from master
  • Docker/linux/x86_64 in above w/ pip install
  • Linux/PopOS 20.04/x86_64 w/ pip install

Sample file

DAD1.UV.remove_the_txt.txt

Reproduce it

from entab import Reader
r = Reader(filename='DAD1.UV')
r.metadata

Backtrace

Click to expand!
thread '<unnamed>' panicked at 'a Display implementation returned an error unexpectedly: Error', /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/string.rs:2406:14
stack backtrace:
 0:        0x105988a40 - std::backtrace_rs::backtrace::libunwind::trace::h449592924b3bd63f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
 1:        0x105988a40 - std::backtrace_rs::backtrace::trace_unsynchronized::ha2aaeafed0c31c90
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
 2:        0x105988a40 - std::sys_common::backtrace::_print_fmt::h58db85a17304976f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:66:5
 3:        0x105988a40 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h10cf06316d33e2a9
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:45:22
 4:        0x1059a3870 - core::fmt::write::h1faf18c959c3a8df
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/fmt/mod.rs:1190:17
 5:        0x105986308 - std::io::Write::write_fmt::h86ab231360bc97d2
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/io/mod.rs:1657:15
 6:        0x10598a5bc - std::sys_common::backtrace::_print::h771b4aab9b128422
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:48:5
 7:        0x10598a5bc - std::sys_common::backtrace::print::h637de99a9f76e8a7
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:35:9
 8:        0x10598a5bc - std::panicking::default_hook::{{closure}}::h36e628ffaf3cd44f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:295:22
 9:        0x10598a234 - std::panicking::default_hook::h3ee1564a7544e58f
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:314:9
10:        0x10598ac1c - std::panicking::rust_panic_with_hook::h191339fbd2fe2360
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:698:17
11:        0x10598a9a4 - std::panicking::begin_panic_handler::{{closure}}::h91c230befd9929e3
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:588:13
12:        0x105988f28 - std::sys_common::backtrace::__rust_end_short_backtrace::haaaeebb1d37476b3
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:138:18
13:        0x10598a6e0 - rust_begin_unwind
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
14:        0x1059ac640 - core::panicking::panic_fmt::h4fe1013b011ef602
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
15:        0x1059ac6e4 - core::result::unwrap_failed::hf608a47e6e04ea5d
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/result.rs:1749:5
16:        0x105937dc0 - core::result::Result<T,E>::expect::he5913cd1ad288b54
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/result.rs:1022:23
17:        0x10582a510 - <T as alloc::string::ToString>::to_string::h502c2311386b7f12
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/string.rs:2405:9
18:        0x105811714 - entab::py_from_value::hb1eb6451a367f81b
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:36:13
19:        0x105813408 - entab::Reader::get_metadata::hfe43c19c7467fb48
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:158:32
20:        0x1058295d4 - entab::<impl pyo3::class::impl_::PyMethods<entab::Reader> for pyo3::class::impl_::PyClassImplCollector<entab::Reader>>::py_methods::METHODS::__wrap::{{closure}}::hc2c8b9c79d0d57ac
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:94:1
21:        0x105830e24 - pyo3::callback::handle_panic::{{closure}}::h9c0e60d40a74b860
                             at /Users/nate/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.15.2/src/callback.rs:247:9
22:        0x105820070 - std::panicking::try::do_call::h47893b6f1394ee2e
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:492:40
23:        0x105820b4c - ___rust_try
24:        0x10581f478 - std::panicking::try::h08f3ea0876586e3a
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:456:19
25:        0x1058183b0 - std::panic::catch_unwind::h72128b255fb6372d
                             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panic.rs:137:14
26:        0x105830acc - pyo3::callback::handle_panic::he5a8dfdaab2835a2
                             at /Users/nate/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.15.2/src/callback.rs:245:24
27:        0x1058136c8 - entab::<impl pyo3::class::impl_::PyMethods<entab::Reader> for pyo3::class::impl_::PyClassImplCollector<entab::Reader>>::py_methods::METHODS::__wrap::h93cc9623eaf06290
                             at /Users/nate/BioBright/BMS/Attune QC data/PT-2022-06-09T11.07.18.000/entab/entab-py/src/lib.rs:94:1
28:        0x102da0124 - __PyObject_GenericGetAttrWithDict
29:        0x102d9f910 - _PyObject_GetAttr
30:        0x102e2a78c - __PyEval_EvalFrameDefault
31:        0x102e26778 - __PyEval_Vector
32:        0x102e266d0 - _PyEval_EvalCode
33:        0x102e23398 - _builtin_exec
34:        0x102d9af84 - _cfunction_vectorcall_FASTCALL
35:        0x102e2f420 - _call_function
36:        0x102e2cad0 - __PyEval_EvalFrameDefault
37:        0x102d6b1f8 - _gen_send_ex2
38:        0x102e28da4 - __PyEval_EvalFrameDefault
39:        0x102d6b1f8 - _gen_send_ex2
40:        0x102e28da4 - __PyEval_EvalFrameDefault
41:        0x102d6b1f8 - _gen_send_ex2
42:        0x102d6b400 - _gen_send
43:        0x102d5f9d0 - _method_vectorcall_O
44:        0x102e2f420 - _call_function
45:        0x102e2ca20 - __PyEval_EvalFrameDefault
46:        0x102e26778 - __PyEval_Vector
47:        0x102e2f420 - _call_function
48:        0x102e2cad0 - __PyEval_EvalFrameDefault
49:        0x102e26778 - __PyEval_Vector
50:        0x102e2f420 - _call_function
51:        0x102e2ca20 - __PyEval_EvalFrameDefault
52:        0x102e26778 - __PyEval_Vector
53:        0x102d58b34 - _method_vectorcall
54:        0x102e2f420 - _call_function
55:        0x102e2cb48 - __PyEval_EvalFrameDefault
56:        0x102e26778 - __PyEval_Vector
57:        0x102e2f420 - _call_function
58:        0x102e2ca20 - __PyEval_EvalFrameDefault
59:        0x102e26778 - __PyEval_Vector
60:        0x102e2f420 - _call_function
61:        0x102e2ca20 - __PyEval_EvalFrameDefault
62:        0x102e26778 - __PyEval_Vector
63:        0x102e2f420 - _call_function
64:        0x102e2ca20 - __PyEval_EvalFrameDefault
65:        0x102e26778 - __PyEval_Vector
66:        0x102d58b34 - _method_vectorcall
67:        0x102d569d8 - _PyVectorcall_Call
68:        0x102e2cd70 - __PyEval_EvalFrameDefault
69:        0x102e26778 - __PyEval_Vector
70:        0x102e2f420 - _call_function
71:        0x102e2cad0 - __PyEval_EvalFrameDefault
72:        0x102e26778 - __PyEval_Vector
73:        0x102e266d0 - _PyEval_EvalCode
74:        0x102e71720 - __PyRun_SimpleFileObject
75:        0x102e71274 - __PyRun_AnyFileObject
76:        0x102e8f3ac - _Py_RunMain
77:        0x102e8f82c - _pymain_main
78:        0x102e8f8a8 - _Py_BytesMain

Cause?

In this line:

d.format("%+").to_string().to_object(py)

d.format("%+") panics, probably because there is no timezone info?

Changing the line to d.format("%Y-%m-%dT%H:%M:%S%.f").to_string().to_object(py), i.e. removing the timezone part of ISO8601 seems to fix it, and then the output matches entab-cli e.g. entab -m -i DAD1.UV

@bovee if this is acceptable, I'm happy to PR it

Rewrite to remove `refill`s from parsers

Right now there's a pattern we use like the following (simplified from the FASTA parser):

    let (start, end) = loop {
        if let Some(p) = memchr(...) {
            break ...;
        } else if rb.eof() {
            return Err(...);
        }
        rb.refill()?;
    };

The advantage of this is that we don't have to reparse the record out if we need to refill (because we don't exit out and then reenter once the buffer is refilled), but it also means that the lifetime of a record returned from the parser is fixed to right after refill is called (every time we parse, we could potentially wipe the underlying buffer and invalidate all previous records). Theoretically, this means we save time whenever we refill, but at the cost of code clarity and some lifetime constraints. The amount of time saved may not be that great for larger buffers or for mmaped files though.

If we moved the refill logic out of the parser into the caller and returned a sentinel value instead, we could simplify to:

    if let Some(p) = memchr(...) {
        (start, end) = ...
    } else if rb.eof() {
        return Err(...);
    } else {
        return Ok(Incomplete)
    }

This is much closer to how parser libraries like nom handle these situations so it probably won't have a noticeable performance impact? It would require a large rewrite of the parser code and all of the parsers, but I don't think many people are using entab right now that would object to the breakage.

Memory corruption in R

There's an intermittent issue when using the R bindings where fields will be out-of-order and parse speed will be abnormally slow. From #28:

This also seems to have improved the issue I mentioned with retention times appearing in the wavelengths column (about 9/10 times). The weird part though (!?) is that this behavior is still happening about one tenth of the time, as in, if I repeatedly run the Reader on the same file. 🧐 (This seems to be independent of the file used).

I thought this might be an issue with R linking to an old version of the dynamic library, but it's also possible it's an issue with all the unsafe libR code or something else entirely.

Support XML

This is mostly necessary to support XML-based file formats like some of the Agilent MassHunter formats, mzML, etc.

There are a couple existing streaming XML Rust parsers that we could possibly wrap, but it may be "easy" enough to just write one on top of the existing ReadBuffer interface:
https://github.com/netvl/xml-rs
https://github.com/tafia/quick-xml

Passing a raw XML file into entab should probably result in a stream with fields like:

  • Materialized path to current node (Value::List([key, subkey, ...]))
  • Attributes for current node (Value::Record<string, string? value?>)
  • Text for current node (String?)

We may not want to actually do that though because it will probably require saving up all the data and emitting the nodes post-traversal (which isn't the most natural format to view and requires more memory).

"Unknown doesn't have a parser" is not a very helpful error message

We should maybe add more common magic sequences to filetype.rs?

Also for really unknown files, we should maybe capture the first 8 (?) bytes (change [Unknown](https://github.com/bovee/entab/blob/f4e0f3cb7ca4383145cd3575a0e80962d07dcdee/entab/src/filetype.rs#L79) to Unknown([u8; 8])?) and then rewrite the error message appropriately.

Chemstation .reg Format

I believe the Core of the Chemstation .reg format is a CArchive https://learn.microsoft.com/en-us/cpp/mfc/reference/carchive-class?view=msvc-170 here is a reader/writer implemented in
cpp: https://github.com/pixelspark/corespark/blob/e2aa78fe13e273fcc9bb2665ab4c700e89895741/Libraries/atlmfc/src/mfc/arcobj.cpp

The "data type numbers" are random / depend on the order in which the different data objects are written to the archive. It get´s a number once it´s written to the file the first time.
See this code: https://github.com/pixelspark/corespark/blob/e2aa78fe13e273fcc9bb2665ab4c700e89895741/Libraries/atlmfc/src/mfc/arcobj.cpp#L259

Support interlaced PNGs

I think this is the biggest thing missing from the PNG parser and I just never got around to implementing this. Requires understanding exactly how Adam-7 splits up the pixels across the IDAT chunk.

Deploy an example of the JS code

It would be cool to have an actual website up and running with the parser in it for demos, etc. We could do this with a gh-pages publishing action? e.g.

    - uses: peaceiris/actions-gh-pages@v3
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: ./public

-o option doesn't seem to work on CLI

It isn't too big a deal, since you can just use a pipe to save the output from the entab function, but I can't get the -o option to work. For example, if I run:
entab -i ~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv -o test.txt

I get back:

No such file or directory (os error 2)

#####

I'm also sometimes getting:
Bad file descriptor (os error 9)

(I also get the same results if I write the a full path after the -o).

However, it works fine if I pipe it:
entab -i ~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv > test.txt

Support Shimadzu LCD format

This is an LC-MS binary format generated by LabSolutions. There's also a Shimadzu QGD format for GC-MS data; I'm not sure how much crossover there is between these. There don't appear to be any other open-source parsers for either.

Support JSON

Not really a priority, but a solution will probably look like #8 a lot.

How to access readers in Python?

This is probably a stupid question, but as someone who was previously leveraging the Aston package but needs to update to this because the Aston package requirement for an older, unsupported Python version....

How do I access the readers in entab? I don't see any section in the README to help illustrate how someone might migrate from Aston to this new package. I used to have code like:

`def test_aston_import():
from aston.tracefile.agilent_uv import AgilentMWD2

instance = AgilentMWD2()
assert isinstance(instance, AgilentMWD2)`

but one cannot simply replace the "aston" with "entab". Any insight would be welcome. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.