Giter Club home page Giter Club logo

Comments (22)

BurntSushi avatar BurntSushi commented on August 23, 2024 1

@davidblewett Naively, a checksum would be written at the end of the FST, and it would correspond to a crc32c sum of all previous bytes in the FST. If they don't line up, then you have pretty high confidence that the FST has been corrupted somehow. The checksum would not however help you if the panics you're seeing are a result of a bug in the FST builder itself.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

Trying to use fst csv edges or fst csv nodes results in:

fatal runtime error: allocator memory exhausted
Illegal instruction (core dumped)

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

I can provide a sample, broken file (and a sample of a functional file) if you like.

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

Interesting bug! Yeah I definitely need a sample to reproduce. Could you also show output with RUST_BACKTRACE=1?

Also, the Python interpreter should not be segfaulting. The ffi layer should be catching panics and converting them to aborts. Otherwise it is UB.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

backtrace:

thread '<unnamed>' panicked at 'index out of bounds: the len is 89225255 but the index is 15119944950614189002', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17                     
stack backtrace:                                        
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace                                                 
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49                                                   
   1: std::sys_common::backtrace::_print                
             at libstd/sys_common/backtrace.rs:71       
   2: std::panicking::default_hook::{{closure}}         
             at libstd/sys_common/backtrace.rs:59       
             at libstd/panicking.rs:380                 
   3: std::panicking::default_hook                      
             at libstd/panicking.rs:396                 
   4: std::panicking::rust_panic_with_hook              
             at libstd/panicking.rs:576                 
   5: std::panicking::begin_panic                       
             at libstd/panicking.rs:537                 
   6: std::panicking::begin_panic_fmt                   
             at libstd/panicking.rs:521                 
   7: rust_begin_unwind                                 
             at libstd/panicking.rs:497                 
   8: core::panicking::panic_fmt                        
             at libcore/panicking.rs:71                 
   9: core::panicking::panic_bounds_check
             at libcore/panicking.rs:58
  10: fst::raw::Fst::node
  11: <fst::raw::StreamBuilder<'f, A> as fst::stream::IntoStreamer<'a>>::into_stream
  12: fst_set_streambuilder_finish
  13: ffi_call_unix64
             at ../src/x86/unix64.S:76
  14: ffi_call
             at ../src/x86/ffi64.c:525
  15: cdata_call
             at c/_cffi_backend.c:3025
  16: _PyObject_FastCallDict
             at Objects/abstract.c:2331
  17: call_function
             at Python/ceval.c:4848
  18: _PyEval_EvalFrameDefault
             at Python/ceval.c:3322
  19: _PyFunction_FastCall
             at Python/ceval.c:4906
  20: _PyFunction_FastCallDict
             at Python/ceval.c:5008
  21: _PyObject_FastCallDict
             at Objects/abstract.c:2310
  22: _PyObject_Call_Prepend
             at Objects/abstract.c:2373
  23: PyObject_Call
             at Objects/abstract.c:2261
  24: call_method.constprop.53
             at Objects/typeobject.c:1453
  25: _PyEval_EvalFrameDefault
             at Python/ceval.c:1510
  26: _PyEval_EvalCodeWithName
             at Python/ceval.c:4153
  27: PyEval_EvalCodeEx                                 
             at Python/ceval.c:4174                     
  28: PyEval_EvalCode                                   
             at Python/ceval.c:730                      
  29: PyRun_InteractiveOneObjectEx                      
             at Python/pythonrun.c:1025                 
             at Python/pythonrun.c:246                  
  30: PyRun_InteractiveLoopFlags                        
             at Python/pythonrun.c:114                  
  31: PyRun_AnyFileExFlags                              
             at Python/pythonrun.c:75                   
  32: Py_Main                                           
             at Modules/main.c:338                      
             at Modules/main.c:809                      
  33: main                                              
             at ./Programs/python.c:69                  
  34: __libc_start_main                                 
  35: <unknown>                                         
fatal runtime error: failed to initiate panic, error 5
Aborted (core dumped)                                   

# echo $?
134

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

Working on getting the files somewhere.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

@BurntSushi : here are the files: https://drive.google.com/file/d/1xs9NSIEU2yEDoUg0Vl2qEtL06hOvo9f1/view?usp=sharing .

Let me know when you get them so I can unshare them. The data in them should be anonymized to not be a problem, but would like to limit exposure.

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

@davidblewett Thanks! I downloaded it. Won't be able to look into this until a bit later, hopefully today, but no promises. Also, in the future, if it's a concern, you can email me files if they are small enough. [email protected]

Out of curiosity (in case it becomes relevant), how did you build these files? Did you use fst 0.3.0?

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

Yes, fst 0.3.0. The process looks like:

  1. Build array in memory
  2. Pass array to fst set --sorted - -
  3. Write output to file, compress, upload to S3
  4. Read from S3, decompress, accumulate files for 5 minutes
  5. fst union ... ... temp_dir/output && mv temp_dir/output foo.fst

Steps 1-4 are actually in Ruby; 5 in Python.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

@BurntSushi could it be step 2, both reading from stdin and writing to stdout that breaks in some circumstances? Should I be outputting to a tempfile on disk?

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

@davidblewett I don't think so. Is it possible to provide the original array of strings that produced one of these FSTs? That would help debugging on my end.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

@BurntSushi : Unfortunately, it would be fairly involved to try to track that down. I don't have the telemetry for what files were combined in step 5 above. I'm going to purge the files that are exhibiting this behavior, and resume the aggregation process. If it occurs again, I might be able to track that down.

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

@davidblewett No worries, thanks! I'll see what I can do with what I have. :)

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

I've added some telemetry to our process that should allow me to reconstruct the input data. Will let you know if I see it happen again.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

@BurntSushi : after letting this run for a few days, I have not been able to reproduce the error. It's possible that the sample file here hadn't been completely finalized, and was accidentally included in a globbing expression to load finished files.

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

@BurntSushi this has raised it's ugly head again. I'm pretty sure it's a sequence of events in my larger application that ends up writing out corrupt FST data. However, would you be opposed to using .get(x) on arrays instead of [x]? It would be a fairly invasive change, but would allow a clean way to recover in the face of invalid data.

Alternatively, perhaps some form of check method could be added that can validate the structure of a given FST file?

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

@davidblewett Hmmm... So I haven't had a chance to look into this. There's unfortunately a large amount of context switching overhead required to dive into fst internals and debug this kind of thing.

I think using get(x) instead of [x] is probably not the right direction to take. It would make the code incredibly noisy. In particular, it's not just about writing get(x) instead of [x], but actually doing case analysis everywhere. And when we do try to access an out-of-bounds index, it's not clear to me that we could do much better than panic anyway.

A check method or at least some kind of optional checksum seems like a better path to me. In theory, a full check method wouldn't be too bad, but that's only if you use the existing code that deserializes the FST, which I think is the actual problem here, so that doesn't help. A checksum would probably work well to weed out corrupt FST files, but would not fix any issues arising from a non-corrupt but incorrect FST file (e.g., if the builder is producing incorrect data as opposed to some external thing preventing a complete FST from being written).

I'm not too sure what I'll have time for in the short term here unfortunately. How severe would you rate this bug?

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

The incidince rate has gone down in the last few days. I like the idea of a checksum, but how would you verify it? Could it be something that is always the "longest" chain in that specific FST?

from fst.

davidblewett avatar davidblewett commented on August 23, 2024

@BurntSushi I believe I discovered the root of the issue. I don't think the FST files were corrupt, it was due to doing multiple operations in different threads on the same handle. Since the Python binding is basically C, those actions don't trigger any warnings.

In pure Rust, are the Set structs not re-entrant? We have a few hundred thousand FST segments, so need to be careful to not go over the operating system limit on the number of open mmap'd files. If we spawned new Set instances in different threads, we could easily go over that limit if we had multiple concurrent requests for a customer that had tens of thousands of segments.

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

@davidblewett The FST sets themselves are purely immutable and can be inspected from multiple threads/processes simultaneously without issue. However, the streams produced by FSTs require mutable access and are themselves not legal to access from multiple threads simultaneously without explicit synchronization. Of course, you can have as many streams operating simultaneously as you want. You just need to make sure you're only accessing each stream from one thread.

from fst.

DiSToAGe avatar DiSToAGe commented on August 23, 2024

Perhaps I discover your problem ...?

In a personal code in Rust, I write an fst coming from a sorted list of strings. Then I tryied to open my fst with the fst-bin and use a regex on it. But I get the almost same error :

thread 'main' panicked at 'index out of bounds: the len is 267524517 but the index is 148671801088665313', (...)github.com-1ecc6299db9ec823/fst-0.3.3/src/raw/node.rs:307:17

Then I realise there was an error on my code writting the fst. I used

  1. SetBuilder::new(file)
  2. build.extend_iter(lines.into_iter())
    => but I forgott the "build.finish()" at the end ...

If you don't "finish()", the fst file is a little bit smaller than what it must be, and at opening it, it give this error given.
Don't know If it is the same problem as you ...?

PS: correcting my code, create a correct fst file and the problem doesn't come again.

from fst.

BurntSushi avatar BurntSushi commented on August 23, 2024

I'm going to close this issue because it seems like there isn't a problem with FST reading/writing itself. Happy to dig into this more with a better reproduction.

from fst.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.