Giter Club home page Giter Club logo

Comments (6)

deads avatar deads commented on June 24, 2024

Sorry for the delay in responding to your issue. It has been a busy summer at work.

It was my intention to get a fairly comprehensive test suite working before fixing bugs that the community reports. This way we can ensure that one fix does not break earlier fixes. The issue you were experiencing was a bug in the algorithm for dividing the file. In the latest master, the example you gave above seems to work for any number of threads. This is not surprising since the latest master incorporates quite a bit of fuzz-like testing on adverserial input.

The paratext package provides several helper functions for generating arbitrary data frames for the purpose of testing. One of these functions is called generate_hell_frame.

df=paratext.testing.generate_hell_frame(1000, 5, fmt="mixed")

In this frame, there are UTF-8 columns, arbitrary byte sequences, 7-bit ASCII strings, and printable ASCII strings. The data of these columns will contain arbitrary punctuation, double quoting, newlines, and escape characters as well as non-UTF-8 and non-ASCII data.

There is another utility function that writes the data to a file:

paratext.serial.save_frame("myfile.csv", df, encoding=encoding)

where encoding can be utf-8, ascii, printable-ascii, or arbitrary. In each case, if a sequence is encountered outside the encoding, it is properly escaped. This enables a data frame with both Unicode, byte buffer columns, and strings to be written to printable ASCII, and read back in a lossless fashion.

from paratext.

ehiggs avatar ehiggs commented on June 24, 2024

FWIW, this is still broken.

from paratext.

deads avatar deads commented on June 24, 2024

Are you sure? It works for me. I tried:

it=paratext.load_raw_csv("/tmp/hello.csv", no_header=True, allow_quoted
    ...: _newlines=True)

In [2]: it.next()
Out[2]: 
(u'col0',
 array([0,

Also, if I try:

In[3]: paratext.load_csv_to_pandas("/tmp/hello.csv", no_header=True).head()
Out[3]: 
    col0 col1 col2   col3 col4
0  hello    ,       world    !
1  hello    ,       world    !
2  hello    ,       world    !
3  hello    ,       world    !
4  hello    ,       world    !

it works.

Perhaps you do have the latest source or did not properly rebuild.

Try doing a git pull, removing the build/ directory:

git pull
rm -rf build

from paratext.

ehiggs avatar ehiggs commented on June 24, 2024

I found that I couldn't reproduce it when writing a test in tests/test_paratext.py. But it fails when reading from stdin. IIRC you use mmap when reading the file so that won't work; and it certainly wouldn't make any sense to do this with multiple threads so maybe it's moot. So you could still fail but crashing with mysterious errors is not a nice UX.

from paratext.

ehiggs avatar ehiggs commented on June 24, 2024

I was able to get paratext added to the game in the end: https://bitbucket.org/ewanhiggs/csv-game

from paratext.

ehiggs avatar ehiggs commented on June 24, 2024

As this is closed, I entered #62 to handle the stdin issue. Thanks!

from paratext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.