Comments (6)
Sorry for the delay in responding to your issue. It has been a busy summer at work.
It was my intention to get a fairly comprehensive test suite working before fixing bugs that the community reports. This way we can ensure that one fix does not break earlier fixes. The issue you were experiencing was a bug in the algorithm for dividing the file. In the latest master, the example you gave above seems to work for any number of threads. This is not surprising since the latest master incorporates quite a bit of fuzz-like testing on adverserial input.
The paratext
package provides several helper functions for generating arbitrary data frames for the purpose of testing. One of these functions is called generate_hell_frame
.
df=paratext.testing.generate_hell_frame(1000, 5, fmt="mixed")
In this frame, there are UTF-8 columns, arbitrary byte sequences, 7-bit ASCII strings, and printable ASCII strings. The data of these columns will contain arbitrary punctuation, double quoting, newlines, and escape characters as well as non-UTF-8 and non-ASCII data.
There is another utility function that writes the data to a file:
paratext.serial.save_frame("myfile.csv", df, encoding=encoding)
where encoding
can be utf-8
, ascii
, printable-ascii
, or arbitrary
. In each case, if a sequence is encountered outside the encoding, it is properly escaped. This enables a data frame with both Unicode, byte buffer columns, and strings to be written to printable ASCII, and read back in a lossless fashion.
from paratext.
FWIW, this is still broken.
from paratext.
Are you sure? It works for me. I tried:
it=paratext.load_raw_csv("/tmp/hello.csv", no_header=True, allow_quoted
...: _newlines=True)
In [2]: it.next()
Out[2]:
(u'col0',
array([0,
Also, if I try:
In[3]: paratext.load_csv_to_pandas("/tmp/hello.csv", no_header=True).head()
Out[3]:
col0 col1 col2 col3 col4
0 hello , world !
1 hello , world !
2 hello , world !
3 hello , world !
4 hello , world !
it works.
Perhaps you do have the latest source or did not properly rebuild.
Try doing a git pull, removing the build/
directory:
git pull
rm -rf build
from paratext.
I found that I couldn't reproduce it when writing a test in tests/test_paratext.py
. But it fails when reading from stdin. IIRC you use mmap when reading the file so that won't work; and it certainly wouldn't make any sense to do this with multiple threads so maybe it's moot. So you could still fail but crashing with mysterious errors is not a nice UX.
from paratext.
I was able to get paratext added to the game in the end: https://bitbucket.org/ewanhiggs/csv-game
from paratext.
As this is closed, I entered #62 to handle the stdin issue. Thanks!
from paratext.
Related Issues (20)
- Build fails on Windows 7 with Anaconda HOT 3
- Add support for reading large files in chunks HOT 1
- C++ rowbase stream processing?
- Unexpected content conversion for a hex string data HOT 3
- add support for .gz files HOT 1
- add support for opening multiple files HOT 1
- Paratext <-> Apache Arrow bridge HOT 1
- perf problems HOT 3
- can i access data by row HOT 3
- how to install paratext for python3? HOT 1
- Support for tab-delimited files? HOT 1
- Reading from stdin should use single thread or report an error HOT 1
- Feature request : Add support for conda HOT 1
- Feature request: Add `sep` argument for separator HOT 1
- Import fails on OSX 10.12.5 using Anaconda HOT 8
- C++11 compiler warning with clang on OS X
- Unable to read csv
- how to use it in c or c++?
- Issue with paratext.load_csv_to_pandas() HOT 4
- Tested on AWS Lambda - EFS ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paratext.