Comments (8)
@WillAyd
With some modification, the codes above are working. I will add it as example in the read_csv doc.
Also I will check the test cases. If it is not there, I will add one.
Thx
from pandas.
From the latest pyarrow documentation
newlines_in_values, optional (default False)
Whether newline characters are allowed in CSV values. Setting this to True reduces the performance of multi-threaded CSV reading.
Enabling it by default would probably be a mistake. The pyarrow engine (with its multi-threaded capabilities) is the preferred option for large CSV files, though, so it'd be a shame for it to fail in this scenario.
If the pyarrow engine is here to stay, I'd recommend exposing newlines_in_values
to the user.
from pandas.
To keep the pyarrow engine, you'll need to use the pyarrow library directly to handle CSV files that contain newline characters. This involves using the ParseOptions class from pyarrow.csv to set the newlines_in_values option to True.
Example
import pyarrow as pa
import pandas as pd
rows = []
for i in range(1_000_000):
rows.append({"text": "ab\ncd", "i": i})
df = pd.DataFrame(rows)
# Define parse options to allow newlines in values
parse_options = pv.ParseOptions(newlines_in_values=True)
# Read the CSV file using pyarrow
table = pv.read_csv("example.csv", parse_options=parse_options)
# Convert the Arrow Table to a Pandas DataFrame
df = table.to_pandas()
df
from pandas.
take
from pandas.
take
from pandas.
take
from pandas.
@WillAyd
I would like to introduce a new argument in order to expose pyarrow's 'newlines_in_values' to the user because I cannot find any suitable in the current parameters. Could you please suggest new parametrer name for this, 'newlines_in_values' which might be used by another engines in the future.
from pandas.
Reading through the issue I don't think we actually want to change anything here - the solution from @tilovashahrin should work.
Can you check if that works for you? If so, we should add a test for it to pandas (if one doesn't already exist) and maybe update the documentation to show how to do it
from pandas.
Related Issues (20)
- BUG: subtracting datetime series from datetime dataframe, or datetime dataframe from datetime series, raises TypeError or UFuncTypeError HOT 2
- BUG: read_csv's usecols type hint isn't match with list of strings HOT 2
- BUG: OverflowError: value too large to convert to int when manipulating very large dataframes HOT 6
- DOC: Wrong bug number in what's new v3.0.0 HOT 3
- ENH: Checking gaps for time series
- ENH: Support non-categorical values for pandas bar plots when x axis is datetime values HOT 1
- ENH: In pandas.testing.assert_frame_equal, support per-column configuration HOT 1
- DOC: read_csv: date_format HOT 2
- ENH: Breakpoint method for dataframes HOT 1
- BUG: inconsistency when `read_csv` reads MultiIndex with empty values
- BUG: .str.contains `na` validation HOT 3
- ENH: Option to configure line wrapping for columns HOT 4
- BUG: inconsistent indices in `GroupByRolling` when selecting or not selecting subset of columns
- BUG: provide better error message for pd.Timedelta - pd.Series[Timestamp] HOT 3
- BUG: Wrong result of Kurtosis HOT 2
- BUG: StringDtype conversion to bool changes False to True HOT 3
- Development with Visual Studio Tools 2022 not possible? HOT 2
- ERR: consistent error messages for unsupported reduction operations
- ENH: `DataFrame.struct.explode(column, *, separator=".")` method to pull struct subfields into the parent DataFrame
- BUG: `to_markdown` raises `ValueError` when values are `np.array`s with more than one element HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.