Giter Club home page Giter Club logo

Comments (8)

wooseogchoi avatar wooseogchoi commented on August 24, 2024 1

@WillAyd
With some modification, the codes above are working. I will add it as example in the read_csv doc.
Also I will check the test cases. If it is not there, I will add one.
Thx

from pandas.

matteosantama avatar matteosantama commented on August 24, 2024

From the latest pyarrow documentation

newlines_in_values, optional (default False)
Whether newline characters are allowed in CSV values. Setting this to True reduces the performance of multi-threaded CSV reading.

Enabling it by default would probably be a mistake. The pyarrow engine (with its multi-threaded capabilities) is the preferred option for large CSV files, though, so it'd be a shame for it to fail in this scenario.

If the pyarrow engine is here to stay, I'd recommend exposing newlines_in_values to the user.

from pandas.

tilovashahrin avatar tilovashahrin commented on August 24, 2024

To keep the pyarrow engine, you'll need to use the pyarrow library directly to handle CSV files that contain newline characters. This involves using the ParseOptions class from pyarrow.csv to set the newlines_in_values option to True.

Example

import pyarrow as pa
import pandas as pd

rows = []
for i in range(1_000_000):
    rows.append({"text": "ab\ncd", "i": i})

df = pd.DataFrame(rows)
# Define parse options to allow newlines in values
parse_options = pv.ParseOptions(newlines_in_values=True)

# Read the CSV file using pyarrow
table = pv.read_csv("example.csv", parse_options=parse_options)

# Convert the Arrow Table to a Pandas DataFrame
df = table.to_pandas()
df

from pandas.

gosuchoi avatar gosuchoi commented on August 24, 2024

take

from pandas.

wooseogchoi avatar wooseogchoi commented on August 24, 2024

take

from pandas.

wooseogchoi avatar wooseogchoi commented on August 24, 2024

take

from pandas.

wooseogchoi avatar wooseogchoi commented on August 24, 2024

@WillAyd
I would like to introduce a new argument in order to expose pyarrow's 'newlines_in_values' to the user because I cannot find any suitable in the current parameters. Could you please suggest new parametrer name for this, 'newlines_in_values' which might be used by another engines in the future.

from pandas.

WillAyd avatar WillAyd commented on August 24, 2024

Reading through the issue I don't think we actually want to change anything here - the solution from @tilovashahrin should work.

Can you check if that works for you? If so, we should add a test for it to pandas (if one doesn't already exist) and maybe update the documentation to show how to do it

from pandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.