Giter Club home page Giter Club logo

Comments (12)

fpepin avatar fpepin commented on May 16, 2024

Is anyone taking a stab at this? It seems like something fairly manageable & useful to get started.

I'm guessing that RFC4180 and R's write.csv should be able to get a decent start.

from dataframes.jl.

HarlanH avatar HarlanH commented on May 16, 2024

Up for grabs! Go for it! You may also find the Python module's design and
the discussions in their PEP of interest:

http://docs.python.org/library/csv.html
http://www.python.org/dev/peps/pep-0305/

Fortunately, writing CSV is much easier than reading! But definitely some
stuff (about dialects and string encoding) to keep in mind in there...

Thanks!

On Thu, Aug 9, 2012 at 2:41 AM, fpepin [email protected] wrote:

Is anyone taking a stab at this? It seems like something fairly manageable
& useful to get started.

I'm guessing that RFC4180 https://tools.ietf.org/html/rfc4180 and R's
write.csv should be able to get a decent start.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/27#issuecomment-7606527.

from dataframes.jl.

fpepin avatar fpepin commented on May 16, 2024

I'll take a look and see what I can come up with.

from dataframes.jl.

doobwa avatar doobwa commented on May 16, 2024

Julia's csvwrite might be good to check out, if you haven't seen it already.

from dataframes.jl.

StefanKarpinski avatar StefanKarpinski commented on May 16, 2024

Code found here: https://github.com/JuliaLang/julia/blob/master/base/datafmt.jl. Real CSV encoding and decoding is unfortunately significantly harder. And unstandardized. I've generally found that tab-separated values (TSV) works better with UNIX commands (though sadly, not all of them; I'm looking at you, join), and since tabs are pretty rare in many kinds of data, whereas commas are everywhere, there's often no need for escaping anything.

Maybe we should create a TSV standard. My proposal is binary data (encoding implies interpretation and this is just a way to express tabular data — if you know something is text or a number, that's the next level up and requires interpretation), where tabs ('\t') delimit fields and newlines ('\n') delimit rows. Embedded tabs, newlines and backslashes get backslash escaped. That's a pretty damned simple format and completely general — any kind of data can be encoded, even binary. And it's trivial to scan and break into pieces: tab characters always delimit fields and newlines always delimit rows. CR ('\r') and any other newlineish characters are just literals. Friends don't let friends end lines with that crap. This isn't DOS or Mac OS 7.

from dataframes.jl.

jfhbrook avatar jfhbrook commented on May 16, 2024

I dig the standard, especially considering the ease of implementation. I'd
be mildly concerned about whether other tools (say, excel) could read it,
but my guess is that excel is pretty forgiving (and you're probably not
embedding tabs or newlines in excel anyway).

CR ('\r') and any other newlineish characters are just literals. Friends
don't let friends end lines with that crap. This isn't DOS or Mac OS 7.

I laughed.

--Josh

On Sun, Aug 12, 2012 at 11:28 PM, Stefan Karpinski <[email protected]

wrote:

Code found here:
https://github.com/JuliaLang/julia/blob/master/base/datafmt.jl. Real CSV
encoding and decoding is unfortunately significantly harder. And
unstandardized. I've generally found that tab-separated values (TSV) works
better with UNIX commands (though sadly, not all of them; I'm looking at
you, join), and since tabs are pretty rare in many kinds of data, whereas
commas are everywhere, there's often no need for escaping anything.

Maybe we should create a TSV standard. My proposal is binary data
(encoding implies interpretation and this is just a way to express tabular
data — if you know something is text or a number, that's the next level
up and requires interpretation), where tabs ('\t') delimit fields and
newlines ('\n') delimit rows. Embedded tabs, newlines and backslashes get
backslash escaped. That's a pretty damned simple format and completely
general — any kind of data can be encoded, even binary. And it's trivial to
scan and break into pieces: tab characters always delimit fields and
newlines always delimit rows. CR ('\r') and any other newlineish
characters are just literals. Friends don't let friends end lines with that
crap. This isn't DOS or Mac OS 7.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/27#issuecomment-7684245.

Joshua Holbrook
Head of Support
Nodejitsu Inc.
[email protected]

from dataframes.jl.

jfhbrook avatar jfhbrook commented on May 16, 2024

One other thought: Do you specify how to parse the entries into some
numeric datatype? Or do you leave that up to the user/implementation?

--Josh

On Sun, Aug 12, 2012 at 11:47 PM, Joshua Holbrook
[email protected]:

I dig the standard, especially considering the ease of implementation. I'd
be mildly concerned about whether other tools (say, excel) could read it,
but my guess is that excel is pretty forgiving (and you're probably not
embedding tabs or newlines in excel anyway).

CR ('\r') and any other newlineish characters are just literals.
Friends don't let friends end lines with that crap. This isn't DOS or Mac
OS 7.

I laughed.

--Josh

On Sun, Aug 12, 2012 at 11:28 PM, Stefan Karpinski <
[email protected]> wrote:

Code found here:
https://github.com/JuliaLang/julia/blob/master/base/datafmt.jl. Real CSV
encoding and decoding is unfortunately significantly harder. And
unstandardized. I've generally found that tab-separated values (TSV) works
better with UNIX commands (though sadly, not all of them; I'm looking at
you, join), and since tabs are pretty rare in many kinds of data,
whereas commas are everywhere, there's often no need for escaping anything.

Maybe we should create a TSV standard. My proposal is binary data
(encoding implies interpretation and this is just a way to express tabular
data — if you know something is text or a number, that's the next
level up and requires interpretation), where tabs ('\t') delimit fields and
newlines ('\n') delimit rows. Embedded tabs, newlines and backslashes get
backslash escaped. That's a pretty damned simple format and completely
general — any kind of data can be encoded, even binary. And it's trivial to
scan and break into pieces: tab characters always delimit fields and
newlines always delimit rows. CR ('\r') and any other newlineish
characters are just literals. Friends don't let friends end lines with that
crap. This isn't DOS or Mac OS 7.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/27#issuecomment-7684245.

Joshua Holbrook
Head of Support
Nodejitsu Inc.
[email protected]

Joshua Holbrook
Head of Support
Nodejitsu Inc.
[email protected]

from dataframes.jl.

HarlanH avatar HarlanH commented on May 16, 2024

Someone on G+ linked to this, actually in reference to Julia, yesterday: http://xkcd.com/927/

Python does do a pretty good job of (on reading) auto-detecting tabular data separators/line endings, and in Universal mode, character sets too. And it does a reasonable job of defining dialects for writing too.

For a minimum viable pair of routines, perhaps we define a type that specifies the separator/terminator/encoding, instantiate some pre-defined dialects (Excel, Unix CSV, TSV, etc.), and use it for both reading and writing? Auto-detection can wait for another day...

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

I agree with Harlan: we should start by making a DelimitedData type that lets you use commas, tabs and whitespace. While full out CSV is hard to get right, I think you'll get a long way just by accounting for quotation rules. I've always preferred TSV as a way to avoid quotations, but I'm pretty sure the majority of data we want to read will use commas.

from dataframes.jl.

fpepin avatar fpepin commented on May 16, 2024

I really like the idea of a DelimitedData type. The extra advantage is that this can be used for both reading and writing and can encode the many parameters: delimiter, end of file, quoting/escape mechanism. Then it's just a matter of defining the main types. Based on the RFC, standard csv shouldn't be too hard but it's complex enough that once I get that one, the other ones will basically be free.

As for printing numeric types, csvwrite uses print_shortest for floats and print for the rest, which seems like a decent start to me. I'm not worried too much about it because we're not really going to be using this to communicate from Julia to Julia, so just having a reasonably intuitive printed version is fine, especially if the user can overload it.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

While there's still more to be done, write_table, goes a long way towards dealing with this issue.

from dataframes.jl.

johnmyleswhite avatar johnmyleswhite commented on May 16, 2024

I'm going to close this because I think write_table is actually sufficient for most purposes. More focused complaints deserve separate issues.

from dataframes.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.