Giter Club home page Giter Club logo

Comments (6)

SpikyClip avatar SpikyClip commented on June 17, 2024 1

Thanks John, I came across the issue to add --tsvlite but not on #923. So IANA TSV is the standard. Admittedly I was a bit confused about the decoding/encoding description in the docs but it makes sense now if --tsv has no other ways of protecting against those characters.

I have no issues with this behaviour, though I think it would be a good idea to mention in the docs that --tsv does not support double quoting. It's a minor thing to add and makes it clear, and considering 3 issues have been raised on this, and that there are parsers out there that are based off the (non-existent) RFC4180 TSV compliance.

TSV (tab-separated values): FS is tab and RS is newline (or carriage return + linefeed for Windows). On input, if fields have \r, \n, \t, or \, those are decoded as carriage return, newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field has an embedded newline, that newline is replaced by \n. TSV does not support RFC-4180-style double-quoting.

For my purposes I am simply going to sed those " into ' as they don't contribute any useful information in the TSV. That way I can be sure it will not cause issues with RFC4180 CSV and IANA TSV parsers.

Am happy to close this issue if you are.

from miller.

johnkerl avatar johnkerl commented on June 17, 2024 1

Thanks @SpikyClip !! And thanks for reminding me of the name "IANA" for the (pseudo-)spec which I ended up following on #923.

Let's leave this issue open as a doc issue. I'll incorporate your suggestions above, and I'll also use the term "IANA" for clarity, etc.

from miller.

aborruso avatar aborruso commented on June 17, 2024

Hi @SpikyClip is there an RFC for TSV? If yes, could you share the URL?

I think there is only the one for CSV, in fact maybe the way to have a TSV that behaves like a CSV is the solution.

I'm not sure I understand your goal, but I'll try. If you run

mlr --icsvlite --ocsv --ragged --fs "\t" cat broken.tsv.txt

you get

col1	col2	col3
1	abc	tsvb
2	"""efg"	csv
3	hij	fdfd
4	"""klm"""	sss
5	"oh"""	sss

from miller.

SpikyClip avatar SpikyClip commented on June 17, 2024

I get your point that RFC4180 formally applies to CSV files and not necessarily TSV files. But from my experience the two are essentially treated the same except that the delimiter is different.

To expand on my use case, a common package used in my field is readr from R tidyverse which has a function read_tsv that has a default arg quote = "\"" that would result in the silent dropping (or concatenation) of rows, and to downstream users of a potentially broken TSV file they may not notice it easily (especially in a large file with thousands of rows and hundreds of columns). Sure its an easy fix on the reader-side (quote = "") but it requires some due diligence, I'd rather just avoid the issue.

Besides, its unusual considering --tsvlite is an option - if they behave the same it doesn't really match the same semantic pattern compared to --csvlite and --csv.

Thanks for the --fs "\t" solution, will try that out (is --ragged necessary though?).

from miller.

aborruso avatar aborruso commented on June 17, 2024

In my experience RFC4180 is rightly and usually taken into account for CSVs. Not for TSV.
And so it is for Miller as well, so the way to solve your use case is --fs"\t", csvlite and csv.

--ragged I dont think its necessary.

from miller.

johnkerl avatar johnkerl commented on June 17, 2024

I get your point that RFC4180 formally applies to CSV files and not necessarily TSV files. But from my experience the two are essentially treated the same except that the delimiter is different.

@SpikyClip there's the rub! I had agreed with your statement here, and this is what Miller once did, but as of #923 this is no longer the case. Formerly, Miller's TSV was RFC4180 CSV with commas replaced by tabs. Now, while TSV does not have an RFC, as of #923 I follow the behavior as described at https://miller.readthedocs.io/en/6.12.0/file-formats/#csvtsvasvusvetc.

Namely: embedded newlines and tabs are encoded as \n and \t.

from miller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.