Giter Club home page Giter Club logo

Comments (17)

kgryte avatar kgryte commented on May 16, 2024 1

Prospective API design:

readDSVLine( [options] )

Returns a function for reading a single DSV line according to provided options.

var opts = {
    'delimiter': ',',
    'comment': '#',
    'whitespace': [ ' ' ]
};

var reader = readDSVLine( opts );
// returns <Function>
  • If options is not provided, default values are used.

reader( line )

Parses a single DSV line and returns an array of values.

var reader = readDSVLine( {} );

var line = reader( 'foo,bar,beep,boop' );
// returns [ 'foo', 'bar', 'beep', 'boop' ]
  • If the reader is unable to parse a provided line, the function must return null.

reader.assign( line, out, stride, offset )

Parses a single DSV line and assigns field values to elements in the provided output array.

var reader = readDSVLine( {} );

var out = [ null, null, null, null, null, null, null, null ];

var o = reader.assign( 'foo,bar,beep,boop', out, 2, 1 );
// returns [ null, 'foo', null, 'bar', null, 'beep', null, 'boop' ]

var bool = ( o === out );
// returns true
  • If unable to parse a line, the method should return null.
  • Users should beware that, if the method returns null, elements in the provided output array could have still been mutated.

from stdlib.

kgryte avatar kgryte commented on May 16, 2024 1

No, not yet. This issue is blocked until the base implementation is finished, which it is not.

from stdlib.

labiej avatar labiej commented on May 16, 2024

While googling around and deciding on a course of action I found this package on npm to test an implementation.

Testing the implementation of d3-dsv this way I run into some trouble. While the output is visibly the same a test for equality failed.

Turns out you need to use tape's deepEqual method in this case. And use the spread operator to create an array without a columns property.
For future reference, this is the code I used to test the implementation

'use strict';

const spectrum = require('csv-spectrum');
const d3DSV = require('d3-dsv');
const tape = require('tape');

tape( 'Test all cases', function test ( t ) {

    spectrum( function ( err, data ) {

        for ( let testCase of data ) {
            // Convert the data to a string
            let csvDataString = testCase.csv.toString( 'utf8' );

            // Create a new array without the columns property which breaks the equality test
            let parsed = [ ...d3DSV.csvParse( csvDataString ) ];
            let control = JSON.parse( testCase.json);

            // Test type of parsed objects
            t.equal( typeof parsed, typeof control, 'testing types of sample: ' + testCase.name );

            // Test equality by value
            t.deepEqual( parsed, control, 'testing sample: ' + testCase.name );
        }
    });

    t.end();
});

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

@labiej Thanks for looking into this!

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

Ref: https://github.com/d3/d3-dsv

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

CSV-spectrum: https://github.com/maxogden/csv-spectrum

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

Papaparse: https://github.com/mholt/PapaParse

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

RFC: https://datatracker.ietf.org/doc/html/rfc4180

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

Python built-in CSV API: https://docs.python.org/3/library/csv.html

PEP 305: https://peps.python.org/pep-0305/

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

GoLang: https://pkg.go.dev/encoding/csv

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

MATLAB: https://www.mathworks.com/help/matlab/ref/csvread.html and https://www.mathworks.com/help/matlab/ref/readmatrix.html

MATLAB's API is interesting insofar as it supports reading only sections of a CSV file.

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Has arguably the most complex CSV read API. Some interesting features.

  • Select particular columns.
  • Ability to provide custom converters for particular columns.
  • Comment support.
  • Custom separator support (decimals, thousands, etc).
  • Dialects (as in native Python CSV API)
  • Support for custom date parsing.
  • Support for specifying a particular column as a column of row labels

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

Streaming CSV parser for Node: https://github.com/mafintosh/csv-parser; however, issue tracker suggests some concerns with implementation.

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

R: https://www.rdocumentation.org/packages/qtl2/versions/0.28/topics/read_csv

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

Parsing options:

  • delimiter: delimiter to use. For CSV, the value would be ,. For TSV, the value would be a TAB character.
  • thousands: thousands separator. This would allow numbers to be written 1,000,000.
  • decimal: decimal separator. This helps with European data, which format 3.14 as 3,14.
  • quote: character sequence used to denote the start and end of a quoted item. Quoted items can include the delimiter and the delimiter must be ignored.
  • true: array of values to be considered equal to true. E.g., the string 'True' would be converted to true, a boolean.
  • false: array of values to be considered equal to false.
  • quoted: list of columns which may have quoted field values or false, indicating that no fields should be quoted. (update: not sure the original intent of this option. However, may be to provide a fast path for parsing. In which case, a value of true could mean that all fields may contain quoted field values.)
  • doublequote: boolean indicating whether to interpret two consecutive quote character sequences inside a quoted field as a single quote character sequence.
  • comment: character sequence indicating whether the remainder of a line should not be parsed.
  • escape: character sequence used to escape other characters.
  • columns: list of columns to return. Default is to return all columns.
  • missing: list of strings to recognize as missing values (e.g., NA, NaN, null, etc).
  • whitespace: list of characters to interpret as whitespace.
  • trim: boolean indicating whether to trim leading whitespace in each field value.
  • trimNonNumeric: boolean indicating whether to trim non-numeric characters from a numeric value.
  • transforms: object whose properties are column numbers and whose values are callbacks which should be invoked for the respective column values and which return a transformed value. E.g., a callback which converts a string to a Date object.
  • consecutiveDelimiters: rule specifying how to handle consecutive delimiters: keep, join, error.
  • leadingDelimiters: rule specifying how to handle leading delimiters: keep, ignore, error.
  • trailingDelimiters: rule specifying how to handle trailing delimiters: keep, ignore, error.

from stdlib.

kgryte avatar kgryte commented on May 16, 2024

For line-by-line reader, proposed package: @stdlib/utils/dsv/base/parse-line.

Once the line-by-line reader is implemented, can consider a "sniff" package and other CSV/DSV abstraction packages.

from stdlib.

Infinage avatar Infinage commented on May 16, 2024

Hi @kgryte, is this issue still open? I see that we have implemented an incremental parser here already: @stdlib/utils/dsv/base/parse. Is this open issue now a matter of creating a wrapper around it?

from stdlib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.