Comments (17)
Prospective API design:
readDSVLine( [options] )
Returns a function for reading a single DSV line according to provided options.
var opts = {
'delimiter': ',',
'comment': '#',
'whitespace': [ ' ' ]
};
var reader = readDSVLine( opts );
// returns <Function>
- If
options
is not provided, default values are used.
reader( line )
Parses a single DSV line and returns an array of values.
var reader = readDSVLine( {} );
var line = reader( 'foo,bar,beep,boop' );
// returns [ 'foo', 'bar', 'beep', 'boop' ]
- If the reader is unable to parse a provided line, the function must return
null
.
reader.assign( line, out, stride, offset )
Parses a single DSV line and assigns field values to elements in the provided output array.
var reader = readDSVLine( {} );
var out = [ null, null, null, null, null, null, null, null ];
var o = reader.assign( 'foo,bar,beep,boop', out, 2, 1 );
// returns [ null, 'foo', null, 'bar', null, 'beep', null, 'boop' ]
var bool = ( o === out );
// returns true
- If unable to parse a line, the method should return
null
. - Users should beware that, if the method returns
null
, elements in the provided output array could have still been mutated.
from stdlib.
No, not yet. This issue is blocked until the base implementation is finished, which it is not.
from stdlib.
While googling around and deciding on a course of action I found this package on npm to test an implementation.
Testing the implementation of d3-dsv this way I run into some trouble. While the output is visibly the same a test for equality failed.
Turns out you need to use tape's deepEqual
method in this case. And use the spread operator to create an array without a columns
property.
For future reference, this is the code I used to test the implementation
'use strict';
const spectrum = require('csv-spectrum');
const d3DSV = require('d3-dsv');
const tape = require('tape');
tape( 'Test all cases', function test ( t ) {
spectrum( function ( err, data ) {
for ( let testCase of data ) {
// Convert the data to a string
let csvDataString = testCase.csv.toString( 'utf8' );
// Create a new array without the columns property which breaks the equality test
let parsed = [ ...d3DSV.csvParse( csvDataString ) ];
let control = JSON.parse( testCase.json);
// Test type of parsed objects
t.equal( typeof parsed, typeof control, 'testing types of sample: ' + testCase.name );
// Test equality by value
t.deepEqual( parsed, control, 'testing sample: ' + testCase.name );
}
});
t.end();
});
from stdlib.
@labiej Thanks for looking into this!
from stdlib.
Ref: https://github.com/d3/d3-dsv
from stdlib.
CSV-spectrum: https://github.com/maxogden/csv-spectrum
from stdlib.
Papaparse: https://github.com/mholt/PapaParse
from stdlib.
RFC: https://datatracker.ietf.org/doc/html/rfc4180
from stdlib.
Python built-in CSV API: https://docs.python.org/3/library/csv.html
PEP 305: https://peps.python.org/pep-0305/
from stdlib.
GoLang: https://pkg.go.dev/encoding/csv
from stdlib.
MATLAB: https://www.mathworks.com/help/matlab/ref/csvread.html and https://www.mathworks.com/help/matlab/ref/readmatrix.html
MATLAB's API is interesting insofar as it supports reading only sections of a CSV file.
from stdlib.
Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Has arguably the most complex CSV read API. Some interesting features.
- Select particular columns.
- Ability to provide custom converters for particular columns.
- Comment support.
- Custom separator support (decimals, thousands, etc).
- Dialects (as in native Python CSV API)
- Support for custom date parsing.
- Support for specifying a particular column as a column of row labels
from stdlib.
Streaming CSV parser for Node: https://github.com/mafintosh/csv-parser; however, issue tracker suggests some concerns with implementation.
from stdlib.
R: https://www.rdocumentation.org/packages/qtl2/versions/0.28/topics/read_csv
from stdlib.
Parsing options:
- delimiter: delimiter to use. For CSV, the value would be
,
. For TSV, the value would be aTAB
character. - thousands: thousands separator. This would allow numbers to be written
1,000,000
. - decimal: decimal separator. This helps with European data, which format
3.14
as3,14
. - quote: character sequence used to denote the start and end of a quoted item. Quoted items can include the delimiter and the delimiter must be ignored.
- true: array of values to be considered equal to
true
. E.g., the string'True'
would be converted totrue
, a boolean. - false: array of values to be considered equal to
false
. - quoted: list of columns which may have quoted field values or
false
, indicating that no fields should be quoted. (update: not sure the original intent of this option. However, may be to provide a fast path for parsing. In which case, a value oftrue
could mean that all fields may contain quoted field values.) - doublequote: boolean indicating whether to interpret two consecutive
quote
character sequences inside a quoted field as a singlequote
character sequence. - comment: character sequence indicating whether the remainder of a line should not be parsed.
- escape: character sequence used to escape other characters.
- columns: list of columns to return. Default is to return all columns.
- missing: list of strings to recognize as missing values (e.g.,
NA
,NaN
,null
, etc). - whitespace: list of characters to interpret as whitespace.
- trim: boolean indicating whether to trim leading whitespace in each field value.
- trimNonNumeric: boolean indicating whether to trim non-numeric characters from a numeric value.
- transforms: object whose properties are column numbers and whose values are callbacks which should be invoked for the respective column values and which return a transformed value. E.g., a callback which converts a string to a
Date
object. - consecutiveDelimiters: rule specifying how to handle consecutive delimiters:
keep
,join
,error
. - leadingDelimiters: rule specifying how to handle leading delimiters:
keep
,ignore
,error
. - trailingDelimiters: rule specifying how to handle trailing delimiters:
keep
,ignore
,error
.
from stdlib.
For line-by-line reader, proposed package: @stdlib/utils/dsv/base/parse-line
.
Once the line-by-line reader is implemented, can consider a "sniff" package and other CSV/DSV abstraction packages.
from stdlib.
Hi @kgryte, is this issue still open? I see that we have implemented an incremental parser here already: @stdlib/utils/dsv/base/parse
. Is this open issue now a matter of creating a wrapper around it?
from stdlib.
Related Issues (20)
- [RFC]: add `fs/mkdir` HOT 5
- [RFC]: add `math/base/special/asecf` HOT 1
- [RFC]: add various physical constants (tracking issue) HOT 7
- `binding.gyp` not found in math-base-special-ln HOT 3
- [RFC]: add `math/base/special/rad2degf` HOT 2
- [RFC]: add `blas/base/zdotu` HOT 8
- [RFC]: add `math/base/special/asinhf` HOT 5
- [RFC]: add `help()` documentation pager in REPL HOT 1
- Binomcoef function evaluation is not memoized and can take very long to run given large n HOT 5
- [RFC]: automated/make recipe for generation of manifest files for C implementation of a package
- [RFC]: refactor `math/base/special/log2` to follow FreeBSD version 12.2.0 HOT 1
- [RFC]: add `blas/base/csscal`
- [RFC]: add stdlib ASCII art in REPL's default welcome message HOT 1
- [RFC]: implement a broader range of statistical distributions (tracking issue) HOT 1
- [RFC]: add `blas/base/grot` HOT 33
- [RFC]: add encoding and decoding functions in stdlib HOT 3
- [RFC]: add C implementation for `math/base/special/binomcoef` HOT 1
- [RFC]: Upgrade OpenBLAS dependency HOT 4
- [RFC]: add tests for `@stdlib/ndarray/base/nullary` HOT 2
- [RFC]: add `fs/stat` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stdlib.