Comments (12)
I didn't have the time yet. The solution needs to deal with the streaming nature of the parser. The solution would be to extract a limited amount of bytes from the stream, apply a detection algorythm such as the one proposed above, then replay the bytes stored on the side with the detected delimiter. I need some time to do it correctly.
from node-csv.
I would tend to discover the character, like in the second methods, after filtering any already used character in options (eg quotes, row delimiters, ...) and general ascii characters ([a-zA-z0-9]) (including accented characters).
from node-csv.
A few notes for now:
delimiter_auto
and notauto_delimiter
,__discoverDelimiterAuto
and not__autoDiscoverDelimiter
- Disabled by default, default value is
false
- When normalizing the option, add consistency check, for exemple it cannot equal the values of
record_delimiter
(all those rules require tests) - Dont convert to string, you shall compare values directly inside the buffer
- Write more unit tests but in particular one which write data one byte at a time (see https://github.com/adaltas/node-csv/blob/master/packages/csv-parse/test/api.stream.events.coffee#L53 as an example)
- My strategy would be to discover delimiter before any parsing is done, here is how I will start my experiment
- Work around the
__needMoreData
, ifdelimiter_auto
is activated, and only in the first line, you shall allocated a safe buffer size dedicated to discovery - Start discovery (your
__autoDiscoverDelimiter
function) just after bom handling and before actual parsing (https://github.com/adaltas/node-csv/blob/master/packages/csv-parse/lib/api/index.js#L109)
- Work around the
from node-csv.
Hey @cawoodm,
Anyone is working on this one ? I take work on it.
from node-csv.
While I am not against the idea, I can't say that I fully support the idea. However, if you come up with a clean delimiter_auto
option, I'll probably merge it.
from node-csv.
@wdavidw I looked on GitHub to see what other people were doing.
node-csv-string detect function, which basically looks for the first occurrence of one of the delimiters, I guess is fine for most cases.
Another more advanced implementation was the detect-csv determineMost function , which looks at a sample and returns the delimiter with the highest occurancy count.
What do you think ?
from node-csv.
@wdavidw
I created a small proof of concept for the auto_delimiter
option.
master...hadyrashwan:node-csv:patch-1
When running tests, the below happens, not sure why:
- All tests pass when I run the test script. Except the encoding with BOM option.
- When I run the encoding tests (
packages/csv-parse/test/option.encoding.coffee
) on its own, it works. - I added a small test for \t based on the delimiter tests to see how the logic runs it did detect the delimiter successfully, however it did not pass the test.
Question:
- We are committing the
dist
files, is this expected ?
Missing parts:
- Handling of characters coming from escape, quote, and record delimiter options.
- Add more tests.
- Add a references in TS definition.
- Add a new page about the
auto_delimiter
in the docs.
Appreciate your feedback :)
from node-csv.
I'll take some time to review later. In the mean time, what do you mean by "We are committing the dist files".
from node-csv.
I'll take some time to review later. In the mean time, what do you mean by "We are committing the dist files".
When I'm working I always see the build files in the dist folders added to git and not ignored.
Some projects add those build files in the git ignore file.
Just want to make sure that I'm not adding those files by mistake.
from node-csv.
A couple of comments on the method of detecting delimiters:
- We cannot safely assume that the most common character is THE CSV delimiter. The CSV delimiter is the character that consistently split the row into the same number of columns on each row.
- CSVs can safely store strings that can contain the delimiter, so parsing has to be a little more intelligent (either considering quotes or allowing for a small degree of inconsistency in column count per row.
Python has a great example of handling these in their implementation. The pandas
library uses this implementation but only reads from the first line of the file (here).
from node-csv.
Are there any plans to open a PR for that? As far as I can see the current changes are only present on a branch.
I would definitely love to see that feature.
from node-csv.
Here's another algorithm for detecting the delimiter that seems like a good idea:
https://stackoverflow.com/a/19070276/2180570
from node-csv.
Related Issues (20)
- Error message when using vite
- Option `to` and `to_line` results in `ERR_STREAM_PREMATURE_CLOSE` HOT 4
- When `bom` and `skipRecordsWithError` skip event is not raised for skipped records HOT 7
- How can I convert json object with array to csv? HOT 1
- Date parse bug HOT 2
- Parse dot notation columns into nested objects HOT 4
- CSV Parse breaks on comment characters that are also in rows HOT 4
- csv-parse fails to parse very large CSV files
- User-defined value generation HOT 5
- support headers as comment HOT 3
- Support parsing of quote characters within quotes HOT 2
- Ignore comment lines for the from_line value
- bug: Fatal JavaScript invalid size error HOT 1
- Error Lines Property Should Return All Lines of error HOT 1
- Taking too much energy and patience to navigate and understand the docs site HOT 1
- Browser Streaming Support HOT 2
- Problems parsing CSV with nested quotes,commas etc HOT 2
- ReferenceError: Buffer is not defined on import in Nuxt3/Vue3 HOT 2
- Support multiple quote characters when parsing HOT 3
- First value of the first row has a leading space. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-csv.