juanjodiaz / streamparser-json Goto Github PK

Streaming JSON parser in Javascript for Node.js and the browser

License: MIT License

JavaScript 1.68% TypeScript 98.32%

streamparser-json's Introduction

@streamparser/json

Fast dependency-free library to parse a JSON stream using utf-8 encoding in Node.js, Deno or any modern browser. Fully compliant with the JSON spec and JSON.parse(...).

@streamparser/json ecosystem

There are multiple flavours of @streamparser:

The @streamparser/json package allows to parse any JSON string or stream using pure Javascript.
The @streamparser/json-whatwg wraps @streamparser/json into a WHATWG TransformStream.
The @streamparser/json-node wraps @streamparser/json into a node Transform stream.

License

See LICENSE.md.

streamparser-json's People

Contributors

Stargazers

Watchers

Forkers

knownasilya gamoutatsumi rehagoal slevy85 drawmindmap callumlocke miunau lunaris mbenedettini xueshifu

streamparser-json's Issues

Potential performance improvement

streamparser-json/packages/plainjs/src/tokenizer.ts

Line 158 in cda44fb

for (let i = 0; i < buffer.length; i += 1) {

const l = buffer.length
for (let i = 0; i < l; i += 1) {

ignore BOM

I ran into an issue where the tokenizer is choking on files with a BOM. This throws with Error: Unexpected "ï" at position "0" in state START.

I was able to patch tokenizer with a quick and dirty addition of TokenizerStates.BOM. Unfortunately I don't have time to submit a formal PR but wanted to raise the issue for tracking.

The project doesn't compile if noFallthroughCasesInSwitch flag is set

The project with this package included doesn't compile if 'noFallthroughCasesInSwitch' flag is set to 'true' in tsconfig.json.

Add possibility to filter which objects are emitted using JSON-path

Add support for WHATWG streams

Can't get streamparser/json-node to extract a specific path

The code below doesn't generate any output, where I would expect it to print "this might be a long string". If the path is changed to just $.attachments.0 then it correctly prints this:

$ node src/stream.ts
>>>>>>>> data {
  value: { filename: 'file1', content: 'this might be a long string' },
  key: 0,
  parent: [ <1 empty item> ],
  stack: [
    { key: undefined, value: undefined, mode: undefined, emit: false },
    { key: 'attachments', value: [Object], mode: 0, emit: false }
  ]
}

I'm going through the docs and can't spot the error I'm making, is it in the path I'm using? Thank you!

import { Readable, Transform } from "stream";
import { JSONParser } from "@streamparser/json-node";

const attachmentContentParser = new JSONParser({
  stringBufferSize: 0,
  keepStack: false,
  paths: ["$.attachments.0.content"],
});

const jsonData = {
  attachments: [
    {
      filename: "file1",
      content: "this might be a long string",
    },
    {
      filename: "file2",
      content: "another long string possibly?",
    },
  ],
};
const myJSON = JSON.stringify(jsonData);

const source = new Readable();
source._read = () => {};
source.push(myJSON);
source.push(null);

const reader = source.pipe(attachmentContentParser);
reader.on("data", (data: any) => console.log(">>>>>>>> data", data));
reader.on("error", (error: any) => console.error(">>>>>>>> error", error));

Tokenizer token offset is incorrect

The Tokenizer outputs the wrong offset for tokens after a string token with special characters. The difference in the expected offset is consistent with the number of certain special characters within the input string.

Some examples

This is the expected behaviour

test('testing string 1', async () => {  
  const json = JSON.stringify({"abcd": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  
})  
  
// raw string length:  15
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'abcd', offset: 1 }  
// { token: 4, value: ':', offset: 7 }  // Using this token as the reference
// { token: 9, value: 'abcd', offset: 8 }  
// { token: 1, value: '}', offset: 14 }  
// :  // We print the expected character

Using a single \t special character

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  
  
// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // Off by 1 now
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

The difference in expected output is consistent with the number of special characters

test('testing string 3', async () => {  
  const json = JSON.stringify({"\t\n": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[5])  
})  
  
// raw string length:  15  // Same length
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: '\t\n', offset: 1 }  
// { token: 4, value: ':', offset: 5 }  // Off by 2 now
// { token: 9, value: 'abcd', offset: 6 }  
// { token: 1, value: '}', offset: 12 }  
// n

My expectation here should be that the offset is relative to the input. I understand that this is a niche use case but is this something you can fix?

TypedArray subarray method is slow in V8

Which makes using buffers pretty slow.

Follow up https://bugs.chromium.org/p/v8/issues/detail?id=7161 and nodejs/node#17431 for possible solutions or workarounds.

Keep jsonPath to each object

Hey @juanjoDiaz ,
could the Tokenizer be extended to keep the jsonPath of each emitted object?

something like this:

jsonparser.onValue = (value, key, parent, stack, jsonPath) => {
   console.log(jsonPath);
   //e.g. ['someProp', 0, 'someProp',...]
};

What would be the right place to look at?

Thanks for this awesome parser!

Update deno package to latest?

There seems to be some performance and bug improvements since 0.0.10, would be nice to take advantage of those in Deno.

https://deno.land/x/[email protected]

Thanks for making this package available.

Add support for Node streams

Question: Sync JSON.parse is more memory efficient than streaming?

Hi @juanjoDiaz, thanks for the library, so far great APIs and quite nice to use, however I have a slight problem.

In short, I can describe it like this: streaming json file of 150MB via streamparser-json requires 1GB of RAM, while parsing via JSON.parse would require 500MB for processing the same file.

Please see a more detailed explanation below:

I have to process JSON files that are usually 150-400 MB. For this example, I am working with 150MB of JSON.

For weird reasons JSONs that I have to work with have this structure:

{
    "TABLE": [
        {
            "OBJECT_1": {
                "OBJECT_1_KEY_1": "OBJECT_1_VALUE_1",
                "OBJECT_1_KEY_2": "OBJECT_1_VALUE_2"
                // ... more properties
            },
            "OBJECT_2": {
                "OBJECT_2_KEY_1": "OBJECT_2_VALUE_1",
                "OBJECT_2_KEY_2": "OBJECT_2_VALUE_2"
                // ... more properties
            }
            // ... more objects
        }
    ]
}

So, if we were to generalize this, my JSON would always contain TABLE property, in this TABLE property I want to always get the first object and inside this first object I want to stream all objects one by one (weird data structure but I have to deal with it).

So, I wrote a simple script that would provide me objects one by one and that works great - e.g. it produces data object by object.

const fs = require('node:fs')
const { JSONParser } = require('@streamparser/json-node')

const main = async () => {
  const jsonStream = await fs.createReadStream('./tab_config.json');

  const parser = new JSONParser({ stringBufferSize: undefined, paths: ['$.TABLE.*.*'], keepStack: false });

  const pipeline = jsonStream.pipe(parser);

  pipeline.on('data', (object) => {
    // got OBJECT_1, OBJECT_2 ... etc.
  })
}

main();

However, when I inspect RAM usage – for processing 150MB file – this script will peak at memory usage at almost 1GB.

On the contrary, if I were to use simple JSON.parse and blow entire stuff into memory – amount of used RAM would go down to ~600MB

const nonStreaming = async () => {
  const file = await fs.promises.readFile('./tab_config.json');
  const parsed = JSON.parse(file);

  console.log('parsed file')
}

nonStreaming();

So my questions are:

Could you kindly explain why could it happen? I would be expecting streamparser-json to use way less memory and stream objects one by one.
I assume the problem is somewhere on my side, could you point me to what can it be?

Once again, thanks a lot for the library I really hope I can use it.

Cannot import package, incorrect exports

I tried to use the package in a vite project and I get the following error:

[vite] Internal server error: Failed to resolve entry for package "@streamparser/json". The package may have incorrect main/module/exports specified in its package.json.

It seems like the "module" key in package.json points to a not existing file ./dist/mjs/index.js.

Move verbatimModuleSyntax to tsconfig once jest fix its issues

See jestjs/jest#14114

Cannot find module '@streamparser/json/index' or its corresponding type declarations

After upgrading to the latest version of all packages, I'm getting this type error:

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@types/json2csv__plainjs/src/StreamParser.d.ts:1:58 - error TS2307: Cannot find module '@streamparser/json/index' or its corresponding type declarations.

1 import { Tokenizer, TokenizerOptions, TokenParser } from '@streamparser/json/index';
                                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~


Found 1 error in ../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@types/json2csv__plainjs/src/StreamParser.d.ts:1

Versions:

"@json2csv/plainjs": "6.1.3",
"@types/json2csv__plainjs": "6.1.0",
"@streamparser/json": "0.0.14"

This is awesome 😎 👏

Thanks so much for making this library!
I was looking for a web compatible version of https://github.com/node-geojson/geojson-stream, and luckily stream-reading GeoJSON is possible using the path option.

P.S.: I made a small demo for streaming GeoJSON on Observable:
https://observablehq.com/@chrispahm/streaming-geojson

OnValue not working if OnToken is set

Great product btw

Had a problem with OnValue not working if I set OnToken. I dont think that was intended?

Dist files are using `export * ...` fails to compile with webpack 4

Error from: @streamparser/json/dist/mjs/index.js 4:9

Error importing `JSONParser`

In our code base when we try to import JSONParser with import { JSONParser } from '@streamparser/json'; we get the following error TS2307: Cannot find module '@streamparser/json' or its corresponding type declarations..

Currently as a work around we are importing it with const jsonStreamParsers = require('@streamparser/json'); which works fine.

My question is, do you have any insights to why the usual import is failing? Or is this likely a problem with the configuration of our project in some way?

Thanks.

Related MatrixAI/Polykey#516

Message parser should be able to support arbitrary whitespace such as '\n', '\t', '\r', and ' ' within and between messages

How do I configure the stream parser to be able to discard whitespace between JSON messages?

I am facing an issue during implementation of a RPC system, which requires usage of JSONParser to parse binary to JSON for transmitting thru RPC. The issue I am facing is that input streams could be separated by a variety of different whitespaces, and the current implementation of seperator in the lib only supports a single separator.

Assuming as such, we have to keep out input streams in the following manner :
{...message}{...message}.

However, to improve readability, we wish to be able to add whitespaces in between messages, in a similar manner as demonstrated below.

// Normal separation
{...message}{...message}{...message}

// Spaces
{...message} {...message}    {...message}

// New lines
{...message}
{...message}
{...message}

// Any combination
{...message}                                  {...message}
                       {...message}
{...message}

Replace nested objects and arrays

@juanjoDiaz ,

I know this is probably out of scope of this library, but do you think it is possible to adjust the code in order to omit nested objects and arrays.

I have a large json object that looks like this:

{
   "cards": [
     {
       "id": 1,
       "name": "Some card name"
      },
     {},
     {}
    ],
   "meta": {
      "updated":"2022-12-31"
    }
}

The cards array is very large, so that it won't fit into memory on its own. (Even when parsing in chunks)
I'd like to get all objects as flat objects that replace nested arrays with "[...]" and nested objects with "{...}".
The result would look like this:

{
  "cards":"[...]",
  "meta":"{...}"
},
{
 "id": 1,
 "name": "Some card name"
},
{},
{},
{
  "updated": "2022-12-31"
}

I'm aware that this is probably out of scope of this repo, but I would like to apply the changes in my fork.
Can you point me into a direction where to look at or where those changes would fit best?

Best regards and thanks a lot for the awesome parser :-)

error TS7029: Fallthrough case in switch

I just tried migrating from the old json2csv package to the new one, and now I'm getting type errors from within node_modules:

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:374:11 - error TS7029: Fallthrough case in switch.

374           case TokenizerStates.STRING_UNICODE_DIGIT_4:
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:500:11 - error TS7029: Fallthrough case in switch.

500           case TokenizerStates.NUMBER_AFTER_E:
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:697:18 - error TS6133: 'parsedToken' is declared but its value is never read.

697   public onToken(parsedToken: ParsedTokenInfo): void {
                     ~~~~~~~~~~~

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenparser.ts:324:18 - error TS6133: 'parsedElementInfo' is declared but its value is never read.

324   public onValue(parsedElementInfo: ParsedElementInfo): void {
                     ~~~~~~~~~~~~~~~~~


Found 4 errors in 2 files.

Errors  Files
     3  ../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:374
     1  ../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenparser.ts:324

OptimisticJSONParser not found

However it's documented in the doc with this commit, OptimisticJSONParser is nowhere to be found.

Any plans to materialize it?

Formatted JSON failing tokenparse

Hi Folks,
Processing a minified version of a .json file the tokenparser does not fail, however, if the original file from the Customer comes in formatted,
Then:

streamparser-json/packages/plainjs/src/tokenparser.ts

Line 328 in ef9a919

throw new TokenParserError(

throws the error
"stack": [
"Runtime.UnhandledPromiseRejection: Error: Error on Readable Stream for s3DropBucket Object acmerepsignature_sampleformatted.json.",
"Error Message: An error has stopped Content Parsing at record 0 for s3 object acmerepsignature_sampleformatted.json.",

    **"Error: Unexpected SEPARATOR (\"\\n\") in state VALUE ",**

    "    at process.<anonymous> (file:///var/runtime/index.mjs:1276:17)",
    "    at process.emit (node:events:517:28)",
    "    at emit (node:internal/process/promises:149:20)",
    "    at processPromiseRejections (node:internal/process/promises:278:11)",
    "    at process.processTicksAndRejections (node:internal/process/task_queues:96:32)"
]

I first thought there was a value in the json that contained an embedded newline, however, inspecting the file very closely that is not the case.

Editing to add params:
const jsonParser = new JSONParser( {
// numberBufferSize: 64, //64, //0, //undefined, // set to 0 to don't buffer.
stringBufferSize: undefined, //64, //0, //undefined,
separator: '\n', // separator between object. For example \n for nd-js.
paths: [ '$' ], //ToDo: Possible data transform oppty
keepStack: false,
emitPartialTokens: false // whether to emit tokens mid-parsing.
} )

Here's a short sample of the Original file:
[
{
id: '0051N000006sMsYQAU',
email: '[email protected]',
first: 'Cheryl',
last: 'Basso',
photo: null,
Photo: null,
brandApprovedPhoto: null,
designations: null,
jobTitle: null,
phone: null,
registeredAdvisorTitle: null,
updatedAt: '2022-01-20T02:28:13.077Z'
},
{
id: '0051N000006sMzcQAE',
email: '[email protected]',
first: 'Laura',
last: 'Dern',
photo: null,
Photo: null,
ApprovedPhoto: null,
designations: null,
jobTitle: null,
phone: null,
registeredAdvisorTitle: null,
updatedAt: '2022-01-20T02:28:13.077Z'
},
{
id: '0051N000006sNCiQAM',
email: '[email protected]',
first: 'Brad',
last: 'Bean',
photo: null,
Photo: null,
ApprovedPhoto: null,
designations: null,
jobTitle: null,
phone: null,
registeredAdvisorTitle: null,
updatedAt: '2022-01-20T02:28:13.077Z'
}
]

Of note, it would be helpful if the error message included an index to the character/position within the json that causes the exception,

I've parsed the formatted version independently and it does not fail so hoping someone can provide some insight as to how to go about resolving this?

Having the Customer >>not<< format the file is not an option, just in case that was a thought,

Thanks,
KWL

Stream incomplete values

Hey @juanjoDiaz

Is it possible to stream incomplete values? Currenly both OnToken and onValue will provide full tokens or attributes. However, I want to be able to easily access given string value before it is fully complete. It would need an artificial closing quote dynamically generated until the end closing quote finally arrives.

Any idea how to do it?