Giter Club home page Giter Club logo

streamparser-json's Introduction

@streamparser/json

npm version npm monthly downloads Build Status Coverage Status

Fast dependency-free library to parse a JSON stream using utf-8 encoding in Node.js, Deno or any modern browser. Fully compliant with the JSON spec and JSON.parse(...).

@streamparser/json ecosystem

There are multiple flavours of @streamparser:

License

See LICENSE.md.

streamparser-json's People

Contributors

callumlocke avatar dependabot[bot] avatar drawmindmap avatar juanjodiaz avatar knownasilya avatar miunau avatar mrazauskas avatar slevy85 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

streamparser-json's Issues

ignore BOM

I ran into an issue where the tokenizer is choking on files with a BOM. This throws with Error: Unexpected "ï" at position "0" in state START.

I was able to patch tokenizer with a quick and dirty addition of TokenizerStates.BOM. Unfortunately I don't have time to submit a formal PR but wanted to raise the issue for tracking.

Can't get streamparser/json-node to extract a specific path

The code below doesn't generate any output, where I would expect it to print "this might be a long string". If the path is changed to just $.attachments.0 then it correctly prints this:

$ node src/stream.ts
>>>>>>>> data {
  value: { filename: 'file1', content: 'this might be a long string' },
  key: 0,
  parent: [ <1 empty item> ],
  stack: [
    { key: undefined, value: undefined, mode: undefined, emit: false },
    { key: 'attachments', value: [Object], mode: 0, emit: false }
  ]
}

I'm going through the docs and can't spot the error I'm making, is it in the path I'm using? Thank you!

import { Readable, Transform } from "stream";
import { JSONParser } from "@streamparser/json-node";

const attachmentContentParser = new JSONParser({
  stringBufferSize: 0,
  keepStack: false,
  paths: ["$.attachments.0.content"],
});

const jsonData = {
  attachments: [
    {
      filename: "file1",
      content: "this might be a long string",
    },
    {
      filename: "file2",
      content: "another long string possibly?",
    },
  ],
};
const myJSON = JSON.stringify(jsonData);

const source = new Readable();
source._read = () => {};
source.push(myJSON);
source.push(null);

const reader = source.pipe(attachmentContentParser);
reader.on("data", (data: any) => console.log(">>>>>>>> data", data));
reader.on("error", (error: any) => console.error(">>>>>>>> error", error));

Tokenizer token offset is incorrect

The Tokenizer outputs the wrong offset for tokens after a string token with special characters. The difference in the expected offset is consistent with the number of certain special characters within the input string.

Some examples

This is the expected behaviour

test('testing string 1', async () => {  
  const json = JSON.stringify({"abcd": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  
})  
  
// raw string length:  15
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'abcd', offset: 1 }  
// { token: 4, value: ':', offset: 7 }  // Using this token as the reference
// { token: 9, value: 'abcd', offset: 8 }  
// { token: 1, value: '}', offset: 14 }  
// :  // We print the expected character

Using a single \t special character

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  
  
// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // Off by 1 now
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

The difference in expected output is consistent with the number of special characters

test('testing string 3', async () => {  
  const json = JSON.stringify({"\t\n": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[5])  
})  
  
// raw string length:  15  // Same length
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: '\t\n', offset: 1 }  
// { token: 4, value: ':', offset: 5 }  // Off by 2 now
// { token: 9, value: 'abcd', offset: 6 }  
// { token: 1, value: '}', offset: 12 }  
// n

My expectation here should be that the offset is relative to the input. I understand that this is a niche use case but is this something you can fix?

Keep jsonPath to each object

Hey @juanjoDiaz ,
could the Tokenizer be extended to keep the jsonPath of each emitted object?

something like this:

jsonparser.onValue = (value, key, parent, stack, jsonPath) => {
   console.log(jsonPath);
   //e.g. ['someProp', 0, 'someProp',...]
};

What would be the right place to look at?

Thanks for this awesome parser!

Question: Sync JSON.parse is more memory efficient than streaming?

Hi @juanjoDiaz, thanks for the library, so far great APIs and quite nice to use, however I have a slight problem.

In short, I can describe it like this: streaming json file of 150MB via streamparser-json requires 1GB of RAM, while parsing via JSON.parse would require 500MB for processing the same file.

Please see a more detailed explanation below:

I have to process JSON files that are usually 150-400 MB. For this example, I am working with 150MB of JSON.

For weird reasons JSONs that I have to work with have this structure:

{
    "TABLE": [
        {
            "OBJECT_1": {
                "OBJECT_1_KEY_1": "OBJECT_1_VALUE_1",
                "OBJECT_1_KEY_2": "OBJECT_1_VALUE_2"
                // ... more properties
            },
            "OBJECT_2": {
                "OBJECT_2_KEY_1": "OBJECT_2_VALUE_1",
                "OBJECT_2_KEY_2": "OBJECT_2_VALUE_2"
                // ... more properties
            }
            // ... more objects
        }
    ]
}

So, if we were to generalize this, my JSON would always contain TABLE property, in this TABLE property I want to always get the first object and inside this first object I want to stream all objects one by one (weird data structure but I have to deal with it).

So, I wrote a simple script that would provide me objects one by one and that works great - e.g. it produces data object by object.

const fs = require('node:fs')
const { JSONParser } = require('@streamparser/json-node')

const main = async () => {
  const jsonStream = await fs.createReadStream('./tab_config.json');

  const parser = new JSONParser({ stringBufferSize: undefined, paths: ['$.TABLE.*.*'], keepStack: false });

  const pipeline = jsonStream.pipe(parser);

  pipeline.on('data', (object) => {
    // got OBJECT_1, OBJECT_2 ... etc.
  })
}

main();

However, when I inspect RAM usage – for processing 150MB file – this script will peak at memory usage at almost 1GB.
image

On the contrary, if I were to use simple JSON.parse and blow entire stuff into memory – amount of used RAM would go down to ~600MB

const nonStreaming = async () => {
  const file = await fs.promises.readFile('./tab_config.json');
  const parsed = JSON.parse(file);

  console.log('parsed file')
}

nonStreaming();
image

So my questions are:

  • Could you kindly explain why could it happen? I would be expecting streamparser-json to use way less memory and stream objects one by one.
  • I assume the problem is somewhere on my side, could you point me to what can it be?

Once again, thanks a lot for the library I really hope I can use it.

Cannot import package, incorrect exports

I tried to use the package in a vite project and I get the following error:

[vite] Internal server error: Failed to resolve entry for package "@streamparser/json". The package may have incorrect main/module/exports specified in its package.json.

It seems like the "module" key in package.json points to a not existing file ./dist/mjs/index.js.

Cannot find module '@streamparser/json/index' or its corresponding type declarations

After upgrading to the latest version of all packages, I'm getting this type error:

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@types/json2csv__plainjs/src/StreamParser.d.ts:1:58 - error TS2307: Cannot find module '@streamparser/json/index' or its corresponding type declarations.

1 import { Tokenizer, TokenizerOptions, TokenParser } from '@streamparser/json/index';
                                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~


Found 1 error in ../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@types/json2csv__plainjs/src/StreamParser.d.ts:1

Versions:

"@json2csv/plainjs": "6.1.3",
"@types/json2csv__plainjs": "6.1.0",
"@streamparser/json": "0.0.14"

Error importing `JSONParser`

In our code base when we try to import JSONParser with import { JSONParser } from '@streamparser/json'; we get the following error TS2307: Cannot find module '@streamparser/json' or its corresponding type declarations..

Currently as a work around we are importing it with const jsonStreamParsers = require('@streamparser/json'); which works fine.

My question is, do you have any insights to why the usual import is failing? Or is this likely a problem with the configuration of our project in some way?

Thanks.

Message parser should be able to support arbitrary whitespace such as '\n', '\t', '\r', and ' ' within and between messages

How do I configure the stream parser to be able to discard whitespace between JSON messages?

I am facing an issue during implementation of a RPC system, which requires usage of JSONParser to parse binary to JSON for transmitting thru RPC. The issue I am facing is that input streams could be separated by a variety of different whitespaces, and the current implementation of seperator in the lib only supports a single separator.

Assuming as such, we have to keep out input streams in the following manner :
{...message}{...message}.

However, to improve readability, we wish to be able to add whitespaces in between messages, in a similar manner as demonstrated below.

// Normal separation
{...message}{...message}{...message}

// Spaces
{...message} {...message}    {...message}

// New lines
{...message}
{...message}
{...message}

// Any combination
{...message}                                  {...message}
                       {...message}
{...message}

Replace nested objects and arrays

@juanjoDiaz ,

I know this is probably out of scope of this library, but do you think it is possible to adjust the code in order to omit nested objects and arrays.

I have a large json object that looks like this:

{
   "cards": [
     {
       "id": 1,
       "name": "Some card name"
      },
     {},
     {}
    ],
   "meta": {
      "updated":"2022-12-31"
    }
}

The cards array is very large, so that it won't fit into memory on its own. (Even when parsing in chunks)
I'd like to get all objects as flat objects that replace nested arrays with "[...]" and nested objects with "{...}".
The result would look like this:

{
  "cards":"[...]",
  "meta":"{...}"
},
{
 "id": 1,
 "name": "Some card name"
},
{},
{},
{
  "updated": "2022-12-31"
}

I'm aware that this is probably out of scope of this repo, but I would like to apply the changes in my fork.
Can you point me into a direction where to look at or where those changes would fit best?

Best regards and thanks a lot for the awesome parser :-)

error TS7029: Fallthrough case in switch

I just tried migrating from the old json2csv package to the new one, and now I'm getting type errors from within node_modules:

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:374:11 - error TS7029: Fallthrough case in switch.

374           case TokenizerStates.STRING_UNICODE_DIGIT_4:
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:500:11 - error TS7029: Fallthrough case in switch.

500           case TokenizerStates.NUMBER_AFTER_E:
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:697:18 - error TS6133: 'parsedToken' is declared but its value is never read.

697   public onToken(parsedToken: ParsedTokenInfo): void {
                     ~~~~~~~~~~~

../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenparser.ts:324:18 - error TS6133: 'parsedElementInfo' is declared but its value is never read.

324   public onValue(parsedElementInfo: ParsedElementInfo): void {
                     ~~~~~~~~~~~~~~~~~


Found 4 errors in 2 files.

Errors  Files
     3  ../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenizer.ts:374
     1  ../../common/temp/node_modules/.pnpm/@[email protected]/node_modules/@streamparser/json/src/tokenparser.ts:324

Formatted JSON failing tokenparse

Hi Folks,
Processing a minified version of a .json file the tokenparser does not fail, however, if the original file from the Customer comes in formatted,
Then:

throw new TokenParserError(

throws the error
"stack": [
"Runtime.UnhandledPromiseRejection: Error: Error on Readable Stream for s3DropBucket Object acmerepsignature_sampleformatted.json.",
"Error Message: An error has stopped Content Parsing at record 0 for s3 object acmerepsignature_sampleformatted.json.",

    **"Error: Unexpected SEPARATOR (\"\\n\") in state VALUE ",**

    "    at process.<anonymous> (file:///var/runtime/index.mjs:1276:17)",
    "    at process.emit (node:events:517:28)",
    "    at emit (node:internal/process/promises:149:20)",
    "    at processPromiseRejections (node:internal/process/promises:278:11)",
    "    at process.processTicksAndRejections (node:internal/process/task_queues:96:32)"
]

I first thought there was a value in the json that contained an embedded newline, however, inspecting the file very closely that is not the case.

Editing to add params:
const jsonParser = new JSONParser( {
// numberBufferSize: 64, //64, //0, //undefined, // set to 0 to don't buffer.
stringBufferSize: undefined, //64, //0, //undefined,
separator: '\n', // separator between object. For example \n for nd-js.
paths: [ '$' ], //ToDo: Possible data transform oppty
keepStack: false,
emitPartialTokens: false // whether to emit tokens mid-parsing.
} )

Here's a short sample of the Original file:
[
{
id: '0051N000006sMsYQAU',
email: '[email protected]',
first: 'Cheryl',
last: 'Basso',
photo: null,
Photo: null,
brandApprovedPhoto: null,
designations: null,
jobTitle: null,
phone: null,
registeredAdvisorTitle: null,
updatedAt: '2022-01-20T02:28:13.077Z'
},
{
id: '0051N000006sMzcQAE',
email: '[email protected]',
first: 'Laura',
last: 'Dern',
photo: null,
Photo: null,
ApprovedPhoto: null,
designations: null,
jobTitle: null,
phone: null,
registeredAdvisorTitle: null,
updatedAt: '2022-01-20T02:28:13.077Z'
},
{
id: '0051N000006sNCiQAM',
email: '[email protected]',
first: 'Brad',
last: 'Bean',
photo: null,
Photo: null,
ApprovedPhoto: null,
designations: null,
jobTitle: null,
phone: null,
registeredAdvisorTitle: null,
updatedAt: '2022-01-20T02:28:13.077Z'
}
]

Of note, it would be helpful if the error message included an index to the character/position within the json that causes the exception,

I've parsed the formatted version independently and it does not fail so hoping someone can provide some insight as to how to go about resolving this?

Having the Customer >>not<< format the file is not an option, just in case that was a thought,

Thanks,
KWL

Stream incomplete values

Hey @juanjoDiaz

Is it possible to stream incomplete values? Currenly both OnToken and onValue will provide full tokens or attributes. However, I want to be able to easily access given string value before it is fully complete. It would need an artificial closing quote dynamically generated until the end closing quote finally arrives.

Any idea how to do it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.