Giter Club home page Giter Club logo

mongodb-schema's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mongodb-schema's Issues

Security issues

I'm thinking about including this project in our repo, but notice on install several security warnings from npm:

Screen Shot 2021-02-02 at 20 17 34

This makes me wonder if the library is maintained - or maybe needs help maintaining?

Add data inference

numbers

min and max

"$d": {
    min: 0,
    max: 122
}
strings

counts (up to a certain number, 100?)

"$d": {
    "red": 300,
    "green": 250,
    "blue": 129
}

connect to secondary node with slaveOk=true

I would like to be able to use the CLI version against one of the secondary nodes in our replicaSet. but if I pass it as an option I get errors

mongodb-schema mongodb://USERNAM:PASSWORD@HOST:PORT/AUTH_DB?slaveOk=true DATABASE.COLLECTION --no-values -n 100

/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb/lib/utils.js:132
      throw err;
      ^
MongoError: not master and slaveOk=false
    at /usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/pool.js:593:63
    at authenticateStragglers (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/pool.js:516:16)
    at Connection.messageHandler (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/pool.js:552:5)
    at emitMessageHandler (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/connection.js:309:10)
    at Socket.<anonymous> (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/connection.js:452:17)
    at emitOne (events.js:116:13)
    at Socket.emit (events.js:211:7)
    at addChunk (_stream_readable.js:263:12)
    at readableAddChunk (_stream_readable.js:250:11)
    at Socket.Readable.push (_stream_readable.js:208:10)

new optional parameter to limit the lengths collection

Hi,

this works very fine on a mongodb with a big collection (600 000 items).
But the schema is very big just because of the "lengths" collection. My schema is 30 Mo and it could be just a few Ko if the "lengths" collection could be limited to 1000 / 2000 items with an option (the average would not be so precise in this case).

For example, the code in stream.js :
type.lengths.push(value.length);

could be something like :
if (type.lengths.length < MYOPTION) type.lengths.push(value.length);

Thanks for your work !

make node.js mode async

Example usage:

var callback = function(err, res) {
    // handle error
    if (err) {
        return console.err( err );
    }
    // else pretty print to console
    console.log( JSON.stringify( res, null, '\t' ) );
}

schema( documents, options, callback );

Make it a mongo shell plugin again

Should extend the DBCollection object to do

db.foo.schema()

and return the serialized version of the schema found in the collection. Should thinly wrap .find() with all its parameters, like query, skip, limit, etc

parseSchema is not a function

Hey!

I'm trying to follow the example given in ReadMe:

const parseSchema = require('mongodb-schema');
const { MongoClient } = require('mongodb');

const dbName = 'test';
const uri = `mongodb://localhost:27017/${dbName}`;
const client = new MongoClient(uri);

async function run() {
  try {
    const database = client.db(dbName);
    const documentStream = database.collection('data').find();

    // Here we are passing in a cursor as the first argument. You can
    // also pass in a stream or an array of documents directly.
    const schema = await parseSchema(documentStream);

    console.log(JSON.stringify(schema, null, 2));
  } finally {
    await client.close();
  }
}

run().catch(console.dir);

But whenever I run this, I'm always getting:

TypeError: parseSchema is not a function

Can anyone point out what I'm doing wrong here? Thanks!

The command is stuck

Hi,
I am trying to analyze schema with following command but the command does not exists either with successful or error message:

mongodb-schema 'mongodb+srv://username:[email protected]/?retryWrites=true&w=majority&maxPoolSize=10' dbname.order -n 100 -f table -o true -s true --values false --sampling true

The command is stuck. While I am able to connect to above MongoDB URI using mongosh

Additional details:
OS: MacOS 13.3.1
Mongo Shell v1.8.2
MongoDB v4.4.21

Abstract it away from mongodb

This tool would be pretty handy for getting some initial schema information on a large blob of semistructured JSON (or JSON-convertible) data. It doesn't seem like there's anything about the principle of operation that necessarily ties it to mongo, so maybe some part can be factored out that makes it work for arbitrary JSON data.

Add CHANGELOG.md file

This package has seen a lot of updates in the past year (amazing!), but it would be great if we could create a CHANGELOG.md file to keep track of what has changed between each version.

Luckily, each update has been fairly small so figuring out what is a breaking change has been fairly simple to follow along. I just want to standardise this procedure so that consumers of this package can update this dependency knowing any risks and/or features added.

compute absolute and relative probability

currently $prob is the relative probability to its parent node, <parent count> / <count>.

Sometimes it's interesting to see the absolute probability as well, which is <global count> / <count>.

$relprob
$prob

Runs in cmd, but not in Git Bash...

Trying to poll a variety of collections across several local MongoDB instances, and I can only get some smaller collections to run successfully. Large collections are hitting some clearLine error, but only in the Bash shell. Runs okay...manually, under Windows:

`C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\progress\lib\node-progress.js:177
this.stream.clearLine();
^

TypeError: this.stream.clearLine is not a function
at ProgressBar.terminate (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\progress\lib\node-progress.js:177:17)
at ProgressBar.tick (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\progress\lib\node-progress.js:91:10)
at Stream. (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\bin\mongodb-schema:179:13)
at emitOne (events.js:96:13)
at Stream.emit (events.js:188:7)
at Stream.write (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\lib\stream.js:308:10)
at Stream.stream.write (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\through\index.js:26:11)
at Stream.ondata (internal/streams/legacy.js:16:26)
at emitOne (events.js:101:20)
at Stream.emit (events.js:188:7)
=== `

`for i in $(mongo localhost:27017 --quiet --eval "db.adminCommand('listDatabases')"
|grep -v Hotfix
|grep '"name" :'
|cut -d" -f4
|sed 's/"//g;s/,//g;'
)

Loop through each to compile a listing of existing collections...

do printf "\n%s: \n" "${i}" &&
for j in $(mongo localhost:27017/${i} --quiet --eval "db.getCollectionNames().join('\n')"
|grep -v Hotfix
|sed 's/^/\t/g;'
)

Loop through these to compile a listing of available fields needed on output...

  do printf "\nCREATE TABLE %s ( \n" ${j} \
     && mongodb-schema localhost:27017 --values=false --format=json ${i}.${j} 2>/dev/null \
        |grep '"path":' \
        |cut -d: -f2 \
        |uniq \
        |sed 's/^/         /g;' \
     && printf "         %s ); \n" " \"X\"" 
  done 

done
`
Thoughts?

Add the ability to limit the number of array lengths collected

I believe this is one area of the tool that can cause heap exhaustion when profiling over a large number of documents. In stream.js's addToType:

    // recurse into arrays by calling `addToType` for each element
    if (typeName === 'Array') {
      type.types = type.types || {};
      type.lengths = type.lengths || [];
      type.lengths.push(value.length); // <-- Grows without bound
      value.forEach(v => addToType(path, v, type.types));

It would be useful to have an option that would skip this, use a reservoir, or somehow cap the collection of lengths.

Unknown argument: sample

Attempting to run: mongodb-schema localhost:27017 prendacoins.users --sample 10000
Prints out help and the notice Unknown argument: sample

mongodb-schema --version => 9.0.0

Support srv uri host names

Allow the URI of the form mongodb+srv://, which is needed for Atlas connections.
Attaching fixed mongodb-schema file which loosens the assumption on the URI parameter
if (!uri.startsWith('mongodb://') && !uri.startsWith("mongodb+srv://") ) { uri = 'mongodb://' + uri; } console.log(URI: ${uri})

Standard deviation for Array lengths

Given the example output below (edited for brevity) for a specific field of interest from a collection analysis: Tasks, which can optionally have Assignees (array).
Regarding the number of assignees (array length) it would be very useful to have the standard deviation besides the already provided average_length

   "_id" : ObjectId("5a8d71276397ce1a2dd42bbe"), 
   "name" : "assignees", 
   "path" : "assignees", 
   "count" : NumberInt(44), 
   "types" : [
       {
           "name" : "Undefined", 
           "type" : "Undefined", 
           "path" : "assignees", 
           "count" : NumberInt(56), 
           "total_count" : NumberInt(0), 
           "probability" : 0.56, 
           "unique" : NumberInt(1), 
           "has_duplicates" : true
       }, 
       {
           "name" : "Array", 
           "bsonType" : "Array", 
           "path" : "assignees", 
           "count" : NumberInt(30), 
           "types" : [
               {
                   "name" : "DBRef", 
                   "bsonType" : "DBRef", 
                   "path" : "assignees", 
                   "count" : NumberInt(37), 
                   "values" : [
                       DBRef("cw_user", ObjectId("577e7f1f300488c6676b3406")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_role", ObjectId("582493383004c0551c10bc5d")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_role", ObjectId("5a7c46bd39c3cc64d3683a18")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_role", ObjectId("5a7c46bd39c3cc64d3683a18")), 
                       DBRef("cw_user", ObjectId("577e7f85300488c6676b344c")), 
                       DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")), 
                       DBRef("cw_user", ObjectId("5a8d51cd6397ce5a3b44496c"))
                   ], 
                   "total_count" : NumberInt(0), 
                   "probability" : NumberInt(1), 
                   "unique" : NumberInt(1), 
                   "has_duplicates" : true
               }
           ], 
           "lengths" : [
               NumberInt(2), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(2), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(3), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(3), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(1), 
               NumberInt(2), 
               NumberInt(1), 
               NumberInt(0), 
               NumberInt(2), 
               NumberInt(2), 
               NumberInt(0), 
               NumberInt(1)
           ], 
           "total_count" : NumberInt(37), 
           "probability" : 0.3, 
           "average_length" : 1.2333333333333334
       }, 
       {
           "name" : "Null", 
           "bsonType" : "Null", 
           "path" : "assignees", 
           "count" : NumberInt(14), 
           "total_count" : NumberInt(0), 
           "probability" : 0.14, 
           "unique" : NumberInt(1), 
           "has_duplicates" : true
       }
   ], 
   "total_count" : NumberInt(100), 
   "type" : [
       "Undefined", 
       "Array", 
       "Null"
   ], 
   "has_duplicates" : true, 
   "probability" : 0.44
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.