mongodb-js / mongodb-schema Goto Github PK
View Code? Open in Web Editor NEWInfer a probabilistic schema for a MongoDB collection.
Home Page: https://github.com/mongodb-js/mongodb-schema
License: Apache License 2.0
Infer a probabilistic schema for a MongoDB collection.
Home Page: https://github.com/mongodb-js/mongodb-schema
License: Apache License 2.0
Sometimes I get > 1 even for non-array fields
min and max
"$d": {
min: 0,
max: 122
}
counts (up to a certain number, 100?)
"$d": {
"red": 300,
"green": 250,
"blue": 129
}
Instead of converting string
to text
/ category
type, provide native type and add a boolean flag $category
.
Provide a parameter to determine how many values make a category.
I'm currently working on PR #141 --> and travis seems very slow:
Would you like a Github action workflow instead?
Should return bson.ObjectId
or {"$oid": "...."}
extended json.
I would like to be able to use the CLI version against one of the secondary nodes in our replicaSet. but if I pass it as an option I get errors
mongodb-schema mongodb://USERNAM:PASSWORD@HOST:PORT/AUTH_DB?slaveOk=true DATABASE.COLLECTION --no-values -n 100
/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb/lib/utils.js:132
throw err;
^
MongoError: not master and slaveOk=false
at /usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/pool.js:593:63
at authenticateStragglers (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/pool.js:516:16)
at Connection.messageHandler (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/pool.js:552:5)
at emitMessageHandler (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/connection.js:309:10)
at Socket.<anonymous> (/usr/local/lib/node_modules/mongodb-schema/node_modules/mongodb-core/lib/connection/connection.js:452:17)
at emitOne (events.js:116:13)
at Socket.emit (events.js:211:7)
at addChunk (_stream_readable.js:263:12)
at readableAddChunk (_stream_readable.js:250:11)
at Socket.Readable.push (_stream_readable.js:208:10)
Hi,
this works very fine on a mongodb with a big collection (600 000 items).
But the schema is very big just because of the "lengths" collection. My schema is 30 Mo and it could be just a few Ko if the "lengths" collection could be limited to 1000 / 2000 items with an option (the average would not be so precise in this case).
For example, the code in stream.js :
type.lengths.push(value.length);
could be something like :
if (type.lengths.length < MYOPTION) type.lengths.push(value.length);
Thanks for your work !
Example usage:
var callback = function(err, res) {
// handle error
if (err) {
return console.err( err );
}
// else pretty print to console
console.log( JSON.stringify( res, null, '\t' ) );
}
schema( documents, options, callback );
If the string is longer than let's say 100 characters, it's probably a text
and not a category
. Make the number configurable via data.maxCategoryLength
.
Should extend the DBCollection object to do
db.foo.schema()
and return the serialized version of the schema found in the collection. Should thinly wrap .find()
with all its parameters, like query, skip, limit, etc
Hey!
I'm trying to follow the example given in ReadMe:
const parseSchema = require('mongodb-schema');
const { MongoClient } = require('mongodb');
const dbName = 'test';
const uri = `mongodb://localhost:27017/${dbName}`;
const client = new MongoClient(uri);
async function run() {
try {
const database = client.db(dbName);
const documentStream = database.collection('data').find();
// Here we are passing in a cursor as the first argument. You can
// also pass in a stream or an array of documents directly.
const schema = await parseSchema(documentStream);
console.log(JSON.stringify(schema, null, 2));
} finally {
await client.close();
}
}
run().catch(console.dir);
But whenever I run this, I'm always getting:
TypeError: parseSchema is not a function
Can anyone point out what I'm doing wrong here? Thanks!
Hi,
I am trying to analyze schema with following command but the command does not exists either with successful or error message:
mongodb-schema 'mongodb+srv://username:[email protected]/?retryWrites=true&w=majority&maxPoolSize=10' dbname.order -n 100 -f table -o true -s true --values false --sampling true
The command is stuck. While I am able to connect to above MongoDB URI using mongosh
Additional details:
OS: MacOS 13.3.1
Mongo Shell v1.8.2
MongoDB v4.4.21
I'm working on updating dependencies in this repo, and I don't know zuul
.
It's part of the npm start
script, but I can't find it referenced anywhere else?
{metavar: {prefix: '#'}}
This tool would be pretty handy for getting some initial schema information on a large blob of semistructured JSON (or JSON-convertible) data. It doesn't seem like there's anything about the principle of operation that necessarily ties it to mongo, so maybe some part can be factored out that makes it work for arbitrary JSON data.
Hi,
We are seeing error is thrown when parsing documents containing toString
in property path including nested property. This error can be observed for mongodb-schema v11+, though it might also be reproduced for versions under.
Here is a code sandbox for reproduction and the error can be observed from the web console.
This package has seen a lot of updates in the past year (amazing!), but it would be great if we could create a CHANGELOG.md file to keep track of what has changed between each version.
Luckily, each update has been fairly small so figuring out what is a breaking change has been fairly simple to follow along. I just want to standardise this procedure so that consumers of this package can update this dependency knowing any risks and/or features added.
currently $prob
is the relative probability to its parent node, <parent count> / <count>
.
Sometimes it's interesting to see the absolute probability as well, which is <global count> / <count>
.
$relprob
$prob
schema(docs, {merge: mySchema})
Trying to poll a variety of collections across several local MongoDB instances, and I can only get some smaller collections to run successfully. Large collections are hitting some clearLine error, but only in the Bash shell. Runs okay...manually, under Windows:
`C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\progress\lib\node-progress.js:177
this.stream.clearLine();
^
TypeError: this.stream.clearLine is not a function
at ProgressBar.terminate (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\progress\lib\node-progress.js:177:17)
at ProgressBar.tick (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\progress\lib\node-progress.js:91:10)
at Stream. (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\bin\mongodb-schema:179:13)
at emitOne (events.js:96:13)
at Stream.emit (events.js:188:7)
at Stream.write (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\lib\stream.js:308:10)
at Stream.stream.write (C:\Users\bcurley\AppData\Roaming\npm\node_modules\mongodb-schema\node_modules\through\index.js:26:11)
at Stream.ondata (internal/streams/legacy.js:16:26)
at emitOne (events.js:101:20)
at Stream.emit (events.js:188:7)
=== `
`for i in $(mongo localhost:27017 --quiet --eval "db.adminCommand('listDatabases')"
|grep -v Hotfix
|grep '"name" :'
|cut -d" -f4
|sed 's/"//g;s/,//g;'
)
do printf "\n%s: \n" "${i}" &&
for j in
|grep -v Hotfix
|sed 's/^/\t/g;'
)
do printf "\nCREATE TABLE %s ( \n" ${j} \
&& mongodb-schema localhost:27017 --values=false --format=json ${i}.${j} 2>/dev/null \
|grep '"path":' \
|cut -d: -f2 \
|uniq \
|sed 's/^/ /g;' \
&& printf " %s ); \n" " \"X\""
done
done
`
Thoughts?
I believe this is one area of the tool that can cause heap exhaustion when profiling over a large number of documents. In stream.js
's addToType
:
// recurse into arrays by calling `addToType` for each element
if (typeName === 'Array') {
type.types = type.types || {};
type.lengths = type.lengths || [];
type.lengths.push(value.length); // <-- Grows without bound
value.forEach(v => addToType(path, v, type.types));
It would be useful to have an option that would skip this, use a reservoir, or somehow cap the collection of lengths.
Attempting to run: mongodb-schema localhost:27017 prendacoins.users --sample 10000
Prints out help and the notice Unknown argument: sample
mongodb-schema --version
=> 9.0.0
Allow the URI of the form mongodb+srv://, which is needed for Atlas connections.
Attaching fixed mongodb-schema file which loosens the assumption on the URI parameter
if (!uri.startsWith('mongodb://') && !uri.startsWith("mongodb+srv://") ) { uri = 'mongodb://' + uri; } console.log(
URI: ${uri})
can probably be aliased with Number
.
Given the example output below (edited for brevity) for a specific field of interest from a collection analysis: Tasks, which can optionally have Assignees (array).
Regarding the number of assignees (array length) it would be very useful to have the standard deviation besides the already provided average_length
"_id" : ObjectId("5a8d71276397ce1a2dd42bbe"),
"name" : "assignees",
"path" : "assignees",
"count" : NumberInt(44),
"types" : [
{
"name" : "Undefined",
"type" : "Undefined",
"path" : "assignees",
"count" : NumberInt(56),
"total_count" : NumberInt(0),
"probability" : 0.56,
"unique" : NumberInt(1),
"has_duplicates" : true
},
{
"name" : "Array",
"bsonType" : "Array",
"path" : "assignees",
"count" : NumberInt(30),
"types" : [
{
"name" : "DBRef",
"bsonType" : "DBRef",
"path" : "assignees",
"count" : NumberInt(37),
"values" : [
DBRef("cw_user", ObjectId("577e7f1f300488c6676b3406")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_role", ObjectId("582493383004c0551c10bc5d")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_role", ObjectId("5a7c46bd39c3cc64d3683a18")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_role", ObjectId("5a7c46bd39c3cc64d3683a18")),
DBRef("cw_user", ObjectId("577e7f85300488c6676b344c")),
DBRef("cw_user", ObjectId("577e7f08300488c6676b33f5")),
DBRef("cw_user", ObjectId("5a8d51cd6397ce5a3b44496c"))
],
"total_count" : NumberInt(0),
"probability" : NumberInt(1),
"unique" : NumberInt(1),
"has_duplicates" : true
}
],
"lengths" : [
NumberInt(2),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(2),
NumberInt(1),
NumberInt(1),
NumberInt(3),
NumberInt(1),
NumberInt(1),
NumberInt(3),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(1),
NumberInt(2),
NumberInt(1),
NumberInt(0),
NumberInt(2),
NumberInt(2),
NumberInt(0),
NumberInt(1)
],
"total_count" : NumberInt(37),
"probability" : 0.3,
"average_length" : 1.2333333333333334
},
{
"name" : "Null",
"bsonType" : "Null",
"path" : "assignees",
"count" : NumberInt(14),
"total_count" : NumberInt(0),
"probability" : 0.14,
"unique" : NumberInt(1),
"has_duplicates" : true
}
],
"total_count" : NumberInt(100),
"type" : [
"Undefined",
"Array",
"Null"
],
"has_duplicates" : true,
"probability" : 0.44
}
Including all the new features, like maxCardinality, metavars, array collapsing, ...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.