Comments (6)
Thanks for the report @Redsandro
Modern Javascript engines execute the user code in a single thread except, optionally I/O operations, so unfortunately truly parallel computation does not happen unless one forks to another process. Unfortunately, that is out of the scope of mingo
.
Also, mingo
performance cannot be compared to mongodb
, which is written in C++ and operates in a different environment (database servers). If you are executing your code on the server, then loading the data into mongodb
before querying would most likely be faster.
That said, the execution time you are seeing merits some investigation.
Can you provide some more information.
- What query are you running?
- What is the structure of documents in your collection?
- Which field are you grouping by?
Some auxiliary questions
- What is the performance observed with custom code that achieves the same thing?
- Do you get better performance using another library that supports grouping?
- If (2), which library and by what factor?
from mingo.
Sorry for getting back to you late. I had to make a custom implementation for my use-case, and that was not simple.
The facts you listed in the first half of your post are true, but frankly, irrelevant to the issue. I've previously identified the problem quite clearly. Mingo does traverse the entire collection once per group operator. This is killing. The collection should be traversed only once, and every operator should use this same traversion.
1. What query are you running?
I'm using data under NDA, so I'll have to obfuscate this, but to give you a basic idea:
[
{
"$match": {
"aaaa.ggg": false,
"aaaa.oooo": {
"$lte": 2200
},
"ccc.aaaa": true,
"ccc.ssss": {
"$in": [
4,
5
]
},
"mmmm.cccc": "1111222233334444"
}
},
{
"$project": {
"_id": 0,
"aaaa": 1,
"ccc": 1,
"id": 1,
"iiii": 1,
"meta": 1,
"oooo": 1,
"pppp": 1,
"rrrr": 1,
"ttt": 1
}
},
{
"$group": {
"_id": "$mmmm.cccc",
"cccbbbb": {
"$sum": {
"$cond": [
{
"$eq": [
"$ccc.bbbb",
true
]
},
1,
0
]
}
},
"cccaaaa": {
"$sum": {
"$cond": [
{
"$eq": [
"$ccc.aaaa",
true
]
},
1,
0
]
}
},
"cccffffddddss": {
"$sum": {
"$cond": [
{
"$gte": [
"$ccc.ddddss",
4
]
},
1,
0
]
}
},
"ffffffff": {
"$sum": {
"$cond": [
{
"$or": [
{
"$eq": [
"$aaaa.ffff",
"FOO"
]
},
{
"$eq": [
"$aaaa.ffff",
"BAR"
]
}
]
},
1,
0
]
}
},
"ggg": {
"$sum": {
"$cond": [
{
"$eq": [
"$aaaa.ggg",
true
]
},
1,
0
]
}
},
"iiiiwwww": {
"$sum": {
"$cond": [
{
"$eq": [
"$iiii.wwww",
true
]
},
1,
0
]
}
},
"iiiiXXXX": {
"$sum": {
"$cond": [
{
"$or": [
{
"$ne": [
"$iiii.a",
false
]
},
{
"$ne": [
"$iiii.b",
false
]
},
{
"$ne": [
"$iiii.c",
false
]
},
{
"$ne": [
"$iiii.d",
false
]
}
]
},
1,
0
]
}
},
"iiiizzzz": {
"$sum": {
"$cond": [
{
"$eq": [
"$iiii.zzzz",
true
]
},
1,
0
]
}
},
"ooaa": {
"$sum": {
"$cond": [
{
"$eq": [
"$aaaa.ooaa",
true
]
},
1,
0
]
}
},
"ssss": {
"$sum": {
"$cond": [
{
"$eq": [
"$aaaa.ssss",
true
]
},
1,
0
]
}
},
"total": {
"$sum": 1
},
"uuuu": {
"$sum": {
"$cond": [
{
"$eq": [
"$aaaa.uuuu",
true
]
},
1,
0
]
}
},
"tttt": {
"$sum": {
"$cond": [
{
"$eq": [
"$aaaa.tttt",
true
]
},
1,
0
]
}
},
"wwww": {
"$sum": {
"$cond": [
{
"$eq": [
"$aaaa.wwww",
true
]
},
1,
0
]
}
},
"yyyy": {
"$sum": {
"$cond": [
{
"$gte": [
"$aaaa.yyyy",
1
]
},
1,
0
]
}
}
}
}
]
Some console.log
timings:
;;; getting data (0.109s)
;;; getting count (0.233s)
;;; mingo start (0.234s)
;;; mingo end (15.024s)
;;; getting total (15.024s)
;;; return (15.024s)
That's about 14.8 seconds for the group (count) stage.
2. What is the structure of documents in your collection?
Again, NDA, but something to the depth of this:
{
key1: {
doc1: {
key1: {
arr1: [{}, {}, {}, {}, {}, {}]
}
// etc
},
doc2: {},
doc3: {},
doc4: {}
},
key2: {
doc1: {},
doc2: {}
},
key3: {
doc1: {},
doc2: {}
},
key4: {
doc1: {},
doc2: {}
},
key5: {
doc1: {},
doc2: {}
}
}
Say every sub-subdoc contains about 20 keys.
3. Which field are you grouping by?
I'm grouping to a common identifier that all objects in the collection share. This results in a single document, so I might as well group to null
.
Aux 1. What is the performance observed with custom code that achieves the same thing?
Porting a complex mongo group and count to a custom function is quite the job. That's why I went looking for libraries with mongodb syntax in the first place. But I'm happy with the result.
;;; getting data (0.126s)
;;; getting count (0.240s)
;;; custom count start (0.241s)
;;; custom count done (0.482s)
;;; getting total (0.483s)
;;; return (0.483s)
That's about 0.220 seconds for the counting.
You should glance at these timings to see the realm of realistic performance in comparison, but know that this is a highly specialistic function. I think something as versatile as a mongodb stage would be 4 times heavier. But realistically it would still be under one second.
Every "group operator" is a function holding their own value, and on every iteration of the single collection loop, every operator function gets executed with the current document as argument, updating their own value.
Aux 2. Do you get better performance using another library that supports grouping?
I don't know of any such library in pure javascript. Loki does not support grouping.
from mingo.
The query you provided is quite elaborate. For most scenarios, custom code will yield better performance however, it should be possible to only use facilities in mingo
that make sense. In your example, that could be the $match
and $project
parts.
I understand the data is NDA protected but it should be straightforward to write a generator for data of a similar shape which will make reproducing this for test even simpler.
Is it possible to provide a generator with random values?
Concerning group operators, each operator takes the collection as input, because every operator requires the collection as input. That does not mean they necessarily iterate it or have to do the same thing. Each $sum
in your example needs to aggregated based on a different rules which cannot be run under the same iteration.
In your special case because you are running a lot of sums, you can throw in a custom group operator which does exactly that, that is, take the collection and compute all the sums required. That way you still benefit from being able to use the rest of mingo
infrastructure as described above. If while at it you find new ways to speed up the more generic implementation for all other operators, please send a PR.
from mingo.
In your special case because you are running a lot of sums, you can throw in a custom group operator which does exactly that, that is, take the collection and compute all the sums required. That way you still benefit from being able to use the rest of
mingo
infrastructure as described above. If while at it you find new ways to speed up the more generic implementation for all other operators, please send a PR.
Concerning group operators, each operator takes the collection as input, because every operator requires the collection as input.
True, but operators are like kids. You give them breakfast on the same table, bring them all to the same school in the same car with the same car ride. They don't have to eat the same sandwich and bring home the same drawing though.
What mingo
currently does is having separate tables set for the kids, and use separate rides to drive them to school individually. Unless there's a big issue that I'm missing, which I doubt, this is big unnecessary overhead.
Now this is heavily simplified pseudo-code made up on the spot, but it is to illustrate that the collection needs to be iterated once.
let results = {}
// init operators, return closures
$group.forEach(expr => init(expr, results))
// traverse collection once
collection.forEach(document =>
$group.forEach(operator =>
operator(document)
)
)
return results
This only works for a single grouped result and in reality there need to be resultSets, but if it was simple to write, I'd send a PR in stead.
Every operator should be a closure so they can keep their own result. This loosely illustrates a $sum
:
function init(expr, results) {
// some operator preparation code here
// switch operator, prepare and return relevant code, etc
// keep results reference
// needs to be object, as values aren't referenced
results[expr.localField] = {}
// return accumulator function for the pipeline
// This will effectively be similar to Array.reduce()
return (accumulator, document) => {
if (!accumulator.value) accumulator.value = 0
if (document.get(expr.foreignField) == expr.fieldValue) {
accumulator.value++
}
}.bind(null, results[expr.localField])
}
This is complex to write for something as versatile as a mongodb pipeline, but it's highly efficient in comparison. There needs to be a cluster of grouped closures so that everything that needs to be iterated will be iterated once only.
it should be possible to only use facilities in mingo that make sense. In your example, that could be the $match and $project parts.
This works fine indeed.
Is it possible to provide a generator with random values?
That's not as straightforward as you think. I'm too swamped right now. Here's an interesting tool: https://www.json-generator.com
You can do things like this:
[
'{{repeat(100)}}',
{
_id: '{{objectId()}}',
index: '{{index()}}',
guid: '{{guid()}}',
data: {
parent: {
isRelevant: '{{bool()}}',
age: '{{integer(20, 40)}}',
eyeColor: '{{random("blue", "brown", "green")}}',
name: '{{firstName()}} {{surname()}}',
gender: '{{gender()}}',
company: '{{company().toUpperCase()}}',
about: '{{lorem(1, "paragraphs")}}'
},
isActive: '{{bool()}}',
balance: '{{floating(1000, 4000, 2, "$0,0.00")}}',
picture: 'http://placehold.it/32x32',
age: '{{integer(20, 40)}}',
eyeColor: '{{random("blue", "brown", "green")}}',
name: '{{firstName()}} {{surname()}}',
gender: '{{gender()}}',
company: '{{company().toUpperCase()}}',
email: '{{email()}}',
phone: '+1 {{phone()}}',
address: '{{integer(100, 999)}} {{street()}}, {{city()}}, {{state()}}, {{integer(100, 10000)}}',
about: '{{lorem(1, "paragraphs")}}',
registered: '{{date(new Date(2014, 0, 1), new Date(), "YYYY-MM-ddThh:mm:ss Z")}}',
latitude: '{{floating(-90.000001, 90)}}',
longitude: '{{floating(-180.000001, 180)}}',
tags: [
'{{repeat(10,20)}}',
{
id: '{{index()}}',
word: '{{lorem(1, "words")}}'
}
],
friends: [
'{{repeat(10,20)}}',
{
id: '{{index()}}',
name: '{{firstName()}} {{surname()}}'
}
]
},
lookup1: {
id1: '{{integer(10000000, 99999999)}}',
id2: '{{integer(10000000, 99999999)}}',
id3: '{{integer(10000000, 99999999)}}',
id4: '{{integer(10000000, 99999999)}}',
id5: '{{integer(10000000, 99999999)}}',
tags: [
'{{repeat(1,10)}}',
{
id: '{{index()}}',
word: '{{lorem(1, "words")}}'
}
],
friends: [
'{{repeat(1,10)}}',
{
id: '{{index()}}',
name: '{{firstName()}} {{surname()}}'
}
]
},
lookup2: {
id1: '{{integer(10000000, 99999999)}}',
id2: '{{integer(10000000, 99999999)}}',
id3: '{{integer(10000000, 99999999)}}',
id4: '{{integer(10000000, 99999999)}}',
id5: '{{integer(10000000, 99999999)}}',
tags: [
'{{repeat(1,10)}}',
{
id: '{{index()}}',
word: '{{lorem(1, "words")}}'
}
],
friends: [
'{{repeat(1,10)}}',
{
id: '{{index()}}',
name: '{{firstName()}} {{surname()}}'
}
]
}
}
]
from mingo.
Latest version should have significant performance improvements to address some of the concerns. This particular example is a special case for which a custom operator is most suitable.
from mingo.
Good to hear @kofrasa
from mingo.
Related Issues (20)
- Possible breaking error in 6.1.1 `dateFromParts` HOT 1
- core.addOperators is missing HOT 1
- How does generator function for collection work? HOT 2
- Expressions in arithmetic operators are evaluated in a nested path HOT 2
- Projection not working properly for deep nested objects HOT 1
- $Reduce $Map Returning incorrect results and making other fields return undefined.
- Es6 Module build HOT 1
- [email protected] breaks CommonJS requires of files inside package HOT 20
- Issues with $let since 6.2.0 HOT 1
- Strange behaviour in $round HOT 3
- Wrong evaluation with `NaN` value HOT 4
- TypeError: CreateListFromArrayLike called on non-object HOT 10
- Filter inside of map causes undefined variable item error
- Add support for `$linearFill` (aggregation window)
- Add support for $fill (aggregation)
- Add support for $densify (aggregation)
- $filter "truthy" condition returns incorrect results HOT 4
- Support useStrictMode for all truth value checks HOT 1
- Add support for $graphLookup (pipeline)
- aggregate() $sort stage on nested date replaces all items in collection with the last item (6.3.2) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mingo.