$group operations are insanely slow. Grouping 3000 do

Thanks for the report <a class="user-mention notranslate" data-hovercard-type="user" d

Good to hear <a class="user-mention notranslate" data-hovercard-type="user" data-hover

$group operators should work in parallel about mingo HOT 6 CLOSED

kofrasa commented on July 18, 2024

$group operators should work in parallel

from mingo.

Comments (6)

kofrasa commented on July 18, 2024

Thanks for the report @Redsandro

Modern Javascript engines execute the user code in a single thread except, optionally I/O operations, so unfortunately truly parallel computation does not happen unless one forks to another process. Unfortunately, that is out of the scope of mingo.

Also, mingo performance cannot be compared to mongodb, which is written in C++ and operates in a different environment (database servers). If you are executing your code on the server, then loading the data into mongodb before querying would most likely be faster.

That said, the execution time you are seeing merits some investigation.

Can you provide some more information.

What query are you running?
What is the structure of documents in your collection?
Which field are you grouping by?

Some auxiliary questions

What is the performance observed with custom code that achieves the same thing?
Do you get better performance using another library that supports grouping?
If (2), which library and by what factor?

from mingo.

Redsandro commented on July 18, 2024

Sorry for getting back to you late. I had to make a custom implementation for my use-case, and that was not simple.

The facts you listed in the first half of your post are true, but frankly, irrelevant to the issue. I've previously identified the problem quite clearly. Mingo does traverse the entire collection once per group operator. This is killing. The collection should be traversed only once, and every operator should use this same traversion.

1. What query are you running?

I'm using data under NDA, so I'll have to obfuscate this, but to give you a basic idea:

[
    {
        "$match": {
            "aaaa.ggg": false, 
            "aaaa.oooo": {
                "$lte": 2200
            }, 
            "ccc.aaaa": true, 
            "ccc.ssss": {
                "$in": [
                    4, 
                    5
                ]
            }, 
            "mmmm.cccc": "1111222233334444"
        }
    }, 
    {
        "$project": {
            "_id": 0, 
            "aaaa": 1, 
            "ccc": 1, 
            "id": 1, 
            "iiii": 1, 
            "meta": 1, 
            "oooo": 1, 
            "pppp": 1, 
            "rrrr": 1, 
            "ttt": 1
        }
    }, 
    {
        "$group": {
            "_id": "$mmmm.cccc", 
            "cccbbbb": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$ccc.bbbb", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "cccaaaa": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$ccc.aaaa", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "cccffffddddss": {
                "$sum": {
                    "$cond": [
                        {
                            "$gte": [
                                "$ccc.ddddss", 
                                4
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "ffffffff": {
                "$sum": {
                    "$cond": [
                        {
                            "$or": [
                                {
                                    "$eq": [
                                        "$aaaa.ffff", 
                                        "FOO"
                                    ]
                                }, 
                                {
                                    "$eq": [
                                        "$aaaa.ffff", 
                                        "BAR"
                                    ]
                                }
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "ggg": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$aaaa.ggg", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "iiiiwwww": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$iiii.wwww", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "iiiiXXXX": {
                "$sum": {
                    "$cond": [
                        {
                            "$or": [
                                {
                                    "$ne": [
                                        "$iiii.a", 
                                        false
                                    ]
                                }, 
                                {
                                    "$ne": [
                                        "$iiii.b", 
                                        false
                                    ]
                                }, 
                                {
                                    "$ne": [
                                        "$iiii.c", 
                                        false
                                    ]
                                }, 
                                {
                                    "$ne": [
                                        "$iiii.d", 
                                        false
                                    ]
                                }
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "iiiizzzz": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$iiii.zzzz", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "ooaa": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$aaaa.ooaa", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            },  
            "ssss": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$aaaa.ssss", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "total": {
                "$sum": 1
            }, 
            "uuuu": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$aaaa.uuuu", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "tttt": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$aaaa.tttt", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "wwww": {
                "$sum": {
                    "$cond": [
                        {
                            "$eq": [
                                "$aaaa.wwww", 
                                true
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }, 
            "yyyy": {
                "$sum": {
                    "$cond": [
                        {
                            "$gte": [
                                "$aaaa.yyyy", 
                                1
                            ]
                        }, 
                        1, 
                        0
                    ]
                }
            }
        }
    }
]

Some console.log timings:

;;; getting data (0.109s)
;;; getting count (0.233s)
;;; mingo start (0.234s)
;;; mingo end (15.024s)
;;; getting total (15.024s)
;;; return (15.024s)

That's about 14.8 seconds for the group (count) stage.

2. What is the structure of documents in your collection?

Again, NDA, but something to the depth of this:

{
	key1: {
		doc1: {
			key1: {
				arr1: [{}, {}, {}, {}, {}, {}]
			}
			// etc
		},
		doc2: {},
		doc3: {},
		doc4: {}
	},
	key2: {
		doc1: {},
		doc2: {}
	},
	key3: {
		doc1: {},
		doc2: {}
	},
	key4: {
		doc1: {},
		doc2: {}
	},
	key5: {
		doc1: {},
		doc2: {}
	}	
}

Say every sub-subdoc contains about 20 keys.

3. Which field are you grouping by?

I'm grouping to a common identifier that all objects in the collection share. This results in a single document, so I might as well group to null.

Aux 1. What is the performance observed with custom code that achieves the same thing?

Porting a complex mongo group and count to a custom function is quite the job. That's why I went looking for libraries with mongodb syntax in the first place. But I'm happy with the result.

;;; getting data (0.126s)
;;; getting count (0.240s)
;;; custom count start (0.241s)
;;; custom count done (0.482s)
;;; getting total (0.483s)
;;; return (0.483s)

That's about 0.220 seconds for the counting.

You should glance at these timings to see the realm of realistic performance in comparison, but know that this is a highly specialistic function. I think something as versatile as a mongodb stage would be 4 times heavier. But realistically it would still be under one second.

Every "group operator" is a function holding their own value, and on every iteration of the single collection loop, every operator function gets executed with the current document as argument, updating their own value.

Aux 2. Do you get better performance using another library that supports grouping?

I don't know of any such library in pure javascript. Loki does not support grouping.

from mingo.

kofrasa commented on July 18, 2024

The query you provided is quite elaborate. For most scenarios, custom code will yield better performance however, it should be possible to only use facilities in mingo that make sense. In your example, that could be the $match and $project parts.

I understand the data is NDA protected but it should be straightforward to write a generator for data of a similar shape which will make reproducing this for test even simpler.
Is it possible to provide a generator with random values?

Concerning group operators, each operator takes the collection as input, because every operator requires the collection as input. That does not mean they necessarily iterate it or have to do the same thing. Each $sum in your example needs to aggregated based on a different rules which cannot be run under the same iteration.

In your special case because you are running a lot of sums, you can throw in a custom group operator which does exactly that, that is, take the collection and compute all the sums required. That way you still benefit from being able to use the rest of mingo infrastructure as described above. If while at it you find new ways to speed up the more generic implementation for all other operators, please send a PR.

from mingo.

Redsandro commented on July 18, 2024

In your special case because you are running a lot of sums, you can throw in a custom group operator which does exactly that, that is, take the collection and compute all the sums required. That way you still benefit from being able to use the rest of mingo infrastructure as described above. If while at it you find new ways to speed up the more generic implementation for all other operators, please send a PR.

👍

Concerning group operators, each operator takes the collection as input, because every operator requires the collection as input.

True, but operators are like kids. You give them breakfast on the same table, bring them all to the same school in the same car with the same car ride. They don't have to eat the same sandwich and bring home the same drawing though.

What mingo currently does is having separate tables set for the kids, and use separate rides to drive them to school individually. Unless there's a big issue that I'm missing, which I doubt, this is big unnecessary overhead.

Now this is heavily simplified pseudo-code made up on the spot, but it is to illustrate that the collection needs to be iterated once.

let results  = {}

// init operators, return closures
$group.forEach(expr => init(expr, results))

// traverse collection once
collection.forEach(document => 
    $group.forEach(operator =>
        operator(document)
    )
)

return results

This only works for a single grouped result and in reality there need to be resultSets, but if it was simple to write, I'd send a PR in stead. 😉

Every operator should be a closure so they can keep their own result. This loosely illustrates a $sum:

function init(expr, results) {
    // some operator preparation code here
    // switch operator, prepare and return relevant code, etc

    // keep results reference
    // needs to be object, as values aren't referenced
    results[expr.localField] = {}

    // return accumulator function for the pipeline
    // This will effectively be similar to Array.reduce()
    return (accumulator, document) => {
        if (!accumulator.value) accumulator.value = 0
        if (document.get(expr.foreignField) == expr.fieldValue) {
            accumulator.value++
        }
    }.bind(null, results[expr.localField])
}

This is complex to write for something as versatile as a mongodb pipeline, but it's highly efficient in comparison. There needs to be a cluster of grouped closures so that everything that needs to be iterated will be iterated once only.

it should be possible to only use facilities in mingo that make sense. In your example, that could be the $match and $project parts.

This works fine indeed. 👍

Is it possible to provide a generator with random values?

That's not as straightforward as you think. I'm too swamped right now. Here's an interesting tool: https://www.json-generator.com

You can do things like this:

[
  '{{repeat(100)}}',
  {
    _id: '{{objectId()}}',
    index: '{{index()}}',
    guid: '{{guid()}}',
    data: {
      parent: {
        isRelevant: '{{bool()}}',
        age: '{{integer(20, 40)}}',
        eyeColor: '{{random("blue", "brown", "green")}}',
        name: '{{firstName()}} {{surname()}}',
        gender: '{{gender()}}',
        company: '{{company().toUpperCase()}}',
        about: '{{lorem(1, "paragraphs")}}'
      },
      isActive: '{{bool()}}',
      balance: '{{floating(1000, 4000, 2, "$0,0.00")}}',
      picture: 'http://placehold.it/32x32',
      age: '{{integer(20, 40)}}',
      eyeColor: '{{random("blue", "brown", "green")}}',
      name: '{{firstName()}} {{surname()}}',
      gender: '{{gender()}}',
      company: '{{company().toUpperCase()}}',
      email: '{{email()}}',
      phone: '+1 {{phone()}}',
      address: '{{integer(100, 999)}} {{street()}}, {{city()}}, {{state()}}, {{integer(100, 10000)}}',
      about: '{{lorem(1, "paragraphs")}}',
      registered: '{{date(new Date(2014, 0, 1), new Date(), "YYYY-MM-ddThh:mm:ss Z")}}',
      latitude: '{{floating(-90.000001, 90)}}',
      longitude: '{{floating(-180.000001, 180)}}',
      tags: [
        '{{repeat(10,20)}}',
        {
          id: '{{index()}}',
          word: '{{lorem(1, "words")}}'
        }
      ],
      friends: [
        '{{repeat(10,20)}}',
        {
          id: '{{index()}}',
          name: '{{firstName()}} {{surname()}}'
        }
      ]
    },
    lookup1: {
      id1: '{{integer(10000000, 99999999)}}',
      id2: '{{integer(10000000, 99999999)}}',
      id3: '{{integer(10000000, 99999999)}}',
      id4: '{{integer(10000000, 99999999)}}',
      id5: '{{integer(10000000, 99999999)}}',
      tags: [
        '{{repeat(1,10)}}',
        {
          id: '{{index()}}',
          word: '{{lorem(1, "words")}}'
        }
      ],
      friends: [
        '{{repeat(1,10)}}',
        {
          id: '{{index()}}',
          name: '{{firstName()}} {{surname()}}'
        }
      ]
    },
    lookup2: {
      id1: '{{integer(10000000, 99999999)}}',
      id2: '{{integer(10000000, 99999999)}}',
      id3: '{{integer(10000000, 99999999)}}',
      id4: '{{integer(10000000, 99999999)}}',
      id5: '{{integer(10000000, 99999999)}}',
      tags: [
        '{{repeat(1,10)}}',
        {
          id: '{{index()}}',
          word: '{{lorem(1, "words")}}'
        }
      ],
      friends: [
        '{{repeat(1,10)}}',
        {
          id: '{{index()}}',
          name: '{{firstName()}} {{surname()}}'
        }
      ]
    }
  }
]

from mingo.

kofrasa commented on July 18, 2024

Latest version should have significant performance improvements to address some of the concerns. This particular example is a special case for which a custom operator is most suitable.

from mingo.

Redsandro commented on July 18, 2024

Good to hear @kofrasa 👍

from mingo.

$group operators should work in parallel about mingo HOT 6 CLOSED

Comments (6)

1. What query are you running?

2. What is the structure of documents in your collection?

3. Which field are you grouping by?

Aux 1. What is the performance observed with custom code that achieves the same thing?

Aux 2. Do you get better performance using another library that supports grouping?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent